mirror of
https://github.com/voson-wang/toon.git
synced 2026-01-29 23:34:10 +08:00
TOON Benchmarks
Benchmarks measuring TOON's token efficiency and retrieval accuracy compared to JSON, XML, YAML, and CSV.
Note
Results are automatically embedded in the main README. This guide focuses on running the benchmarks locally.
Quick Start
# Run token efficiency benchmark
pnpm benchmark:tokens
# Run retrieval accuracy benchmark (requires API keys)
pnpm benchmark:accuracy
Token Efficiency Benchmark
Measures token count reduction across JSON, XML, YAML, CSV, and TOON:
- Generate datasets (GitHub repos, analytics, orders)
- Convert to all formats (TOON, JSON, XML, YAML, CSV)
- Tokenize using
gpt-tokenizer(o200k_baseencoding) - Calculate savings and generate report
pnpm benchmark:tokens
Results are saved to results/token-efficiency.md.
Retrieval Accuracy Benchmark
Tests how well LLMs can answer questions about data in different formats (TOON, JSON, JSON compact, XML, YAML, CSV):
- Generate ~200 questions across 6 datasets (CSV only included for datasets with flat/tabular structure)
- Convert each dataset to all supported formats
- Query each LLM with formatted data + question
- Validate answers using
gpt-5-nanoas judge - Aggregate metrics and generate report
Setup
- Edit
src/evaluate.tsand add models to the exportedmodelsarray:export const models: LanguageModelV2[] = [ openai('gpt-5-nano'), anthropic('claude-haiku-4-5-20251001'), google('gemini-2.5-flash'), xai('grok-4-fast-non-reasoning'), // Add your models here ] - Duplicate
.env.exampleto.envand add your API keys:cp .env.example .env
Usage
# Full benchmark
pnpm benchmark:accuracy
# Dry run (10 questions only, for testing setup)
DRY_RUN=true pnpm benchmark:accuracy
Running the script will:
- Prompt you to select which models to test.
- Skip models with existing results (rerun to overwrite).
- Show progress with rate limiting.
- Save results to
results/accuracy/models/{model-id}.json. - Generate report at
results/retrieval-accuracy.md.
Configuration
Edit src/constants.ts to adjust:
MODEL_RPM_LIMITS– Rate limits per modelDEFAULT_CONCURRENCY– Parallel tasks (default: 10)DRY_RUN_LIMITS– Questions per dry run (default: 10)
Project Structure
scripts/
├── accuracy-benchmark.ts # Retrieval accuracy benchmark
├── token-efficiency-benchmark.ts # Token counting benchmark
└── fetch-github-repos.ts # Update GitHub dataset
src/
├── constants.ts # Configuration
├── datasets.ts # Test data generators
├── evaluate.ts # LLM evaluation
├── formatters.ts # Format converters
├── questions.ts # Question generation
├── report.ts # Markdown reports
├── storage.ts # Result caching
└── utils.ts # Helpers
data/
└── github-repos.json # Top 100 GitHub repos
results/
├── token-efficiency.md # Token savings report
├── retrieval-accuracy.md # Accuracy report
└── accuracy/models/ # Per-model results (JSON)