Files
toon/benchmarks
2025-11-01 17:15:37 +01:00
..
2025-11-01 17:15:37 +01:00
2025-11-01 17:15:37 +01:00
2025-10-31 08:56:42 +01:00
2025-10-30 17:38:00 +01:00

TOON Benchmarks

Benchmarks measuring TOON's token efficiency and retrieval accuracy compared to JSON, XML, YAML, and CSV.

Note

Results are automatically embedded in the main README. This guide focuses on running the benchmarks locally.

Quick Start

# Run token efficiency benchmark
pnpm benchmark:token-efficiency

# Run retrieval accuracy benchmark (requires API keys)
pnpm benchmark:accuracy

Token Efficiency Benchmark

Measures token count reduction across JSON, XML, YAML, CSV, and TOON:

  1. Generate datasets (GitHub repos, analytics, orders)
  2. Convert to all formats (TOON, JSON, XML, YAML, CSV)
  3. Tokenize using gpt-tokenizer (o200k_base encoding)
  4. Calculate savings and generate report
pnpm benchmark:token-efficiency

Results are saved to results/token-efficiency.md.

Retrieval Accuracy Benchmark

Tests how well LLMs can answer questions about data in different formats (TOON, JSON, JSON compact, XML, YAML, CSV):

  1. Generate ~150-160 questions across 4 datasets
  2. Convert each dataset to all 6 formats
  3. Query each LLM with formatted data + question
  4. Validate answers using gpt-5-nano as judge
  5. Aggregate metrics and generate report

Setup

  1. Edit src/evaluate.ts and add models to the exported models array:
    export const models: LanguageModelV2[] = [
      openai('gpt-5-nano'),
      anthropic('claude-haiku-4-5-20251001'),
      google('gemini-2.5-flash'),
      xai('grok-4-fast-non-reasoning'),
      // Add your models here
    ]
    
  2. Duplicate .env.example to .env and add your API keys:
    cp .env.example .env
    

Usage

# Full benchmark
pnpm benchmark:accuracy

# Dry run (10 questions only, for testing setup)
DRY_RUN=true pnpm benchmark:accuracy

Running the script will:

  1. Prompt you to select which models to test.
  2. Skip models with existing results (rerun to overwrite).
  3. Show progress with rate limiting.
  4. Save results to results/accuracy/models/{model-id}.json.
  5. Generate report at results/retrieval-accuracy.md.

Configuration

Edit src/constants.ts to adjust:

  • MODEL_RPM_LIMITS Rate limits per model
  • DEFAULT_CONCURRENCY Parallel tasks (default: 10)
  • DRY_RUN_LIMITS Questions per dry run (default: 10)

Project Structure

scripts/
├── accuracy-benchmark.ts         # Retrieval accuracy benchmark
├── token-efficiency-benchmark.ts # Token counting benchmark
└── fetch-github-repos.ts         # Update GitHub dataset
src/
├── constants.ts                  # Configuration
├── datasets.ts                   # Test data generators
├── evaluate.ts                   # LLM evaluation
├── formatters.ts                 # Format converters
├── questions.ts                  # Question generation
├── report.ts                     # Markdown reports
├── storage.ts                    # Result caching
└── utils.ts                      # Helpers
data/
└── github-repos.json             # Top 100 GitHub repos
results/
├── token-efficiency.md           # Token savings report
├── retrieval-accuracy.md         # Accuracy report
└── accuracy/models/              # Per-model results (JSON)