github/toon

Fork 0

mirror of https://github.com/voson-wang/toon.git synced 2026-01-29 23:34:10 +08:00

Files

History

Johann Schopplich c6ba6446f5 chore(benchmarks): finalize structure-awareness run

2025-11-07 10:33:46 +01:00

data

docs: overhaul retrieval accuracy benchmark

2025-10-28 20:22:43 +01:00

results

chore(benchmarks): finalize structure-awareness run

2025-11-07 10:33:46 +01:00

scripts

chore(benchmarks): finalize structure-awareness run

2025-11-07 10:33:46 +01:00

src

chore(benchmarks): finalize structure-awareness run

2025-11-07 10:33:46 +01:00

.env.example

text(accuracy): add Grok-4-fast, remove default temperature

2025-10-28 22:54:00 +01:00

package.json

chore: split token efficiency benchmark into mixed/flat tracks

2025-11-06 22:17:18 +01:00

README.md

chore: split token efficiency benchmark into mixed/flat tracks

2025-11-06 22:17:18 +01:00

README.md

TOON Benchmarks

Benchmarks measuring TOON's token efficiency and retrieval accuracy compared to JSON, XML, YAML, and CSV.

Note

Results are automatically embedded in the main README. This guide focuses on running the benchmarks locally.

Quick Start

# Run token efficiency benchmark
pnpm benchmark:tokens

# Run retrieval accuracy benchmark (requires API keys)
pnpm benchmark:accuracy

Token Efficiency Benchmark

Measures token count reduction across JSON, XML, YAML, CSV, and TOON:

Generate datasets (GitHub repos, analytics, orders)
Convert to all formats (TOON, JSON, XML, YAML, CSV)
Tokenize using gpt-tokenizer (o200k_base encoding)
Calculate savings and generate report

pnpm benchmark:tokens

Results are saved to results/token-efficiency.md.

Retrieval Accuracy Benchmark

Tests how well LLMs can answer questions about data in different formats (TOON, JSON, JSON compact, XML, YAML, CSV):

Generate ~200 questions across 6 datasets (CSV only included for datasets with flat/tabular structure)
Convert each dataset to all supported formats
Query each LLM with formatted data + question
Validate answers using gpt-5-nano as judge
Aggregate metrics and generate report

Setup

Edit src/evaluate.ts and add models to the exported models array:

export const models: LanguageModelV2[] = [
  openai('gpt-5-nano'),
  anthropic('claude-haiku-4-5-20251001'),
  google('gemini-2.5-flash'),
  xai('grok-4-fast-non-reasoning'),
  // Add your models here
]

Duplicate .env.example to .env and add your API keys:
```
cp .env.example .env
```

Usage

# Full benchmark
pnpm benchmark:accuracy

# Dry run (10 questions only, for testing setup)
DRY_RUN=true pnpm benchmark:accuracy

Running the script will:

Prompt you to select which models to test.
Skip models with existing results (rerun to overwrite).
Show progress with rate limiting.
Save results to results/accuracy/models/{model-id}.json.
Generate report at results/retrieval-accuracy.md.

Configuration

Edit src/constants.ts to adjust:

MODEL_RPM_LIMITS – Rate limits per model
DEFAULT_CONCURRENCY – Parallel tasks (default: 10)
DRY_RUN_LIMITS – Questions per dry run (default: 10)

Project Structure

scripts/
├── accuracy-benchmark.ts         # Retrieval accuracy benchmark
├── token-efficiency-benchmark.ts # Token counting benchmark
└── fetch-github-repos.ts         # Update GitHub dataset
src/
├── constants.ts                  # Configuration
├── datasets.ts                   # Test data generators
├── evaluate.ts                   # LLM evaluation
├── formatters.ts                 # Format converters
├── questions.ts                  # Question generation
├── report.ts                     # Markdown reports
├── storage.ts                    # Result caching
└── utils.ts                      # Helpers
data/
└── github-repos.json             # Top 100 GitHub repos
results/
├── token-efficiency.md           # Token savings report
├── retrieval-accuracy.md         # Accuracy report
└── accuracy/models/              # Per-model results (JSON)

README.md Unescape Escape

TOON Benchmarks

Quick Start

Token Efficiency Benchmark

Retrieval Accuracy Benchmark

Setup

Usage

Configuration

Project Structure

README.md