github/toon

Fork 0

mirror of https://github.com/voson-wang/toon.git synced 2026-01-29 23:34:10 +08:00

Files

History

Johann Schopplich 5f09a14c61 chore: fix type issues

2025-11-01 17:15:37 +01:00

data

docs: overhaul retrieval accuracy benchmark

2025-10-28 20:22:43 +01:00

results

docs: adjust padding for benchmark comparison

2025-10-30 15:19:16 +01:00

scripts

chore: fix type issues

2025-11-01 17:15:37 +01:00

src

chore: fix type issues

2025-11-01 17:15:37 +01:00

.env.example

text(accuracy): add Grok-4-fast, remove default temperature

2025-10-28 22:54:00 +01:00

package.json

docs: add table of contents

2025-10-31 08:56:42 +01:00

README.md

docs: update benchmark README

2025-10-30 17:38:00 +01:00

README.md

TOON Benchmarks

Benchmarks measuring TOON's token efficiency and retrieval accuracy compared to JSON, XML, YAML, and CSV.

Note

Results are automatically embedded in the main README. This guide focuses on running the benchmarks locally.

Quick Start

# Run token efficiency benchmark
pnpm benchmark:token-efficiency

# Run retrieval accuracy benchmark (requires API keys)
pnpm benchmark:accuracy

Token Efficiency Benchmark

Measures token count reduction across JSON, XML, YAML, CSV, and TOON:

Generate datasets (GitHub repos, analytics, orders)
Convert to all formats (TOON, JSON, XML, YAML, CSV)
Tokenize using gpt-tokenizer (o200k_base encoding)
Calculate savings and generate report

pnpm benchmark:token-efficiency

Results are saved to results/token-efficiency.md.

Retrieval Accuracy Benchmark

Tests how well LLMs can answer questions about data in different formats (TOON, JSON, JSON compact, XML, YAML, CSV):

Generate ~150-160 questions across 4 datasets
Convert each dataset to all 6 formats
Query each LLM with formatted data + question
Validate answers using gpt-5-nano as judge
Aggregate metrics and generate report

Setup

Edit src/evaluate.ts and add models to the exported models array:

export const models: LanguageModelV2[] = [
  openai('gpt-5-nano'),
  anthropic('claude-haiku-4-5-20251001'),
  google('gemini-2.5-flash'),
  xai('grok-4-fast-non-reasoning'),
  // Add your models here
]

Duplicate .env.example to .env and add your API keys:
```
cp .env.example .env
```

Usage

# Full benchmark
pnpm benchmark:accuracy

# Dry run (10 questions only, for testing setup)
DRY_RUN=true pnpm benchmark:accuracy

Running the script will:

Prompt you to select which models to test.
Skip models with existing results (rerun to overwrite).
Show progress with rate limiting.
Save results to results/accuracy/models/{model-id}.json.
Generate report at results/retrieval-accuracy.md.

Configuration

Edit src/constants.ts to adjust:

MODEL_RPM_LIMITS – Rate limits per model
DEFAULT_CONCURRENCY – Parallel tasks (default: 10)
DRY_RUN_LIMITS – Questions per dry run (default: 10)

Project Structure

scripts/
├── accuracy-benchmark.ts         # Retrieval accuracy benchmark
├── token-efficiency-benchmark.ts # Token counting benchmark
└── fetch-github-repos.ts         # Update GitHub dataset
src/
├── constants.ts                  # Configuration
├── datasets.ts                   # Test data generators
├── evaluate.ts                   # LLM evaluation
├── formatters.ts                 # Format converters
├── questions.ts                  # Question generation
├── report.ts                     # Markdown reports
├── storage.ts                    # Result caching
└── utils.ts                      # Helpers
data/
└── github-repos.json             # Top 100 GitHub repos
results/
├── token-efficiency.md           # Token savings report
├── retrieval-accuracy.md         # Accuracy report
└── accuracy/models/              # Per-model results (JSON)

README.md Unescape Escape

TOON Benchmarks

Quick Start

Token Efficiency Benchmark

Retrieval Accuracy Benchmark

Setup

Usage

Configuration

Project Structure

README.md