Files
toon/benchmarks/README.md
2025-10-28 22:54:00 +01:00

109 lines
3.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# TOON Benchmarks
Benchmarks measuring TOON's **token efficiency** and **retrieval accuracy** compared to JSON, XML, YAML, and CSV.
> [!NOTE]
> Results are automatically embedded in the [main README](../README.md#benchmarks). This guide focuses on running the benchmarks locally.
## Quick Start
```bash
# Run token efficiency benchmark
pnpm benchmark:token-efficiency
# Run retrieval accuracy benchmark (requires API keys)
pnpm benchmark:accuracy
```
## Token Efficiency Benchmark
Measures token count reduction across JSON, XML, YAML, CSV, and TOON:
1. Generate datasets (GitHub repos, analytics, orders)
2. Convert to all formats (TOON, JSON, XML, YAML, CSV)
3. Tokenize using `gpt-tokenizer` (`o200k_base` encoding)
4. Calculate savings and generate report
```bash
pnpm benchmark:token-efficiency
```
Results are saved to `results/token-efficiency.md`.
## Retrieval Accuracy Benchmark
Tests how well LLMs can answer questions about data in different formats (TOON, JSON, XML, YAML, CSV):
1. Generate 154 questions across 4 datasets
2. Convert each dataset to all 5 formats
3. Query each LLM with formatted data + question
4. Validate answers using `gpt-5-nano` as judge
5. Aggregate metrics and generate report
### Setup
1. Edit [`src/evaluate.ts`](./src/evaluate.ts) and add models to the `models` array:
```ts
export const models: LanguageModelV2[] = [
openai('gpt-5-nano'),
anthropic('claude-haiku-4-5-20251001'),
google('gemini-2.5-flash'),
xai('grok-4-fast-non-reasoning'),
// Add your models here
]
```
2. Duplicate `.env.example` to `.env` and add your API keys:
```bash
cp .env.example .env
```
### Usage
```bash
# Full benchmark
pnpm benchmark:accuracy
# Dry run (10 questions only, for testing setup)
DRY_RUN=true pnpm benchmark:accuracy
```
Running the script will:
1. Prompt you to select which models to test.
2. Skip models with existing results (rerun to overwrite).
3. Show progress with rate limiting.
4. Save results to `results/accuracy/models/{model-id}.json`.
5. Generate report at `results/retrieval-accuracy.md`.
### Configuration
Edit [`src/constants.ts`](./src/constants.ts) to adjust:
- `MODEL_RPM_LIMITS` Rate limits per model
- `DEFAULT_CONCURRENCY` Parallel tasks (default: 10)
- `DRY_RUN_LIMITS` Questions per dry run (default: 10)
## Project Structure
```
scripts/
├── accuracy-benchmark.ts # Retrieval accuracy benchmark
├── token-efficiency-benchmark.ts # Token counting benchmark
└── fetch-github-repos.ts # Update GitHub dataset
src/
├── constants.ts # Configuration
├── datasets.ts # Test data generators
├── evaluate.ts # LLM evaluation
├── formatters.ts # Format converters
├── questions.ts # Question generation
├── report.ts # Markdown reports
├── storage.ts # Result caching
└── utils.ts # Helpers
data/
└── github-repos.json # Top 100 GitHub repos
results/
├── token-efficiency.md # Token savings report
├── retrieval-accuracy.md # Accuracy report
└── accuracy/models/ # Per-model results (JSON)
```