text(accuracy): add Grok-4-fast, remove default temperature

This commit is contained in:
Johann Schopplich
2025-10-28 22:54:00 +01:00
parent e400e68ad6
commit ecf578a7dc
13 changed files with 301 additions and 117 deletions

108
benchmarks/README.md Normal file
View File

@@ -0,0 +1,108 @@
# TOON Benchmarks
Benchmarks measuring TOON's **token efficiency** and **retrieval accuracy** compared to JSON, XML, YAML, and CSV.
> [!NOTE]
> Results are automatically embedded in the [main README](../README.md#benchmarks). This guide focuses on running the benchmarks locally.
## Quick Start
```bash
# Run token efficiency benchmark
pnpm benchmark:token-efficiency
# Run retrieval accuracy benchmark (requires API keys)
pnpm benchmark:accuracy
```
## Token Efficiency Benchmark
Measures token count reduction across JSON, XML, YAML, CSV, and TOON:
1. Generate datasets (GitHub repos, analytics, orders)
2. Convert to all formats (TOON, JSON, XML, YAML, CSV)
3. Tokenize using `gpt-tokenizer` (`o200k_base` encoding)
4. Calculate savings and generate report
```bash
pnpm benchmark:token-efficiency
```
Results are saved to `results/token-efficiency.md`.
## Retrieval Accuracy Benchmark
Tests how well LLMs can answer questions about data in different formats (TOON, JSON, XML, YAML, CSV):
1. Generate 154 questions across 4 datasets
2. Convert each dataset to all 5 formats
3. Query each LLM with formatted data + question
4. Validate answers using `gpt-5-nano` as judge
5. Aggregate metrics and generate report
### Setup
1. Edit [`src/evaluate.ts`](./src/evaluate.ts) and add models to the `models` array:
```ts
export const models: LanguageModelV2[] = [
openai('gpt-5-nano'),
anthropic('claude-haiku-4-5-20251001'),
google('gemini-2.5-flash'),
xai('grok-4-fast-non-reasoning'),
// Add your models here
]
```
2. Duplicate `.env.example` to `.env` and add your API keys:
```bash
cp .env.example .env
```
### Usage
```bash
# Full benchmark
pnpm benchmark:accuracy
# Dry run (10 questions only, for testing setup)
DRY_RUN=true pnpm benchmark:accuracy
```
Running the script will:
1. Prompt you to select which models to test.
2. Skip models with existing results (rerun to overwrite).
3. Show progress with rate limiting.
4. Save results to `results/accuracy/models/{model-id}.json`.
5. Generate report at `results/retrieval-accuracy.md`.
### Configuration
Edit [`src/constants.ts`](./src/constants.ts) to adjust:
- `MODEL_RPM_LIMITS` Rate limits per model
- `DEFAULT_CONCURRENCY` Parallel tasks (default: 10)
- `DRY_RUN_LIMITS` Questions per dry run (default: 10)
## Project Structure
```
scripts/
├── accuracy-benchmark.ts # Retrieval accuracy benchmark
├── token-efficiency-benchmark.ts # Token counting benchmark
└── fetch-github-repos.ts # Update GitHub dataset
src/
├── constants.ts # Configuration
├── datasets.ts # Test data generators
├── evaluate.ts # LLM evaluation
├── formatters.ts # Format converters
├── questions.ts # Question generation
├── report.ts # Markdown reports
├── storage.ts # Result caching
└── utils.ts # Helpers
data/
└── github-repos.json # Top 100 GitHub repos
results/
├── token-efficiency.md # Token savings report
├── retrieval-accuracy.md # Accuracy report
└── accuracy/models/ # Per-model results (JSON)
```