text(accuracy): add Grok-4-fast, remove default temperature

2026-01-29 23:34:10 +08:00 · 2025-10-28 22:54:00 +01:00
parent e400e68ad6
commit ecf578a7dc
13 changed files with 301 additions and 117 deletions
--- a/benchmarks/README.md
+++ b/benchmarks/README.md
@@ -0,0 +1,108 @@
+# TOON Benchmarks
+
+Benchmarks measuring TOON's **token efficiency** and **retrieval accuracy** compared to JSON, XML, YAML, and CSV.
+
+> [!NOTE]
+> Results are automatically embedded in the [main README](../README.md#benchmarks). This guide focuses on running the benchmarks locally.
+
+## Quick Start
+
+```bash
+# Run token efficiency benchmark
+pnpm benchmark:token-efficiency
+
+# Run retrieval accuracy benchmark (requires API keys)
+pnpm benchmark:accuracy
+```
+
+## Token Efficiency Benchmark
+
+Measures token count reduction across JSON, XML, YAML, CSV, and TOON:
+
+1. Generate datasets (GitHub repos, analytics, orders)
+2. Convert to all formats (TOON, JSON, XML, YAML, CSV)
+3. Tokenize using `gpt-tokenizer` (`o200k_base` encoding)
+4. Calculate savings and generate report
+
+```bash
+pnpm benchmark:token-efficiency
+```
+
+Results are saved to `results/token-efficiency.md`.
+
+## Retrieval Accuracy Benchmark
+
+Tests how well LLMs can answer questions about data in different formats (TOON, JSON, XML, YAML, CSV):
+
+1. Generate 154 questions across 4 datasets
+2. Convert each dataset to all 5 formats
+3. Query each LLM with formatted data + question
+4. Validate answers using `gpt-5-nano` as judge
+5. Aggregate metrics and generate report
+
+### Setup
+
+1. Edit [`src/evaluate.ts`](./src/evaluate.ts) and add models to the `models` array:
+   ```ts
+   export const models: LanguageModelV2[] = [
+     openai('gpt-5-nano'),
+     anthropic('claude-haiku-4-5-20251001'),
+     google('gemini-2.5-flash'),
+     xai('grok-4-fast-non-reasoning'),
+     // Add your models here
+   ]
+   ```
+2. Duplicate `.env.example` to `.env` and add your API keys:
+   ```bash
+   cp .env.example .env
+   ```
+
+### Usage
+
+```bash
+# Full benchmark
+pnpm benchmark:accuracy
+
+# Dry run (10 questions only, for testing setup)
+DRY_RUN=true pnpm benchmark:accuracy
+```
+
+Running the script will:
+
+1. Prompt you to select which models to test.
+2. Skip models with existing results (rerun to overwrite).
+3. Show progress with rate limiting.
+4. Save results to `results/accuracy/models/{model-id}.json`.
+5. Generate report at `results/retrieval-accuracy.md`.
+
+### Configuration
+
+Edit [`src/constants.ts`](./src/constants.ts) to adjust:
+
+- `MODEL_RPM_LIMITS` – Rate limits per model
+- `DEFAULT_CONCURRENCY` – Parallel tasks (default: 10)
+- `DRY_RUN_LIMITS` – Questions per dry run (default: 10)
+
+## Project Structure
+
+```
+scripts/
+├── accuracy-benchmark.ts         # Retrieval accuracy benchmark
+├── token-efficiency-benchmark.ts # Token counting benchmark
+└── fetch-github-repos.ts         # Update GitHub dataset
+src/
+├── constants.ts                  # Configuration
+├── datasets.ts                   # Test data generators
+├── evaluate.ts                   # LLM evaluation
+├── formatters.ts                 # Format converters
+├── questions.ts                  # Question generation
+├── report.ts                     # Markdown reports
+├── storage.ts                    # Result caching
+└── utils.ts                      # Helpers
+data/
+└── github-repos.json             # Top 100 GitHub repos
+results/
+├── token-efficiency.md           # Token savings report
+├── retrieval-accuracy.md         # Accuracy report
+└── accuracy/models/              # Per-model results (JSON)
+```