mirror of
https://github.com/voson-wang/toon.git
synced 2026-01-29 23:34:10 +08:00
text(accuracy): add Grok-4-fast, remove default temperature
This commit is contained in:
108
benchmarks/README.md
Normal file
108
benchmarks/README.md
Normal file
@@ -0,0 +1,108 @@
|
||||
# TOON Benchmarks
|
||||
|
||||
Benchmarks measuring TOON's **token efficiency** and **retrieval accuracy** compared to JSON, XML, YAML, and CSV.
|
||||
|
||||
> [!NOTE]
|
||||
> Results are automatically embedded in the [main README](../README.md#benchmarks). This guide focuses on running the benchmarks locally.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Run token efficiency benchmark
|
||||
pnpm benchmark:token-efficiency
|
||||
|
||||
# Run retrieval accuracy benchmark (requires API keys)
|
||||
pnpm benchmark:accuracy
|
||||
```
|
||||
|
||||
## Token Efficiency Benchmark
|
||||
|
||||
Measures token count reduction across JSON, XML, YAML, CSV, and TOON:
|
||||
|
||||
1. Generate datasets (GitHub repos, analytics, orders)
|
||||
2. Convert to all formats (TOON, JSON, XML, YAML, CSV)
|
||||
3. Tokenize using `gpt-tokenizer` (`o200k_base` encoding)
|
||||
4. Calculate savings and generate report
|
||||
|
||||
```bash
|
||||
pnpm benchmark:token-efficiency
|
||||
```
|
||||
|
||||
Results are saved to `results/token-efficiency.md`.
|
||||
|
||||
## Retrieval Accuracy Benchmark
|
||||
|
||||
Tests how well LLMs can answer questions about data in different formats (TOON, JSON, XML, YAML, CSV):
|
||||
|
||||
1. Generate 154 questions across 4 datasets
|
||||
2. Convert each dataset to all 5 formats
|
||||
3. Query each LLM with formatted data + question
|
||||
4. Validate answers using `gpt-5-nano` as judge
|
||||
5. Aggregate metrics and generate report
|
||||
|
||||
### Setup
|
||||
|
||||
1. Edit [`src/evaluate.ts`](./src/evaluate.ts) and add models to the `models` array:
|
||||
```ts
|
||||
export const models: LanguageModelV2[] = [
|
||||
openai('gpt-5-nano'),
|
||||
anthropic('claude-haiku-4-5-20251001'),
|
||||
google('gemini-2.5-flash'),
|
||||
xai('grok-4-fast-non-reasoning'),
|
||||
// Add your models here
|
||||
]
|
||||
```
|
||||
2. Duplicate `.env.example` to `.env` and add your API keys:
|
||||
```bash
|
||||
cp .env.example .env
|
||||
```
|
||||
|
||||
### Usage
|
||||
|
||||
```bash
|
||||
# Full benchmark
|
||||
pnpm benchmark:accuracy
|
||||
|
||||
# Dry run (10 questions only, for testing setup)
|
||||
DRY_RUN=true pnpm benchmark:accuracy
|
||||
```
|
||||
|
||||
Running the script will:
|
||||
|
||||
1. Prompt you to select which models to test.
|
||||
2. Skip models with existing results (rerun to overwrite).
|
||||
3. Show progress with rate limiting.
|
||||
4. Save results to `results/accuracy/models/{model-id}.json`.
|
||||
5. Generate report at `results/retrieval-accuracy.md`.
|
||||
|
||||
### Configuration
|
||||
|
||||
Edit [`src/constants.ts`](./src/constants.ts) to adjust:
|
||||
|
||||
- `MODEL_RPM_LIMITS` – Rate limits per model
|
||||
- `DEFAULT_CONCURRENCY` – Parallel tasks (default: 10)
|
||||
- `DRY_RUN_LIMITS` – Questions per dry run (default: 10)
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
scripts/
|
||||
├── accuracy-benchmark.ts # Retrieval accuracy benchmark
|
||||
├── token-efficiency-benchmark.ts # Token counting benchmark
|
||||
└── fetch-github-repos.ts # Update GitHub dataset
|
||||
src/
|
||||
├── constants.ts # Configuration
|
||||
├── datasets.ts # Test data generators
|
||||
├── evaluate.ts # LLM evaluation
|
||||
├── formatters.ts # Format converters
|
||||
├── questions.ts # Question generation
|
||||
├── report.ts # Markdown reports
|
||||
├── storage.ts # Result caching
|
||||
└── utils.ts # Helpers
|
||||
data/
|
||||
└── github-repos.json # Top 100 GitHub repos
|
||||
results/
|
||||
├── token-efficiency.md # Token savings report
|
||||
├── retrieval-accuracy.md # Accuracy report
|
||||
└── accuracy/models/ # Per-model results (JSON)
|
||||
```
|
||||
Reference in New Issue
Block a user