chore: split token efficiency benchmark into mixed/flat tracks

This commit is contained in:
Johann Schopplich
2025-11-06 22:17:18 +01:00
parent e22884308b
commit 54433de930
13 changed files with 567 additions and 1830 deletions

View File

@@ -9,7 +9,7 @@ Benchmarks measuring TOON's **token efficiency** and **retrieval accuracy** comp
```bash
# Run token efficiency benchmark
pnpm benchmark:token-efficiency
pnpm benchmark:tokens
# Run retrieval accuracy benchmark (requires API keys)
pnpm benchmark:accuracy
@@ -25,7 +25,7 @@ Measures token count reduction across JSON, XML, YAML, CSV, and TOON:
4. Calculate savings and generate report
```bash
pnpm benchmark:token-efficiency
pnpm benchmark:tokens
```
Results are saved to `results/token-efficiency.md`.

View File

@@ -3,7 +3,7 @@
"type": "module",
"private": true,
"scripts": {
"benchmark:token-efficiency": "tsx scripts/token-efficiency-benchmark.ts",
"benchmark:tokens": "tsx scripts/token-efficiency-benchmark.ts",
"benchmark:accuracy": "tsx --env-file=.env scripts/accuracy-benchmark.ts",
"fetch:github-repos": "tsx scripts/fetch-github-repos.ts"
},

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@@ -1,108 +1,153 @@
Benchmarks test LLM comprehension across different input formats using 154 data retrieval questions on 4 models.
Benchmarks test LLM comprehension across different input formats using 201 data retrieval questions on 4 models.
<details>
<summary><strong>View Dataset Catalog</strong></summary>
#### Dataset Catalog
| Dataset | Rows | Structure | CSV Support | Eligibility |
| ------- | ---- | --------- | ----------- | ----------- |
| Uniform employee records | 100 | uniform | ✓ | 100% |
| E-commerce orders with nested structures | 50 | nested | ✗ | 33% |
| Time-series analytics data | 60 | uniform | ✓ | 100% |
| Top 100 GitHub repositories | 100 | uniform | ✓ | 100% |
| Semi-uniform event logs | 75 | semi-uniform | ✗ | 50% |
| Deeply nested configuration | 11 | deep | ✗ | 0% |
**Structure classes:**
- **uniform**: All objects have identical fields with primitive values
- **semi-uniform**: Mix of uniform and non-uniform structures
- **nested**: Objects with nested structures (nested objects or arrays)
- **deep**: Highly nested with minimal tabular eligibility
**CSV Support:** ✓ (supported), ✗ (not supported would require lossy flattening)
**Eligibility:** Percentage of arrays that qualify for TOON's tabular format (uniform objects with primitive values)
</details>
#### Efficiency Ranking (Accuracy per 1K Tokens)
Each format's overall performance, balancing accuracy against token cost:
```
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 15.070.1% acc │ 4,678 tokens
csv ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 14.3 │ 67.7% acc │ 4,745 tokens
json-compact ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░ 11.0 │ 65.3% acc │ 5,925 tokens
yaml ▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░ 9.4 │ 66.7% acc │ 7,091 tokens
json-pretty ▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░ 7.5 │ 65.4% acc │ 8,713 tokens
xml ▓▓▓▓▓▓▓▓▓░░░░░░░░░░ 6.8 │ 67.2% acc │ 9,944 tokens
TOON ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 15.668.7% acc │ 4,389 tokens
CSV ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 15.3 │ 62.3% acc │ 4,080 tokens
JSON compact ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░ 13.5 │ 67.2% acc │ 4,982 tokens
YAML ▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░ 11.2 │ 66.7% acc │ 5,976 tokens
JSON ▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░ 9.0 │ 65.7% acc │ 7,260 tokens
XML ▓▓▓▓▓▓▓▓▓░░░░░░░░░░ 8.1 │ 66.8% acc │ 8,251 tokens
```
TOON achieves **70.1%** accuracy (vs JSON's 65.4%) while using **46.3% fewer tokens**.
TOON achieves **68.7%** accuracy (vs JSON's 65.7%) while using **39.5% fewer tokens**.
#### Per-Model Accuracy
Accuracy across **4 LLMs** on 154 data retrieval questions:
Accuracy across 4 LLMs on 201 data retrieval questions:
```
gpt-5-nano
→ TOON ██████████████████96.1% (148/154)
CSV ██████████████████░░ 91.6% (141/154)
YAML ██████████████████░░ 91.6% (141/154)
JSON compact ██████████████████░░ 91.6% (141/154)
XML ████████████████░░░ 87.0% (134/154)
JSON ████████████████░░░ 86.4% (133/154)
→ TOON ██████████████████88.6% (178/201)
JSON compact ██████████████████░░ 88.1% (177/201)
CSV ██████████████████░░ 88.0% (88/100)
YAML █████████████████░░ 84.6% (170/201)
XML ████████████████░░░ 81.6% (164/201)
JSON ████████████████░░░ 80.1% (161/201)
claude-haiku-4-5-20251001
JSON ██████████░░░░░░░░░░ 50.0% (77/154)
YAML ██████████░░░░░░░░░░ 49.4% (76/154)
→ TOON ██████████░░░░░░░░░░ 48.7% (75/154)
XML ██████████░░░░░░░░░░ 48.1% (74/154)
CSV █████████░░░░░░░░░░ 47.4% (73/154)
JSON compact █████████░░░░░░░░░░░ 44.2% (68/154)
YAML ██████████░░░░░░░░░░ 52.2% (105/201)
→ TOON ██████████░░░░░░░░░░ 50.7% (102/201)
JSON ██████████░░░░░░░░░░ 50.2% (101/201)
JSON compact ██████████░░░░░░░░░░ 49.8% (100/201)
XML █████████░░░░░░░░░░ 49.3% (99/201)
CSV ████████░░░░░░░░░░░ 39.0% (39/100)
gemini-2.5-flash
CSV █████████████████░░ 87.7% (135/154)
XML █████████████████░░ 87.7% (135/154)
→ TOON ████████████████░░░ 86.4% (133/154)
YAML ████████████████░░░░ 79.9% (123/154)
JSON compact ████████████████░░░░ 79.9% (123/154)
JSON ███████████████░░░░ 76.6% (118/154)
XML █████████████████░░ 86.1% (173/201)
→ TOON █████████████████░░ 84.1% (169/201)
CSV ████████████████░░░ 82.0% (82/100)
JSON compact ████████████████░░░░ 81.1% (163/201)
YAML ████████████████░░░░ 81.1% (163/201)
JSON ███████████████░░░░ 81.1% (163/201)
grok-4-fast-non-reasoning
→ TOON ██████████░░░░░░░░░░ 49.4% (76/154)
JSON ██████████░░░░░░░░░░ 48.7% (75/154)
XML █████████░░░░░░░░░░ 46.1% (71/154)
YAML █████████░░░░░░░░░░ 46.1% (71/154)
JSON compact █████████░░░░░░░░░░ 45.5% (70/154)
CSV ████████░░░░░░░░░░░ 44.2% (68/154)
→ TOON ██████████░░░░░░░░░░ 51.2% (103/201)
JSON ██████████░░░░░░░░░░ 51.2% (103/201)
XML █████████░░░░░░░░░░ 50.2% (101/201)
JSON compact ██████████░░░░░░░░░░ 49.8% (100/201)
YAML ██████████░░░░░░░░░░ 48.8% (98/201)
CSV ████████░░░░░░░░░░░ 40.0% (40/100)
```
**Key tradeoff:** TOON achieves **70.1% accuracy** (vs JSON's 65.4%) while using **46.3% fewer tokens** on these datasets.
**Key tradeoff:** TOON achieves **68.7% accuracy** (vs JSON's 65.7%) while using **39.5% fewer tokens** on these datasets.
<details>
<summary><strong>Performance by dataset and model</strong></summary>
#### Performance by Dataset
##### Uniform employee records (TOON optimal format)
##### Uniform employee records
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `csv` | 65.5% | 2,337 | 131/200 |
| `toon` | 67.5% | 2,483 | 135/200 |
| `json-compact` | 65.5% | 3,943 | 131/200 |
| `yaml` | 68.5% | 4,969 | 137/200 |
| `xml` | 69.5% | 7,314 | 139/200 |
| `json-pretty` | 64.5% | 6,347 | 129/200 |
| `toon` | 65.6% | 2,483 | 105/160 |
| `csv` | 62.5% | 2,337 | 100/160 |
| `json-compact` | 66.3% | 3,943 | 106/160 |
| `yaml` | 63.7% | 4,969 | 102/160 |
| `xml` | 67.5% | 7,314 | 108/160 |
| `json-pretty` | 62.5% | 6,347 | 100/160 |
##### E-commerce orders with nested structures
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `toon` | 78.8% | 5,967 | 126/160 |
| `csv` | 76.3% | 6,735 | 122/160 |
| `json-compact` | 70.6% | 5,962 | 113/160 |
| `yaml` | 72.5% | 7,328 | 116/160 |
| `json-pretty` | 76.9% | 9,694 | 123/160 |
| `xml` | 73.1% | 10,992 | 117/160 |
| `toon` | 75.6% | 7,197 | 121/160 |
| `json-compact` | 70.6% | 6,784 | 113/160 |
| `yaml` | 71.9% | 8,334 | 115/160 |
| `json-pretty` | 68.8% | 10,700 | 110/160 |
| `xml` | 71.9% | 12,013 | 115/160 |
##### Time-series analytics data
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `toon` | 68.4% | 1,515 | 93/136 |
| `csv` | 65.4% | 1,393 | 89/136 |
| `json-compact` | 64.7% | 2,341 | 88/136 |
| `yaml` | 66.2% | 2,938 | 90/136 |
| `json-pretty` | 64.7% | 3,665 | 88/136 |
| `xml` | 66.9% | 4,376 | 91/136 |
| `csv` | 63.8% | 1,391 | 74/116 |
| `toon` | 66.4% | 1,513 | 77/116 |
| `json-compact` | 61.2% | 2,339 | 71/116 |
| `yaml` | 65.5% | 2,936 | 76/116 |
| `json-pretty` | 64.7% | 3,663 | 75/116 |
| `xml` | 65.5% | 4,374 | 76/116 |
##### Top 100 GitHub repositories
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `toon` | 65.0% | 8,745 | 78/120 |
| `csv` | 62.5% | 8,513 | 75/120 |
| `json-compact` | 58.3% | 11,455 | 70/120 |
| `yaml` | 56.7% | 13,129 | 68/120 |
| `xml` | 55.8% | 17,095 | 67/120 |
| `json-pretty` | 52.5% | 15,145 | 63/120 |
| `toon` | 63.7% | 8,745 | 79/124 |
| `csv` | 60.5% | 8,513 | 75/124 |
| `json-compact` | 56.5% | 11,455 | 70/124 |
| `yaml` | 53.2% | 13,129 | 66/124 |
| `json-pretty` | 53.2% | 15,145 | 66/124 |
| `xml` | 53.2% | 17,095 | 66/124 |
##### Semi-uniform event logs
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `json-compact` | 55.0% | 4,809 | 66/120 |
| `yaml` | 52.5% | 5,814 | 63/120 |
| `json-pretty` | 52.5% | 6,784 | 63/120 |
| `toon` | 45.8% | 5,764 | 55/120 |
| `xml` | 50.8% | 7,699 | 61/120 |
##### Deeply nested configuration
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `json-compact` | 91.9% | 564 | 114/124 |
| `toon` | 92.7% | 631 | 115/124 |
| `yaml` | 91.9% | 673 | 114/124 |
| `json-pretty` | 91.9% | 919 | 114/124 |
| `xml` | 89.5% | 1,008 | 111/124 |
#### Performance by Model
@@ -110,45 +155,45 @@ grok-4-fast-non-reasoning
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
| `toon` | 96.1% | 148/154 |
| `csv` | 91.6% | 141/154 |
| `yaml` | 91.6% | 141/154 |
| `json-compact` | 91.6% | 141/154 |
| `xml` | 87.0% | 134/154 |
| `json-pretty` | 86.4% | 133/154 |
| `toon` | 88.6% | 178/201 |
| `json-compact` | 88.1% | 177/201 |
| `csv` | 88.0% | 88/100 |
| `yaml` | 84.6% | 170/201 |
| `xml` | 81.6% | 164/201 |
| `json-pretty` | 80.1% | 161/201 |
##### claude-haiku-4-5-20251001
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
| `json-pretty` | 50.0% | 77/154 |
| `yaml` | 49.4% | 76/154 |
| `toon` | 48.7% | 75/154 |
| `xml` | 48.1% | 74/154 |
| `csv` | 47.4% | 73/154 |
| `json-compact` | 44.2% | 68/154 |
| `yaml` | 52.2% | 105/201 |
| `toon` | 50.7% | 102/201 |
| `json-pretty` | 50.2% | 101/201 |
| `json-compact` | 49.8% | 100/201 |
| `xml` | 49.3% | 99/201 |
| `csv` | 39.0% | 39/100 |
##### gemini-2.5-flash
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
| `csv` | 87.7% | 135/154 |
| `xml` | 87.7% | 135/154 |
| `toon` | 86.4% | 133/154 |
| `yaml` | 79.9% | 123/154 |
| `json-compact` | 79.9% | 123/154 |
| `json-pretty` | 76.6% | 118/154 |
| `xml` | 86.1% | 173/201 |
| `toon` | 84.1% | 169/201 |
| `csv` | 82.0% | 82/100 |
| `json-compact` | 81.1% | 163/201 |
| `yaml` | 81.1% | 163/201 |
| `json-pretty` | 81.1% | 163/201 |
##### grok-4-fast-non-reasoning
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
| `toon` | 49.4% | 76/154 |
| `json-pretty` | 48.7% | 75/154 |
| `xml` | 46.1% | 71/154 |
| `yaml` | 46.1% | 71/154 |
| `json-compact` | 45.5% | 70/154 |
| `csv` | 44.2% | 68/154 |
| `toon` | 51.2% | 103/201 |
| `json-pretty` | 51.2% | 103/201 |
| `xml` | 50.2% | 101/201 |
| `json-compact` | 49.8% | 100/201 |
| `yaml` | 48.8% | 98/201 |
| `csv` | 40.0% | 40/100 |
</details>
@@ -161,34 +206,36 @@ This benchmark tests **LLM comprehension and data retrieval accuracy** across di
#### Datasets Tested
Four datasets designed to test different structural patterns (all contain arrays of uniform objects, TOON's optimal format):
Six datasets designed to test different structural patterns:
1. **Tabular** (100 employee records): Uniform objects with identical fields optimal for TOON's tabular format.
2. **Nested** (50 e-commerce orders): Complex structures with nested customer objects and item arrays.
3. **Analytics** (60 days of metrics): Time-series data with dates and numeric values.
4. **GitHub** (100 repositories): Real-world data from top GitHub repos by stars.
5. **Event Logs** (75 logs): Semi-uniform data with ~50% flat logs and ~50% with nested error objects.
6. **Nested Config** (1 configuration): Deeply nested configuration with minimal tabular eligibility.
#### Question Types
154 questions are generated dynamically across three categories:
201 questions are generated dynamically across three categories:
- **Field retrieval (40%)**: Direct value lookups or values that can be read straight off a record (including booleans and simple counts such as array lengths)
- **Field retrieval (36%)**: Direct value lookups or values that can be read straight off a record (including booleans and simple counts such as array lengths)
- Example: "What is Alice's salary?" → `75000`
- Example: "How many items are in order ORD-0042?" → `3`
- Example: "What is the customer name for order ORD-0042?" → `John Doe`
- **Aggregation (32%)**: Dataset-level totals and averages plus single-condition filters (counts, sums, min/max comparisons)
- **Aggregation (37%)**: Dataset-level totals and averages plus single-condition filters (counts, sums, min/max comparisons)
- Example: "How many employees work in Engineering?" → `17`
- Example: "What is the total revenue across all orders?" → `45123.50`
- Example: "How many employees have salary > 80000?" → `23`
- **Filtering (28%)**: Multi-condition queries requiring compound logic (AND constraints across fields)
- **Filtering (27%)**: Multi-condition queries requiring compound logic (AND constraints across fields)
- Example: "How many employees in Sales have salary > 80000?" → `5`
- Example: "How many active employees have more than 10 years of experience?" → `8`
#### Evaluation Process
1. **Format conversion**: Each dataset is converted to all 6 formats (TOON, CSV, XML, YAML, JSON, JSON compact).
1. **Format conversion**: Each dataset is converted to all 6 formats (TOON, JSON compact, XML, YAML, JSON, CSV).
2. **Query LLM**: Each model receives formatted data + question in a prompt and extracts the answer.
3. **Validate with LLM-as-judge**: `gpt-5-nano` validates if the answer is semantically correct (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`).
@@ -197,6 +244,6 @@ Four datasets designed to test different structural patterns (all contain arrays
- **Models tested**: `gpt-5-nano`, `claude-haiku-4-5-20251001`, `gemini-2.5-flash`, `grok-4-fast-non-reasoning`
- **Token counting**: Using `gpt-tokenizer` with `o200k_base` encoding (GPT-5 tokenizer)
- **Temperature**: Not set (models use their defaults)
- **Total evaluations**: 154 questions × 6 formats × 4 models = 3,696 LLM calls
- **Total evaluations**: 201 questions × 6 formats × 4 models = 4,824 LLM calls
</details>

View File

@@ -1,79 +1,81 @@
## Mixed-Structure Track
#### Mixed-Structure Track
Datasets with nested or semi-uniform structures. CSV excluded as it cannot properly represent these structures.
```
🛒 E-commerce orders with nested structures [eligibility: 33%]
toon ▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░ 58,528 tokens
vs JSON (37.9%) 94,207
vs JSON compact (+0.9%) 57,979
vs YAML (17.8%) 71,223
vs XML (45.2%) 106,720
🛒 E-commerce orders with nested structures ┊ Tabular: 33%
TOON █████████████░░░░░░░ 72,743 tokens
├─ vs JSON (33.1%) 108,731 tokens
├─ vs JSON compact (+5.5%) 68,936 tokens
├─ vs YAML (14.1%) 84,724 tokens
└─ vs XML (40.5%) 122,313 tokens
🧾 Semi-uniform event logs [eligibility: 50%]
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░ 154,419 tokens
vs JSON (15.0%) 181,592
vs JSON compact (+19.9%) 128,836
vs YAML (0.9%) 155,749
vs XML (25.1%) 206,271
🧾 Semi-uniform event logs ┊ Tabular: 50%
TOON █████████████████░░░ 153,223 tokens
├─ vs JSON (15.0%) 180,196 tokens
├─ vs JSON compact (+19.9%) 127,740 tokens
├─ vs YAML (0.8%) 154,514 tokens
└─ vs XML (25.2%) 204,800 tokens
🧩 Deeply nested configuration [eligibility: 0%]
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░ 630 tokens
vs JSON (31.4%) 918
vs JSON compact (+11.9%) 563
vs YAML (6.4%) 673
vs XML (37.4%) 1,007
🧩 Deeply nested configuration ┊ Tabular: 0%
TOON ██████████████░░░░░░ 631 tokens
├─ vs JSON (31.3%) 919 tokens
├─ vs JSON compact (+11.9%) 564 tokens
├─ vs YAML (6.2%) 673 tokens
└─ vs XML (37.4%) 1,008 tokens
─────────────────────────────────────────────────────────────────────────────────
Total
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░ 213,577 tokens
vs JSON (22.8%) 276,717
vs JSON compact (+14.0%) 187,378
vs YAML (6.2%) 227,645
vs XML (32.0%) 313,998
──────────────────────────────────── Total ────────────────────────────────────
TOON ████████████████░░░░ 226,597 tokens
├─ vs JSON (21.8%) 289,846 tokens
├─ vs JSON compact (+14.9%) 197,240 tokens
├─ vs YAML (5.5%) 239,911 tokens
└─ vs XML (30.9%) 328,121 tokens
```
## Flat-Only Track
#### Flat-Only Track
Datasets with flat tabular structures where CSV is applicable.
```
👥 Uniform employee records (TOON optimal format) [eligibility: 100%]
csv ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░ 46,968 tokens
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 49,841 tokens (+5.8% vs CSV)
vs JSON (60.7%) 126,886
vs JSON compact (36.8%) 78,882
vs YAML (50.0%) 99,743
vs XML (66.0%) 146,465
👥 Uniform employee records ┊ Tabular: 100%
CSV ███████████████████░ 46,956 tokens
TOON ████████████████████ 49,827 tokens (+6.1% vs CSV)
├─ vs JSON (60.7%) 126,854 tokens
├─ vs JSON compact (36.8%) 78,850 tokens
├─ vs YAML (50.0%) 99,701 tokens
└─ vs XML (66.0%) 146,440 tokens
📈 Time-series analytics data [eligibility: 100%]
csv ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░ 8,382 tokens
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 9,114 tokens (+8.0% vs CSV)
vs JSON (59.0%) 22,244
vs JSON compact (35.9%) 14,210
vs YAML (49.0%) 17,857
vs XML (65.8%) 26,615
📈 Time-series analytics data ┊ Tabular: 100%
CSV ██████████████████░░ 8,396 tokens
TOON ████████████████████ 9,128 tokens (+8.7% vs CSV)
├─ vs JSON (59.0%) 22,258 tokens
├─ vs JSON compact (35.8%) 14,224 tokens
├─ vs YAML (48.9%) 17,871 tokens
└─ vs XML (65.7%) 26,629 tokens
⭐ Top 100 GitHub repositories [eligibility: 100%]
csv ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░ 8,513 tokens
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 8,745 tokens (+2.7% vs CSV)
vs JSON (42.3%) 15,145
vs JSON compact (23.7%) 11,455
vs YAML (33.4%) 13,129
vs XML (48.8%) 17,095
⭐ Top 100 GitHub repositories ┊ Tabular: 100%
CSV ███████████████████░ 8,513 tokens
TOON ████████████████████ 8,745 tokens (+2.7% vs CSV)
├─ vs JSON (42.3%) 15,145 tokens
├─ vs JSON compact (23.7%) 11,455 tokens
├─ vs YAML (33.4%) 13,129 tokens
└─ vs XML (48.8%) 17,095 tokens
─────────────────────────────────────────────────────────────────────────────────
Total
csv ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░ 63,863 tokens
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 67,700 tokens (+5.7% vs CSV)
vs JSON (58.8%) 164,275
vs JSON compact (35.2%) 104,547
vs YAML (48.2%) 130,729
vs XML (64.4%) 190,175
──────────────────────────────────── Total ────────────────────────────────────
CSV ███████████████████░ 63,865 tokens
TOON ████████████████████ 67,700 tokens (+6.0% vs CSV)
├─ vs JSON (58.8%) 164,257 tokens
├─ vs JSON compact (35.2%) 104,529 tokens
├─ vs YAML (48.2%) 130,701 tokens
└─ vs XML (64.4%) 190,164 tokens
```
<details>
<summary><strong>View detailed examples</strong></summary>
@@ -81,64 +83,64 @@ toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
**Savings:** 13,130 tokens (59.0% reduction vs JSON)
**JSON** (22,244 tokens):
**JSON** (22,258 tokens):
```json
{
"metrics": [
{
"date": "2025-01-01",
"views": 4324,
"clicks": 146,
"conversions": 21,
"revenue": 3834.57,
"bounceRate": 0.4
"views": 7708,
"clicks": 595,
"conversions": 69,
"revenue": 15369.93,
"bounceRate": 0.35
},
{
"date": "2025-01-02",
"views": 6248,
"clicks": 407,
"conversions": 22,
"revenue": 2936.12,
"bounceRate": 0.62
"views": 5894,
"clicks": 381,
"conversions": 21,
"revenue": 2112.12,
"bounceRate": 0.3
},
{
"date": "2025-01-03",
"views": 7382,
"clicks": 270,
"conversions": 24,
"revenue": 6825.19,
"bounceRate": 0.7
"views": 6835,
"clicks": 422,
"conversions": 35,
"revenue": 4525.73,
"bounceRate": 0.5
},
{
"date": "2025-01-04",
"views": 4586,
"clicks": 267,
"conversions": 24,
"revenue": 2391.11,
"bounceRate": 0.64
"views": 5325,
"clicks": 305,
"conversions": 22,
"revenue": 2445.3,
"bounceRate": 0.44
},
{
"date": "2025-01-05",
"views": 6171,
"clicks": 227,
"conversions": 12,
"revenue": 3430.1,
"bounceRate": 0.39
"views": 2974,
"clicks": 61,
"conversions": 6,
"revenue": 956.57,
"bounceRate": 0.47
}
]
}
```
**TOON** (9,114 tokens):
**TOON** (9,128 tokens):
```
metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
2025-01-01,4324,146,21,3834.57,0.4
2025-01-02,6248,407,22,2936.12,0.62
2025-01-03,7382,270,24,6825.19,0.7
2025-01-04,4586,267,24,2391.11,0.64
2025-01-05,6171,227,12,3430.1,0.39
2025-01-01,7708,595,69,15369.93,0.35
2025-01-02,5894,381,21,2112.12,0.3
2025-01-03,6835,422,35,4525.73,0.5
2025-01-04,5325,305,22,2445.3,0.44
2025-01-05,2974,61,6,956.57,0.47
```
---

View File

@@ -32,13 +32,9 @@ const DATASET_ICONS: Record<string, string> = {
const COMPARISON_FORMAT_ORDER = ['json-pretty', 'json-compact', 'yaml', 'xml'] as const
const PROGRESS_BAR_CONFIG = { filled: '▓', empty: '░' } as const
const PROGRESS_BAR_WIDTH = 20
const TOKEN_PADDING = 7
const LABEL_PADDING = 60
const COMPARISON_LABEL_PADDING = 30
const SEPARATOR = '─────────────────────────────────────────────────────────────────────────────────'
const DEFAULT_DATASET_ICON = '📊'
const DETAILED_EXAMPLE_DATASETS = ['github', 'analytics'] as const
@@ -51,14 +47,14 @@ prompts.intro('Token Efficiency Benchmark')
/**
* Format a comparison line showing savings vs TOON
*/
function formatComparisonLine(format: FormatMetrics): string {
function formatComparisonLine(format: FormatMetrics, isLast: boolean = false): string {
const label = FORMATTER_DISPLAY_NAMES[format.name] || format.name.toUpperCase()
const signedPercent = format.savingsPercent >= 0
? `${format.savingsPercent.toFixed(1)}%`
: `+${Math.abs(format.savingsPercent).toFixed(1)}%`
const labelWithSavings = `vs ${label} (${signedPercent})`.padEnd(COMPARISON_LABEL_PADDING)
const connector = isLast ? '└─' : '├─'
const tokenStr = format.tokens.toLocaleString('en-US').padStart(TOKEN_PADDING)
return ` ${labelWithSavings}${tokenStr}`
return `${connector} vs ${label.padEnd(13)} ${`(${signedPercent})`.padEnd(20)} ${tokenStr} tokens`
}
/**
@@ -91,36 +87,39 @@ function generateTotalLines(
totals: { name: string, tokens: number, savingsPercent: number }[],
baselineFormat?: { name: string, tokens: number },
) {
const lines: string[] = ['Total ']
const separatorHalf = '─'.repeat(36)
const lines: string[] = [`${separatorHalf} Total ${separatorHalf}`]
if (baselineFormat) {
// Flat-only track with CSV baseline
const csvPercentage = Math.min(100, (baselineFormat.tokens / totalToonTokens) * 100)
const csvBar = createProgressBar(csvPercentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
const csvBar = createProgressBar(csvPercentage, 100, PROGRESS_BAR_WIDTH)
const csvStr = baselineFormat.tokens.toLocaleString('en-US').padStart(TOKEN_PADDING)
lines.push(`csv ${csvBar} ${csvStr} tokens`)
lines.push(` CSV ${csvBar} ${csvStr} tokens`)
const overheadPercent = ((totalToonTokens - baselineFormat.tokens) / baselineFormat.tokens) * 100
const toonBar = createProgressBar(100, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
const toonBar = createProgressBar(100, 100, PROGRESS_BAR_WIDTH)
const toonStr = totalToonTokens.toLocaleString('en-US').padStart(TOKEN_PADDING)
lines.push(`toon ${toonBar} ${toonStr} tokens (+${overheadPercent.toFixed(1)}% vs CSV)`)
lines.push(` TOON ${toonBar} ${toonStr} tokens (+${overheadPercent.toFixed(1)}% vs CSV)`)
}
else {
// Mixed-structure track
const totalPercentage = Math.min(100, (totalToonTokens / totals[0]!.tokens) * 100)
const totalBar = createProgressBar(totalPercentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
const totalBar = createProgressBar(totalPercentage, 100, PROGRESS_BAR_WIDTH)
const toonStr = totalToonTokens.toLocaleString('en-US').padStart(TOKEN_PADDING)
lines.push(`toon ${totalBar} ${toonStr} tokens`)
lines.push(` TOON ${totalBar} ${toonStr} tokens`)
}
// Add comparison lines
for (const format of totals) {
lines.push(formatComparisonLine({
for (let i = 0; i < totals.length; i++) {
const format = totals[i]!
const isLast = i === totals.length - 1
lines.push(` ${formatComparisonLine({
name: format.name,
tokens: format.tokens,
savings: 0, // Not used in this context
savingsPercent: format.savingsPercent,
}))
}, isLast)}`)
}
return lines.join('\n')
@@ -136,22 +135,25 @@ function generateDatasetChart(result: BenchmarkResult): string {
const emoji = DATASET_ICONS[dataset.name] || DEFAULT_DATASET_ICON
const eligibility = dataset.metadata.tabularEligibility
const name = `${dataset.description} [eligibility: ${eligibility}%]`
const name = dataset.description
const percentage = Math.min(100, 100 - jsonPretty.savingsPercent)
const bar = createProgressBar(percentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
const bar = createProgressBar(percentage, 100, PROGRESS_BAR_WIDTH)
const toonStr = toon.tokens.toLocaleString('en-US')
const line1 = `${emoji} ${name.padEnd(LABEL_PADDING)}\ntoon ${bar} ${toonStr.padStart(TOKEN_PADDING)} tokens`
const line1 = `${emoji} ${name} Tabular: ${eligibility}%`
const line2 = ``
const line3 = ` TOON ${bar} ${toonStr.padStart(TOKEN_PADDING)} tokens`
const comparisonLines = COMPARISON_FORMAT_ORDER.map((formatName) => {
const comparisonLines = COMPARISON_FORMAT_ORDER.map((formatName, index, array) => {
const format = formats.find(f => f.name === formatName)
if (!format)
return null
return undefined
return formatComparisonLine(format)
return ` ${formatComparisonLine(format, index === array.length - 1)}`
}).filter(Boolean)
return [line1, ...comparisonLines].join('\n')
return [line1, line2, line3, ...comparisonLines].join('\n')
}
const results: BenchmarkResult[] = []
@@ -167,8 +169,8 @@ for (const dataset of TOKEN_EFFICIENCY_DATASETS) {
if (formatName === 'csv' && !supportsCSV(dataset))
continue
const formattedString = formatter(dataset.data)
const tokens = tokenize(formattedString)
const formattedData = formatter(dataset.data)
const tokens = tokenize(formattedData)
tokensByFormat[formatName] = tokens
}
@@ -212,35 +214,36 @@ const flatCharts = flatOnlyDatasets
const { dataset } = result
const emoji = DATASET_ICONS[dataset.name] || DEFAULT_DATASET_ICON
const eligibility = dataset.metadata.tabularEligibility
const name = `${dataset.description} [eligibility: ${eligibility}%]`
const name = dataset.description
// CSV line
const csvPercentage = Math.min(100, (csv.tokens / toon.tokens) * 100)
const csvBar = createProgressBar(csvPercentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
const csvBar = createProgressBar(csvPercentage, 100, PROGRESS_BAR_WIDTH)
const csvStr = csv.tokens.toLocaleString('en-US')
const line1 = `${emoji} ${name.padEnd(LABEL_PADDING)}\ncsv ${csvBar} ${csvStr.padStart(TOKEN_PADDING)} tokens`
const line1 = `${emoji} ${name} Tabular: ${eligibility}%`
const line2 = ``
const line3 = ` CSV ${csvBar} ${csvStr.padStart(TOKEN_PADDING)} tokens`
// TOON line with overhead vs CSV
const toonOverhead = toon.tokens - csv.tokens
const toonOverheadPercent = (toonOverhead / csv.tokens) * 100
const toonBar = createProgressBar(100, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
const toonBar = createProgressBar(100, 100, PROGRESS_BAR_WIDTH)
const toonStr = toon.tokens.toLocaleString('en-US')
const toonVsCSV = toonOverheadPercent >= 0
? `(+${toonOverheadPercent.toFixed(1)}% vs CSV)`
: `(${toonOverheadPercent.toFixed(1)}% vs CSV)`
const toonLine = `toon ${toonBar} ${toonStr.padStart(TOKEN_PADDING)} tokens ${toonVsCSV}`
const toonLine = ` TOON ${toonBar} ${toonStr.padStart(TOKEN_PADDING)} tokens ${toonVsCSV}`
// Other format comparisons (vs TOON)
const comparisonLines = COMPARISON_FORMAT_ORDER.map((formatName) => {
const comparisonLines = COMPARISON_FORMAT_ORDER.map((formatName, index, array) => {
const format = result.formats.find(f => f.name === formatName)
if (!format)
return null
return undefined
return formatComparisonLine(format)
return ` ${formatComparisonLine(format, index === array.length - 1)}`
}).filter(Boolean)
return [line1, toonLine, ...comparisonLines].join('\n')
return [line1, line2, line3, toonLine, ...comparisonLines].join('\n')
})
.join('\n\n')
@@ -257,25 +260,23 @@ const totalCSVTokensFlat = flatOnlyDatasets.reduce((sum, r) => {
const flatTotalLines = generateTotalLines(totalToonTokensFlat, flatTotals, { name: 'csv', tokens: totalCSVTokensFlat })
const barChartSection = `
## Mixed-Structure Track
#### Mixed-Structure Track
Datasets with nested or semi-uniform structures. CSV excluded as it cannot properly represent these structures.
\`\`\`
${mixedCharts}
${SEPARATOR}
${mixedTotalLines}
\`\`\`
## Flat-Only Track
#### Flat-Only Track
Datasets with flat tabular structures where CSV is applicable.
\`\`\`
${flatCharts}
${SEPARATOR}
${flatTotalLines}
\`\`\`
`.trim()

View File

@@ -208,7 +208,7 @@ function generateEmployees(count: number): { employees: Employee[] } {
*/
const tabularDataset: Dataset = {
name: 'tabular',
description: 'Uniform employee records (TOON optimal format)',
description: 'Uniform employee records',
data: generateEmployees(100),
metadata: {
supportsCSV: true,
@@ -558,7 +558,7 @@ export const TOKEN_EFFICIENCY_DATASETS: Dataset[] = [
// Tabular: 2000 employees
{
name: 'tabular',
description: 'Uniform employee records (TOON optimal format)',
description: 'Uniform employee records',
data: generateEmployees(2000),
metadata: {
supportsCSV: true,

View File

@@ -80,8 +80,13 @@ export function generateAccuracyReport(
return `
Benchmarks test LLM comprehension across different input formats using ${totalQuestions} data retrieval questions on ${modelNames.length} ${modelNames.length === 1 ? 'model' : 'models'}.
<details>
<summary><strong>View Dataset Catalog</strong></summary>
${generateDatasetCatalog(ACCURACY_DATASETS)}
</details>
#### Efficiency Ranking (Accuracy per 1K Tokens)
${generateEfficiencyRankingReport(formatResults)}
@@ -118,7 +123,7 @@ ${rows}
- **nested**: Objects with nested structures (nested objects or arrays)
- **deep**: Highly nested with minimal tabular eligibility
**CSV Support:** ✓ (supported), ✗ (not supported - would require lossy flattening)
**CSV Support:** ✓ (supported), ✗ (not supported would require lossy flattening)
**Eligibility:** Percentage of arrays that qualify for TOON's tabular format (uniform objects with primitive values)
`.trim()
@@ -219,7 +224,7 @@ function generateDetailedAccuracyReport(
const totalEvaluations = totalQuestions * formatCount * modelNames.length
return `
Accuracy across **${modelNames.length} ${modelNames.length === 1 ? 'LLM' : 'LLMs'}** on ${totalQuestions} data retrieval questions:
Accuracy across ${modelNames.length} ${modelNames.length === 1 ? 'LLM' : 'LLMs'} on ${totalQuestions} data retrieval questions:
\`\`\`
${modelBreakdown}
@@ -453,13 +458,17 @@ function generateHorizontalEfficiencyChart(
): string {
const barWidth = 20
const maxEfficiency = Math.max(...ranking.map(r => r.efficiency))
const maxFormatWidth = Math.max(...ranking.map(r => r.format.length))
const maxFormatWidth = Math.max(...ranking.map((r) => {
const displayName = FORMATTER_DISPLAY_NAMES[r.format] || r.format
return displayName.length
}))
return ranking
.map((r) => {
const normalizedValue = r.efficiency / maxEfficiency
const bar = createProgressBar(normalizedValue, 1, barWidth, { filled: '▓', empty: '░' })
const formatName = r.format.padEnd(maxFormatWidth)
const displayName = FORMATTER_DISPLAY_NAMES[r.format] || r.format
const formatName = displayName.padEnd(maxFormatWidth)
const efficiency = r.efficiency.toFixed(1).padStart(4)
const accuracy = `${(r.accuracy * 100).toFixed(1)}%`.padStart(5)
const tokens = r.tokens.toLocaleString('en-US').padStart(5)