mirror of
https://github.com/voson-wang/toon.git
synced 2026-01-29 15:24:10 +08:00
chore: split token efficiency benchmark into mixed/flat tracks
This commit is contained in:
456
README.md
456
README.md
@@ -101,39 +101,154 @@ The benchmarks test datasets across different structural patterns (uniform, semi
|
||||
|
||||
<!-- automd:file src="./benchmarks/results/token-efficiency.md" -->
|
||||
|
||||
#### Mixed-Structure Track
|
||||
|
||||
Datasets with nested or semi-uniform structures. CSV excluded as it cannot properly represent these structures.
|
||||
|
||||
```
|
||||
⭐ GitHub Repositories ██████████████░░░░░░░░░░░ 8,745 tokens
|
||||
vs JSON (−42.3%) 15,145
|
||||
vs JSON compact (−23.7%) 11,455
|
||||
vs YAML (−33.4%) 13,129
|
||||
vs XML (−48.8%) 17,095
|
||||
🛒 E-commerce orders with nested structures ┊ Tabular: 33%
|
||||
│
|
||||
TOON █████████████░░░░░░░ 72,743 tokens
|
||||
├─ vs JSON (−33.1%) 108,731 tokens
|
||||
├─ vs JSON compact (+5.5%) 68,936 tokens
|
||||
├─ vs YAML (−14.1%) 84,724 tokens
|
||||
└─ vs XML (−40.5%) 122,313 tokens
|
||||
|
||||
📈 Daily Analytics ██████████░░░░░░░░░░░░░░░ 4,507 tokens
|
||||
vs JSON (−58.9%) 10,977
|
||||
vs JSON compact (−35.7%) 7,013
|
||||
vs YAML (−48.8%) 8,810
|
||||
vs XML (−65.7%) 13,128
|
||||
🧾 Semi-uniform event logs ┊ Tabular: 50%
|
||||
│
|
||||
TOON █████████████████░░░ 153,223 tokens
|
||||
├─ vs JSON (−15.0%) 180,196 tokens
|
||||
├─ vs JSON compact (+19.9%) 127,740 tokens
|
||||
├─ vs YAML (−0.8%) 154,514 tokens
|
||||
└─ vs XML (−25.2%) 204,800 tokens
|
||||
|
||||
🛒 E-Commerce Order ████████████████░░░░░░░░░ 166 tokens
|
||||
vs JSON (−35.4%) 257
|
||||
vs JSON compact (−2.9%) 171
|
||||
vs YAML (−15.7%) 197
|
||||
vs XML (−38.7%) 271
|
||||
🧩 Deeply nested configuration ┊ Tabular: 0%
|
||||
│
|
||||
TOON ██████████████░░░░░░ 631 tokens
|
||||
├─ vs JSON (−31.3%) 919 tokens
|
||||
├─ vs JSON compact (+11.9%) 564 tokens
|
||||
├─ vs YAML (−6.2%) 673 tokens
|
||||
└─ vs XML (−37.4%) 1,008 tokens
|
||||
|
||||
─────────────────────────────────────────────────────────────────────
|
||||
Total ██████████████░░░░░░░░░░░ 13,418 tokens
|
||||
vs JSON (−49.1%) 26,379
|
||||
vs JSON compact (−28.0%) 18,639
|
||||
vs YAML (−39.4%) 22,136
|
||||
vs XML (−56.0%) 30,494
|
||||
──────────────────────────────────── Total ────────────────────────────────────
|
||||
TOON ████████████████░░░░ 226,597 tokens
|
||||
├─ vs JSON (−21.8%) 289,846 tokens
|
||||
├─ vs JSON compact (+14.9%) 197,240 tokens
|
||||
├─ vs YAML (−5.5%) 239,911 tokens
|
||||
└─ vs XML (−30.9%) 328,121 tokens
|
||||
```
|
||||
|
||||
#### Flat-Only Track
|
||||
|
||||
Datasets with flat tabular structures where CSV is applicable.
|
||||
|
||||
```
|
||||
👥 Uniform employee records ┊ Tabular: 100%
|
||||
│
|
||||
CSV ███████████████████░ 46,956 tokens
|
||||
TOON ████████████████████ 49,827 tokens (+6.1% vs CSV)
|
||||
├─ vs JSON (−60.7%) 126,854 tokens
|
||||
├─ vs JSON compact (−36.8%) 78,850 tokens
|
||||
├─ vs YAML (−50.0%) 99,701 tokens
|
||||
└─ vs XML (−66.0%) 146,440 tokens
|
||||
|
||||
📈 Time-series analytics data ┊ Tabular: 100%
|
||||
│
|
||||
CSV ██████████████████░░ 8,396 tokens
|
||||
TOON ████████████████████ 9,128 tokens (+8.7% vs CSV)
|
||||
├─ vs JSON (−59.0%) 22,258 tokens
|
||||
├─ vs JSON compact (−35.8%) 14,224 tokens
|
||||
├─ vs YAML (−48.9%) 17,871 tokens
|
||||
└─ vs XML (−65.7%) 26,629 tokens
|
||||
|
||||
⭐ Top 100 GitHub repositories ┊ Tabular: 100%
|
||||
│
|
||||
CSV ███████████████████░ 8,513 tokens
|
||||
TOON ████████████████████ 8,745 tokens (+2.7% vs CSV)
|
||||
├─ vs JSON (−42.3%) 15,145 tokens
|
||||
├─ vs JSON compact (−23.7%) 11,455 tokens
|
||||
├─ vs YAML (−33.4%) 13,129 tokens
|
||||
└─ vs XML (−48.8%) 17,095 tokens
|
||||
|
||||
──────────────────────────────────── Total ────────────────────────────────────
|
||||
CSV ███████████████████░ 63,865 tokens
|
||||
TOON ████████████████████ 67,700 tokens (+6.0% vs CSV)
|
||||
├─ vs JSON (−58.8%) 164,257 tokens
|
||||
├─ vs JSON compact (−35.2%) 104,529 tokens
|
||||
├─ vs YAML (−48.2%) 130,701 tokens
|
||||
└─ vs XML (−64.4%) 190,164 tokens
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary><strong>View detailed examples</strong></summary>
|
||||
|
||||
#### ⭐ GitHub Repositories
|
||||
#### 📈 Time-series analytics data
|
||||
|
||||
**Configuration:** Top 100 GitHub repositories with stars, forks, and metadata
|
||||
**Savings:** 13,130 tokens (59.0% reduction vs JSON)
|
||||
|
||||
**JSON** (22,258 tokens):
|
||||
|
||||
```json
|
||||
{
|
||||
"metrics": [
|
||||
{
|
||||
"date": "2025-01-01",
|
||||
"views": 7708,
|
||||
"clicks": 595,
|
||||
"conversions": 69,
|
||||
"revenue": 15369.93,
|
||||
"bounceRate": 0.35
|
||||
},
|
||||
{
|
||||
"date": "2025-01-02",
|
||||
"views": 5894,
|
||||
"clicks": 381,
|
||||
"conversions": 21,
|
||||
"revenue": 2112.12,
|
||||
"bounceRate": 0.3
|
||||
},
|
||||
{
|
||||
"date": "2025-01-03",
|
||||
"views": 6835,
|
||||
"clicks": 422,
|
||||
"conversions": 35,
|
||||
"revenue": 4525.73,
|
||||
"bounceRate": 0.5
|
||||
},
|
||||
{
|
||||
"date": "2025-01-04",
|
||||
"views": 5325,
|
||||
"clicks": 305,
|
||||
"conversions": 22,
|
||||
"revenue": 2445.3,
|
||||
"bounceRate": 0.44
|
||||
},
|
||||
{
|
||||
"date": "2025-01-05",
|
||||
"views": 2974,
|
||||
"clicks": 61,
|
||||
"conversions": 6,
|
||||
"revenue": 956.57,
|
||||
"bounceRate": 0.47
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**TOON** (9,128 tokens):
|
||||
|
||||
```
|
||||
metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
|
||||
2025-01-01,7708,595,69,15369.93,0.35
|
||||
2025-01-02,5894,381,21,2112.12,0.3
|
||||
2025-01-03,6835,422,35,4525.73,0.5
|
||||
2025-01-04,5325,305,22,2445.3,0.44
|
||||
2025-01-05,2974,61,6,956.57,0.47
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### ⭐ Top 100 GitHub repositories
|
||||
|
||||
**Savings:** 6,400 tokens (42.3% reduction vs JSON)
|
||||
|
||||
@@ -194,74 +309,6 @@ repositories[3]{id,name,repo,description,createdAt,updatedAt,pushedAt,stars,watc
|
||||
21737465,awesome,sindresorhus/awesome,😎 Awesome lists about all kinds of interesting topics,"2014-07-11T13:42:37Z","2025-10-28T12:40:21Z","2025-10-27T17:57:31Z",410052,8017,32029,main
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### 📈 Daily Analytics
|
||||
|
||||
**Configuration:** 180 days of web metrics (views, clicks, conversions, revenue)
|
||||
|
||||
**Savings:** 6,470 tokens (58.9% reduction vs JSON)
|
||||
|
||||
**JSON** (10,977 tokens):
|
||||
|
||||
```json
|
||||
{
|
||||
"metrics": [
|
||||
{
|
||||
"date": "2025-01-01",
|
||||
"views": 6890,
|
||||
"clicks": 401,
|
||||
"conversions": 23,
|
||||
"revenue": 6015.59,
|
||||
"bounceRate": 0.63
|
||||
},
|
||||
{
|
||||
"date": "2025-01-02",
|
||||
"views": 6940,
|
||||
"clicks": 323,
|
||||
"conversions": 37,
|
||||
"revenue": 9086.44,
|
||||
"bounceRate": 0.36
|
||||
},
|
||||
{
|
||||
"date": "2025-01-03",
|
||||
"views": 4390,
|
||||
"clicks": 346,
|
||||
"conversions": 26,
|
||||
"revenue": 6360.75,
|
||||
"bounceRate": 0.48
|
||||
},
|
||||
{
|
||||
"date": "2025-01-04",
|
||||
"views": 3429,
|
||||
"clicks": 231,
|
||||
"conversions": 13,
|
||||
"revenue": 2360.96,
|
||||
"bounceRate": 0.65
|
||||
},
|
||||
{
|
||||
"date": "2025-01-05",
|
||||
"views": 5804,
|
||||
"clicks": 186,
|
||||
"conversions": 22,
|
||||
"revenue": 2535.96,
|
||||
"bounceRate": 0.37
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**TOON** (4,507 tokens):
|
||||
|
||||
```
|
||||
metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
|
||||
2025-01-01,6890,401,23,6015.59,0.63
|
||||
2025-01-02,6940,323,37,9086.44,0.36
|
||||
2025-01-03,4390,346,26,6360.75,0.48
|
||||
2025-01-04,3429,231,13,2360.96,0.65
|
||||
2025-01-05,5804,186,22,2535.96,0.37
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<!-- /automd -->
|
||||
@@ -270,111 +317,156 @@ metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
|
||||
|
||||
<!-- automd:file src="./benchmarks/results/retrieval-accuracy.md" -->
|
||||
|
||||
Benchmarks test LLM comprehension across different input formats using 154 data retrieval questions on 4 models.
|
||||
Benchmarks test LLM comprehension across different input formats using 201 data retrieval questions on 4 models.
|
||||
|
||||
<details>
|
||||
<summary><strong>View Dataset Catalog</strong></summary>
|
||||
|
||||
#### Dataset Catalog
|
||||
|
||||
| Dataset | Rows | Structure | CSV Support | Eligibility |
|
||||
| ------- | ---- | --------- | ----------- | ----------- |
|
||||
| Uniform employee records | 100 | uniform | ✓ | 100% |
|
||||
| E-commerce orders with nested structures | 50 | nested | ✗ | 33% |
|
||||
| Time-series analytics data | 60 | uniform | ✓ | 100% |
|
||||
| Top 100 GitHub repositories | 100 | uniform | ✓ | 100% |
|
||||
| Semi-uniform event logs | 75 | semi-uniform | ✗ | 50% |
|
||||
| Deeply nested configuration | 11 | deep | ✗ | 0% |
|
||||
|
||||
**Structure classes:**
|
||||
- **uniform**: All objects have identical fields with primitive values
|
||||
- **semi-uniform**: Mix of uniform and non-uniform structures
|
||||
- **nested**: Objects with nested structures (nested objects or arrays)
|
||||
- **deep**: Highly nested with minimal tabular eligibility
|
||||
|
||||
**CSV Support:** ✓ (supported), ✗ (not supported – would require lossy flattening)
|
||||
|
||||
**Eligibility:** Percentage of arrays that qualify for TOON's tabular format (uniform objects with primitive values)
|
||||
|
||||
</details>
|
||||
|
||||
#### Efficiency Ranking (Accuracy per 1K Tokens)
|
||||
|
||||
Each format's overall performance, balancing accuracy against token cost:
|
||||
|
||||
```
|
||||
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 15.0 │ 70.1% acc │ 4,678 tokens
|
||||
csv ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░ 14.3 │ 67.7% acc │ 4,745 tokens
|
||||
json-compact ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░ 11.0 │ 65.3% acc │ 5,925 tokens
|
||||
yaml ▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░ 9.4 │ 66.7% acc │ 7,091 tokens
|
||||
json-pretty ▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░ 7.5 │ 65.4% acc │ 8,713 tokens
|
||||
xml ▓▓▓▓▓▓▓▓▓░░░░░░░░░░░ 6.8 │ 67.2% acc │ 9,944 tokens
|
||||
TOON ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 15.6 │ 68.7% acc │ 4,389 tokens
|
||||
CSV ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 15.3 │ 62.3% acc │ 4,080 tokens
|
||||
JSON compact ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░ 13.5 │ 67.2% acc │ 4,982 tokens
|
||||
YAML ▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░ 11.2 │ 66.7% acc │ 5,976 tokens
|
||||
JSON ▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░ 9.0 │ 65.7% acc │ 7,260 tokens
|
||||
XML ▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░ 8.1 │ 66.8% acc │ 8,251 tokens
|
||||
```
|
||||
|
||||
TOON achieves **70.1%** accuracy (vs JSON's 65.4%) while using **46.3% fewer tokens**.
|
||||
TOON achieves **68.7%** accuracy (vs JSON's 65.7%) while using **39.5% fewer tokens**.
|
||||
|
||||
#### Per-Model Accuracy
|
||||
|
||||
Accuracy across **4 LLMs** on 154 data retrieval questions:
|
||||
Accuracy across 4 LLMs on 201 data retrieval questions:
|
||||
|
||||
```
|
||||
gpt-5-nano
|
||||
→ TOON ███████████████████░ 96.1% (148/154)
|
||||
CSV ██████████████████░░ 91.6% (141/154)
|
||||
YAML ██████████████████░░ 91.6% (141/154)
|
||||
JSON compact ██████████████████░░ 91.6% (141/154)
|
||||
XML █████████████████░░░ 87.0% (134/154)
|
||||
JSON █████████████████░░░ 86.4% (133/154)
|
||||
→ TOON ██████████████████░░ 88.6% (178/201)
|
||||
JSON compact ██████████████████░░ 88.1% (177/201)
|
||||
CSV ██████████████████░░ 88.0% (88/100)
|
||||
YAML █████████████████░░░ 84.6% (170/201)
|
||||
XML ████████████████░░░░ 81.6% (164/201)
|
||||
JSON ████████████████░░░░ 80.1% (161/201)
|
||||
|
||||
claude-haiku-4-5-20251001
|
||||
JSON ██████████░░░░░░░░░░ 50.0% (77/154)
|
||||
YAML ██████████░░░░░░░░░░ 49.4% (76/154)
|
||||
→ TOON ██████████░░░░░░░░░░ 48.7% (75/154)
|
||||
XML ██████████░░░░░░░░░░ 48.1% (74/154)
|
||||
CSV █████████░░░░░░░░░░░ 47.4% (73/154)
|
||||
JSON compact █████████░░░░░░░░░░░ 44.2% (68/154)
|
||||
YAML ██████████░░░░░░░░░░ 52.2% (105/201)
|
||||
→ TOON ██████████░░░░░░░░░░ 50.7% (102/201)
|
||||
JSON ██████████░░░░░░░░░░ 50.2% (101/201)
|
||||
JSON compact ██████████░░░░░░░░░░ 49.8% (100/201)
|
||||
XML ██████████░░░░░░░░░░ 49.3% (99/201)
|
||||
CSV ████████░░░░░░░░░░░░ 39.0% (39/100)
|
||||
|
||||
gemini-2.5-flash
|
||||
CSV ██████████████████░░ 87.7% (135/154)
|
||||
XML ██████████████████░░ 87.7% (135/154)
|
||||
→ TOON █████████████████░░░ 86.4% (133/154)
|
||||
YAML ████████████████░░░░ 79.9% (123/154)
|
||||
JSON compact ████████████████░░░░ 79.9% (123/154)
|
||||
JSON ███████████████░░░░░ 76.6% (118/154)
|
||||
XML █████████████████░░░ 86.1% (173/201)
|
||||
→ TOON █████████████████░░░ 84.1% (169/201)
|
||||
CSV ████████████████░░░░ 82.0% (82/100)
|
||||
JSON compact ████████████████░░░░ 81.1% (163/201)
|
||||
YAML ████████████████░░░░ 81.1% (163/201)
|
||||
JSON ████████████████░░░░ 81.1% (163/201)
|
||||
|
||||
grok-4-fast-non-reasoning
|
||||
→ TOON ██████████░░░░░░░░░░ 49.4% (76/154)
|
||||
JSON ██████████░░░░░░░░░░ 48.7% (75/154)
|
||||
XML █████████░░░░░░░░░░░ 46.1% (71/154)
|
||||
YAML █████████░░░░░░░░░░░ 46.1% (71/154)
|
||||
JSON compact █████████░░░░░░░░░░░ 45.5% (70/154)
|
||||
CSV █████████░░░░░░░░░░░ 44.2% (68/154)
|
||||
→ TOON ██████████░░░░░░░░░░ 51.2% (103/201)
|
||||
JSON ██████████░░░░░░░░░░ 51.2% (103/201)
|
||||
XML ██████████░░░░░░░░░░ 50.2% (101/201)
|
||||
JSON compact ██████████░░░░░░░░░░ 49.8% (100/201)
|
||||
YAML ██████████░░░░░░░░░░ 48.8% (98/201)
|
||||
CSV ████████░░░░░░░░░░░░ 40.0% (40/100)
|
||||
```
|
||||
|
||||
**Key tradeoff:** TOON achieves **70.1% accuracy** (vs JSON's 65.4%) while using **46.3% fewer tokens** on these datasets.
|
||||
**Key tradeoff:** TOON achieves **68.7% accuracy** (vs JSON's 65.7%) while using **39.5% fewer tokens** on these datasets.
|
||||
|
||||
<details>
|
||||
<summary><strong>Performance by dataset and model</strong></summary>
|
||||
|
||||
#### Performance by Dataset
|
||||
|
||||
##### Uniform employee records (TOON optimal format)
|
||||
##### Uniform employee records
|
||||
|
||||
| Format | Accuracy | Tokens | Correct/Total |
|
||||
| ------ | -------- | ------ | ------------- |
|
||||
| `csv` | 65.5% | 2,337 | 131/200 |
|
||||
| `toon` | 67.5% | 2,483 | 135/200 |
|
||||
| `json-compact` | 65.5% | 3,943 | 131/200 |
|
||||
| `yaml` | 68.5% | 4,969 | 137/200 |
|
||||
| `xml` | 69.5% | 7,314 | 139/200 |
|
||||
| `json-pretty` | 64.5% | 6,347 | 129/200 |
|
||||
| `toon` | 65.6% | 2,483 | 105/160 |
|
||||
| `csv` | 62.5% | 2,337 | 100/160 |
|
||||
| `json-compact` | 66.3% | 3,943 | 106/160 |
|
||||
| `yaml` | 63.7% | 4,969 | 102/160 |
|
||||
| `xml` | 67.5% | 7,314 | 108/160 |
|
||||
| `json-pretty` | 62.5% | 6,347 | 100/160 |
|
||||
|
||||
##### E-commerce orders with nested structures
|
||||
|
||||
| Format | Accuracy | Tokens | Correct/Total |
|
||||
| ------ | -------- | ------ | ------------- |
|
||||
| `toon` | 78.8% | 5,967 | 126/160 |
|
||||
| `csv` | 76.3% | 6,735 | 122/160 |
|
||||
| `json-compact` | 70.6% | 5,962 | 113/160 |
|
||||
| `yaml` | 72.5% | 7,328 | 116/160 |
|
||||
| `json-pretty` | 76.9% | 9,694 | 123/160 |
|
||||
| `xml` | 73.1% | 10,992 | 117/160 |
|
||||
| `toon` | 75.6% | 7,197 | 121/160 |
|
||||
| `json-compact` | 70.6% | 6,784 | 113/160 |
|
||||
| `yaml` | 71.9% | 8,334 | 115/160 |
|
||||
| `json-pretty` | 68.8% | 10,700 | 110/160 |
|
||||
| `xml` | 71.9% | 12,013 | 115/160 |
|
||||
|
||||
##### Time-series analytics data
|
||||
|
||||
| Format | Accuracy | Tokens | Correct/Total |
|
||||
| ------ | -------- | ------ | ------------- |
|
||||
| `toon` | 68.4% | 1,515 | 93/136 |
|
||||
| `csv` | 65.4% | 1,393 | 89/136 |
|
||||
| `json-compact` | 64.7% | 2,341 | 88/136 |
|
||||
| `yaml` | 66.2% | 2,938 | 90/136 |
|
||||
| `json-pretty` | 64.7% | 3,665 | 88/136 |
|
||||
| `xml` | 66.9% | 4,376 | 91/136 |
|
||||
| `csv` | 63.8% | 1,391 | 74/116 |
|
||||
| `toon` | 66.4% | 1,513 | 77/116 |
|
||||
| `json-compact` | 61.2% | 2,339 | 71/116 |
|
||||
| `yaml` | 65.5% | 2,936 | 76/116 |
|
||||
| `json-pretty` | 64.7% | 3,663 | 75/116 |
|
||||
| `xml` | 65.5% | 4,374 | 76/116 |
|
||||
|
||||
##### Top 100 GitHub repositories
|
||||
|
||||
| Format | Accuracy | Tokens | Correct/Total |
|
||||
| ------ | -------- | ------ | ------------- |
|
||||
| `toon` | 65.0% | 8,745 | 78/120 |
|
||||
| `csv` | 62.5% | 8,513 | 75/120 |
|
||||
| `json-compact` | 58.3% | 11,455 | 70/120 |
|
||||
| `yaml` | 56.7% | 13,129 | 68/120 |
|
||||
| `xml` | 55.8% | 17,095 | 67/120 |
|
||||
| `json-pretty` | 52.5% | 15,145 | 63/120 |
|
||||
| `toon` | 63.7% | 8,745 | 79/124 |
|
||||
| `csv` | 60.5% | 8,513 | 75/124 |
|
||||
| `json-compact` | 56.5% | 11,455 | 70/124 |
|
||||
| `yaml` | 53.2% | 13,129 | 66/124 |
|
||||
| `json-pretty` | 53.2% | 15,145 | 66/124 |
|
||||
| `xml` | 53.2% | 17,095 | 66/124 |
|
||||
|
||||
##### Semi-uniform event logs
|
||||
|
||||
| Format | Accuracy | Tokens | Correct/Total |
|
||||
| ------ | -------- | ------ | ------------- |
|
||||
| `json-compact` | 55.0% | 4,809 | 66/120 |
|
||||
| `yaml` | 52.5% | 5,814 | 63/120 |
|
||||
| `json-pretty` | 52.5% | 6,784 | 63/120 |
|
||||
| `toon` | 45.8% | 5,764 | 55/120 |
|
||||
| `xml` | 50.8% | 7,699 | 61/120 |
|
||||
|
||||
##### Deeply nested configuration
|
||||
|
||||
| Format | Accuracy | Tokens | Correct/Total |
|
||||
| ------ | -------- | ------ | ------------- |
|
||||
| `json-compact` | 91.9% | 564 | 114/124 |
|
||||
| `toon` | 92.7% | 631 | 115/124 |
|
||||
| `yaml` | 91.9% | 673 | 114/124 |
|
||||
| `json-pretty` | 91.9% | 919 | 114/124 |
|
||||
| `xml` | 89.5% | 1,008 | 111/124 |
|
||||
|
||||
#### Performance by Model
|
||||
|
||||
@@ -382,45 +474,45 @@ grok-4-fast-non-reasoning
|
||||
|
||||
| Format | Accuracy | Correct/Total |
|
||||
| ------ | -------- | ------------- |
|
||||
| `toon` | 96.1% | 148/154 |
|
||||
| `csv` | 91.6% | 141/154 |
|
||||
| `yaml` | 91.6% | 141/154 |
|
||||
| `json-compact` | 91.6% | 141/154 |
|
||||
| `xml` | 87.0% | 134/154 |
|
||||
| `json-pretty` | 86.4% | 133/154 |
|
||||
| `toon` | 88.6% | 178/201 |
|
||||
| `json-compact` | 88.1% | 177/201 |
|
||||
| `csv` | 88.0% | 88/100 |
|
||||
| `yaml` | 84.6% | 170/201 |
|
||||
| `xml` | 81.6% | 164/201 |
|
||||
| `json-pretty` | 80.1% | 161/201 |
|
||||
|
||||
##### claude-haiku-4-5-20251001
|
||||
|
||||
| Format | Accuracy | Correct/Total |
|
||||
| ------ | -------- | ------------- |
|
||||
| `json-pretty` | 50.0% | 77/154 |
|
||||
| `yaml` | 49.4% | 76/154 |
|
||||
| `toon` | 48.7% | 75/154 |
|
||||
| `xml` | 48.1% | 74/154 |
|
||||
| `csv` | 47.4% | 73/154 |
|
||||
| `json-compact` | 44.2% | 68/154 |
|
||||
| `yaml` | 52.2% | 105/201 |
|
||||
| `toon` | 50.7% | 102/201 |
|
||||
| `json-pretty` | 50.2% | 101/201 |
|
||||
| `json-compact` | 49.8% | 100/201 |
|
||||
| `xml` | 49.3% | 99/201 |
|
||||
| `csv` | 39.0% | 39/100 |
|
||||
|
||||
##### gemini-2.5-flash
|
||||
|
||||
| Format | Accuracy | Correct/Total |
|
||||
| ------ | -------- | ------------- |
|
||||
| `csv` | 87.7% | 135/154 |
|
||||
| `xml` | 87.7% | 135/154 |
|
||||
| `toon` | 86.4% | 133/154 |
|
||||
| `yaml` | 79.9% | 123/154 |
|
||||
| `json-compact` | 79.9% | 123/154 |
|
||||
| `json-pretty` | 76.6% | 118/154 |
|
||||
| `xml` | 86.1% | 173/201 |
|
||||
| `toon` | 84.1% | 169/201 |
|
||||
| `csv` | 82.0% | 82/100 |
|
||||
| `json-compact` | 81.1% | 163/201 |
|
||||
| `yaml` | 81.1% | 163/201 |
|
||||
| `json-pretty` | 81.1% | 163/201 |
|
||||
|
||||
##### grok-4-fast-non-reasoning
|
||||
|
||||
| Format | Accuracy | Correct/Total |
|
||||
| ------ | -------- | ------------- |
|
||||
| `toon` | 49.4% | 76/154 |
|
||||
| `json-pretty` | 48.7% | 75/154 |
|
||||
| `xml` | 46.1% | 71/154 |
|
||||
| `yaml` | 46.1% | 71/154 |
|
||||
| `json-compact` | 45.5% | 70/154 |
|
||||
| `csv` | 44.2% | 68/154 |
|
||||
| `toon` | 51.2% | 103/201 |
|
||||
| `json-pretty` | 51.2% | 103/201 |
|
||||
| `xml` | 50.2% | 101/201 |
|
||||
| `json-compact` | 49.8% | 100/201 |
|
||||
| `yaml` | 48.8% | 98/201 |
|
||||
| `csv` | 40.0% | 40/100 |
|
||||
|
||||
</details>
|
||||
|
||||
@@ -433,34 +525,36 @@ This benchmark tests **LLM comprehension and data retrieval accuracy** across di
|
||||
|
||||
#### Datasets Tested
|
||||
|
||||
Four datasets designed to test different structural patterns (all contain arrays of uniform objects, TOON's optimal format):
|
||||
Six datasets designed to test different structural patterns:
|
||||
|
||||
1. **Tabular** (100 employee records): Uniform objects with identical fields – optimal for TOON's tabular format.
|
||||
2. **Nested** (50 e-commerce orders): Complex structures with nested customer objects and item arrays.
|
||||
3. **Analytics** (60 days of metrics): Time-series data with dates and numeric values.
|
||||
4. **GitHub** (100 repositories): Real-world data from top GitHub repos by stars.
|
||||
5. **Event Logs** (75 logs): Semi-uniform data with ~50% flat logs and ~50% with nested error objects.
|
||||
6. **Nested Config** (1 configuration): Deeply nested configuration with minimal tabular eligibility.
|
||||
|
||||
#### Question Types
|
||||
|
||||
154 questions are generated dynamically across three categories:
|
||||
201 questions are generated dynamically across three categories:
|
||||
|
||||
- **Field retrieval (40%)**: Direct value lookups or values that can be read straight off a record (including booleans and simple counts such as array lengths)
|
||||
- **Field retrieval (36%)**: Direct value lookups or values that can be read straight off a record (including booleans and simple counts such as array lengths)
|
||||
- Example: "What is Alice's salary?" → `75000`
|
||||
- Example: "How many items are in order ORD-0042?" → `3`
|
||||
- Example: "What is the customer name for order ORD-0042?" → `John Doe`
|
||||
|
||||
- **Aggregation (32%)**: Dataset-level totals and averages plus single-condition filters (counts, sums, min/max comparisons)
|
||||
- **Aggregation (37%)**: Dataset-level totals and averages plus single-condition filters (counts, sums, min/max comparisons)
|
||||
- Example: "How many employees work in Engineering?" → `17`
|
||||
- Example: "What is the total revenue across all orders?" → `45123.50`
|
||||
- Example: "How many employees have salary > 80000?" → `23`
|
||||
|
||||
- **Filtering (28%)**: Multi-condition queries requiring compound logic (AND constraints across fields)
|
||||
- **Filtering (27%)**: Multi-condition queries requiring compound logic (AND constraints across fields)
|
||||
- Example: "How many employees in Sales have salary > 80000?" → `5`
|
||||
- Example: "How many active employees have more than 10 years of experience?" → `8`
|
||||
|
||||
#### Evaluation Process
|
||||
|
||||
1. **Format conversion**: Each dataset is converted to all 6 formats (TOON, CSV, XML, YAML, JSON, JSON compact).
|
||||
1. **Format conversion**: Each dataset is converted to all 6 formats (TOON, JSON compact, XML, YAML, JSON, CSV).
|
||||
2. **Query LLM**: Each model receives formatted data + question in a prompt and extracts the answer.
|
||||
3. **Validate with LLM-as-judge**: `gpt-5-nano` validates if the answer is semantically correct (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`).
|
||||
|
||||
@@ -469,7 +563,7 @@ Four datasets designed to test different structural patterns (all contain arrays
|
||||
- **Models tested**: `gpt-5-nano`, `claude-haiku-4-5-20251001`, `gemini-2.5-flash`, `grok-4-fast-non-reasoning`
|
||||
- **Token counting**: Using `gpt-tokenizer` with `o200k_base` encoding (GPT-5 tokenizer)
|
||||
- **Temperature**: Not set (models use their defaults)
|
||||
- **Total evaluations**: 154 questions × 6 formats × 4 models = 3,696 LLM calls
|
||||
- **Total evaluations**: 201 questions × 6 formats × 4 models = 4,824 LLM calls
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
@@ -9,7 +9,7 @@ Benchmarks measuring TOON's **token efficiency** and **retrieval accuracy** comp
|
||||
|
||||
```bash
|
||||
# Run token efficiency benchmark
|
||||
pnpm benchmark:token-efficiency
|
||||
pnpm benchmark:tokens
|
||||
|
||||
# Run retrieval accuracy benchmark (requires API keys)
|
||||
pnpm benchmark:accuracy
|
||||
@@ -25,7 +25,7 @@ Measures token count reduction across JSON, XML, YAML, CSV, and TOON:
|
||||
4. Calculate savings and generate report
|
||||
|
||||
```bash
|
||||
pnpm benchmark:token-efficiency
|
||||
pnpm benchmark:tokens
|
||||
```
|
||||
|
||||
Results are saved to `results/token-efficiency.md`.
|
||||
|
||||
@@ -3,7 +3,7 @@
|
||||
"type": "module",
|
||||
"private": true,
|
||||
"scripts": {
|
||||
"benchmark:token-efficiency": "tsx scripts/token-efficiency-benchmark.ts",
|
||||
"benchmark:tokens": "tsx scripts/token-efficiency-benchmark.ts",
|
||||
"benchmark:accuracy": "tsx --env-file=.env scripts/accuracy-benchmark.ts",
|
||||
"fetch:github-repos": "tsx scripts/fetch-github-repos.ts"
|
||||
},
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
@@ -1,108 +1,153 @@
|
||||
Benchmarks test LLM comprehension across different input formats using 154 data retrieval questions on 4 models.
|
||||
Benchmarks test LLM comprehension across different input formats using 201 data retrieval questions on 4 models.
|
||||
|
||||
<details>
|
||||
<summary><strong>View Dataset Catalog</strong></summary>
|
||||
|
||||
#### Dataset Catalog
|
||||
|
||||
| Dataset | Rows | Structure | CSV Support | Eligibility |
|
||||
| ------- | ---- | --------- | ----------- | ----------- |
|
||||
| Uniform employee records | 100 | uniform | ✓ | 100% |
|
||||
| E-commerce orders with nested structures | 50 | nested | ✗ | 33% |
|
||||
| Time-series analytics data | 60 | uniform | ✓ | 100% |
|
||||
| Top 100 GitHub repositories | 100 | uniform | ✓ | 100% |
|
||||
| Semi-uniform event logs | 75 | semi-uniform | ✗ | 50% |
|
||||
| Deeply nested configuration | 11 | deep | ✗ | 0% |
|
||||
|
||||
**Structure classes:**
|
||||
- **uniform**: All objects have identical fields with primitive values
|
||||
- **semi-uniform**: Mix of uniform and non-uniform structures
|
||||
- **nested**: Objects with nested structures (nested objects or arrays)
|
||||
- **deep**: Highly nested with minimal tabular eligibility
|
||||
|
||||
**CSV Support:** ✓ (supported), ✗ (not supported – would require lossy flattening)
|
||||
|
||||
**Eligibility:** Percentage of arrays that qualify for TOON's tabular format (uniform objects with primitive values)
|
||||
|
||||
</details>
|
||||
|
||||
#### Efficiency Ranking (Accuracy per 1K Tokens)
|
||||
|
||||
Each format's overall performance, balancing accuracy against token cost:
|
||||
|
||||
```
|
||||
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 15.0 │ 70.1% acc │ 4,678 tokens
|
||||
csv ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░ 14.3 │ 67.7% acc │ 4,745 tokens
|
||||
json-compact ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░ 11.0 │ 65.3% acc │ 5,925 tokens
|
||||
yaml ▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░ 9.4 │ 66.7% acc │ 7,091 tokens
|
||||
json-pretty ▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░ 7.5 │ 65.4% acc │ 8,713 tokens
|
||||
xml ▓▓▓▓▓▓▓▓▓░░░░░░░░░░░ 6.8 │ 67.2% acc │ 9,944 tokens
|
||||
TOON ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 15.6 │ 68.7% acc │ 4,389 tokens
|
||||
CSV ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 15.3 │ 62.3% acc │ 4,080 tokens
|
||||
JSON compact ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░ 13.5 │ 67.2% acc │ 4,982 tokens
|
||||
YAML ▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░ 11.2 │ 66.7% acc │ 5,976 tokens
|
||||
JSON ▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░ 9.0 │ 65.7% acc │ 7,260 tokens
|
||||
XML ▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░ 8.1 │ 66.8% acc │ 8,251 tokens
|
||||
```
|
||||
|
||||
TOON achieves **70.1%** accuracy (vs JSON's 65.4%) while using **46.3% fewer tokens**.
|
||||
TOON achieves **68.7%** accuracy (vs JSON's 65.7%) while using **39.5% fewer tokens**.
|
||||
|
||||
#### Per-Model Accuracy
|
||||
|
||||
Accuracy across **4 LLMs** on 154 data retrieval questions:
|
||||
Accuracy across 4 LLMs on 201 data retrieval questions:
|
||||
|
||||
```
|
||||
gpt-5-nano
|
||||
→ TOON ███████████████████░ 96.1% (148/154)
|
||||
CSV ██████████████████░░ 91.6% (141/154)
|
||||
YAML ██████████████████░░ 91.6% (141/154)
|
||||
JSON compact ██████████████████░░ 91.6% (141/154)
|
||||
XML █████████████████░░░ 87.0% (134/154)
|
||||
JSON █████████████████░░░ 86.4% (133/154)
|
||||
→ TOON ██████████████████░░ 88.6% (178/201)
|
||||
JSON compact ██████████████████░░ 88.1% (177/201)
|
||||
CSV ██████████████████░░ 88.0% (88/100)
|
||||
YAML █████████████████░░░ 84.6% (170/201)
|
||||
XML ████████████████░░░░ 81.6% (164/201)
|
||||
JSON ████████████████░░░░ 80.1% (161/201)
|
||||
|
||||
claude-haiku-4-5-20251001
|
||||
JSON ██████████░░░░░░░░░░ 50.0% (77/154)
|
||||
YAML ██████████░░░░░░░░░░ 49.4% (76/154)
|
||||
→ TOON ██████████░░░░░░░░░░ 48.7% (75/154)
|
||||
XML ██████████░░░░░░░░░░ 48.1% (74/154)
|
||||
CSV █████████░░░░░░░░░░░ 47.4% (73/154)
|
||||
JSON compact █████████░░░░░░░░░░░ 44.2% (68/154)
|
||||
YAML ██████████░░░░░░░░░░ 52.2% (105/201)
|
||||
→ TOON ██████████░░░░░░░░░░ 50.7% (102/201)
|
||||
JSON ██████████░░░░░░░░░░ 50.2% (101/201)
|
||||
JSON compact ██████████░░░░░░░░░░ 49.8% (100/201)
|
||||
XML ██████████░░░░░░░░░░ 49.3% (99/201)
|
||||
CSV ████████░░░░░░░░░░░░ 39.0% (39/100)
|
||||
|
||||
gemini-2.5-flash
|
||||
CSV ██████████████████░░ 87.7% (135/154)
|
||||
XML ██████████████████░░ 87.7% (135/154)
|
||||
→ TOON █████████████████░░░ 86.4% (133/154)
|
||||
YAML ████████████████░░░░ 79.9% (123/154)
|
||||
JSON compact ████████████████░░░░ 79.9% (123/154)
|
||||
JSON ███████████████░░░░░ 76.6% (118/154)
|
||||
XML █████████████████░░░ 86.1% (173/201)
|
||||
→ TOON █████████████████░░░ 84.1% (169/201)
|
||||
CSV ████████████████░░░░ 82.0% (82/100)
|
||||
JSON compact ████████████████░░░░ 81.1% (163/201)
|
||||
YAML ████████████████░░░░ 81.1% (163/201)
|
||||
JSON ████████████████░░░░ 81.1% (163/201)
|
||||
|
||||
grok-4-fast-non-reasoning
|
||||
→ TOON ██████████░░░░░░░░░░ 49.4% (76/154)
|
||||
JSON ██████████░░░░░░░░░░ 48.7% (75/154)
|
||||
XML █████████░░░░░░░░░░░ 46.1% (71/154)
|
||||
YAML █████████░░░░░░░░░░░ 46.1% (71/154)
|
||||
JSON compact █████████░░░░░░░░░░░ 45.5% (70/154)
|
||||
CSV █████████░░░░░░░░░░░ 44.2% (68/154)
|
||||
→ TOON ██████████░░░░░░░░░░ 51.2% (103/201)
|
||||
JSON ██████████░░░░░░░░░░ 51.2% (103/201)
|
||||
XML ██████████░░░░░░░░░░ 50.2% (101/201)
|
||||
JSON compact ██████████░░░░░░░░░░ 49.8% (100/201)
|
||||
YAML ██████████░░░░░░░░░░ 48.8% (98/201)
|
||||
CSV ████████░░░░░░░░░░░░ 40.0% (40/100)
|
||||
```
|
||||
|
||||
**Key tradeoff:** TOON achieves **70.1% accuracy** (vs JSON's 65.4%) while using **46.3% fewer tokens** on these datasets.
|
||||
**Key tradeoff:** TOON achieves **68.7% accuracy** (vs JSON's 65.7%) while using **39.5% fewer tokens** on these datasets.
|
||||
|
||||
<details>
|
||||
<summary><strong>Performance by dataset and model</strong></summary>
|
||||
|
||||
#### Performance by Dataset
|
||||
|
||||
##### Uniform employee records (TOON optimal format)
|
||||
##### Uniform employee records
|
||||
|
||||
| Format | Accuracy | Tokens | Correct/Total |
|
||||
| ------ | -------- | ------ | ------------- |
|
||||
| `csv` | 65.5% | 2,337 | 131/200 |
|
||||
| `toon` | 67.5% | 2,483 | 135/200 |
|
||||
| `json-compact` | 65.5% | 3,943 | 131/200 |
|
||||
| `yaml` | 68.5% | 4,969 | 137/200 |
|
||||
| `xml` | 69.5% | 7,314 | 139/200 |
|
||||
| `json-pretty` | 64.5% | 6,347 | 129/200 |
|
||||
| `toon` | 65.6% | 2,483 | 105/160 |
|
||||
| `csv` | 62.5% | 2,337 | 100/160 |
|
||||
| `json-compact` | 66.3% | 3,943 | 106/160 |
|
||||
| `yaml` | 63.7% | 4,969 | 102/160 |
|
||||
| `xml` | 67.5% | 7,314 | 108/160 |
|
||||
| `json-pretty` | 62.5% | 6,347 | 100/160 |
|
||||
|
||||
##### E-commerce orders with nested structures
|
||||
|
||||
| Format | Accuracy | Tokens | Correct/Total |
|
||||
| ------ | -------- | ------ | ------------- |
|
||||
| `toon` | 78.8% | 5,967 | 126/160 |
|
||||
| `csv` | 76.3% | 6,735 | 122/160 |
|
||||
| `json-compact` | 70.6% | 5,962 | 113/160 |
|
||||
| `yaml` | 72.5% | 7,328 | 116/160 |
|
||||
| `json-pretty` | 76.9% | 9,694 | 123/160 |
|
||||
| `xml` | 73.1% | 10,992 | 117/160 |
|
||||
| `toon` | 75.6% | 7,197 | 121/160 |
|
||||
| `json-compact` | 70.6% | 6,784 | 113/160 |
|
||||
| `yaml` | 71.9% | 8,334 | 115/160 |
|
||||
| `json-pretty` | 68.8% | 10,700 | 110/160 |
|
||||
| `xml` | 71.9% | 12,013 | 115/160 |
|
||||
|
||||
##### Time-series analytics data
|
||||
|
||||
| Format | Accuracy | Tokens | Correct/Total |
|
||||
| ------ | -------- | ------ | ------------- |
|
||||
| `toon` | 68.4% | 1,515 | 93/136 |
|
||||
| `csv` | 65.4% | 1,393 | 89/136 |
|
||||
| `json-compact` | 64.7% | 2,341 | 88/136 |
|
||||
| `yaml` | 66.2% | 2,938 | 90/136 |
|
||||
| `json-pretty` | 64.7% | 3,665 | 88/136 |
|
||||
| `xml` | 66.9% | 4,376 | 91/136 |
|
||||
| `csv` | 63.8% | 1,391 | 74/116 |
|
||||
| `toon` | 66.4% | 1,513 | 77/116 |
|
||||
| `json-compact` | 61.2% | 2,339 | 71/116 |
|
||||
| `yaml` | 65.5% | 2,936 | 76/116 |
|
||||
| `json-pretty` | 64.7% | 3,663 | 75/116 |
|
||||
| `xml` | 65.5% | 4,374 | 76/116 |
|
||||
|
||||
##### Top 100 GitHub repositories
|
||||
|
||||
| Format | Accuracy | Tokens | Correct/Total |
|
||||
| ------ | -------- | ------ | ------------- |
|
||||
| `toon` | 65.0% | 8,745 | 78/120 |
|
||||
| `csv` | 62.5% | 8,513 | 75/120 |
|
||||
| `json-compact` | 58.3% | 11,455 | 70/120 |
|
||||
| `yaml` | 56.7% | 13,129 | 68/120 |
|
||||
| `xml` | 55.8% | 17,095 | 67/120 |
|
||||
| `json-pretty` | 52.5% | 15,145 | 63/120 |
|
||||
| `toon` | 63.7% | 8,745 | 79/124 |
|
||||
| `csv` | 60.5% | 8,513 | 75/124 |
|
||||
| `json-compact` | 56.5% | 11,455 | 70/124 |
|
||||
| `yaml` | 53.2% | 13,129 | 66/124 |
|
||||
| `json-pretty` | 53.2% | 15,145 | 66/124 |
|
||||
| `xml` | 53.2% | 17,095 | 66/124 |
|
||||
|
||||
##### Semi-uniform event logs
|
||||
|
||||
| Format | Accuracy | Tokens | Correct/Total |
|
||||
| ------ | -------- | ------ | ------------- |
|
||||
| `json-compact` | 55.0% | 4,809 | 66/120 |
|
||||
| `yaml` | 52.5% | 5,814 | 63/120 |
|
||||
| `json-pretty` | 52.5% | 6,784 | 63/120 |
|
||||
| `toon` | 45.8% | 5,764 | 55/120 |
|
||||
| `xml` | 50.8% | 7,699 | 61/120 |
|
||||
|
||||
##### Deeply nested configuration
|
||||
|
||||
| Format | Accuracy | Tokens | Correct/Total |
|
||||
| ------ | -------- | ------ | ------------- |
|
||||
| `json-compact` | 91.9% | 564 | 114/124 |
|
||||
| `toon` | 92.7% | 631 | 115/124 |
|
||||
| `yaml` | 91.9% | 673 | 114/124 |
|
||||
| `json-pretty` | 91.9% | 919 | 114/124 |
|
||||
| `xml` | 89.5% | 1,008 | 111/124 |
|
||||
|
||||
#### Performance by Model
|
||||
|
||||
@@ -110,45 +155,45 @@ grok-4-fast-non-reasoning
|
||||
|
||||
| Format | Accuracy | Correct/Total |
|
||||
| ------ | -------- | ------------- |
|
||||
| `toon` | 96.1% | 148/154 |
|
||||
| `csv` | 91.6% | 141/154 |
|
||||
| `yaml` | 91.6% | 141/154 |
|
||||
| `json-compact` | 91.6% | 141/154 |
|
||||
| `xml` | 87.0% | 134/154 |
|
||||
| `json-pretty` | 86.4% | 133/154 |
|
||||
| `toon` | 88.6% | 178/201 |
|
||||
| `json-compact` | 88.1% | 177/201 |
|
||||
| `csv` | 88.0% | 88/100 |
|
||||
| `yaml` | 84.6% | 170/201 |
|
||||
| `xml` | 81.6% | 164/201 |
|
||||
| `json-pretty` | 80.1% | 161/201 |
|
||||
|
||||
##### claude-haiku-4-5-20251001
|
||||
|
||||
| Format | Accuracy | Correct/Total |
|
||||
| ------ | -------- | ------------- |
|
||||
| `json-pretty` | 50.0% | 77/154 |
|
||||
| `yaml` | 49.4% | 76/154 |
|
||||
| `toon` | 48.7% | 75/154 |
|
||||
| `xml` | 48.1% | 74/154 |
|
||||
| `csv` | 47.4% | 73/154 |
|
||||
| `json-compact` | 44.2% | 68/154 |
|
||||
| `yaml` | 52.2% | 105/201 |
|
||||
| `toon` | 50.7% | 102/201 |
|
||||
| `json-pretty` | 50.2% | 101/201 |
|
||||
| `json-compact` | 49.8% | 100/201 |
|
||||
| `xml` | 49.3% | 99/201 |
|
||||
| `csv` | 39.0% | 39/100 |
|
||||
|
||||
##### gemini-2.5-flash
|
||||
|
||||
| Format | Accuracy | Correct/Total |
|
||||
| ------ | -------- | ------------- |
|
||||
| `csv` | 87.7% | 135/154 |
|
||||
| `xml` | 87.7% | 135/154 |
|
||||
| `toon` | 86.4% | 133/154 |
|
||||
| `yaml` | 79.9% | 123/154 |
|
||||
| `json-compact` | 79.9% | 123/154 |
|
||||
| `json-pretty` | 76.6% | 118/154 |
|
||||
| `xml` | 86.1% | 173/201 |
|
||||
| `toon` | 84.1% | 169/201 |
|
||||
| `csv` | 82.0% | 82/100 |
|
||||
| `json-compact` | 81.1% | 163/201 |
|
||||
| `yaml` | 81.1% | 163/201 |
|
||||
| `json-pretty` | 81.1% | 163/201 |
|
||||
|
||||
##### grok-4-fast-non-reasoning
|
||||
|
||||
| Format | Accuracy | Correct/Total |
|
||||
| ------ | -------- | ------------- |
|
||||
| `toon` | 49.4% | 76/154 |
|
||||
| `json-pretty` | 48.7% | 75/154 |
|
||||
| `xml` | 46.1% | 71/154 |
|
||||
| `yaml` | 46.1% | 71/154 |
|
||||
| `json-compact` | 45.5% | 70/154 |
|
||||
| `csv` | 44.2% | 68/154 |
|
||||
| `toon` | 51.2% | 103/201 |
|
||||
| `json-pretty` | 51.2% | 103/201 |
|
||||
| `xml` | 50.2% | 101/201 |
|
||||
| `json-compact` | 49.8% | 100/201 |
|
||||
| `yaml` | 48.8% | 98/201 |
|
||||
| `csv` | 40.0% | 40/100 |
|
||||
|
||||
</details>
|
||||
|
||||
@@ -161,34 +206,36 @@ This benchmark tests **LLM comprehension and data retrieval accuracy** across di
|
||||
|
||||
#### Datasets Tested
|
||||
|
||||
Four datasets designed to test different structural patterns (all contain arrays of uniform objects, TOON's optimal format):
|
||||
Six datasets designed to test different structural patterns:
|
||||
|
||||
1. **Tabular** (100 employee records): Uniform objects with identical fields – optimal for TOON's tabular format.
|
||||
2. **Nested** (50 e-commerce orders): Complex structures with nested customer objects and item arrays.
|
||||
3. **Analytics** (60 days of metrics): Time-series data with dates and numeric values.
|
||||
4. **GitHub** (100 repositories): Real-world data from top GitHub repos by stars.
|
||||
5. **Event Logs** (75 logs): Semi-uniform data with ~50% flat logs and ~50% with nested error objects.
|
||||
6. **Nested Config** (1 configuration): Deeply nested configuration with minimal tabular eligibility.
|
||||
|
||||
#### Question Types
|
||||
|
||||
154 questions are generated dynamically across three categories:
|
||||
201 questions are generated dynamically across three categories:
|
||||
|
||||
- **Field retrieval (40%)**: Direct value lookups or values that can be read straight off a record (including booleans and simple counts such as array lengths)
|
||||
- **Field retrieval (36%)**: Direct value lookups or values that can be read straight off a record (including booleans and simple counts such as array lengths)
|
||||
- Example: "What is Alice's salary?" → `75000`
|
||||
- Example: "How many items are in order ORD-0042?" → `3`
|
||||
- Example: "What is the customer name for order ORD-0042?" → `John Doe`
|
||||
|
||||
- **Aggregation (32%)**: Dataset-level totals and averages plus single-condition filters (counts, sums, min/max comparisons)
|
||||
- **Aggregation (37%)**: Dataset-level totals and averages plus single-condition filters (counts, sums, min/max comparisons)
|
||||
- Example: "How many employees work in Engineering?" → `17`
|
||||
- Example: "What is the total revenue across all orders?" → `45123.50`
|
||||
- Example: "How many employees have salary > 80000?" → `23`
|
||||
|
||||
- **Filtering (28%)**: Multi-condition queries requiring compound logic (AND constraints across fields)
|
||||
- **Filtering (27%)**: Multi-condition queries requiring compound logic (AND constraints across fields)
|
||||
- Example: "How many employees in Sales have salary > 80000?" → `5`
|
||||
- Example: "How many active employees have more than 10 years of experience?" → `8`
|
||||
|
||||
#### Evaluation Process
|
||||
|
||||
1. **Format conversion**: Each dataset is converted to all 6 formats (TOON, CSV, XML, YAML, JSON, JSON compact).
|
||||
1. **Format conversion**: Each dataset is converted to all 6 formats (TOON, JSON compact, XML, YAML, JSON, CSV).
|
||||
2. **Query LLM**: Each model receives formatted data + question in a prompt and extracts the answer.
|
||||
3. **Validate with LLM-as-judge**: `gpt-5-nano` validates if the answer is semantically correct (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`).
|
||||
|
||||
@@ -197,6 +244,6 @@ Four datasets designed to test different structural patterns (all contain arrays
|
||||
- **Models tested**: `gpt-5-nano`, `claude-haiku-4-5-20251001`, `gemini-2.5-flash`, `grok-4-fast-non-reasoning`
|
||||
- **Token counting**: Using `gpt-tokenizer` with `o200k_base` encoding (GPT-5 tokenizer)
|
||||
- **Temperature**: Not set (models use their defaults)
|
||||
- **Total evaluations**: 154 questions × 6 formats × 4 models = 3,696 LLM calls
|
||||
- **Total evaluations**: 201 questions × 6 formats × 4 models = 4,824 LLM calls
|
||||
|
||||
</details>
|
||||
|
||||
@@ -1,79 +1,81 @@
|
||||
|
||||
## Mixed-Structure Track
|
||||
#### Mixed-Structure Track
|
||||
|
||||
Datasets with nested or semi-uniform structures. CSV excluded as it cannot properly represent these structures.
|
||||
|
||||
```
|
||||
🛒 E-commerce orders with nested structures [eligibility: 33%]
|
||||
toon ▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░ 58,528 tokens
|
||||
vs JSON (−37.9%) 94,207
|
||||
vs JSON compact (+0.9%) 57,979
|
||||
vs YAML (−17.8%) 71,223
|
||||
vs XML (−45.2%) 106,720
|
||||
🛒 E-commerce orders with nested structures ┊ Tabular: 33%
|
||||
│
|
||||
TOON █████████████░░░░░░░ 72,743 tokens
|
||||
├─ vs JSON (−33.1%) 108,731 tokens
|
||||
├─ vs JSON compact (+5.5%) 68,936 tokens
|
||||
├─ vs YAML (−14.1%) 84,724 tokens
|
||||
└─ vs XML (−40.5%) 122,313 tokens
|
||||
|
||||
🧾 Semi-uniform event logs [eligibility: 50%]
|
||||
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░ 154,419 tokens
|
||||
vs JSON (−15.0%) 181,592
|
||||
vs JSON compact (+19.9%) 128,836
|
||||
vs YAML (−0.9%) 155,749
|
||||
vs XML (−25.1%) 206,271
|
||||
🧾 Semi-uniform event logs ┊ Tabular: 50%
|
||||
│
|
||||
TOON █████████████████░░░ 153,223 tokens
|
||||
├─ vs JSON (−15.0%) 180,196 tokens
|
||||
├─ vs JSON compact (+19.9%) 127,740 tokens
|
||||
├─ vs YAML (−0.8%) 154,514 tokens
|
||||
└─ vs XML (−25.2%) 204,800 tokens
|
||||
|
||||
🧩 Deeply nested configuration [eligibility: 0%]
|
||||
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░ 630 tokens
|
||||
vs JSON (−31.4%) 918
|
||||
vs JSON compact (+11.9%) 563
|
||||
vs YAML (−6.4%) 673
|
||||
vs XML (−37.4%) 1,007
|
||||
🧩 Deeply nested configuration ┊ Tabular: 0%
|
||||
│
|
||||
TOON ██████████████░░░░░░ 631 tokens
|
||||
├─ vs JSON (−31.3%) 919 tokens
|
||||
├─ vs JSON compact (+11.9%) 564 tokens
|
||||
├─ vs YAML (−6.2%) 673 tokens
|
||||
└─ vs XML (−37.4%) 1,008 tokens
|
||||
|
||||
─────────────────────────────────────────────────────────────────────────────────
|
||||
Total
|
||||
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░ 213,577 tokens
|
||||
vs JSON (−22.8%) 276,717
|
||||
vs JSON compact (+14.0%) 187,378
|
||||
vs YAML (−6.2%) 227,645
|
||||
vs XML (−32.0%) 313,998
|
||||
──────────────────────────────────── Total ────────────────────────────────────
|
||||
TOON ████████████████░░░░ 226,597 tokens
|
||||
├─ vs JSON (−21.8%) 289,846 tokens
|
||||
├─ vs JSON compact (+14.9%) 197,240 tokens
|
||||
├─ vs YAML (−5.5%) 239,911 tokens
|
||||
└─ vs XML (−30.9%) 328,121 tokens
|
||||
```
|
||||
|
||||
## Flat-Only Track
|
||||
#### Flat-Only Track
|
||||
|
||||
Datasets with flat tabular structures where CSV is applicable.
|
||||
|
||||
```
|
||||
👥 Uniform employee records (TOON optimal format) [eligibility: 100%]
|
||||
csv ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░ 46,968 tokens
|
||||
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 49,841 tokens (+5.8% vs CSV)
|
||||
vs JSON (−60.7%) 126,886
|
||||
vs JSON compact (−36.8%) 78,882
|
||||
vs YAML (−50.0%) 99,743
|
||||
vs XML (−66.0%) 146,465
|
||||
👥 Uniform employee records ┊ Tabular: 100%
|
||||
│
|
||||
CSV ███████████████████░ 46,956 tokens
|
||||
TOON ████████████████████ 49,827 tokens (+6.1% vs CSV)
|
||||
├─ vs JSON (−60.7%) 126,854 tokens
|
||||
├─ vs JSON compact (−36.8%) 78,850 tokens
|
||||
├─ vs YAML (−50.0%) 99,701 tokens
|
||||
└─ vs XML (−66.0%) 146,440 tokens
|
||||
|
||||
📈 Time-series analytics data [eligibility: 100%]
|
||||
csv ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░ 8,382 tokens
|
||||
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 9,114 tokens (+8.0% vs CSV)
|
||||
vs JSON (−59.0%) 22,244
|
||||
vs JSON compact (−35.9%) 14,210
|
||||
vs YAML (−49.0%) 17,857
|
||||
vs XML (−65.8%) 26,615
|
||||
📈 Time-series analytics data ┊ Tabular: 100%
|
||||
│
|
||||
CSV ██████████████████░░ 8,396 tokens
|
||||
TOON ████████████████████ 9,128 tokens (+8.7% vs CSV)
|
||||
├─ vs JSON (−59.0%) 22,258 tokens
|
||||
├─ vs JSON compact (−35.8%) 14,224 tokens
|
||||
├─ vs YAML (−48.9%) 17,871 tokens
|
||||
└─ vs XML (−65.7%) 26,629 tokens
|
||||
|
||||
⭐ Top 100 GitHub repositories [eligibility: 100%]
|
||||
csv ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░ 8,513 tokens
|
||||
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 8,745 tokens (+2.7% vs CSV)
|
||||
vs JSON (−42.3%) 15,145
|
||||
vs JSON compact (−23.7%) 11,455
|
||||
vs YAML (−33.4%) 13,129
|
||||
vs XML (−48.8%) 17,095
|
||||
⭐ Top 100 GitHub repositories ┊ Tabular: 100%
|
||||
│
|
||||
CSV ███████████████████░ 8,513 tokens
|
||||
TOON ████████████████████ 8,745 tokens (+2.7% vs CSV)
|
||||
├─ vs JSON (−42.3%) 15,145 tokens
|
||||
├─ vs JSON compact (−23.7%) 11,455 tokens
|
||||
├─ vs YAML (−33.4%) 13,129 tokens
|
||||
└─ vs XML (−48.8%) 17,095 tokens
|
||||
|
||||
─────────────────────────────────────────────────────────────────────────────────
|
||||
Total
|
||||
csv ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░ 63,863 tokens
|
||||
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 67,700 tokens (+5.7% vs CSV)
|
||||
vs JSON (−58.8%) 164,275
|
||||
vs JSON compact (−35.2%) 104,547
|
||||
vs YAML (−48.2%) 130,729
|
||||
vs XML (−64.4%) 190,175
|
||||
──────────────────────────────────── Total ────────────────────────────────────
|
||||
CSV ███████████████████░ 63,865 tokens
|
||||
TOON ████████████████████ 67,700 tokens (+6.0% vs CSV)
|
||||
├─ vs JSON (−58.8%) 164,257 tokens
|
||||
├─ vs JSON compact (−35.2%) 104,529 tokens
|
||||
├─ vs YAML (−48.2%) 130,701 tokens
|
||||
└─ vs XML (−64.4%) 190,164 tokens
|
||||
```
|
||||
|
||||
|
||||
<details>
|
||||
<summary><strong>View detailed examples</strong></summary>
|
||||
|
||||
@@ -81,64 +83,64 @@ toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
|
||||
|
||||
**Savings:** 13,130 tokens (59.0% reduction vs JSON)
|
||||
|
||||
**JSON** (22,244 tokens):
|
||||
**JSON** (22,258 tokens):
|
||||
|
||||
```json
|
||||
{
|
||||
"metrics": [
|
||||
{
|
||||
"date": "2025-01-01",
|
||||
"views": 4324,
|
||||
"clicks": 146,
|
||||
"conversions": 21,
|
||||
"revenue": 3834.57,
|
||||
"bounceRate": 0.4
|
||||
"views": 7708,
|
||||
"clicks": 595,
|
||||
"conversions": 69,
|
||||
"revenue": 15369.93,
|
||||
"bounceRate": 0.35
|
||||
},
|
||||
{
|
||||
"date": "2025-01-02",
|
||||
"views": 6248,
|
||||
"clicks": 407,
|
||||
"conversions": 22,
|
||||
"revenue": 2936.12,
|
||||
"bounceRate": 0.62
|
||||
"views": 5894,
|
||||
"clicks": 381,
|
||||
"conversions": 21,
|
||||
"revenue": 2112.12,
|
||||
"bounceRate": 0.3
|
||||
},
|
||||
{
|
||||
"date": "2025-01-03",
|
||||
"views": 7382,
|
||||
"clicks": 270,
|
||||
"conversions": 24,
|
||||
"revenue": 6825.19,
|
||||
"bounceRate": 0.7
|
||||
"views": 6835,
|
||||
"clicks": 422,
|
||||
"conversions": 35,
|
||||
"revenue": 4525.73,
|
||||
"bounceRate": 0.5
|
||||
},
|
||||
{
|
||||
"date": "2025-01-04",
|
||||
"views": 4586,
|
||||
"clicks": 267,
|
||||
"conversions": 24,
|
||||
"revenue": 2391.11,
|
||||
"bounceRate": 0.64
|
||||
"views": 5325,
|
||||
"clicks": 305,
|
||||
"conversions": 22,
|
||||
"revenue": 2445.3,
|
||||
"bounceRate": 0.44
|
||||
},
|
||||
{
|
||||
"date": "2025-01-05",
|
||||
"views": 6171,
|
||||
"clicks": 227,
|
||||
"conversions": 12,
|
||||
"revenue": 3430.1,
|
||||
"bounceRate": 0.39
|
||||
"views": 2974,
|
||||
"clicks": 61,
|
||||
"conversions": 6,
|
||||
"revenue": 956.57,
|
||||
"bounceRate": 0.47
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**TOON** (9,114 tokens):
|
||||
**TOON** (9,128 tokens):
|
||||
|
||||
```
|
||||
metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
|
||||
2025-01-01,4324,146,21,3834.57,0.4
|
||||
2025-01-02,6248,407,22,2936.12,0.62
|
||||
2025-01-03,7382,270,24,6825.19,0.7
|
||||
2025-01-04,4586,267,24,2391.11,0.64
|
||||
2025-01-05,6171,227,12,3430.1,0.39
|
||||
2025-01-01,7708,595,69,15369.93,0.35
|
||||
2025-01-02,5894,381,21,2112.12,0.3
|
||||
2025-01-03,6835,422,35,4525.73,0.5
|
||||
2025-01-04,5325,305,22,2445.3,0.44
|
||||
2025-01-05,2974,61,6,956.57,0.47
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
@@ -32,13 +32,9 @@ const DATASET_ICONS: Record<string, string> = {
|
||||
|
||||
const COMPARISON_FORMAT_ORDER = ['json-pretty', 'json-compact', 'yaml', 'xml'] as const
|
||||
|
||||
const PROGRESS_BAR_CONFIG = { filled: '▓', empty: '░' } as const
|
||||
const PROGRESS_BAR_WIDTH = 20
|
||||
const TOKEN_PADDING = 7
|
||||
const LABEL_PADDING = 60
|
||||
const COMPARISON_LABEL_PADDING = 30
|
||||
|
||||
const SEPARATOR = '─────────────────────────────────────────────────────────────────────────────────'
|
||||
const DEFAULT_DATASET_ICON = '📊'
|
||||
|
||||
const DETAILED_EXAMPLE_DATASETS = ['github', 'analytics'] as const
|
||||
@@ -51,14 +47,14 @@ prompts.intro('Token Efficiency Benchmark')
|
||||
/**
|
||||
* Format a comparison line showing savings vs TOON
|
||||
*/
|
||||
function formatComparisonLine(format: FormatMetrics): string {
|
||||
function formatComparisonLine(format: FormatMetrics, isLast: boolean = false): string {
|
||||
const label = FORMATTER_DISPLAY_NAMES[format.name] || format.name.toUpperCase()
|
||||
const signedPercent = format.savingsPercent >= 0
|
||||
? `−${format.savingsPercent.toFixed(1)}%`
|
||||
: `+${Math.abs(format.savingsPercent).toFixed(1)}%`
|
||||
const labelWithSavings = `vs ${label} (${signedPercent})`.padEnd(COMPARISON_LABEL_PADDING)
|
||||
const connector = isLast ? '└─' : '├─'
|
||||
const tokenStr = format.tokens.toLocaleString('en-US').padStart(TOKEN_PADDING)
|
||||
return ` ${labelWithSavings}${tokenStr}`
|
||||
return `${connector} vs ${label.padEnd(13)} ${`(${signedPercent})`.padEnd(20)} ${tokenStr} tokens`
|
||||
}
|
||||
|
||||
/**
|
||||
@@ -91,36 +87,39 @@ function generateTotalLines(
|
||||
totals: { name: string, tokens: number, savingsPercent: number }[],
|
||||
baselineFormat?: { name: string, tokens: number },
|
||||
) {
|
||||
const lines: string[] = ['Total ']
|
||||
const separatorHalf = '─'.repeat(36)
|
||||
const lines: string[] = [`${separatorHalf} Total ${separatorHalf}`]
|
||||
|
||||
if (baselineFormat) {
|
||||
// Flat-only track with CSV baseline
|
||||
const csvPercentage = Math.min(100, (baselineFormat.tokens / totalToonTokens) * 100)
|
||||
const csvBar = createProgressBar(csvPercentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
|
||||
const csvBar = createProgressBar(csvPercentage, 100, PROGRESS_BAR_WIDTH)
|
||||
const csvStr = baselineFormat.tokens.toLocaleString('en-US').padStart(TOKEN_PADDING)
|
||||
lines.push(`csv ${csvBar} ${csvStr} tokens`)
|
||||
lines.push(` CSV ${csvBar} ${csvStr} tokens`)
|
||||
|
||||
const overheadPercent = ((totalToonTokens - baselineFormat.tokens) / baselineFormat.tokens) * 100
|
||||
const toonBar = createProgressBar(100, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
|
||||
const toonBar = createProgressBar(100, 100, PROGRESS_BAR_WIDTH)
|
||||
const toonStr = totalToonTokens.toLocaleString('en-US').padStart(TOKEN_PADDING)
|
||||
lines.push(`toon ${toonBar} ${toonStr} tokens (+${overheadPercent.toFixed(1)}% vs CSV)`)
|
||||
lines.push(` TOON ${toonBar} ${toonStr} tokens (+${overheadPercent.toFixed(1)}% vs CSV)`)
|
||||
}
|
||||
else {
|
||||
// Mixed-structure track
|
||||
const totalPercentage = Math.min(100, (totalToonTokens / totals[0]!.tokens) * 100)
|
||||
const totalBar = createProgressBar(totalPercentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
|
||||
const totalBar = createProgressBar(totalPercentage, 100, PROGRESS_BAR_WIDTH)
|
||||
const toonStr = totalToonTokens.toLocaleString('en-US').padStart(TOKEN_PADDING)
|
||||
lines.push(`toon ${totalBar} ${toonStr} tokens`)
|
||||
lines.push(` TOON ${totalBar} ${toonStr} tokens`)
|
||||
}
|
||||
|
||||
// Add comparison lines
|
||||
for (const format of totals) {
|
||||
lines.push(formatComparisonLine({
|
||||
for (let i = 0; i < totals.length; i++) {
|
||||
const format = totals[i]!
|
||||
const isLast = i === totals.length - 1
|
||||
lines.push(` ${formatComparisonLine({
|
||||
name: format.name,
|
||||
tokens: format.tokens,
|
||||
savings: 0, // Not used in this context
|
||||
savingsPercent: format.savingsPercent,
|
||||
}))
|
||||
}, isLast)}`)
|
||||
}
|
||||
|
||||
return lines.join('\n')
|
||||
@@ -136,22 +135,25 @@ function generateDatasetChart(result: BenchmarkResult): string {
|
||||
|
||||
const emoji = DATASET_ICONS[dataset.name] || DEFAULT_DATASET_ICON
|
||||
const eligibility = dataset.metadata.tabularEligibility
|
||||
const name = `${dataset.description} [eligibility: ${eligibility}%]`
|
||||
const name = dataset.description
|
||||
|
||||
const percentage = Math.min(100, 100 - jsonPretty.savingsPercent)
|
||||
const bar = createProgressBar(percentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
|
||||
const bar = createProgressBar(percentage, 100, PROGRESS_BAR_WIDTH)
|
||||
const toonStr = toon.tokens.toLocaleString('en-US')
|
||||
|
||||
const line1 = `${emoji} ${name.padEnd(LABEL_PADDING)}\ntoon ${bar} ${toonStr.padStart(TOKEN_PADDING)} tokens`
|
||||
const line1 = `${emoji} ${name} ┊ Tabular: ${eligibility}%`
|
||||
const line2 = ` │`
|
||||
const line3 = ` TOON ${bar} ${toonStr.padStart(TOKEN_PADDING)} tokens`
|
||||
|
||||
const comparisonLines = COMPARISON_FORMAT_ORDER.map((formatName) => {
|
||||
const comparisonLines = COMPARISON_FORMAT_ORDER.map((formatName, index, array) => {
|
||||
const format = formats.find(f => f.name === formatName)
|
||||
if (!format)
|
||||
return null
|
||||
return undefined
|
||||
|
||||
return formatComparisonLine(format)
|
||||
return ` ${formatComparisonLine(format, index === array.length - 1)}`
|
||||
}).filter(Boolean)
|
||||
|
||||
return [line1, ...comparisonLines].join('\n')
|
||||
return [line1, line2, line3, ...comparisonLines].join('\n')
|
||||
}
|
||||
|
||||
const results: BenchmarkResult[] = []
|
||||
@@ -167,8 +169,8 @@ for (const dataset of TOKEN_EFFICIENCY_DATASETS) {
|
||||
if (formatName === 'csv' && !supportsCSV(dataset))
|
||||
continue
|
||||
|
||||
const formattedString = formatter(dataset.data)
|
||||
const tokens = tokenize(formattedString)
|
||||
const formattedData = formatter(dataset.data)
|
||||
const tokens = tokenize(formattedData)
|
||||
tokensByFormat[formatName] = tokens
|
||||
}
|
||||
|
||||
@@ -212,35 +214,36 @@ const flatCharts = flatOnlyDatasets
|
||||
const { dataset } = result
|
||||
const emoji = DATASET_ICONS[dataset.name] || DEFAULT_DATASET_ICON
|
||||
const eligibility = dataset.metadata.tabularEligibility
|
||||
const name = `${dataset.description} [eligibility: ${eligibility}%]`
|
||||
const name = dataset.description
|
||||
|
||||
// CSV line
|
||||
const csvPercentage = Math.min(100, (csv.tokens / toon.tokens) * 100)
|
||||
const csvBar = createProgressBar(csvPercentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
|
||||
const csvBar = createProgressBar(csvPercentage, 100, PROGRESS_BAR_WIDTH)
|
||||
const csvStr = csv.tokens.toLocaleString('en-US')
|
||||
|
||||
const line1 = `${emoji} ${name.padEnd(LABEL_PADDING)}\ncsv ${csvBar} ${csvStr.padStart(TOKEN_PADDING)} tokens`
|
||||
const line1 = `${emoji} ${name} ┊ Tabular: ${eligibility}%`
|
||||
const line2 = ` │`
|
||||
const line3 = ` CSV ${csvBar} ${csvStr.padStart(TOKEN_PADDING)} tokens`
|
||||
|
||||
// TOON line with overhead vs CSV
|
||||
const toonOverhead = toon.tokens - csv.tokens
|
||||
const toonOverheadPercent = (toonOverhead / csv.tokens) * 100
|
||||
const toonBar = createProgressBar(100, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
|
||||
const toonBar = createProgressBar(100, 100, PROGRESS_BAR_WIDTH)
|
||||
const toonStr = toon.tokens.toLocaleString('en-US')
|
||||
const toonVsCSV = toonOverheadPercent >= 0
|
||||
? `(+${toonOverheadPercent.toFixed(1)}% vs CSV)`
|
||||
: `(${toonOverheadPercent.toFixed(1)}% vs CSV)`
|
||||
const toonLine = `toon ${toonBar} ${toonStr.padStart(TOKEN_PADDING)} tokens ${toonVsCSV}`
|
||||
const toonLine = ` TOON ${toonBar} ${toonStr.padStart(TOKEN_PADDING)} tokens ${toonVsCSV}`
|
||||
|
||||
// Other format comparisons (vs TOON)
|
||||
const comparisonLines = COMPARISON_FORMAT_ORDER.map((formatName) => {
|
||||
const comparisonLines = COMPARISON_FORMAT_ORDER.map((formatName, index, array) => {
|
||||
const format = result.formats.find(f => f.name === formatName)
|
||||
if (!format)
|
||||
return null
|
||||
return undefined
|
||||
|
||||
return formatComparisonLine(format)
|
||||
return ` ${formatComparisonLine(format, index === array.length - 1)}`
|
||||
}).filter(Boolean)
|
||||
|
||||
return [line1, toonLine, ...comparisonLines].join('\n')
|
||||
return [line1, line2, line3, toonLine, ...comparisonLines].join('\n')
|
||||
})
|
||||
.join('\n\n')
|
||||
|
||||
@@ -257,25 +260,23 @@ const totalCSVTokensFlat = flatOnlyDatasets.reduce((sum, r) => {
|
||||
const flatTotalLines = generateTotalLines(totalToonTokensFlat, flatTotals, { name: 'csv', tokens: totalCSVTokensFlat })
|
||||
|
||||
const barChartSection = `
|
||||
## Mixed-Structure Track
|
||||
#### Mixed-Structure Track
|
||||
|
||||
Datasets with nested or semi-uniform structures. CSV excluded as it cannot properly represent these structures.
|
||||
|
||||
\`\`\`
|
||||
${mixedCharts}
|
||||
|
||||
${SEPARATOR}
|
||||
${mixedTotalLines}
|
||||
\`\`\`
|
||||
|
||||
## Flat-Only Track
|
||||
#### Flat-Only Track
|
||||
|
||||
Datasets with flat tabular structures where CSV is applicable.
|
||||
|
||||
\`\`\`
|
||||
${flatCharts}
|
||||
|
||||
${SEPARATOR}
|
||||
${flatTotalLines}
|
||||
\`\`\`
|
||||
`.trim()
|
||||
|
||||
@@ -208,7 +208,7 @@ function generateEmployees(count: number): { employees: Employee[] } {
|
||||
*/
|
||||
const tabularDataset: Dataset = {
|
||||
name: 'tabular',
|
||||
description: 'Uniform employee records (TOON optimal format)',
|
||||
description: 'Uniform employee records',
|
||||
data: generateEmployees(100),
|
||||
metadata: {
|
||||
supportsCSV: true,
|
||||
@@ -558,7 +558,7 @@ export const TOKEN_EFFICIENCY_DATASETS: Dataset[] = [
|
||||
// Tabular: 2000 employees
|
||||
{
|
||||
name: 'tabular',
|
||||
description: 'Uniform employee records (TOON optimal format)',
|
||||
description: 'Uniform employee records',
|
||||
data: generateEmployees(2000),
|
||||
metadata: {
|
||||
supportsCSV: true,
|
||||
|
||||
@@ -80,8 +80,13 @@ export function generateAccuracyReport(
|
||||
return `
|
||||
Benchmarks test LLM comprehension across different input formats using ${totalQuestions} data retrieval questions on ${modelNames.length} ${modelNames.length === 1 ? 'model' : 'models'}.
|
||||
|
||||
<details>
|
||||
<summary><strong>View Dataset Catalog</strong></summary>
|
||||
|
||||
${generateDatasetCatalog(ACCURACY_DATASETS)}
|
||||
|
||||
</details>
|
||||
|
||||
#### Efficiency Ranking (Accuracy per 1K Tokens)
|
||||
|
||||
${generateEfficiencyRankingReport(formatResults)}
|
||||
@@ -118,7 +123,7 @@ ${rows}
|
||||
- **nested**: Objects with nested structures (nested objects or arrays)
|
||||
- **deep**: Highly nested with minimal tabular eligibility
|
||||
|
||||
**CSV Support:** ✓ (supported), ✗ (not supported - would require lossy flattening)
|
||||
**CSV Support:** ✓ (supported), ✗ (not supported – would require lossy flattening)
|
||||
|
||||
**Eligibility:** Percentage of arrays that qualify for TOON's tabular format (uniform objects with primitive values)
|
||||
`.trim()
|
||||
@@ -219,7 +224,7 @@ function generateDetailedAccuracyReport(
|
||||
const totalEvaluations = totalQuestions * formatCount * modelNames.length
|
||||
|
||||
return `
|
||||
Accuracy across **${modelNames.length} ${modelNames.length === 1 ? 'LLM' : 'LLMs'}** on ${totalQuestions} data retrieval questions:
|
||||
Accuracy across ${modelNames.length} ${modelNames.length === 1 ? 'LLM' : 'LLMs'} on ${totalQuestions} data retrieval questions:
|
||||
|
||||
\`\`\`
|
||||
${modelBreakdown}
|
||||
@@ -453,13 +458,17 @@ function generateHorizontalEfficiencyChart(
|
||||
): string {
|
||||
const barWidth = 20
|
||||
const maxEfficiency = Math.max(...ranking.map(r => r.efficiency))
|
||||
const maxFormatWidth = Math.max(...ranking.map(r => r.format.length))
|
||||
const maxFormatWidth = Math.max(...ranking.map((r) => {
|
||||
const displayName = FORMATTER_DISPLAY_NAMES[r.format] || r.format
|
||||
return displayName.length
|
||||
}))
|
||||
|
||||
return ranking
|
||||
.map((r) => {
|
||||
const normalizedValue = r.efficiency / maxEfficiency
|
||||
const bar = createProgressBar(normalizedValue, 1, barWidth, { filled: '▓', empty: '░' })
|
||||
const formatName = r.format.padEnd(maxFormatWidth)
|
||||
const displayName = FORMATTER_DISPLAY_NAMES[r.format] || r.format
|
||||
const formatName = displayName.padEnd(maxFormatWidth)
|
||||
const efficiency = r.efficiency.toFixed(1).padStart(4)
|
||||
const accuracy = `${(r.accuracy * 100).toFixed(1)}%`.padStart(5)
|
||||
const tokens = r.tokens.toLocaleString('en-US').padStart(5)
|
||||
|
||||
Reference in New Issue
Block a user