chore(benchmarks): replace LLM-as-judge, new structural validation

This commit is contained in:
Johann Schopplich
2025-11-07 21:28:21 +01:00
parent 9a519dd114
commit acca69c64a
25 changed files with 1311 additions and 396 deletions

414
README.md
View File

@@ -75,7 +75,7 @@ See [benchmarks](#benchmarks) for concrete comparisons across different data str
## Key Features
- 💸 **Token-efficient:** typically 3060% fewer tokens than JSON[^1]
- 💸 **Token-efficient:** typically 30-60% fewer tokens on large uniform arrays vs formatted JSON[^1]
- 🤿 **LLM-friendly guardrails:** explicit lengths and fields enable validation
- 🍱 **Minimal syntax:** removes redundant punctuation (braces, brackets, most quotes)
- 📐 **Indentation-based structure:** like YAML, uses whitespace instead of braces
@@ -108,19 +108,19 @@ Datasets with nested or semi-uniform structures. CSV excluded as it cannot prope
```
🛒 E-commerce orders with nested structures ┊ Tabular: 33%
TOON █████████████░░░░░░░ 72,743 tokens
├─ vs JSON (33.1%) 108,731 tokens
├─ vs JSON compact (+5.5%) 68,936 tokens
├─ vs YAML (14.1%) 84,724 tokens
└─ vs XML (40.5%) 122,313 tokens
TOON █████████████░░░░░░░ 72,771 tokens
├─ vs JSON (33.1%) 108,806 tokens
├─ vs JSON compact (+5.5%) 68,975 tokens
├─ vs YAML (14.2%) 84,780 tokens
└─ vs XML (40.5%) 122,406 tokens
🧾 Semi-uniform event logs ┊ Tabular: 50%
TOON █████████████████░░░ 153,223 tokens
├─ vs JSON (15.0%) 180,196 tokens
├─ vs JSON compact (+19.9%) 127,740 tokens
├─ vs YAML (0.8%) 154,514 tokens
└─ vs XML (25.2%) 204,800 tokens
TOON █████████████████░░░ 153,211 tokens
├─ vs JSON (15.0%) 180,176 tokens
├─ vs JSON compact (+19.9%) 127,731 tokens
├─ vs YAML (0.8%) 154,505 tokens
└─ vs XML (25.2%) 204,777 tokens
🧩 Deeply nested configuration ┊ Tabular: 0%
@@ -131,11 +131,11 @@ Datasets with nested or semi-uniform structures. CSV excluded as it cannot prope
└─ vs XML (37.4%) 1,008 tokens
──────────────────────────────────── Total ────────────────────────────────────
TOON ████████████████░░░░ 226,597 tokens
├─ vs JSON (21.8%) 289,846 tokens
├─ vs JSON compact (+14.9%) 197,240 tokens
├─ vs YAML (5.5%) 239,911 tokens
└─ vs XML (30.9%) 328,121 tokens
TOON ████████████████░░░░ 226,613 tokens
├─ vs JSON (21.8%) 289,901 tokens
├─ vs JSON compact (+14.9%) 197,270 tokens
├─ vs YAML (5.6%) 239,958 tokens
└─ vs XML (31.0%) 328,191 tokens
```
#### Flat-Only Track
@@ -145,21 +145,21 @@ Datasets with flat tabular structures where CSV is applicable.
```
👥 Uniform employee records ┊ Tabular: 100%
CSV ███████████████████░ 46,956 tokens
TOON ████████████████████ 49,827 tokens (+6.1% vs CSV)
├─ vs JSON (60.7%) 126,854 tokens
├─ vs JSON compact (36.8%) 78,850 tokens
├─ vs YAML (50.0%) 99,701 tokens
└─ vs XML (66.0%) 146,440 tokens
CSV ███████████████████░ 46,954 tokens
TOON ████████████████████ 49,831 tokens (+6.1% vs CSV)
├─ vs JSON (60.7%) 126,860 tokens
├─ vs JSON compact (36.8%) 78,856 tokens
├─ vs YAML (50.0%) 99,706 tokens
└─ vs XML (66.0%) 146,444 tokens
📈 Time-series analytics data ┊ Tabular: 100%
CSV ██████████████████░░ 8,396 tokens
TOON ████████████████████ 9,128 tokens (+8.7% vs CSV)
├─ vs JSON (59.0%) 22,258 tokens
├─ vs JSON compact (35.8%) 14,224 tokens
├─ vs YAML (48.9%) 17,871 tokens
└─ vs XML (65.7%) 26,629 tokens
CSV ██████████████████░░ 8,388 tokens
TOON ████████████████████ 9,120 tokens (+8.7% vs CSV)
├─ vs JSON (59.0%) 22,250 tokens
├─ vs JSON compact (35.8%) 14,216 tokens
├─ vs YAML (48.9%) 17,863 tokens
└─ vs XML (65.7%) 26,621 tokens
⭐ Top 100 GitHub repositories ┊ Tabular: 100%
@@ -171,12 +171,12 @@ Datasets with flat tabular structures where CSV is applicable.
└─ vs XML (48.8%) 17,095 tokens
──────────────────────────────────── Total ────────────────────────────────────
CSV ███████████████████░ 63,865 tokens
TOON ████████████████████ 67,700 tokens (+6.0% vs CSV)
├─ vs JSON (58.8%) 164,257 tokens
├─ vs JSON compact (35.2%) 104,529 tokens
├─ vs YAML (48.2%) 130,701 tokens
└─ vs XML (64.4%) 190,164 tokens
CSV ███████████████████░ 63,855 tokens
TOON ████████████████████ 67,696 tokens (+6.0% vs CSV)
├─ vs JSON (58.8%) 164,255 tokens
├─ vs JSON compact (35.2%) 104,527 tokens
├─ vs YAML (48.2%) 130,698 tokens
└─ vs XML (64.4%) 190,160 tokens
```
<details>
@@ -186,64 +186,64 @@ Datasets with flat tabular structures where CSV is applicable.
**Savings:** 13,130 tokens (59.0% reduction vs JSON)
**JSON** (22,258 tokens):
**JSON** (22,250 tokens):
```json
{
"metrics": [
{
"date": "2025-01-01",
"views": 7708,
"clicks": 595,
"conversions": 69,
"revenue": 15369.93,
"bounceRate": 0.35
"views": 5715,
"clicks": 211,
"conversions": 28,
"revenue": 7976.46,
"bounceRate": 0.47
},
{
"date": "2025-01-02",
"views": 5894,
"clicks": 381,
"conversions": 21,
"revenue": 2112.12,
"bounceRate": 0.3
"views": 7103,
"clicks": 393,
"conversions": 28,
"revenue": 8360.53,
"bounceRate": 0.32
},
{
"date": "2025-01-03",
"views": 6835,
"clicks": 422,
"conversions": 35,
"revenue": 4525.73,
"views": 7248,
"clicks": 378,
"conversions": 24,
"revenue": 3212.57,
"bounceRate": 0.5
},
{
"date": "2025-01-04",
"views": 5325,
"clicks": 305,
"conversions": 22,
"revenue": 2445.3,
"bounceRate": 0.44
"views": 2927,
"clicks": 77,
"conversions": 11,
"revenue": 1211.69,
"bounceRate": 0.62
},
{
"date": "2025-01-05",
"views": 2974,
"clicks": 61,
"conversions": 6,
"revenue": 956.57,
"bounceRate": 0.47
"views": 3530,
"clicks": 82,
"conversions": 8,
"revenue": 462.77,
"bounceRate": 0.56
}
]
}
```
**TOON** (9,128 tokens):
**TOON** (9,120 tokens):
```
metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
2025-01-01,7708,595,69,15369.93,0.35
2025-01-02,5894,381,21,2112.12,0.3
2025-01-03,6835,422,35,4525.73,0.5
2025-01-04,5325,305,22,2445.3,0.44
2025-01-05,2974,61,6,956.57,0.47
2025-01-01,5715,211,28,7976.46,0.47
2025-01-02,7103,393,28,8360.53,0.32
2025-01-03,7248,378,24,3212.57,0.5
2025-01-04,2927,77,11,1211.69,0.62
2025-01-05,3530,82,8,462.77,0.56
```
---
@@ -317,7 +317,7 @@ repositories[3]{id,name,repo,description,createdAt,updatedAt,pushedAt,stars,watc
<!-- automd:file src="./benchmarks/results/retrieval-accuracy.md" -->
Benchmarks test LLM comprehension across different input formats using 204 data retrieval questions on 4 models.
Benchmarks test LLM comprehension across different input formats using 209 data retrieval questions on 4 models.
<details>
<summary><strong>Show Dataset Catalog</strong></summary>
@@ -332,6 +332,11 @@ Benchmarks test LLM comprehension across different input formats using 204 data
| Top 100 GitHub repositories | 100 | uniform | ✓ | 100% |
| Semi-uniform event logs | 75 | semi-uniform | ✗ | 50% |
| Deeply nested configuration | 11 | deep | ✗ | 0% |
| Valid complete dataset (control) | 20 | uniform | ✓ | 100% |
| Array truncated: 3 rows removed from end | 17 | uniform | ✓ | 100% |
| Extra rows added beyond declared length | 23 | uniform | ✓ | 100% |
| Inconsistent field count (missing salary in row 10) | 20 | uniform | ✓ | 100% |
| Missing required fields (no email in multiple rows) | 20 | uniform | ✓ | 100% |
**Structure classes:**
- **uniform**: All objects have identical fields with primitive values
@@ -350,67 +355,69 @@ Benchmarks test LLM comprehension across different input formats using 204 data
Each format's overall performance, balancing accuracy against token cost:
```
TOON ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 17.2 │ 75.5% acc │ 4,389 tokens
CSV ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓16.667.8% acc │ 4,080 tokens
JSON compact ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░ 14.773.3% acc │ 4,982 tokens
YAML ▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░ 12.172.4% acc │ 5,976 tokens
JSON ▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░ 10.0 │ 72.4% acc │ 7,260 tokens
XML ▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░ 8.4 │ 69.0% acc │ 8,251 tokens
TOON ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 26.9 │ 73.9% acc │ 2,744 tokens
JSON compact ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░22.970.7% acc │ 3,081 tokens
YAML ▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░ 18.669.0% acc │ 3,719 tokens
JSON ▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░░ 15.369.7% acc │ 4,545 tokens
XML ▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░ 13.0 │ 67.1% acc │ 5,167 tokens
```
TOON achieves **75.5%** accuracy (vs JSON's 72.4%) while using **39.5% fewer tokens**.
TOON achieves **73.9%** accuracy (vs JSON's 69.7%) while using **39.6% fewer tokens**.
**Note on CSV:** Excluded from ranking as it only supports 436/209 questions (flat tabular data only). While CSV is highly token-efficient for simple tabular data, it cannot represent nested structures that other formats handle.
#### Per-Model Accuracy
Accuracy across 4 LLMs on 204 data retrieval questions:
Accuracy across 4 LLMs on 209 data retrieval questions:
```
claude-haiku-4-5-20251001
→ TOON ████████████░░░░░░░░ 62.3% (127/204)
JSON ███████████░░░░░░░░░ 56.9% (116/204)
YAML ███████████░░░░░░░░░ 55.9% (114/204)
JSON compact ███████████░░░░░░░░░ 54.9% (112/204)
XML ███████████░░░░░░░░░ 54.9% (112/204)
CSV █████████░░░░░░░░░░ 47.1% (49/104)
→ TOON ████████████░░░░░░░░ 59.8% (125/209)
JSON ███████████░░░░░░░░░ 57.4% (120/209)
YAML ███████████░░░░░░░░░ 56.0% (117/209)
XML ███████████░░░░░░░░░ 55.5% (116/209)
JSON compact ███████████░░░░░░░░░ 55.0% (115/209)
CSV █████████░░░░░░░░░░ 50.5% (55/109)
gemini-2.5-flash
→ TOON ██████████████████░░ 91.2% (186/204)
YAML █████████████████░░ 89.7% (183/204)
JSON compact ██████████████████░░ 87.7% (179/204)
JSON ██████████████████░░ 87.7% (179/204)
XML ████████████████░░░ 87.3% (178/204)
CSV █████████████████░░░ 85.6% (89/104)
→ TOON ██████████████████░░ 87.6% (183/209)
CSV █████████████████░░ 86.2% (94/109)
JSON compact ████████████████░░░░ 82.3% (172/209)
YAML ████████████████░░░░ 79.4% (166/209)
XML ████████████████░░░ 79.4% (166/209)
JSON ███████████████░░░░░ 77.0% (161/209)
gpt-5-nano
JSON compact ███████████████████░ 93.6% (191/204)
CSV ██████████████████░░ 90.4% (94/104)
JSON ██████████████████░░ 89.7% (183/204)
→ TOON ██████████████████░░ 89.2% (182/204)
YAML █████████████████░░ 89.2% (182/204)
XML ████████████████░░░░ 81.4% (166/204)
→ TOON ██████████████████░ 90.9% (190/209)
JSON compact ██████████████████░░ 90.9% (190/209)
JSON ██████████████████░░ 89.0% (186/209)
CSV ██████████████████░░ 89.0% (97/109)
YAML █████████████████░░ 87.1% (182/209)
XML ████████████████░░░░ 80.9% (169/209)
grok-4-fast-non-reasoning
→ TOON ███████████░░░░░░░░ 59.3% (121/204)
JSON compact ███████████░░░░░░░░░ 56.9% (116/204)
JSON ███████████░░░░░░░░░ 55.4% (113/204)
YAML ███████████░░░░░░░░░ 54.9% (112/204)
XML ██████████░░░░░░░░░ 52.5% (107/204)
CSV ██████████░░░░░░░░░░ 48.1% (50/104)
→ TOON ███████████░░░░░░░░ 57.4% (120/209)
JSON ███████████░░░░░░░░░ 55.5% (116/209)
JSON compact ███████████░░░░░░░░░ 54.5% (114/209)
YAML ███████████░░░░░░░░░ 53.6% (112/209)
XML ██████████░░░░░░░░░ 52.6% (110/209)
CSV ██████████░░░░░░░░░░ 52.3% (57/109)
```
**Key tradeoff:** TOON achieves **75.5% accuracy** (vs JSON's 72.4%) while using **39.5% fewer tokens** on these datasets.
**Key tradeoff:** TOON achieves **73.9% accuracy** (vs JSON's 69.7%) while using **39.6% fewer tokens** on these datasets.
<details>
<summary><strong>Performance by dataset, model, and question type</strong></summary>
#### Performance by Question Type
| Question Type | TOON | JSON compact | JSON | YAML | XML | CSV |
| Question Type | TOON | JSON compact | JSON | CSV | YAML | XML |
| ------------- | ---- | ---- | ---- | ---- | ---- | ---- |
| Field Retrieval | 100.0% | 98.9% | 99.6% | 99.3% | 98.5% | 100.0% |
| Aggregation | 56.3% | 52.4% | 53.2% | 53.2% | 47.2% | 40.5% |
| Filtering | 58.9% | 58.3% | 54.2% | 53.1% | 50.5% | 49.1% |
| Structure Awareness | 89.0% | 85.0% | 82.0% | 85.0% | 79.0% | 84.4% |
| Field Retrieval | 99.6% | 99.3% | 99.3% | 100.0% | 98.2% | 98.9% |
| Aggregation | 54.4% | 47.2% | 48.8% | 44.0% | 47.6% | 41.3% |
| Filtering | 56.3% | 57.3% | 50.5% | 49.1% | 51.0% | 47.9% |
| Structure Awareness | 88.0% | 83.0% | 83.0% | 85.9% | 80.0% | 80.0% |
| Structural Validation | 70.0% | 45.0% | 50.0% | 80.0% | 60.0% | 80.0% |
#### Performance by Dataset
@@ -418,64 +425,119 @@ grok-4-fast-non-reasoning
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `csv` | 70.7% | 2,337 | 116/164 |
| `toon` | 72.0% | 2,483 | 118/164 |
| `json-compact` | 71.3% | 3,943 | 117/164 |
| `yaml` | 70.1% | 4,969 | 115/164 |
| `json-pretty` | 72.6% | 6,347 | 119/164 |
| `xml` | 70.7% | 7,314 | 116/164 |
| `csv` | 72.0% | 2,352 | 118/164 |
| `toon` | 73.8% | 2,518 | 121/164 |
| `json-compact` | 69.5% | 3,953 | 114/164 |
| `yaml` | 68.3% | 4,982 | 112/164 |
| `json-pretty` | 68.3% | 6,360 | 112/164 |
| `xml` | 69.5% | 7,324 | 114/164 |
##### E-commerce orders with nested structures
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `toon` | 83.5% | 7,197 | 137/164 |
| `json-compact` | 79.3% | 6,784 | 130/164 |
| `yaml` | 78.7% | 8,334 | 129/164 |
| `json-pretty` | 78.7% | 10,700 | 129/164 |
| `xml` | 73.8% | 12,013 | 121/164 |
| `toon` | 81.1% | 7,232 | 133/164 |
| `json-compact` | 76.8% | 6,794 | 126/164 |
| `yaml` | 75.6% | 8,347 | 124/164 |
| `json-pretty` | 76.2% | 10,713 | 125/164 |
| `xml` | 74.4% | 12,023 | 122/164 |
##### Time-series analytics data
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `toon` | 75.8% | 1,513 | 91/120 |
| `csv` | 72.5% | 1,391 | 87/120 |
| `json-compact` | 70.0% | 2,339 | 84/120 |
| `yaml` | 70.0% | 2,936 | 84/120 |
| `json-pretty` | 71.7% | 3,663 | 86/120 |
| `xml` | 71.7% | 4,374 | 86/120 |
| `csv` | 73.3% | 1,406 | 88/120 |
| `toon` | 72.5% | 1,548 | 87/120 |
| `json-compact` | 71.7% | 2,349 | 86/120 |
| `yaml` | 71.7% | 2,949 | 86/120 |
| `json-pretty` | 68.3% | 3,676 | 82/120 |
| `xml` | 68.3% | 4,384 | 82/120 |
##### Top 100 GitHub repositories
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `toon` | 64.4% | 8,745 | 85/132 |
| `csv` | 59.8% | 8,513 | 79/132 |
| `json-compact` | 60.6% | 11,455 | 80/132 |
| `yaml` | 61.4% | 13,129 | 81/132 |
| `json-pretty` | 59.1% | 15,145 | 78/132 |
| `xml` | 51.5% | 17,095 | 68/132 |
| `toon` | 62.9% | 8,780 | 83/132 |
| `csv` | 61.4% | 8,528 | 81/132 |
| `yaml` | 59.8% | 13,142 | 79/132 |
| `json-compact` | 55.3% | 11,465 | 73/132 |
| `json-pretty` | 56.1% | 15,158 | 74/132 |
| `xml` | 48.5% | 17,105 | 64/132 |
##### Semi-uniform event logs
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `json-compact` | 67.5% | 4,809 | 81/120 |
| `yaml` | 63.3% | 5,814 | 76/120 |
| `toon` | 62.5% | 5,764 | 75/120 |
| `json-pretty` | 59.2% | 6,784 | 71/120 |
| `xml` | 55.0% | 7,699 | 66/120 |
| `json-compact` | 63.3% | 4,819 | 76/120 |
| `toon` | 57.5% | 5,799 | 69/120 |
| `json-pretty` | 59.2% | 6,797 | 71/120 |
| `yaml` | 48.3% | 5,827 | 58/120 |
| `xml` | 46.7% | 7,709 | 56/120 |
##### Deeply nested configuration
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `json-compact` | 91.4% | 564 | 106/116 |
| `toon` | 94.8% | 631 | 110/116 |
| `yaml` | 91.4% | 673 | 106/116 |
| `json-pretty` | 93.1% | 919 | 108/116 |
| `xml` | 91.4% | 1,008 | 106/116 |
| `json-compact` | 92.2% | 574 | 107/116 |
| `toon` | 95.7% | 666 | 111/116 |
| `yaml` | 91.4% | 686 | 106/116 |
| `json-pretty` | 94.0% | 932 | 109/116 |
| `xml` | 92.2% | 1,018 | 107/116 |
##### Valid complete dataset (control)
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `toon` | 100.0% | 544 | 4/4 |
| `json-compact` | 100.0% | 795 | 4/4 |
| `yaml` | 100.0% | 1,003 | 4/4 |
| `json-pretty` | 100.0% | 1,282 | 4/4 |
| `csv` | 25.0% | 492 | 1/4 |
| `xml` | 0.0% | 1,467 | 0/4 |
##### Array truncated: 3 rows removed from end
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `csv` | 100.0% | 425 | 4/4 |
| `xml` | 100.0% | 1,251 | 4/4 |
| `toon` | 0.0% | 474 | 0/4 |
| `json-compact` | 0.0% | 681 | 0/4 |
| `json-pretty` | 0.0% | 1,096 | 0/4 |
| `yaml` | 0.0% | 859 | 0/4 |
##### Extra rows added beyond declared length
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `csv` | 100.0% | 566 | 4/4 |
| `toon` | 75.0% | 621 | 3/4 |
| `xml` | 100.0% | 1,692 | 4/4 |
| `yaml` | 75.0% | 1,157 | 3/4 |
| `json-compact` | 50.0% | 917 | 2/4 |
| `json-pretty` | 50.0% | 1,476 | 2/4 |
##### Inconsistent field count (missing salary in row 10)
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `csv` | 75.0% | 489 | 3/4 |
| `yaml` | 100.0% | 996 | 4/4 |
| `toon` | 100.0% | 1,019 | 4/4 |
| `json-compact` | 75.0% | 790 | 3/4 |
| `xml` | 100.0% | 1,458 | 4/4 |
| `json-pretty` | 75.0% | 1,274 | 3/4 |
##### Missing required fields (no email in multiple rows)
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `csv` | 100.0% | 329 | 4/4 |
| `xml` | 100.0% | 1,411 | 4/4 |
| `toon` | 75.0% | 983 | 3/4 |
| `yaml` | 25.0% | 960 | 1/4 |
| `json-pretty` | 25.0% | 1,230 | 1/4 |
| `json-compact` | 0.0% | 755 | 0/4 |
#### Performance by Model
@@ -483,45 +545,45 @@ grok-4-fast-non-reasoning
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
| `toon` | 62.3% | 127/204 |
| `json-pretty` | 56.9% | 116/204 |
| `yaml` | 55.9% | 114/204 |
| `json-compact` | 54.9% | 112/204 |
| `xml` | 54.9% | 112/204 |
| `csv` | 47.1% | 49/104 |
| `toon` | 59.8% | 125/209 |
| `json-pretty` | 57.4% | 120/209 |
| `yaml` | 56.0% | 117/209 |
| `xml` | 55.5% | 116/209 |
| `json-compact` | 55.0% | 115/209 |
| `csv` | 50.5% | 55/109 |
##### gemini-2.5-flash
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
| `toon` | 91.2% | 186/204 |
| `yaml` | 89.7% | 183/204 |
| `json-compact` | 87.7% | 179/204 |
| `json-pretty` | 87.7% | 179/204 |
| `xml` | 87.3% | 178/204 |
| `csv` | 85.6% | 89/104 |
| `toon` | 87.6% | 183/209 |
| `csv` | 86.2% | 94/109 |
| `json-compact` | 82.3% | 172/209 |
| `yaml` | 79.4% | 166/209 |
| `xml` | 79.4% | 166/209 |
| `json-pretty` | 77.0% | 161/209 |
##### gpt-5-nano
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
| `json-compact` | 93.6% | 191/204 |
| `csv` | 90.4% | 94/104 |
| `json-pretty` | 89.7% | 183/204 |
| `toon` | 89.2% | 182/204 |
| `yaml` | 89.2% | 182/204 |
| `xml` | 81.4% | 166/204 |
| `toon` | 90.9% | 190/209 |
| `json-compact` | 90.9% | 190/209 |
| `json-pretty` | 89.0% | 186/209 |
| `csv` | 89.0% | 97/109 |
| `yaml` | 87.1% | 182/209 |
| `xml` | 80.9% | 169/209 |
##### grok-4-fast-non-reasoning
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
| `toon` | 59.3% | 121/204 |
| `json-compact` | 56.9% | 116/204 |
| `json-pretty` | 55.4% | 113/204 |
| `yaml` | 54.9% | 112/204 |
| `xml` | 52.5% | 107/204 |
| `csv` | 48.1% | 50/104 |
| `toon` | 57.4% | 120/209 |
| `json-pretty` | 55.5% | 116/209 |
| `json-compact` | 54.5% | 114/209 |
| `yaml` | 53.6% | 112/209 |
| `xml` | 52.6% | 110/209 |
| `csv` | 52.3% | 57/109 |
</details>
@@ -534,8 +596,9 @@ This benchmark tests **LLM comprehension and data retrieval accuracy** across di
#### Datasets Tested
Six datasets designed to test different structural patterns:
Eleven datasets designed to test different structural patterns and validation capabilities:
**Primary datasets:**
1. **Tabular** (100 employee records): Uniform objects with identical fields optimal for TOON's tabular format.
2. **Nested** (50 e-commerce orders): Complex structures with nested customer objects and item arrays.
3. **Analytics** (60 days of metrics): Time-series data with dates and numeric values.
@@ -543,21 +606,28 @@ Six datasets designed to test different structural patterns:
5. **Event Logs** (75 logs): Semi-uniform data with ~50% flat logs and ~50% with nested error objects.
6. **Nested Config** (1 configuration): Deeply nested configuration with minimal tabular eligibility.
**Structural validation datasets:**
7. **Control**: Valid complete dataset (baseline for validation)
8. **Truncated**: Array with 3 rows removed from end (tests [N] length detection)
9. **Extra rows**: Array with 3 additional rows beyond declared length
10. **Width mismatch**: Inconsistent field count (missing salary in row 10)
11. **Missing fields**: Systematic field omissions (no email in multiple rows)
#### Question Types
204 questions are generated dynamically across four categories:
209 questions are generated dynamically across five categories:
- **Field retrieval (33%)**: Direct value lookups or values that can be read straight off a record (including booleans and simple counts such as array lengths)
- Example: "What is Alice's salary?" → `75000`
- Example: "How many items are in order ORD-0042?" → `3`
- Example: "What is the customer name for order ORD-0042?" → `John Doe`
- **Aggregation (31%)**: Dataset-level totals and averages plus single-condition filters (counts, sums, min/max comparisons)
- **Aggregation (30%)**: Dataset-level totals and averages plus single-condition filters (counts, sums, min/max comparisons)
- Example: "How many employees work in Engineering?" → `17`
- Example: "What is the total revenue across all orders?" → `45123.50`
- Example: "How many employees have salary > 80000?" → `23`
- **Filtering (24%)**: Multi-condition queries requiring compound logic (AND constraints across fields)
- **Filtering (23%)**: Multi-condition queries requiring compound logic (AND constraints across fields)
- Example: "How many employees in Sales have salary > 80000?" → `5`
- Example: "How many active employees have more than 10 years of experience?" → `8`
@@ -566,18 +636,23 @@ Six datasets designed to test different structural patterns:
- Example: "List the field names for employees" → `id, name, email, department, salary, yearsExperience, active`
- Example: "What is the department of the last employee?" → `Sales`
- **Structural validation (2%)**: Tests ability to detect incomplete, truncated, or corrupted data using structural metadata
- Example: "Is this data complete and valid?" → `YES` (control dataset) or `NO` (corrupted datasets)
- Tests TOON's [N] length validation and {fields} consistency checking
- Demonstrates CSV's lack of structural validation capabilities
#### Evaluation Process
1. **Format conversion**: Each dataset is converted to all 6 formats (TOON, JSON compact, JSON, YAML, XML, CSV).
1. **Format conversion**: Each dataset is converted to all 6 formats (TOON, JSON compact, JSON, CSV, YAML, XML).
2. **Query LLM**: Each model receives formatted data + question in a prompt and extracts the answer.
3. **Validate with LLM-as-judge**: `gpt-5-nano` validates if the answer is semantically correct (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`).
3. **Validate deterministically**: Answers are validated using type-aware comparison (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`) without requiring an LLM judge.
#### Models & Configuration
- **Models tested**: `claude-haiku-4-5-20251001`, `gemini-2.5-flash`, `gpt-5-nano`, `grok-4-fast-non-reasoning`
- **Token counting**: Using `gpt-tokenizer` with `o200k_base` encoding (GPT-5 tokenizer)
- **Temperature**: Not set (models use their defaults)
- **Total evaluations**: 204 questions × 6 formats × 4 models = 4,896 LLM calls
- **Total evaluations**: 209 questions × 6 formats × 4 models = 5,016 LLM calls
</details>
@@ -782,6 +857,9 @@ items[1]:
status: active
```
> [!NOTE]
> Tabular format requires identical field sets across all objects (same keys, order doesn't matter) and primitive values only (strings, numbers, booleans, null).
#### Mixed and Non-Uniform Arrays
Arrays that don't meet the tabular requirements use list format: