mirror of
https://github.com/voson-wang/toon.git
synced 2026-01-29 15:24:10 +08:00
chore(benchmarks): replace LLM-as-judge, new structural validation
This commit is contained in:
414
README.md
414
README.md
@@ -75,7 +75,7 @@ See [benchmarks](#benchmarks) for concrete comparisons across different data str
|
||||
|
||||
## Key Features
|
||||
|
||||
- 💸 **Token-efficient:** typically 30–60% fewer tokens than JSON[^1]
|
||||
- 💸 **Token-efficient:** typically 30-60% fewer tokens on large uniform arrays vs formatted JSON[^1]
|
||||
- 🤿 **LLM-friendly guardrails:** explicit lengths and fields enable validation
|
||||
- 🍱 **Minimal syntax:** removes redundant punctuation (braces, brackets, most quotes)
|
||||
- 📐 **Indentation-based structure:** like YAML, uses whitespace instead of braces
|
||||
@@ -108,19 +108,19 @@ Datasets with nested or semi-uniform structures. CSV excluded as it cannot prope
|
||||
```
|
||||
🛒 E-commerce orders with nested structures ┊ Tabular: 33%
|
||||
│
|
||||
TOON █████████████░░░░░░░ 72,743 tokens
|
||||
├─ vs JSON (−33.1%) 108,731 tokens
|
||||
├─ vs JSON compact (+5.5%) 68,936 tokens
|
||||
├─ vs YAML (−14.1%) 84,724 tokens
|
||||
└─ vs XML (−40.5%) 122,313 tokens
|
||||
TOON █████████████░░░░░░░ 72,771 tokens
|
||||
├─ vs JSON (−33.1%) 108,806 tokens
|
||||
├─ vs JSON compact (+5.5%) 68,975 tokens
|
||||
├─ vs YAML (−14.2%) 84,780 tokens
|
||||
└─ vs XML (−40.5%) 122,406 tokens
|
||||
|
||||
🧾 Semi-uniform event logs ┊ Tabular: 50%
|
||||
│
|
||||
TOON █████████████████░░░ 153,223 tokens
|
||||
├─ vs JSON (−15.0%) 180,196 tokens
|
||||
├─ vs JSON compact (+19.9%) 127,740 tokens
|
||||
├─ vs YAML (−0.8%) 154,514 tokens
|
||||
└─ vs XML (−25.2%) 204,800 tokens
|
||||
TOON █████████████████░░░ 153,211 tokens
|
||||
├─ vs JSON (−15.0%) 180,176 tokens
|
||||
├─ vs JSON compact (+19.9%) 127,731 tokens
|
||||
├─ vs YAML (−0.8%) 154,505 tokens
|
||||
└─ vs XML (−25.2%) 204,777 tokens
|
||||
|
||||
🧩 Deeply nested configuration ┊ Tabular: 0%
|
||||
│
|
||||
@@ -131,11 +131,11 @@ Datasets with nested or semi-uniform structures. CSV excluded as it cannot prope
|
||||
└─ vs XML (−37.4%) 1,008 tokens
|
||||
|
||||
──────────────────────────────────── Total ────────────────────────────────────
|
||||
TOON ████████████████░░░░ 226,597 tokens
|
||||
├─ vs JSON (−21.8%) 289,846 tokens
|
||||
├─ vs JSON compact (+14.9%) 197,240 tokens
|
||||
├─ vs YAML (−5.5%) 239,911 tokens
|
||||
└─ vs XML (−30.9%) 328,121 tokens
|
||||
TOON ████████████████░░░░ 226,613 tokens
|
||||
├─ vs JSON (−21.8%) 289,901 tokens
|
||||
├─ vs JSON compact (+14.9%) 197,270 tokens
|
||||
├─ vs YAML (−5.6%) 239,958 tokens
|
||||
└─ vs XML (−31.0%) 328,191 tokens
|
||||
```
|
||||
|
||||
#### Flat-Only Track
|
||||
@@ -145,21 +145,21 @@ Datasets with flat tabular structures where CSV is applicable.
|
||||
```
|
||||
👥 Uniform employee records ┊ Tabular: 100%
|
||||
│
|
||||
CSV ███████████████████░ 46,956 tokens
|
||||
TOON ████████████████████ 49,827 tokens (+6.1% vs CSV)
|
||||
├─ vs JSON (−60.7%) 126,854 tokens
|
||||
├─ vs JSON compact (−36.8%) 78,850 tokens
|
||||
├─ vs YAML (−50.0%) 99,701 tokens
|
||||
└─ vs XML (−66.0%) 146,440 tokens
|
||||
CSV ███████████████████░ 46,954 tokens
|
||||
TOON ████████████████████ 49,831 tokens (+6.1% vs CSV)
|
||||
├─ vs JSON (−60.7%) 126,860 tokens
|
||||
├─ vs JSON compact (−36.8%) 78,856 tokens
|
||||
├─ vs YAML (−50.0%) 99,706 tokens
|
||||
└─ vs XML (−66.0%) 146,444 tokens
|
||||
|
||||
📈 Time-series analytics data ┊ Tabular: 100%
|
||||
│
|
||||
CSV ██████████████████░░ 8,396 tokens
|
||||
TOON ████████████████████ 9,128 tokens (+8.7% vs CSV)
|
||||
├─ vs JSON (−59.0%) 22,258 tokens
|
||||
├─ vs JSON compact (−35.8%) 14,224 tokens
|
||||
├─ vs YAML (−48.9%) 17,871 tokens
|
||||
└─ vs XML (−65.7%) 26,629 tokens
|
||||
CSV ██████████████████░░ 8,388 tokens
|
||||
TOON ████████████████████ 9,120 tokens (+8.7% vs CSV)
|
||||
├─ vs JSON (−59.0%) 22,250 tokens
|
||||
├─ vs JSON compact (−35.8%) 14,216 tokens
|
||||
├─ vs YAML (−48.9%) 17,863 tokens
|
||||
└─ vs XML (−65.7%) 26,621 tokens
|
||||
|
||||
⭐ Top 100 GitHub repositories ┊ Tabular: 100%
|
||||
│
|
||||
@@ -171,12 +171,12 @@ Datasets with flat tabular structures where CSV is applicable.
|
||||
└─ vs XML (−48.8%) 17,095 tokens
|
||||
|
||||
──────────────────────────────────── Total ────────────────────────────────────
|
||||
CSV ███████████████████░ 63,865 tokens
|
||||
TOON ████████████████████ 67,700 tokens (+6.0% vs CSV)
|
||||
├─ vs JSON (−58.8%) 164,257 tokens
|
||||
├─ vs JSON compact (−35.2%) 104,529 tokens
|
||||
├─ vs YAML (−48.2%) 130,701 tokens
|
||||
└─ vs XML (−64.4%) 190,164 tokens
|
||||
CSV ███████████████████░ 63,855 tokens
|
||||
TOON ████████████████████ 67,696 tokens (+6.0% vs CSV)
|
||||
├─ vs JSON (−58.8%) 164,255 tokens
|
||||
├─ vs JSON compact (−35.2%) 104,527 tokens
|
||||
├─ vs YAML (−48.2%) 130,698 tokens
|
||||
└─ vs XML (−64.4%) 190,160 tokens
|
||||
```
|
||||
|
||||
<details>
|
||||
@@ -186,64 +186,64 @@ Datasets with flat tabular structures where CSV is applicable.
|
||||
|
||||
**Savings:** 13,130 tokens (59.0% reduction vs JSON)
|
||||
|
||||
**JSON** (22,258 tokens):
|
||||
**JSON** (22,250 tokens):
|
||||
|
||||
```json
|
||||
{
|
||||
"metrics": [
|
||||
{
|
||||
"date": "2025-01-01",
|
||||
"views": 7708,
|
||||
"clicks": 595,
|
||||
"conversions": 69,
|
||||
"revenue": 15369.93,
|
||||
"bounceRate": 0.35
|
||||
"views": 5715,
|
||||
"clicks": 211,
|
||||
"conversions": 28,
|
||||
"revenue": 7976.46,
|
||||
"bounceRate": 0.47
|
||||
},
|
||||
{
|
||||
"date": "2025-01-02",
|
||||
"views": 5894,
|
||||
"clicks": 381,
|
||||
"conversions": 21,
|
||||
"revenue": 2112.12,
|
||||
"bounceRate": 0.3
|
||||
"views": 7103,
|
||||
"clicks": 393,
|
||||
"conversions": 28,
|
||||
"revenue": 8360.53,
|
||||
"bounceRate": 0.32
|
||||
},
|
||||
{
|
||||
"date": "2025-01-03",
|
||||
"views": 6835,
|
||||
"clicks": 422,
|
||||
"conversions": 35,
|
||||
"revenue": 4525.73,
|
||||
"views": 7248,
|
||||
"clicks": 378,
|
||||
"conversions": 24,
|
||||
"revenue": 3212.57,
|
||||
"bounceRate": 0.5
|
||||
},
|
||||
{
|
||||
"date": "2025-01-04",
|
||||
"views": 5325,
|
||||
"clicks": 305,
|
||||
"conversions": 22,
|
||||
"revenue": 2445.3,
|
||||
"bounceRate": 0.44
|
||||
"views": 2927,
|
||||
"clicks": 77,
|
||||
"conversions": 11,
|
||||
"revenue": 1211.69,
|
||||
"bounceRate": 0.62
|
||||
},
|
||||
{
|
||||
"date": "2025-01-05",
|
||||
"views": 2974,
|
||||
"clicks": 61,
|
||||
"conversions": 6,
|
||||
"revenue": 956.57,
|
||||
"bounceRate": 0.47
|
||||
"views": 3530,
|
||||
"clicks": 82,
|
||||
"conversions": 8,
|
||||
"revenue": 462.77,
|
||||
"bounceRate": 0.56
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**TOON** (9,128 tokens):
|
||||
**TOON** (9,120 tokens):
|
||||
|
||||
```
|
||||
metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
|
||||
2025-01-01,7708,595,69,15369.93,0.35
|
||||
2025-01-02,5894,381,21,2112.12,0.3
|
||||
2025-01-03,6835,422,35,4525.73,0.5
|
||||
2025-01-04,5325,305,22,2445.3,0.44
|
||||
2025-01-05,2974,61,6,956.57,0.47
|
||||
2025-01-01,5715,211,28,7976.46,0.47
|
||||
2025-01-02,7103,393,28,8360.53,0.32
|
||||
2025-01-03,7248,378,24,3212.57,0.5
|
||||
2025-01-04,2927,77,11,1211.69,0.62
|
||||
2025-01-05,3530,82,8,462.77,0.56
|
||||
```
|
||||
|
||||
---
|
||||
@@ -317,7 +317,7 @@ repositories[3]{id,name,repo,description,createdAt,updatedAt,pushedAt,stars,watc
|
||||
|
||||
<!-- automd:file src="./benchmarks/results/retrieval-accuracy.md" -->
|
||||
|
||||
Benchmarks test LLM comprehension across different input formats using 204 data retrieval questions on 4 models.
|
||||
Benchmarks test LLM comprehension across different input formats using 209 data retrieval questions on 4 models.
|
||||
|
||||
<details>
|
||||
<summary><strong>Show Dataset Catalog</strong></summary>
|
||||
@@ -332,6 +332,11 @@ Benchmarks test LLM comprehension across different input formats using 204 data
|
||||
| Top 100 GitHub repositories | 100 | uniform | ✓ | 100% |
|
||||
| Semi-uniform event logs | 75 | semi-uniform | ✗ | 50% |
|
||||
| Deeply nested configuration | 11 | deep | ✗ | 0% |
|
||||
| Valid complete dataset (control) | 20 | uniform | ✓ | 100% |
|
||||
| Array truncated: 3 rows removed from end | 17 | uniform | ✓ | 100% |
|
||||
| Extra rows added beyond declared length | 23 | uniform | ✓ | 100% |
|
||||
| Inconsistent field count (missing salary in row 10) | 20 | uniform | ✓ | 100% |
|
||||
| Missing required fields (no email in multiple rows) | 20 | uniform | ✓ | 100% |
|
||||
|
||||
**Structure classes:**
|
||||
- **uniform**: All objects have identical fields with primitive values
|
||||
@@ -350,67 +355,69 @@ Benchmarks test LLM comprehension across different input formats using 204 data
|
||||
Each format's overall performance, balancing accuracy against token cost:
|
||||
|
||||
```
|
||||
TOON ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 17.2 │ 75.5% acc │ 4,389 tokens
|
||||
CSV ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░ 16.6 │ 67.8% acc │ 4,080 tokens
|
||||
JSON compact ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░ 14.7 │ 73.3% acc │ 4,982 tokens
|
||||
YAML ▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░ 12.1 │ 72.4% acc │ 5,976 tokens
|
||||
JSON ▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░ 10.0 │ 72.4% acc │ 7,260 tokens
|
||||
XML ▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░ 8.4 │ 69.0% acc │ 8,251 tokens
|
||||
TOON ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 26.9 │ 73.9% acc │ 2,744 tokens
|
||||
JSON compact ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░ 22.9 │ 70.7% acc │ 3,081 tokens
|
||||
YAML ▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░ 18.6 │ 69.0% acc │ 3,719 tokens
|
||||
JSON ▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░░ 15.3 │ 69.7% acc │ 4,545 tokens
|
||||
XML ▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░ 13.0 │ 67.1% acc │ 5,167 tokens
|
||||
```
|
||||
|
||||
TOON achieves **75.5%** accuracy (vs JSON's 72.4%) while using **39.5% fewer tokens**.
|
||||
TOON achieves **73.9%** accuracy (vs JSON's 69.7%) while using **39.6% fewer tokens**.
|
||||
|
||||
**Note on CSV:** Excluded from ranking as it only supports 436/209 questions (flat tabular data only). While CSV is highly token-efficient for simple tabular data, it cannot represent nested structures that other formats handle.
|
||||
|
||||
#### Per-Model Accuracy
|
||||
|
||||
Accuracy across 4 LLMs on 204 data retrieval questions:
|
||||
Accuracy across 4 LLMs on 209 data retrieval questions:
|
||||
|
||||
```
|
||||
claude-haiku-4-5-20251001
|
||||
→ TOON ████████████░░░░░░░░ 62.3% (127/204)
|
||||
JSON ███████████░░░░░░░░░ 56.9% (116/204)
|
||||
YAML ███████████░░░░░░░░░ 55.9% (114/204)
|
||||
JSON compact ███████████░░░░░░░░░ 54.9% (112/204)
|
||||
XML ███████████░░░░░░░░░ 54.9% (112/204)
|
||||
CSV █████████░░░░░░░░░░░ 47.1% (49/104)
|
||||
→ TOON ████████████░░░░░░░░ 59.8% (125/209)
|
||||
JSON ███████████░░░░░░░░░ 57.4% (120/209)
|
||||
YAML ███████████░░░░░░░░░ 56.0% (117/209)
|
||||
XML ███████████░░░░░░░░░ 55.5% (116/209)
|
||||
JSON compact ███████████░░░░░░░░░ 55.0% (115/209)
|
||||
CSV ██████████░░░░░░░░░░ 50.5% (55/109)
|
||||
|
||||
gemini-2.5-flash
|
||||
→ TOON ██████████████████░░ 91.2% (186/204)
|
||||
YAML ██████████████████░░ 89.7% (183/204)
|
||||
JSON compact ██████████████████░░ 87.7% (179/204)
|
||||
JSON ██████████████████░░ 87.7% (179/204)
|
||||
XML █████████████████░░░ 87.3% (178/204)
|
||||
CSV █████████████████░░░ 85.6% (89/104)
|
||||
→ TOON ██████████████████░░ 87.6% (183/209)
|
||||
CSV █████████████████░░░ 86.2% (94/109)
|
||||
JSON compact ████████████████░░░░ 82.3% (172/209)
|
||||
YAML ████████████████░░░░ 79.4% (166/209)
|
||||
XML ████████████████░░░░ 79.4% (166/209)
|
||||
JSON ███████████████░░░░░ 77.0% (161/209)
|
||||
|
||||
gpt-5-nano
|
||||
JSON compact ███████████████████░ 93.6% (191/204)
|
||||
CSV ██████████████████░░ 90.4% (94/104)
|
||||
JSON ██████████████████░░ 89.7% (183/204)
|
||||
→ TOON ██████████████████░░ 89.2% (182/204)
|
||||
YAML ██████████████████░░ 89.2% (182/204)
|
||||
XML ████████████████░░░░ 81.4% (166/204)
|
||||
→ TOON ██████████████████░░ 90.9% (190/209)
|
||||
JSON compact ██████████████████░░ 90.9% (190/209)
|
||||
JSON ██████████████████░░ 89.0% (186/209)
|
||||
CSV ██████████████████░░ 89.0% (97/109)
|
||||
YAML █████████████████░░░ 87.1% (182/209)
|
||||
XML ████████████████░░░░ 80.9% (169/209)
|
||||
|
||||
grok-4-fast-non-reasoning
|
||||
→ TOON ████████████░░░░░░░░ 59.3% (121/204)
|
||||
JSON compact ███████████░░░░░░░░░ 56.9% (116/204)
|
||||
JSON ███████████░░░░░░░░░ 55.4% (113/204)
|
||||
YAML ███████████░░░░░░░░░ 54.9% (112/204)
|
||||
XML ██████████░░░░░░░░░░ 52.5% (107/204)
|
||||
CSV ██████████░░░░░░░░░░ 48.1% (50/104)
|
||||
→ TOON ███████████░░░░░░░░░ 57.4% (120/209)
|
||||
JSON ███████████░░░░░░░░░ 55.5% (116/209)
|
||||
JSON compact ███████████░░░░░░░░░ 54.5% (114/209)
|
||||
YAML ███████████░░░░░░░░░ 53.6% (112/209)
|
||||
XML ███████████░░░░░░░░░ 52.6% (110/209)
|
||||
CSV ██████████░░░░░░░░░░ 52.3% (57/109)
|
||||
```
|
||||
|
||||
**Key tradeoff:** TOON achieves **75.5% accuracy** (vs JSON's 72.4%) while using **39.5% fewer tokens** on these datasets.
|
||||
**Key tradeoff:** TOON achieves **73.9% accuracy** (vs JSON's 69.7%) while using **39.6% fewer tokens** on these datasets.
|
||||
|
||||
<details>
|
||||
<summary><strong>Performance by dataset, model, and question type</strong></summary>
|
||||
|
||||
#### Performance by Question Type
|
||||
|
||||
| Question Type | TOON | JSON compact | JSON | YAML | XML | CSV |
|
||||
| Question Type | TOON | JSON compact | JSON | CSV | YAML | XML |
|
||||
| ------------- | ---- | ---- | ---- | ---- | ---- | ---- |
|
||||
| Field Retrieval | 100.0% | 98.9% | 99.6% | 99.3% | 98.5% | 100.0% |
|
||||
| Aggregation | 56.3% | 52.4% | 53.2% | 53.2% | 47.2% | 40.5% |
|
||||
| Filtering | 58.9% | 58.3% | 54.2% | 53.1% | 50.5% | 49.1% |
|
||||
| Structure Awareness | 89.0% | 85.0% | 82.0% | 85.0% | 79.0% | 84.4% |
|
||||
| Field Retrieval | 99.6% | 99.3% | 99.3% | 100.0% | 98.2% | 98.9% |
|
||||
| Aggregation | 54.4% | 47.2% | 48.8% | 44.0% | 47.6% | 41.3% |
|
||||
| Filtering | 56.3% | 57.3% | 50.5% | 49.1% | 51.0% | 47.9% |
|
||||
| Structure Awareness | 88.0% | 83.0% | 83.0% | 85.9% | 80.0% | 80.0% |
|
||||
| Structural Validation | 70.0% | 45.0% | 50.0% | 80.0% | 60.0% | 80.0% |
|
||||
|
||||
#### Performance by Dataset
|
||||
|
||||
@@ -418,64 +425,119 @@ grok-4-fast-non-reasoning
|
||||
|
||||
| Format | Accuracy | Tokens | Correct/Total |
|
||||
| ------ | -------- | ------ | ------------- |
|
||||
| `csv` | 70.7% | 2,337 | 116/164 |
|
||||
| `toon` | 72.0% | 2,483 | 118/164 |
|
||||
| `json-compact` | 71.3% | 3,943 | 117/164 |
|
||||
| `yaml` | 70.1% | 4,969 | 115/164 |
|
||||
| `json-pretty` | 72.6% | 6,347 | 119/164 |
|
||||
| `xml` | 70.7% | 7,314 | 116/164 |
|
||||
| `csv` | 72.0% | 2,352 | 118/164 |
|
||||
| `toon` | 73.8% | 2,518 | 121/164 |
|
||||
| `json-compact` | 69.5% | 3,953 | 114/164 |
|
||||
| `yaml` | 68.3% | 4,982 | 112/164 |
|
||||
| `json-pretty` | 68.3% | 6,360 | 112/164 |
|
||||
| `xml` | 69.5% | 7,324 | 114/164 |
|
||||
|
||||
##### E-commerce orders with nested structures
|
||||
|
||||
| Format | Accuracy | Tokens | Correct/Total |
|
||||
| ------ | -------- | ------ | ------------- |
|
||||
| `toon` | 83.5% | 7,197 | 137/164 |
|
||||
| `json-compact` | 79.3% | 6,784 | 130/164 |
|
||||
| `yaml` | 78.7% | 8,334 | 129/164 |
|
||||
| `json-pretty` | 78.7% | 10,700 | 129/164 |
|
||||
| `xml` | 73.8% | 12,013 | 121/164 |
|
||||
| `toon` | 81.1% | 7,232 | 133/164 |
|
||||
| `json-compact` | 76.8% | 6,794 | 126/164 |
|
||||
| `yaml` | 75.6% | 8,347 | 124/164 |
|
||||
| `json-pretty` | 76.2% | 10,713 | 125/164 |
|
||||
| `xml` | 74.4% | 12,023 | 122/164 |
|
||||
|
||||
##### Time-series analytics data
|
||||
|
||||
| Format | Accuracy | Tokens | Correct/Total |
|
||||
| ------ | -------- | ------ | ------------- |
|
||||
| `toon` | 75.8% | 1,513 | 91/120 |
|
||||
| `csv` | 72.5% | 1,391 | 87/120 |
|
||||
| `json-compact` | 70.0% | 2,339 | 84/120 |
|
||||
| `yaml` | 70.0% | 2,936 | 84/120 |
|
||||
| `json-pretty` | 71.7% | 3,663 | 86/120 |
|
||||
| `xml` | 71.7% | 4,374 | 86/120 |
|
||||
| `csv` | 73.3% | 1,406 | 88/120 |
|
||||
| `toon` | 72.5% | 1,548 | 87/120 |
|
||||
| `json-compact` | 71.7% | 2,349 | 86/120 |
|
||||
| `yaml` | 71.7% | 2,949 | 86/120 |
|
||||
| `json-pretty` | 68.3% | 3,676 | 82/120 |
|
||||
| `xml` | 68.3% | 4,384 | 82/120 |
|
||||
|
||||
##### Top 100 GitHub repositories
|
||||
|
||||
| Format | Accuracy | Tokens | Correct/Total |
|
||||
| ------ | -------- | ------ | ------------- |
|
||||
| `toon` | 64.4% | 8,745 | 85/132 |
|
||||
| `csv` | 59.8% | 8,513 | 79/132 |
|
||||
| `json-compact` | 60.6% | 11,455 | 80/132 |
|
||||
| `yaml` | 61.4% | 13,129 | 81/132 |
|
||||
| `json-pretty` | 59.1% | 15,145 | 78/132 |
|
||||
| `xml` | 51.5% | 17,095 | 68/132 |
|
||||
| `toon` | 62.9% | 8,780 | 83/132 |
|
||||
| `csv` | 61.4% | 8,528 | 81/132 |
|
||||
| `yaml` | 59.8% | 13,142 | 79/132 |
|
||||
| `json-compact` | 55.3% | 11,465 | 73/132 |
|
||||
| `json-pretty` | 56.1% | 15,158 | 74/132 |
|
||||
| `xml` | 48.5% | 17,105 | 64/132 |
|
||||
|
||||
##### Semi-uniform event logs
|
||||
|
||||
| Format | Accuracy | Tokens | Correct/Total |
|
||||
| ------ | -------- | ------ | ------------- |
|
||||
| `json-compact` | 67.5% | 4,809 | 81/120 |
|
||||
| `yaml` | 63.3% | 5,814 | 76/120 |
|
||||
| `toon` | 62.5% | 5,764 | 75/120 |
|
||||
| `json-pretty` | 59.2% | 6,784 | 71/120 |
|
||||
| `xml` | 55.0% | 7,699 | 66/120 |
|
||||
| `json-compact` | 63.3% | 4,819 | 76/120 |
|
||||
| `toon` | 57.5% | 5,799 | 69/120 |
|
||||
| `json-pretty` | 59.2% | 6,797 | 71/120 |
|
||||
| `yaml` | 48.3% | 5,827 | 58/120 |
|
||||
| `xml` | 46.7% | 7,709 | 56/120 |
|
||||
|
||||
##### Deeply nested configuration
|
||||
|
||||
| Format | Accuracy | Tokens | Correct/Total |
|
||||
| ------ | -------- | ------ | ------------- |
|
||||
| `json-compact` | 91.4% | 564 | 106/116 |
|
||||
| `toon` | 94.8% | 631 | 110/116 |
|
||||
| `yaml` | 91.4% | 673 | 106/116 |
|
||||
| `json-pretty` | 93.1% | 919 | 108/116 |
|
||||
| `xml` | 91.4% | 1,008 | 106/116 |
|
||||
| `json-compact` | 92.2% | 574 | 107/116 |
|
||||
| `toon` | 95.7% | 666 | 111/116 |
|
||||
| `yaml` | 91.4% | 686 | 106/116 |
|
||||
| `json-pretty` | 94.0% | 932 | 109/116 |
|
||||
| `xml` | 92.2% | 1,018 | 107/116 |
|
||||
|
||||
##### Valid complete dataset (control)
|
||||
|
||||
| Format | Accuracy | Tokens | Correct/Total |
|
||||
| ------ | -------- | ------ | ------------- |
|
||||
| `toon` | 100.0% | 544 | 4/4 |
|
||||
| `json-compact` | 100.0% | 795 | 4/4 |
|
||||
| `yaml` | 100.0% | 1,003 | 4/4 |
|
||||
| `json-pretty` | 100.0% | 1,282 | 4/4 |
|
||||
| `csv` | 25.0% | 492 | 1/4 |
|
||||
| `xml` | 0.0% | 1,467 | 0/4 |
|
||||
|
||||
##### Array truncated: 3 rows removed from end
|
||||
|
||||
| Format | Accuracy | Tokens | Correct/Total |
|
||||
| ------ | -------- | ------ | ------------- |
|
||||
| `csv` | 100.0% | 425 | 4/4 |
|
||||
| `xml` | 100.0% | 1,251 | 4/4 |
|
||||
| `toon` | 0.0% | 474 | 0/4 |
|
||||
| `json-compact` | 0.0% | 681 | 0/4 |
|
||||
| `json-pretty` | 0.0% | 1,096 | 0/4 |
|
||||
| `yaml` | 0.0% | 859 | 0/4 |
|
||||
|
||||
##### Extra rows added beyond declared length
|
||||
|
||||
| Format | Accuracy | Tokens | Correct/Total |
|
||||
| ------ | -------- | ------ | ------------- |
|
||||
| `csv` | 100.0% | 566 | 4/4 |
|
||||
| `toon` | 75.0% | 621 | 3/4 |
|
||||
| `xml` | 100.0% | 1,692 | 4/4 |
|
||||
| `yaml` | 75.0% | 1,157 | 3/4 |
|
||||
| `json-compact` | 50.0% | 917 | 2/4 |
|
||||
| `json-pretty` | 50.0% | 1,476 | 2/4 |
|
||||
|
||||
##### Inconsistent field count (missing salary in row 10)
|
||||
|
||||
| Format | Accuracy | Tokens | Correct/Total |
|
||||
| ------ | -------- | ------ | ------------- |
|
||||
| `csv` | 75.0% | 489 | 3/4 |
|
||||
| `yaml` | 100.0% | 996 | 4/4 |
|
||||
| `toon` | 100.0% | 1,019 | 4/4 |
|
||||
| `json-compact` | 75.0% | 790 | 3/4 |
|
||||
| `xml` | 100.0% | 1,458 | 4/4 |
|
||||
| `json-pretty` | 75.0% | 1,274 | 3/4 |
|
||||
|
||||
##### Missing required fields (no email in multiple rows)
|
||||
|
||||
| Format | Accuracy | Tokens | Correct/Total |
|
||||
| ------ | -------- | ------ | ------------- |
|
||||
| `csv` | 100.0% | 329 | 4/4 |
|
||||
| `xml` | 100.0% | 1,411 | 4/4 |
|
||||
| `toon` | 75.0% | 983 | 3/4 |
|
||||
| `yaml` | 25.0% | 960 | 1/4 |
|
||||
| `json-pretty` | 25.0% | 1,230 | 1/4 |
|
||||
| `json-compact` | 0.0% | 755 | 0/4 |
|
||||
|
||||
#### Performance by Model
|
||||
|
||||
@@ -483,45 +545,45 @@ grok-4-fast-non-reasoning
|
||||
|
||||
| Format | Accuracy | Correct/Total |
|
||||
| ------ | -------- | ------------- |
|
||||
| `toon` | 62.3% | 127/204 |
|
||||
| `json-pretty` | 56.9% | 116/204 |
|
||||
| `yaml` | 55.9% | 114/204 |
|
||||
| `json-compact` | 54.9% | 112/204 |
|
||||
| `xml` | 54.9% | 112/204 |
|
||||
| `csv` | 47.1% | 49/104 |
|
||||
| `toon` | 59.8% | 125/209 |
|
||||
| `json-pretty` | 57.4% | 120/209 |
|
||||
| `yaml` | 56.0% | 117/209 |
|
||||
| `xml` | 55.5% | 116/209 |
|
||||
| `json-compact` | 55.0% | 115/209 |
|
||||
| `csv` | 50.5% | 55/109 |
|
||||
|
||||
##### gemini-2.5-flash
|
||||
|
||||
| Format | Accuracy | Correct/Total |
|
||||
| ------ | -------- | ------------- |
|
||||
| `toon` | 91.2% | 186/204 |
|
||||
| `yaml` | 89.7% | 183/204 |
|
||||
| `json-compact` | 87.7% | 179/204 |
|
||||
| `json-pretty` | 87.7% | 179/204 |
|
||||
| `xml` | 87.3% | 178/204 |
|
||||
| `csv` | 85.6% | 89/104 |
|
||||
| `toon` | 87.6% | 183/209 |
|
||||
| `csv` | 86.2% | 94/109 |
|
||||
| `json-compact` | 82.3% | 172/209 |
|
||||
| `yaml` | 79.4% | 166/209 |
|
||||
| `xml` | 79.4% | 166/209 |
|
||||
| `json-pretty` | 77.0% | 161/209 |
|
||||
|
||||
##### gpt-5-nano
|
||||
|
||||
| Format | Accuracy | Correct/Total |
|
||||
| ------ | -------- | ------------- |
|
||||
| `json-compact` | 93.6% | 191/204 |
|
||||
| `csv` | 90.4% | 94/104 |
|
||||
| `json-pretty` | 89.7% | 183/204 |
|
||||
| `toon` | 89.2% | 182/204 |
|
||||
| `yaml` | 89.2% | 182/204 |
|
||||
| `xml` | 81.4% | 166/204 |
|
||||
| `toon` | 90.9% | 190/209 |
|
||||
| `json-compact` | 90.9% | 190/209 |
|
||||
| `json-pretty` | 89.0% | 186/209 |
|
||||
| `csv` | 89.0% | 97/109 |
|
||||
| `yaml` | 87.1% | 182/209 |
|
||||
| `xml` | 80.9% | 169/209 |
|
||||
|
||||
##### grok-4-fast-non-reasoning
|
||||
|
||||
| Format | Accuracy | Correct/Total |
|
||||
| ------ | -------- | ------------- |
|
||||
| `toon` | 59.3% | 121/204 |
|
||||
| `json-compact` | 56.9% | 116/204 |
|
||||
| `json-pretty` | 55.4% | 113/204 |
|
||||
| `yaml` | 54.9% | 112/204 |
|
||||
| `xml` | 52.5% | 107/204 |
|
||||
| `csv` | 48.1% | 50/104 |
|
||||
| `toon` | 57.4% | 120/209 |
|
||||
| `json-pretty` | 55.5% | 116/209 |
|
||||
| `json-compact` | 54.5% | 114/209 |
|
||||
| `yaml` | 53.6% | 112/209 |
|
||||
| `xml` | 52.6% | 110/209 |
|
||||
| `csv` | 52.3% | 57/109 |
|
||||
|
||||
</details>
|
||||
|
||||
@@ -534,8 +596,9 @@ This benchmark tests **LLM comprehension and data retrieval accuracy** across di
|
||||
|
||||
#### Datasets Tested
|
||||
|
||||
Six datasets designed to test different structural patterns:
|
||||
Eleven datasets designed to test different structural patterns and validation capabilities:
|
||||
|
||||
**Primary datasets:**
|
||||
1. **Tabular** (100 employee records): Uniform objects with identical fields – optimal for TOON's tabular format.
|
||||
2. **Nested** (50 e-commerce orders): Complex structures with nested customer objects and item arrays.
|
||||
3. **Analytics** (60 days of metrics): Time-series data with dates and numeric values.
|
||||
@@ -543,21 +606,28 @@ Six datasets designed to test different structural patterns:
|
||||
5. **Event Logs** (75 logs): Semi-uniform data with ~50% flat logs and ~50% with nested error objects.
|
||||
6. **Nested Config** (1 configuration): Deeply nested configuration with minimal tabular eligibility.
|
||||
|
||||
**Structural validation datasets:**
|
||||
7. **Control**: Valid complete dataset (baseline for validation)
|
||||
8. **Truncated**: Array with 3 rows removed from end (tests [N] length detection)
|
||||
9. **Extra rows**: Array with 3 additional rows beyond declared length
|
||||
10. **Width mismatch**: Inconsistent field count (missing salary in row 10)
|
||||
11. **Missing fields**: Systematic field omissions (no email in multiple rows)
|
||||
|
||||
#### Question Types
|
||||
|
||||
204 questions are generated dynamically across four categories:
|
||||
209 questions are generated dynamically across five categories:
|
||||
|
||||
- **Field retrieval (33%)**: Direct value lookups or values that can be read straight off a record (including booleans and simple counts such as array lengths)
|
||||
- Example: "What is Alice's salary?" → `75000`
|
||||
- Example: "How many items are in order ORD-0042?" → `3`
|
||||
- Example: "What is the customer name for order ORD-0042?" → `John Doe`
|
||||
|
||||
- **Aggregation (31%)**: Dataset-level totals and averages plus single-condition filters (counts, sums, min/max comparisons)
|
||||
- **Aggregation (30%)**: Dataset-level totals and averages plus single-condition filters (counts, sums, min/max comparisons)
|
||||
- Example: "How many employees work in Engineering?" → `17`
|
||||
- Example: "What is the total revenue across all orders?" → `45123.50`
|
||||
- Example: "How many employees have salary > 80000?" → `23`
|
||||
|
||||
- **Filtering (24%)**: Multi-condition queries requiring compound logic (AND constraints across fields)
|
||||
- **Filtering (23%)**: Multi-condition queries requiring compound logic (AND constraints across fields)
|
||||
- Example: "How many employees in Sales have salary > 80000?" → `5`
|
||||
- Example: "How many active employees have more than 10 years of experience?" → `8`
|
||||
|
||||
@@ -566,18 +636,23 @@ Six datasets designed to test different structural patterns:
|
||||
- Example: "List the field names for employees" → `id, name, email, department, salary, yearsExperience, active`
|
||||
- Example: "What is the department of the last employee?" → `Sales`
|
||||
|
||||
- **Structural validation (2%)**: Tests ability to detect incomplete, truncated, or corrupted data using structural metadata
|
||||
- Example: "Is this data complete and valid?" → `YES` (control dataset) or `NO` (corrupted datasets)
|
||||
- Tests TOON's [N] length validation and {fields} consistency checking
|
||||
- Demonstrates CSV's lack of structural validation capabilities
|
||||
|
||||
#### Evaluation Process
|
||||
|
||||
1. **Format conversion**: Each dataset is converted to all 6 formats (TOON, JSON compact, JSON, YAML, XML, CSV).
|
||||
1. **Format conversion**: Each dataset is converted to all 6 formats (TOON, JSON compact, JSON, CSV, YAML, XML).
|
||||
2. **Query LLM**: Each model receives formatted data + question in a prompt and extracts the answer.
|
||||
3. **Validate with LLM-as-judge**: `gpt-5-nano` validates if the answer is semantically correct (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`).
|
||||
3. **Validate deterministically**: Answers are validated using type-aware comparison (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`) without requiring an LLM judge.
|
||||
|
||||
#### Models & Configuration
|
||||
|
||||
- **Models tested**: `claude-haiku-4-5-20251001`, `gemini-2.5-flash`, `gpt-5-nano`, `grok-4-fast-non-reasoning`
|
||||
- **Token counting**: Using `gpt-tokenizer` with `o200k_base` encoding (GPT-5 tokenizer)
|
||||
- **Temperature**: Not set (models use their defaults)
|
||||
- **Total evaluations**: 204 questions × 6 formats × 4 models = 4,896 LLM calls
|
||||
- **Total evaluations**: 209 questions × 6 formats × 4 models = 5,016 LLM calls
|
||||
|
||||
</details>
|
||||
|
||||
@@ -782,6 +857,9 @@ items[1]:
|
||||
status: active
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> Tabular format requires identical field sets across all objects (same keys, order doesn't matter) and primitive values only (strings, numbers, booleans, null).
|
||||
|
||||
#### Mixed and Non-Uniform Arrays
|
||||
|
||||
Arrays that don't meet the tabular requirements use list format:
|
||||
|
||||
Reference in New Issue
Block a user