chore(benchmarks): finalize structure-awareness run

This commit is contained in:
Johann Schopplich
2025-11-07 10:33:46 +01:00
parent 89df613059
commit c6ba6446f5
10 changed files with 259 additions and 223 deletions

230
README.md
View File

@@ -180,7 +180,7 @@ Datasets with flat tabular structures where CSV is applicable.
```
<details>
<summary><strong>View detailed examples</strong></summary>
<summary><strong>Show detailed examples</strong></summary>
#### 📈 Time-series analytics data
@@ -317,10 +317,10 @@ repositories[3]{id,name,repo,description,createdAt,updatedAt,pushedAt,stars,watc
<!-- automd:file src="./benchmarks/results/retrieval-accuracy.md" -->
Benchmarks test LLM comprehension across different input formats using 201 data retrieval questions on 4 models.
Benchmarks test LLM comprehension across different input formats using 204 data retrieval questions on 4 models.
<details>
<summary><strong>View Dataset Catalog</strong></summary>
<summary><strong>Show Dataset Catalog</strong></summary>
#### Dataset Catalog
@@ -350,58 +350,67 @@ Benchmarks test LLM comprehension across different input formats using 201 data
Each format's overall performance, balancing accuracy against token cost:
```
TOON ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 15.668.7% acc │ 4,389 tokens
CSV ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 15.3 │ 62.3% acc │ 4,080 tokens
JSON compact ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░ 13.567.2% acc │ 4,982 tokens
YAML ▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░ 11.266.7% acc │ 5,976 tokens
JSON ▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░ 9.0 │ 65.7% acc │ 7,260 tokens
XML ▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░ 8.1 │ 66.8% acc │ 8,251 tokens
TOON ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 17.275.5% acc │ 4,389 tokens
CSV ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 16.6 │ 67.8% acc │ 4,080 tokens
JSON compact ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░ 14.773.3% acc │ 4,982 tokens
YAML ▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░ 12.172.4% acc │ 5,976 tokens
JSON ▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░ 10.0 │ 72.4% acc │ 7,260 tokens
XML ▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░ 8.4 │ 69.0% acc │ 8,251 tokens
```
TOON achieves **68.7%** accuracy (vs JSON's 65.7%) while using **39.5% fewer tokens**.
TOON achieves **75.5%** accuracy (vs JSON's 72.4%) while using **39.5% fewer tokens**.
#### Per-Model Accuracy
Accuracy across 4 LLMs on 201 data retrieval questions:
Accuracy across 4 LLMs on 204 data retrieval questions:
```
gpt-5-nano
→ TOON ██████████████████░░ 88.6% (178/201)
JSON compact ██████████████████░░ 88.1% (177/201)
CSV ██████████████████░░ 88.0% (88/100)
YAML █████████████████░░░ 84.6% (170/201)
XML ████████████████░░░░ 81.6% (164/201)
JSON ████████████████░░░░ 80.1% (161/201)
claude-haiku-4-5-20251001
YAML ██████████░░░░░░░░░░ 52.2% (105/201)
→ TOON ██████████░░░░░░░░░ 50.7% (102/201)
JSON ██████████░░░░░░░░░ 50.2% (101/201)
JSON compact ██████████░░░░░░░░░ 49.8% (100/201)
XML ██████████░░░░░░░░░ 49.3% (99/201)
CSV ████████░░░░░░░░░░░ 39.0% (39/100)
→ TOON ████████████░░░░░░░░ 62.3% (127/204)
JSON ██████████░░░░░░░░░ 56.9% (116/204)
YAML ██████████░░░░░░░░░ 55.9% (114/204)
JSON compact ██████████░░░░░░░░░ 54.9% (112/204)
XML ██████████░░░░░░░░░ 54.9% (112/204)
CSV ████████░░░░░░░░░░░ 47.1% (49/104)
gemini-2.5-flash
XML █████████████████░░ 86.1% (173/201)
→ TOON █████████████████░░ 84.1% (169/201)
CSV ████████████████░░░░ 82.0% (82/100)
JSON compact ████████████████░░░░ 81.1% (163/201)
YAML ████████████████░░░ 81.1% (163/201)
JSON ████████████████░░░ 81.1% (163/201)
→ TOON █████████████████░░ 91.2% (186/204)
YAML █████████████████░░ 89.7% (183/204)
JSON compact ██████████████████░░ 87.7% (179/204)
JSON ██████████████████░░ 87.7% (179/204)
XML ████████████████░░░ 87.3% (178/204)
CSV ████████████████░░░ 85.6% (89/104)
gpt-5-nano
JSON compact ███████████████████░ 93.6% (191/204)
CSV ██████████████████░░ 90.4% (94/104)
JSON ██████████████████░░ 89.7% (183/204)
→ TOON ██████████████████░░ 89.2% (182/204)
YAML ██████████████████░░ 89.2% (182/204)
XML ████████████████░░░░ 81.4% (166/204)
grok-4-fast-non-reasoning
→ TOON ██████████░░░░░░░░░░ 51.2% (103/201)
JSON ██████████░░░░░░░░░ 51.2% (103/201)
XML ██████████░░░░░░░░░ 50.2% (101/201)
JSON compact ██████████░░░░░░░░░ 49.8% (100/201)
YAML ██████████░░░░░░░░░░ 48.8% (98/201)
CSV ████████░░░░░░░░░░░░ 40.0% (40/100)
→ TOON ████████████░░░░░░░░ 59.3% (121/204)
JSON compact ██████████░░░░░░░░░ 56.9% (116/204)
JSON ██████████░░░░░░░░░ 55.4% (113/204)
YAML ███████████░░░░░░░░░ 54.9% (112/204)
XML ██████████░░░░░░░░░░ 52.5% (107/204)
CSV ██████████░░░░░░░░░░ 48.1% (50/104)
```
**Key tradeoff:** TOON achieves **68.7% accuracy** (vs JSON's 65.7%) while using **39.5% fewer tokens** on these datasets.
**Key tradeoff:** TOON achieves **75.5% accuracy** (vs JSON's 72.4%) while using **39.5% fewer tokens** on these datasets.
<details>
<summary><strong>Performance by dataset and model</strong></summary>
<summary><strong>Performance by dataset, model, and question type</strong></summary>
#### Performance by Question Type
| Question Type | TOON | JSON compact | JSON | YAML | XML | CSV |
| ------------- | ---- | ---- | ---- | ---- | ---- | ---- |
| Field Retrieval | 100.0% | 98.9% | 99.6% | 99.3% | 98.5% | 100.0% |
| Aggregation | 56.3% | 52.4% | 53.2% | 53.2% | 47.2% | 40.5% |
| Filtering | 58.9% | 58.3% | 54.2% | 53.1% | 50.5% | 49.1% |
| Structure Awareness | 89.0% | 85.0% | 82.0% | 85.0% | 79.0% | 84.4% |
#### Performance by Dataset
@@ -409,110 +418,110 @@ grok-4-fast-non-reasoning
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `toon` | 65.6% | 2,483 | 105/160 |
| `csv` | 62.5% | 2,337 | 100/160 |
| `json-compact` | 66.3% | 3,943 | 106/160 |
| `yaml` | 63.7% | 4,969 | 102/160 |
| `xml` | 67.5% | 7,314 | 108/160 |
| `json-pretty` | 62.5% | 6,347 | 100/160 |
| `csv` | 70.7% | 2,337 | 116/164 |
| `toon` | 72.0% | 2,483 | 118/164 |
| `json-compact` | 71.3% | 3,943 | 117/164 |
| `yaml` | 70.1% | 4,969 | 115/164 |
| `json-pretty` | 72.6% | 6,347 | 119/164 |
| `xml` | 70.7% | 7,314 | 116/164 |
##### E-commerce orders with nested structures
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `toon` | 75.6% | 7,197 | 121/160 |
| `json-compact` | 70.6% | 6,784 | 113/160 |
| `yaml` | 71.9% | 8,334 | 115/160 |
| `json-pretty` | 68.8% | 10,700 | 110/160 |
| `xml` | 71.9% | 12,013 | 115/160 |
| `toon` | 83.5% | 7,197 | 137/164 |
| `json-compact` | 79.3% | 6,784 | 130/164 |
| `yaml` | 78.7% | 8,334 | 129/164 |
| `json-pretty` | 78.7% | 10,700 | 129/164 |
| `xml` | 73.8% | 12,013 | 121/164 |
##### Time-series analytics data
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `csv` | 63.8% | 1,391 | 74/116 |
| `toon` | 66.4% | 1,513 | 77/116 |
| `json-compact` | 61.2% | 2,339 | 71/116 |
| `yaml` | 65.5% | 2,936 | 76/116 |
| `json-pretty` | 64.7% | 3,663 | 75/116 |
| `xml` | 65.5% | 4,374 | 76/116 |
| `toon` | 75.8% | 1,513 | 91/120 |
| `csv` | 72.5% | 1,391 | 87/120 |
| `json-compact` | 70.0% | 2,339 | 84/120 |
| `yaml` | 70.0% | 2,936 | 84/120 |
| `json-pretty` | 71.7% | 3,663 | 86/120 |
| `xml` | 71.7% | 4,374 | 86/120 |
##### Top 100 GitHub repositories
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `toon` | 63.7% | 8,745 | 79/124 |
| `csv` | 60.5% | 8,513 | 75/124 |
| `json-compact` | 56.5% | 11,455 | 70/124 |
| `yaml` | 53.2% | 13,129 | 66/124 |
| `json-pretty` | 53.2% | 15,145 | 66/124 |
| `xml` | 53.2% | 17,095 | 66/124 |
| `toon` | 64.4% | 8,745 | 85/132 |
| `csv` | 59.8% | 8,513 | 79/132 |
| `json-compact` | 60.6% | 11,455 | 80/132 |
| `yaml` | 61.4% | 13,129 | 81/132 |
| `json-pretty` | 59.1% | 15,145 | 78/132 |
| `xml` | 51.5% | 17,095 | 68/132 |
##### Semi-uniform event logs
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `json-compact` | 55.0% | 4,809 | 66/120 |
| `yaml` | 52.5% | 5,814 | 63/120 |
| `json-pretty` | 52.5% | 6,784 | 63/120 |
| `toon` | 45.8% | 5,764 | 55/120 |
| `xml` | 50.8% | 7,699 | 61/120 |
| `json-compact` | 67.5% | 4,809 | 81/120 |
| `yaml` | 63.3% | 5,814 | 76/120 |
| `toon` | 62.5% | 5,764 | 75/120 |
| `json-pretty` | 59.2% | 6,784 | 71/120 |
| `xml` | 55.0% | 7,699 | 66/120 |
##### Deeply nested configuration
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `json-compact` | 91.9% | 564 | 114/124 |
| `toon` | 92.7% | 631 | 115/124 |
| `yaml` | 91.9% | 673 | 114/124 |
| `json-pretty` | 91.9% | 919 | 114/124 |
| `xml` | 89.5% | 1,008 | 111/124 |
| `json-compact` | 91.4% | 564 | 106/116 |
| `toon` | 94.8% | 631 | 110/116 |
| `yaml` | 91.4% | 673 | 106/116 |
| `json-pretty` | 93.1% | 919 | 108/116 |
| `xml` | 91.4% | 1,008 | 106/116 |
#### Performance by Model
##### gpt-5-nano
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
| `toon` | 88.6% | 178/201 |
| `json-compact` | 88.1% | 177/201 |
| `csv` | 88.0% | 88/100 |
| `yaml` | 84.6% | 170/201 |
| `xml` | 81.6% | 164/201 |
| `json-pretty` | 80.1% | 161/201 |
##### claude-haiku-4-5-20251001
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
| `yaml` | 52.2% | 105/201 |
| `toon` | 50.7% | 102/201 |
| `json-pretty` | 50.2% | 101/201 |
| `json-compact` | 49.8% | 100/201 |
| `xml` | 49.3% | 99/201 |
| `csv` | 39.0% | 39/100 |
| `toon` | 62.3% | 127/204 |
| `json-pretty` | 56.9% | 116/204 |
| `yaml` | 55.9% | 114/204 |
| `json-compact` | 54.9% | 112/204 |
| `xml` | 54.9% | 112/204 |
| `csv` | 47.1% | 49/104 |
##### gemini-2.5-flash
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
| `xml` | 86.1% | 173/201 |
| `toon` | 84.1% | 169/201 |
| `csv` | 82.0% | 82/100 |
| `json-compact` | 81.1% | 163/201 |
| `yaml` | 81.1% | 163/201 |
| `json-pretty` | 81.1% | 163/201 |
| `toon` | 91.2% | 186/204 |
| `yaml` | 89.7% | 183/204 |
| `json-compact` | 87.7% | 179/204 |
| `json-pretty` | 87.7% | 179/204 |
| `xml` | 87.3% | 178/204 |
| `csv` | 85.6% | 89/104 |
##### gpt-5-nano
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
| `json-compact` | 93.6% | 191/204 |
| `csv` | 90.4% | 94/104 |
| `json-pretty` | 89.7% | 183/204 |
| `toon` | 89.2% | 182/204 |
| `yaml` | 89.2% | 182/204 |
| `xml` | 81.4% | 166/204 |
##### grok-4-fast-non-reasoning
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
| `toon` | 51.2% | 103/201 |
| `json-pretty` | 51.2% | 103/201 |
| `xml` | 50.2% | 101/201 |
| `json-compact` | 49.8% | 100/201 |
| `yaml` | 48.8% | 98/201 |
| `csv` | 40.0% | 40/100 |
| `toon` | 59.3% | 121/204 |
| `json-compact` | 56.9% | 116/204 |
| `json-pretty` | 55.4% | 113/204 |
| `yaml` | 54.9% | 112/204 |
| `xml` | 52.5% | 107/204 |
| `csv` | 48.1% | 50/104 |
</details>
@@ -536,34 +545,39 @@ Six datasets designed to test different structural patterns:
#### Question Types
201 questions are generated dynamically across three categories:
204 questions are generated dynamically across four categories:
- **Field retrieval (36%)**: Direct value lookups or values that can be read straight off a record (including booleans and simple counts such as array lengths)
- **Field retrieval (33%)**: Direct value lookups or values that can be read straight off a record (including booleans and simple counts such as array lengths)
- Example: "What is Alice's salary?" → `75000`
- Example: "How many items are in order ORD-0042?" → `3`
- Example: "What is the customer name for order ORD-0042?" → `John Doe`
- **Aggregation (37%)**: Dataset-level totals and averages plus single-condition filters (counts, sums, min/max comparisons)
- **Aggregation (31%)**: Dataset-level totals and averages plus single-condition filters (counts, sums, min/max comparisons)
- Example: "How many employees work in Engineering?" → `17`
- Example: "What is the total revenue across all orders?" → `45123.50`
- Example: "How many employees have salary > 80000?" → `23`
- **Filtering (27%)**: Multi-condition queries requiring compound logic (AND constraints across fields)
- **Filtering (24%)**: Multi-condition queries requiring compound logic (AND constraints across fields)
- Example: "How many employees in Sales have salary > 80000?" → `5`
- Example: "How many active employees have more than 10 years of experience?" → `8`
- **Structure awareness (12%)**: Tests format-native structural affordances (TOON's [N] count and {fields}, CSV's header row)
- Example: "How many employees are in the dataset?" → `100`
- Example: "List the field names for employees" → `id, name, email, department, salary, yearsExperience, active`
- Example: "What is the department of the last employee?" → `Sales`
#### Evaluation Process
1. **Format conversion**: Each dataset is converted to all 6 formats (TOON, JSON compact, XML, YAML, JSON, CSV).
1. **Format conversion**: Each dataset is converted to all 6 formats (TOON, JSON compact, JSON, YAML, XML, CSV).
2. **Query LLM**: Each model receives formatted data + question in a prompt and extracts the answer.
3. **Validate with LLM-as-judge**: `gpt-5-nano` validates if the answer is semantically correct (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`).
#### Models & Configuration
- **Models tested**: `gpt-5-nano`, `claude-haiku-4-5-20251001`, `gemini-2.5-flash`, `grok-4-fast-non-reasoning`
- **Models tested**: `claude-haiku-4-5-20251001`, `gemini-2.5-flash`, `gpt-5-nano`, `grok-4-fast-non-reasoning`
- **Token counting**: Using `gpt-tokenizer` with `o200k_base` encoding (GPT-5 tokenizer)
- **Temperature**: Not set (models use their defaults)
- **Total evaluations**: 201 questions × 6 formats × 4 models = 4,824 LLM calls
- **Total evaluations**: 204 questions × 6 formats × 4 models = 4,896 LLM calls
</details>

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@@ -1,7 +1,7 @@
Benchmarks test LLM comprehension across different input formats using 201 data retrieval questions on 4 models.
Benchmarks test LLM comprehension across different input formats using 204 data retrieval questions on 4 models.
<details>
<summary><strong>View Dataset Catalog</strong></summary>
<summary><strong>Show Dataset Catalog</strong></summary>
#### Dataset Catalog
@@ -31,58 +31,67 @@ Benchmarks test LLM comprehension across different input formats using 201 data
Each format's overall performance, balancing accuracy against token cost:
```
TOON ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 15.668.7% acc │ 4,389 tokens
CSV ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 15.3 │ 62.3% acc │ 4,080 tokens
JSON compact ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░ 13.567.2% acc │ 4,982 tokens
YAML ▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░ 11.266.7% acc │ 5,976 tokens
JSON ▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░ 9.0 │ 65.7% acc │ 7,260 tokens
XML ▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░ 8.1 │ 66.8% acc │ 8,251 tokens
TOON ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 17.275.5% acc │ 4,389 tokens
CSV ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 16.6 │ 67.8% acc │ 4,080 tokens
JSON compact ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░ 14.773.3% acc │ 4,982 tokens
YAML ▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░ 12.172.4% acc │ 5,976 tokens
JSON ▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░ 10.0 │ 72.4% acc │ 7,260 tokens
XML ▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░ 8.4 │ 69.0% acc │ 8,251 tokens
```
TOON achieves **68.7%** accuracy (vs JSON's 65.7%) while using **39.5% fewer tokens**.
TOON achieves **75.5%** accuracy (vs JSON's 72.4%) while using **39.5% fewer tokens**.
#### Per-Model Accuracy
Accuracy across 4 LLMs on 201 data retrieval questions:
Accuracy across 4 LLMs on 204 data retrieval questions:
```
gpt-5-nano
→ TOON ██████████████████░░ 88.6% (178/201)
JSON compact ██████████████████░░ 88.1% (177/201)
CSV ██████████████████░░ 88.0% (88/100)
YAML █████████████████░░░ 84.6% (170/201)
XML ████████████████░░░░ 81.6% (164/201)
JSON ████████████████░░░░ 80.1% (161/201)
claude-haiku-4-5-20251001
YAML ██████████░░░░░░░░░░ 52.2% (105/201)
→ TOON ██████████░░░░░░░░░ 50.7% (102/201)
JSON ██████████░░░░░░░░░ 50.2% (101/201)
JSON compact ██████████░░░░░░░░░ 49.8% (100/201)
XML ██████████░░░░░░░░░ 49.3% (99/201)
CSV ████████░░░░░░░░░░░ 39.0% (39/100)
→ TOON ████████████░░░░░░░░ 62.3% (127/204)
JSON ██████████░░░░░░░░░ 56.9% (116/204)
YAML ██████████░░░░░░░░░ 55.9% (114/204)
JSON compact ██████████░░░░░░░░░ 54.9% (112/204)
XML ██████████░░░░░░░░░ 54.9% (112/204)
CSV ████████░░░░░░░░░░░ 47.1% (49/104)
gemini-2.5-flash
XML █████████████████░░ 86.1% (173/201)
→ TOON █████████████████░░ 84.1% (169/201)
CSV ████████████████░░░░ 82.0% (82/100)
JSON compact ████████████████░░░░ 81.1% (163/201)
YAML ████████████████░░░ 81.1% (163/201)
JSON ████████████████░░░ 81.1% (163/201)
→ TOON █████████████████░░ 91.2% (186/204)
YAML █████████████████░░ 89.7% (183/204)
JSON compact ██████████████████░░ 87.7% (179/204)
JSON ██████████████████░░ 87.7% (179/204)
XML ████████████████░░░ 87.3% (178/204)
CSV ████████████████░░░ 85.6% (89/104)
gpt-5-nano
JSON compact ███████████████████░ 93.6% (191/204)
CSV ██████████████████░░ 90.4% (94/104)
JSON ██████████████████░░ 89.7% (183/204)
→ TOON ██████████████████░░ 89.2% (182/204)
YAML ██████████████████░░ 89.2% (182/204)
XML ████████████████░░░░ 81.4% (166/204)
grok-4-fast-non-reasoning
→ TOON ██████████░░░░░░░░░░ 51.2% (103/201)
JSON ██████████░░░░░░░░░ 51.2% (103/201)
XML ██████████░░░░░░░░░ 50.2% (101/201)
JSON compact ██████████░░░░░░░░░ 49.8% (100/201)
YAML ██████████░░░░░░░░░░ 48.8% (98/201)
CSV ████████░░░░░░░░░░░░ 40.0% (40/100)
→ TOON ████████████░░░░░░░░ 59.3% (121/204)
JSON compact ██████████░░░░░░░░░ 56.9% (116/204)
JSON ██████████░░░░░░░░░ 55.4% (113/204)
YAML ███████████░░░░░░░░░ 54.9% (112/204)
XML ██████████░░░░░░░░░░ 52.5% (107/204)
CSV ██████████░░░░░░░░░░ 48.1% (50/104)
```
**Key tradeoff:** TOON achieves **68.7% accuracy** (vs JSON's 65.7%) while using **39.5% fewer tokens** on these datasets.
**Key tradeoff:** TOON achieves **75.5% accuracy** (vs JSON's 72.4%) while using **39.5% fewer tokens** on these datasets.
<details>
<summary><strong>Performance by dataset and model</strong></summary>
<summary><strong>Performance by dataset, model, and question type</strong></summary>
#### Performance by Question Type
| Question Type | TOON | JSON compact | JSON | YAML | XML | CSV |
| ------------- | ---- | ---- | ---- | ---- | ---- | ---- |
| Field Retrieval | 100.0% | 98.9% | 99.6% | 99.3% | 98.5% | 100.0% |
| Aggregation | 56.3% | 52.4% | 53.2% | 53.2% | 47.2% | 40.5% |
| Filtering | 58.9% | 58.3% | 54.2% | 53.1% | 50.5% | 49.1% |
| Structure Awareness | 89.0% | 85.0% | 82.0% | 85.0% | 79.0% | 84.4% |
#### Performance by Dataset
@@ -90,110 +99,110 @@ grok-4-fast-non-reasoning
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `toon` | 65.6% | 2,483 | 105/160 |
| `csv` | 62.5% | 2,337 | 100/160 |
| `json-compact` | 66.3% | 3,943 | 106/160 |
| `yaml` | 63.7% | 4,969 | 102/160 |
| `xml` | 67.5% | 7,314 | 108/160 |
| `json-pretty` | 62.5% | 6,347 | 100/160 |
| `csv` | 70.7% | 2,337 | 116/164 |
| `toon` | 72.0% | 2,483 | 118/164 |
| `json-compact` | 71.3% | 3,943 | 117/164 |
| `yaml` | 70.1% | 4,969 | 115/164 |
| `json-pretty` | 72.6% | 6,347 | 119/164 |
| `xml` | 70.7% | 7,314 | 116/164 |
##### E-commerce orders with nested structures
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `toon` | 75.6% | 7,197 | 121/160 |
| `json-compact` | 70.6% | 6,784 | 113/160 |
| `yaml` | 71.9% | 8,334 | 115/160 |
| `json-pretty` | 68.8% | 10,700 | 110/160 |
| `xml` | 71.9% | 12,013 | 115/160 |
| `toon` | 83.5% | 7,197 | 137/164 |
| `json-compact` | 79.3% | 6,784 | 130/164 |
| `yaml` | 78.7% | 8,334 | 129/164 |
| `json-pretty` | 78.7% | 10,700 | 129/164 |
| `xml` | 73.8% | 12,013 | 121/164 |
##### Time-series analytics data
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `csv` | 63.8% | 1,391 | 74/116 |
| `toon` | 66.4% | 1,513 | 77/116 |
| `json-compact` | 61.2% | 2,339 | 71/116 |
| `yaml` | 65.5% | 2,936 | 76/116 |
| `json-pretty` | 64.7% | 3,663 | 75/116 |
| `xml` | 65.5% | 4,374 | 76/116 |
| `toon` | 75.8% | 1,513 | 91/120 |
| `csv` | 72.5% | 1,391 | 87/120 |
| `json-compact` | 70.0% | 2,339 | 84/120 |
| `yaml` | 70.0% | 2,936 | 84/120 |
| `json-pretty` | 71.7% | 3,663 | 86/120 |
| `xml` | 71.7% | 4,374 | 86/120 |
##### Top 100 GitHub repositories
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `toon` | 63.7% | 8,745 | 79/124 |
| `csv` | 60.5% | 8,513 | 75/124 |
| `json-compact` | 56.5% | 11,455 | 70/124 |
| `yaml` | 53.2% | 13,129 | 66/124 |
| `json-pretty` | 53.2% | 15,145 | 66/124 |
| `xml` | 53.2% | 17,095 | 66/124 |
| `toon` | 64.4% | 8,745 | 85/132 |
| `csv` | 59.8% | 8,513 | 79/132 |
| `json-compact` | 60.6% | 11,455 | 80/132 |
| `yaml` | 61.4% | 13,129 | 81/132 |
| `json-pretty` | 59.1% | 15,145 | 78/132 |
| `xml` | 51.5% | 17,095 | 68/132 |
##### Semi-uniform event logs
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `json-compact` | 55.0% | 4,809 | 66/120 |
| `yaml` | 52.5% | 5,814 | 63/120 |
| `json-pretty` | 52.5% | 6,784 | 63/120 |
| `toon` | 45.8% | 5,764 | 55/120 |
| `xml` | 50.8% | 7,699 | 61/120 |
| `json-compact` | 67.5% | 4,809 | 81/120 |
| `yaml` | 63.3% | 5,814 | 76/120 |
| `toon` | 62.5% | 5,764 | 75/120 |
| `json-pretty` | 59.2% | 6,784 | 71/120 |
| `xml` | 55.0% | 7,699 | 66/120 |
##### Deeply nested configuration
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `json-compact` | 91.9% | 564 | 114/124 |
| `toon` | 92.7% | 631 | 115/124 |
| `yaml` | 91.9% | 673 | 114/124 |
| `json-pretty` | 91.9% | 919 | 114/124 |
| `xml` | 89.5% | 1,008 | 111/124 |
| `json-compact` | 91.4% | 564 | 106/116 |
| `toon` | 94.8% | 631 | 110/116 |
| `yaml` | 91.4% | 673 | 106/116 |
| `json-pretty` | 93.1% | 919 | 108/116 |
| `xml` | 91.4% | 1,008 | 106/116 |
#### Performance by Model
##### gpt-5-nano
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
| `toon` | 88.6% | 178/201 |
| `json-compact` | 88.1% | 177/201 |
| `csv` | 88.0% | 88/100 |
| `yaml` | 84.6% | 170/201 |
| `xml` | 81.6% | 164/201 |
| `json-pretty` | 80.1% | 161/201 |
##### claude-haiku-4-5-20251001
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
| `yaml` | 52.2% | 105/201 |
| `toon` | 50.7% | 102/201 |
| `json-pretty` | 50.2% | 101/201 |
| `json-compact` | 49.8% | 100/201 |
| `xml` | 49.3% | 99/201 |
| `csv` | 39.0% | 39/100 |
| `toon` | 62.3% | 127/204 |
| `json-pretty` | 56.9% | 116/204 |
| `yaml` | 55.9% | 114/204 |
| `json-compact` | 54.9% | 112/204 |
| `xml` | 54.9% | 112/204 |
| `csv` | 47.1% | 49/104 |
##### gemini-2.5-flash
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
| `xml` | 86.1% | 173/201 |
| `toon` | 84.1% | 169/201 |
| `csv` | 82.0% | 82/100 |
| `json-compact` | 81.1% | 163/201 |
| `yaml` | 81.1% | 163/201 |
| `json-pretty` | 81.1% | 163/201 |
| `toon` | 91.2% | 186/204 |
| `yaml` | 89.7% | 183/204 |
| `json-compact` | 87.7% | 179/204 |
| `json-pretty` | 87.7% | 179/204 |
| `xml` | 87.3% | 178/204 |
| `csv` | 85.6% | 89/104 |
##### gpt-5-nano
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
| `json-compact` | 93.6% | 191/204 |
| `csv` | 90.4% | 94/104 |
| `json-pretty` | 89.7% | 183/204 |
| `toon` | 89.2% | 182/204 |
| `yaml` | 89.2% | 182/204 |
| `xml` | 81.4% | 166/204 |
##### grok-4-fast-non-reasoning
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
| `toon` | 51.2% | 103/201 |
| `json-pretty` | 51.2% | 103/201 |
| `xml` | 50.2% | 101/201 |
| `json-compact` | 49.8% | 100/201 |
| `yaml` | 48.8% | 98/201 |
| `csv` | 40.0% | 40/100 |
| `toon` | 59.3% | 121/204 |
| `json-compact` | 56.9% | 116/204 |
| `json-pretty` | 55.4% | 113/204 |
| `yaml` | 54.9% | 112/204 |
| `xml` | 52.5% | 107/204 |
| `csv` | 48.1% | 50/104 |
</details>
@@ -217,33 +226,38 @@ Six datasets designed to test different structural patterns:
#### Question Types
201 questions are generated dynamically across three categories:
204 questions are generated dynamically across four categories:
- **Field retrieval (36%)**: Direct value lookups or values that can be read straight off a record (including booleans and simple counts such as array lengths)
- **Field retrieval (33%)**: Direct value lookups or values that can be read straight off a record (including booleans and simple counts such as array lengths)
- Example: "What is Alice's salary?" → `75000`
- Example: "How many items are in order ORD-0042?" → `3`
- Example: "What is the customer name for order ORD-0042?" → `John Doe`
- **Aggregation (37%)**: Dataset-level totals and averages plus single-condition filters (counts, sums, min/max comparisons)
- **Aggregation (31%)**: Dataset-level totals and averages plus single-condition filters (counts, sums, min/max comparisons)
- Example: "How many employees work in Engineering?" → `17`
- Example: "What is the total revenue across all orders?" → `45123.50`
- Example: "How many employees have salary > 80000?" → `23`
- **Filtering (27%)**: Multi-condition queries requiring compound logic (AND constraints across fields)
- **Filtering (24%)**: Multi-condition queries requiring compound logic (AND constraints across fields)
- Example: "How many employees in Sales have salary > 80000?" → `5`
- Example: "How many active employees have more than 10 years of experience?" → `8`
- **Structure awareness (12%)**: Tests format-native structural affordances (TOON's [N] count and {fields}, CSV's header row)
- Example: "How many employees are in the dataset?" → `100`
- Example: "List the field names for employees" → `id, name, email, department, salary, yearsExperience, active`
- Example: "What is the department of the last employee?" → `Sales`
#### Evaluation Process
1. **Format conversion**: Each dataset is converted to all 6 formats (TOON, JSON compact, XML, YAML, JSON, CSV).
1. **Format conversion**: Each dataset is converted to all 6 formats (TOON, JSON compact, JSON, YAML, XML, CSV).
2. **Query LLM**: Each model receives formatted data + question in a prompt and extracts the answer.
3. **Validate with LLM-as-judge**: `gpt-5-nano` validates if the answer is semantically correct (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`).
#### Models & Configuration
- **Models tested**: `gpt-5-nano`, `claude-haiku-4-5-20251001`, `gemini-2.5-flash`, `grok-4-fast-non-reasoning`
- **Models tested**: `claude-haiku-4-5-20251001`, `gemini-2.5-flash`, `gpt-5-nano`, `grok-4-fast-non-reasoning`
- **Token counting**: Using `gpt-tokenizer` with `o200k_base` encoding (GPT-5 tokenizer)
- **Temperature**: Not set (models use their defaults)
- **Total evaluations**: 201 questions × 6 formats × 4 models = 4,824 LLM calls
- **Total evaluations**: 204 questions × 6 formats × 4 models = 4,896 LLM calls
</details>

View File

@@ -77,7 +77,7 @@ Datasets with flat tabular structures where CSV is applicable.
```
<details>
<summary><strong>View detailed examples</strong></summary>
<summary><strong>Show detailed examples</strong></summary>
#### 📈 Time-series analytics data

View File

@@ -88,7 +88,7 @@ function generateTotalLines(
baselineFormat?: { name: string, tokens: number },
) {
const separatorHalf = '─'.repeat(36)
const lines: string[] = [`${separatorHalf} Total ${separatorHalf}`]
const lines = [`${separatorHalf} Total ${separatorHalf}`]
if (baselineFormat) {
// Flat-only track with CSV baseline
@@ -300,12 +300,13 @@ const detailedExamples = results
displayData = { metrics: displayData.metrics.slice(0, ANALYTICS_METRICS_LIMIT) }
}
const separator = i < filtered.length - 1 ? '\n\n---' : ''
const emoji = DATASET_ICONS[result.dataset.name] || DEFAULT_DATASET_ICON
const json = result.formats.find(f => f.name === 'json-pretty')!
const toon = result.formats.find(f => f.name === 'toon')!
const separator = i < filtered.length - 1 ? '---' : ''
return `#### ${emoji} ${result.dataset.description}
return `
#### ${emoji} ${result.dataset.description}
**Savings:** ${json.savings.toLocaleString('en-US')} tokens (${json.savingsPercent.toFixed(1)}% reduction vs JSON)
@@ -319,7 +320,10 @@ ${JSON.stringify(displayData, undefined, 2)}
\`\`\`
${encode(displayData)}
\`\`\`${separator}`
\`\`\`
${separator}
`.trim()
})
.join('\n\n')
@@ -327,7 +331,7 @@ const markdown = `
${barChartSection}
<details>
<summary><strong>View detailed examples</strong></summary>
<summary><strong>Show detailed examples</strong></summary>
${detailedExamples}

View File

@@ -10,9 +10,9 @@ import { generateText } from 'ai'
* Models used for evaluation
*/
export const models: LanguageModelV2[] = [
openai('gpt-5-nano'),
anthropic('claude-haiku-4-5-20251001'),
google('gemini-2.5-flash'),
openai('gpt-5-nano'),
xai('grok-4-fast-non-reasoning'),
]

View File

@@ -81,7 +81,7 @@ export function generateAccuracyReport(
Benchmarks test LLM comprehension across different input formats using ${totalQuestions} data retrieval questions on ${modelNames.length} ${modelNames.length === 1 ? 'model' : 'models'}.
<details>
<summary><strong>View Dataset Catalog</strong></summary>
<summary><strong>Show Dataset Catalog</strong></summary>
${generateDatasetCatalog(ACCURACY_DATASETS)}