docs: overhaul retrieval accuracy benchmark

This commit is contained in:
Johann Schopplich
2025-10-28 20:22:43 +01:00
parent efbe4ded88
commit 67c0df8cb0
22 changed files with 1553 additions and 27288 deletions

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

View File

@@ -1,91 +0,0 @@
{
"formatResults": [
{
"format": "toon",
"accuracy": 0.8658280922431866,
"totalTokens": 4678,
"averageLatency": 5321,
"correctCount": 413,
"totalCount": 477
},
{
"format": "xml",
"accuracy": 0.8616352201257862,
"totalTokens": 9944,
"averageLatency": 6035,
"correctCount": 411,
"totalCount": 477
},
{
"format": "csv",
"accuracy": 0.8469601677148847,
"totalTokens": 4745,
"averageLatency": 6551,
"correctCount": 404,
"totalCount": 477
},
{
"format": "json",
"accuracy": 0.8322851153039832,
"totalTokens": 8713,
"averageLatency": 7981,
"correctCount": 397,
"totalCount": 477
},
{
"format": "yaml",
"accuracy": 0.8259958071278826,
"totalTokens": 7091,
"averageLatency": 5561,
"correctCount": 394,
"totalCount": 477
}
],
"questions": 159,
"models": [
"gpt-5-nano",
"claude-haiku-4-5",
"gemini-2.5-flash"
],
"datasets": [
{
"name": "tabular",
"description": "Uniform employee records (TOON optimal format)"
},
{
"name": "nested",
"description": "E-commerce orders with nested structures"
},
{
"name": "analytics",
"description": "Time-series analytics data"
},
{
"name": "github",
"description": "Top 100 GitHub repositories"
}
],
"tokenCounts": {
"json-tabular": 6347,
"json-nested": 9694,
"json-analytics": 3665,
"json-github": 15145,
"toon-tabular": 2483,
"toon-nested": 5967,
"toon-analytics": 1515,
"toon-github": 8745,
"csv-tabular": 2337,
"csv-nested": 6735,
"csv-analytics": 1393,
"csv-github": 8513,
"xml-tabular": 7314,
"xml-nested": 10992,
"xml-analytics": 4376,
"xml-github": 17095,
"yaml-tabular": 4969,
"yaml-nested": 7328,
"yaml-analytics": 2938,
"yaml-github": 13129
},
"timestamp": "2025-10-28T07:39:09.360Z"
}

View File

@@ -1,31 +1,31 @@
### Retrieval Accuracy
Accuracy across **3 LLMs** on **159 data retrieval questions**:
Accuracy across **3 LLMs** on **154 data retrieval questions**:
```
gpt-5-nano
toon ███████████████████ 99.4% (158/159)
yaml ██████████████████░ 95.0% (151/159)
csv ██████████████████░░ 92.5% (147/159)
json ██████████████████░░ 92.5% (147/159)
xml █████████████████░░ 91.2% (145/159)
claude-haiku-4-5
toon ███████████████░░░░░ 75.5% (120/159)
xml ███████████████░░░░░ 75.5% (120/159)
csv ███████████████░░░░░ 75.5% (120/159)
json ███████████████░░░░░ 75.5% (120/159)
yaml ███████████████░░░░░ 74.2% (118/159)
toon ███████████████████ 96.1% (148/154)
csv ██████████████████░ 90.3% (139/154)
yaml ██████████████████░░ 89.0% (137/154)
json ██████████████████░░ 87.7% (135/154)
xml █████████████████░░ 83.8% (129/154)
gemini-2.5-flash
xml ██████████████████░░ 91.8% (146/159)
csv █████████████████░░ 86.2% (137/159)
toon █████████████████░░░ 84.9% (135/159)
json ████████████████░░░░ 81.8% (130/159)
yaml ███████████████░░░░ 78.6% (125/159)
xml ██████████████████░░ 90.3% (139/154)
csv █████████████████░░ 89.0% (137/154)
toon █████████████████░░░ 87.0% (134/154)
json ████████████████░░░░ 79.2% (122/154)
yaml ███████████████░░░░ 76.0% (117/154)
claude-haiku-4-5-20251001
json ██████████░░░░░░░░░░ 48.7% (75/154)
toon ██████████░░░░░░░░░░ 48.1% (74/154)
xml █████████░░░░░░░░░░░ 47.4% (73/154)
yaml █████████░░░░░░░░░░░ 47.4% (73/154)
csv █████████░░░░░░░░░░░ 45.5% (70/154)
```
**Advantage:** TOON achieves **86.6% accuracy** (vs JSON's 83.2%) while using **46.3% fewer tokens**.
**Advantage:** TOON achieves **77.1% accuracy** (vs JSON's 71.9%) while using **46.3% fewer tokens**.
<details>
<summary><strong>Performance by dataset and model</strong></summary>
@@ -36,41 +36,41 @@ gemini-2.5-flash
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `toon` | 87.4% | 2.483 | 152/174 |
| `csv` | 82.8% | 2.337 | 144/174 |
| `yaml` | 83.9% | 4.969 | 146/174 |
| `json` | 83.9% | 6.347 | 146/174 |
| `xml` | 88.5% | 7.314 | 154/174 |
| `csv` | 74.7% | 2,337 | 112/150 |
| `toon` | 76.7% | 2,483 | 115/150 |
| `yaml` | 70.7% | 4,969 | 106/150 |
| `xml` | 77.3% | 7,314 | 116/150 |
| `json` | 69.3% | 6,347 | 104/150 |
##### E-commerce orders with nested structures
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `toon` | 90.9% | 5.967 | 120/132 |
| `csv` | 93.9% | 6.735 | 124/132 |
| `yaml` | 87.1% | 7.328 | 115/132 |
| `json` | 87.9% | 9.694 | 116/132 |
| `xml` | 93.2% | 10.992 | 123/132 |
| `toon` | 80.0% | 5,967 | 96/120 |
| `csv` | 75.8% | 6,735 | 91/120 |
| `yaml` | 74.2% | 7,328 | 89/120 |
| `json` | 79.2% | 9,694 | 95/120 |
| `xml` | 78.3% | 10,992 | 94/120 |
##### Time-series analytics data
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `csv` | 89.7% | 1.393 | 78/87 |
| `toon` | 88.5% | 1.515 | 77/87 |
| `yaml` | 83.9% | 2.938 | 73/87 |
| `json` | 88.5% | 3.665 | 77/87 |
| `xml` | 85.1% | 4.376 | 74/87 |
| `csv` | 75.5% | 1,393 | 77/102 |
| `toon` | 76.5% | 1,515 | 78/102 |
| `yaml` | 74.5% | 2,938 | 76/102 |
| `json` | 76.5% | 3,665 | 78/102 |
| `xml` | 74.5% | 4,376 | 76/102 |
##### Top 100 GitHub repositories
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
| `toon` | 76.2% | 8.745 | 64/84 |
| `csv` | 69.0% | 8.513 | 58/84 |
| `yaml` | 71.4% | 13.129 | 60/84 |
| `json` | 69.0% | 15.145 | 58/84 |
| `xml` | 71.4% | 17.095 | 60/84 |
| `toon` | 74.4% | 8,745 | 67/90 |
| `csv` | 73.3% | 8,513 | 66/90 |
| `yaml` | 62.2% | 13,129 | 56/90 |
| `json` | 61.1% | 15,145 | 55/90 |
| `xml` | 61.1% | 17,095 | 55/90 |
#### Performance by Model
@@ -78,31 +78,31 @@ gemini-2.5-flash
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
| `toon` | 99.4% | 158/159 |
| `yaml` | 95.0% | 151/159 |
| `csv` | 92.5% | 147/159 |
| `json` | 92.5% | 147/159 |
| `xml` | 91.2% | 145/159 |
##### claude-haiku-4-5
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
| `toon` | 75.5% | 120/159 |
| `xml` | 75.5% | 120/159 |
| `csv` | 75.5% | 120/159 |
| `json` | 75.5% | 120/159 |
| `yaml` | 74.2% | 118/159 |
| `toon` | 96.1% | 148/154 |
| `csv` | 90.3% | 139/154 |
| `yaml` | 89.0% | 137/154 |
| `json` | 87.7% | 135/154 |
| `xml` | 83.8% | 129/154 |
##### gemini-2.5-flash
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
| `xml` | 91.8% | 146/159 |
| `csv` | 86.2% | 137/159 |
| `toon` | 84.9% | 135/159 |
| `json` | 81.8% | 130/159 |
| `yaml` | 78.6% | 125/159 |
| `xml` | 90.3% | 139/154 |
| `csv` | 89.0% | 137/154 |
| `toon` | 87.0% | 134/154 |
| `json` | 79.2% | 122/154 |
| `yaml` | 76.0% | 117/154 |
##### claude-haiku-4-5-20251001
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
| `json` | 48.7% | 75/154 |
| `toon` | 48.1% | 74/154 |
| `xml` | 47.4% | 73/154 |
| `yaml` | 47.4% | 73/154 |
| `csv` | 45.5% | 70/154 |
</details>
@@ -124,31 +124,33 @@ Four datasets designed to test different structural patterns:
#### Question Types
159 questions are generated dynamically across three categories:
154 questions are generated dynamically across three categories:
- **Field retrieval (50%)**: Direct value lookups
- **Field retrieval (40%)**: Direct value lookups or values that can be read straight off a record (including booleans and simple counts such as array lengths)
- Example: "What is Alice's salary?" → `75000`
- Example: "How many items are in order ORD-0042?" → `3`
- Example: "What is the customer name for order ORD-0042?" → `John Doe`
- **Aggregation (25%)**: Counting and summation tasks
- **Aggregation (32%)**: Dataset-level totals and averages plus single-condition filters (counts, sums, min/max comparisons)
- Example: "How many employees work in Engineering?" → `17`
- Example: "What is the total revenue across all orders?" → `45123.50`
- Example: "How many employees have salary > 80000?" → `23`
- **Filtering (25%)**: Conditional queries
- **Filtering (28%)**: Multi-condition queries requiring compound logic (AND constraints across fields)
- Example: "How many employees in Sales have salary > 80000?" → `5`
- Example: "How many orders have total > 400?" → `12`
- Example: "How many active employees have more than 10 years of experience?" → `8`
#### Evaluation Process
1. **Format conversion:** Each dataset is converted to all 5 formats (TOON, JSON, YAML, CSV, XML).
1. **Format conversion:** Each dataset is converted to all 5 formats (TOON, CSV, XML, JSON, YAML).
2. **Query LLM**: Each model receives formatted data + question in a prompt and extracts the answer.
4. **Validate with LLM-as-judge**: `gpt-5-nano` validates if the answer is semantically correct (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`).
3. **Validate with LLM-as-judge**: `gpt-5-nano` validates if the answer is semantically correct (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`).
#### Models & Configuration
- **Models tested**: `gpt-5-nano`, `claude-haiku-4-5`, `gemini-2.5-flash`
- **Models tested**: `claude-haiku-4-5-20251001`, `gemini-2.5-flash`, `gpt-5-nano`
- **Token counting**: Using `gpt-tokenizer` with `o200k_base` encoding (GPT-5 tokenizer)
- **Temperature**: 0 (for non-reasoning models)
- **Total evaluations**: 159 questions × 5 formats × 3 models = 2,385 LLM calls
- **Total evaluations**: 154 questions × 5 formats × 3 models = 2,310 LLM calls
</details>

View File

@@ -39,11 +39,11 @@ Total ████████████░░░░░
"repo": "freeCodeCamp/freeCodeCamp",
"description": "freeCodeCamp.org's open-source codebase and curriculum. Learn math, programming,…",
"createdAt": "2014-12-24T17:49:19Z",
"updatedAt": "2025-10-27T07:40:58Z",
"pushedAt": "2025-10-26T11:31:08Z",
"stars": 430828,
"watchers": 8582,
"forks": 42136,
"updatedAt": "2025-10-28T11:58:08Z",
"pushedAt": "2025-10-28T10:17:16Z",
"stars": 430886,
"watchers": 8583,
"forks": 42146,
"defaultBranch": "main"
},
{
@@ -52,11 +52,11 @@ Total ████████████░░░░░
"repo": "codecrafters-io/build-your-own-x",
"description": "Master programming by recreating your favorite technologies from scratch.",
"createdAt": "2018-05-09T12:03:18Z",
"updatedAt": "2025-10-27T07:43:25Z",
"updatedAt": "2025-10-28T12:37:11Z",
"pushedAt": "2025-10-10T18:45:01Z",
"stars": 430102,
"watchers": 6322,
"forks": 40388,
"stars": 430877,
"watchers": 6332,
"forks": 40453,
"defaultBranch": "master"
},
{
@@ -65,11 +65,11 @@ Total ████████████░░░░░
"repo": "sindresorhus/awesome",
"description": "😎 Awesome lists about all kinds of interesting topics",
"createdAt": "2014-07-11T13:42:37Z",
"updatedAt": "2025-10-27T07:44:27Z",
"pushedAt": "2025-10-23T17:26:53Z",
"stars": 409760,
"watchers": 8016,
"forks": 32015,
"updatedAt": "2025-10-28T12:40:21Z",
"pushedAt": "2025-10-27T17:57:31Z",
"stars": 410052,
"watchers": 8017,
"forks": 32029,
"defaultBranch": "main"
}
]
@@ -80,9 +80,9 @@ Total ████████████░░░░░
```
repositories[3]{id,name,repo,description,createdAt,updatedAt,pushedAt,stars,watchers,forks,defaultBranch}:
28457823,freeCodeCamp,freeCodeCamp/freeCodeCamp,"freeCodeCamp.org's open-source codebase and curriculum. Learn math, programming,…","2014-12-24T17:49:19Z","2025-10-27T07:40:58Z","2025-10-26T11:31:08Z",430828,8582,42136,main
132750724,build-your-own-x,codecrafters-io/build-your-own-x,Master programming by recreating your favorite technologies from scratch.,"2018-05-09T12:03:18Z","2025-10-27T07:43:25Z","2025-10-10T18:45:01Z",430102,6322,40388,master
21737465,awesome,sindresorhus/awesome,😎 Awesome lists about all kinds of interesting topics,"2014-07-11T13:42:37Z","2025-10-27T07:44:27Z","2025-10-23T17:26:53Z",409760,8016,32015,main
28457823,freeCodeCamp,freeCodeCamp/freeCodeCamp,"freeCodeCamp.org's open-source codebase and curriculum. Learn math, programming,…","2014-12-24T17:49:19Z","2025-10-28T11:58:08Z","2025-10-28T10:17:16Z",430886,8583,42146,main
132750724,build-your-own-x,codecrafters-io/build-your-own-x,Master programming by recreating your favorite technologies from scratch.,"2018-05-09T12:03:18Z","2025-10-28T12:37:11Z","2025-10-10T18:45:01Z",430877,6332,40453,master
21737465,awesome,sindresorhus/awesome,😎 Awesome lists about all kinds of interesting topics,"2014-07-11T13:42:37Z","2025-10-28T12:40:21Z","2025-10-27T17:57:31Z",410052,8017,32029,main
```
---