docs: overhaul retrieval accuracy benchmark

2026-01-29 15:24:10 +08:00 · 2025-10-28 20:22:43 +01:00
parent efbe4ded88
commit 67c0df8cb0
22 changed files with 1553 additions and 27288 deletions
--- a/README.md
+++ b/README.md
@@ -87,11 +87,11 @@ Total                        ████████████░░░░░
      "repo": "freeCodeCamp/freeCodeCamp",
      "description": "freeCodeCamp.org's open-source codebase and curriculum. Learn math, programming,…",
      "createdAt": "2014-12-24T17:49:19Z",
-      "updatedAt": "2025-10-27T07:40:58Z",
-      "pushedAt": "2025-10-26T11:31:08Z",
-      "stars": 430828,
-      "watchers": 8582,
-      "forks": 42136,
+      "updatedAt": "2025-10-28T11:58:08Z",
+      "pushedAt": "2025-10-28T10:17:16Z",
+      "stars": 430886,
+      "watchers": 8583,
+      "forks": 42146,
      "defaultBranch": "main"
    },
    {
@@ -100,11 +100,11 @@ Total                        ████████████░░░░░
      "repo": "codecrafters-io/build-your-own-x",
      "description": "Master programming by recreating your favorite technologies from scratch.",
      "createdAt": "2018-05-09T12:03:18Z",
-      "updatedAt": "2025-10-27T07:43:25Z",
+      "updatedAt": "2025-10-28T12:37:11Z",
      "pushedAt": "2025-10-10T18:45:01Z",
-      "stars": 430102,
-      "watchers": 6322,
-      "forks": 40388,
+      "stars": 430877,
+      "watchers": 6332,
+      "forks": 40453,
      "defaultBranch": "master"
    },
    {
@@ -113,11 +113,11 @@ Total                        ████████████░░░░░
      "repo": "sindresorhus/awesome",
      "description": "😎 Awesome lists about all kinds of interesting topics",
      "createdAt": "2014-07-11T13:42:37Z",
-      "updatedAt": "2025-10-27T07:44:27Z",
-      "pushedAt": "2025-10-23T17:26:53Z",
-      "stars": 409760,
-      "watchers": 8016,
-      "forks": 32015,
+      "updatedAt": "2025-10-28T12:40:21Z",
+      "pushedAt": "2025-10-27T17:57:31Z",
+      "stars": 410052,
+      "watchers": 8017,
+      "forks": 32029,
      "defaultBranch": "main"
    }
  ]
@@ -128,9 +128,9 @@ Total                        ████████████░░░░░

 ```
 repositories[3]{id,name,repo,description,createdAt,updatedAt,pushedAt,stars,watchers,forks,defaultBranch}:
-  28457823,freeCodeCamp,freeCodeCamp/freeCodeCamp,"freeCodeCamp.org's open-source codebase and curriculum. Learn math, programming,…","2014-12-24T17:49:19Z","2025-10-27T07:40:58Z","2025-10-26T11:31:08Z",430828,8582,42136,main
-  132750724,build-your-own-x,codecrafters-io/build-your-own-x,Master programming by recreating your favorite technologies from scratch.,"2018-05-09T12:03:18Z","2025-10-27T07:43:25Z","2025-10-10T18:45:01Z",430102,6322,40388,master
-  21737465,awesome,sindresorhus/awesome,😎 Awesome lists about all kinds of interesting topics,"2014-07-11T13:42:37Z","2025-10-27T07:44:27Z","2025-10-23T17:26:53Z",409760,8016,32015,main
+  28457823,freeCodeCamp,freeCodeCamp/freeCodeCamp,"freeCodeCamp.org's open-source codebase and curriculum. Learn math, programming,…","2014-12-24T17:49:19Z","2025-10-28T11:58:08Z","2025-10-28T10:17:16Z",430886,8583,42146,main
+  132750724,build-your-own-x,codecrafters-io/build-your-own-x,Master programming by recreating your favorite technologies from scratch.,"2018-05-09T12:03:18Z","2025-10-28T12:37:11Z","2025-10-10T18:45:01Z",430877,6332,40453,master
+  21737465,awesome,sindresorhus/awesome,😎 Awesome lists about all kinds of interesting topics,"2014-07-11T13:42:37Z","2025-10-28T12:40:21Z","2025-10-27T17:57:31Z",410052,8017,32029,main
 ```

 ---
@@ -208,36 +208,36 @@ metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
 > [!NOTE]
 > Measured with [`gpt-tokenizer`](https://github.com/niieani/gpt-tokenizer) using `o200k_base` encoding (used by GPT-5 and other modern models). Savings will vary across models and tokenizers.

-<!-- automd:file src="./benchmarks/results/accuracy/report.md" -->
+<!-- automd:file src="./benchmarks/results/retrieval-accuracy.md" -->

 ### Retrieval Accuracy

-Accuracy across **3 LLMs** on **159 data retrieval questions**:
+Accuracy across **3 LLMs** on **154 data retrieval questions**:

 ```
-gpt-5-nano
-  toon         ████████████████████  99.4% (158/159)
-  yaml         ███████████████████░  95.0% (151/159)
-  csv          ██████████████████░░  92.5% (147/159)
-  json         ██████████████████░░  92.5% (147/159)
-  xml          ██████████████████░░  91.2% (145/159)
-
-claude-haiku-4-5
-  toon         ███████████████░░░░░  75.5% (120/159)
-  xml          ███████████████░░░░░  75.5% (120/159)
-  csv          ███████████████░░░░░  75.5% (120/159)
-  json         ███████████████░░░░░  75.5% (120/159)
-  yaml         ███████████████░░░░░  74.2% (118/159)
-
 gemini-2.5-flash
-  xml          ██████████████████░░  91.8% (146/159)
-  csv          █████████████████░░░  86.2% (137/159)
-  toon         █████████████████░░░  84.9% (135/159)
-  json         ████████████████░░░░  81.8% (130/159)
-  yaml         ████████████████░░░░  78.6% (125/159)
+  xml          ██████████████████░░  90.3% (139/154)
+  csv          ██████████████████░░  89.0% (137/154)
+  toon         █████████████████░░░  87.0% (134/154)
+  json         ████████████████░░░░  79.2% (122/154)
+  yaml         ███████████████░░░░░  76.0% (117/154)
+
+gpt-5-nano
+  toon         ███████████████████░  96.1% (148/154)
+  csv          ██████████████████░░  90.3% (139/154)
+  yaml         ██████████████████░░  89.0% (137/154)
+  json         ██████████████████░░  87.7% (135/154)
+  xml          █████████████████░░░  83.8% (129/154)
+
+claude-haiku-4-5-20251001
+  json         ██████████░░░░░░░░░░  48.7% (75/154)
+  toon         ██████████░░░░░░░░░░  48.1% (74/154)
+  xml          █████████░░░░░░░░░░░  47.4% (73/154)
+  yaml         █████████░░░░░░░░░░░  47.4% (73/154)
+  csv          █████████░░░░░░░░░░░  45.5% (70/154)
 ```

-**Advantage:** TOON achieves **86.6% accuracy** (vs JSON's 83.2%) while using **46.3% fewer tokens**.
+**Advantage:** TOON achieves **77.1% accuracy** (vs JSON's 71.9%) while using **46.3% fewer tokens**.

 <details>
 <summary><strong>Performance by dataset and model</strong></summary>
@@ -248,73 +248,73 @@ gemini-2.5-flash

 | Format | Accuracy | Tokens | Correct/Total |
 | ------ | -------- | ------ | ------------- |
-| `toon` | 87.4% | 2.483 | 152/174 |
-| `csv` | 82.8% | 2.337 | 144/174 |
-| `yaml` | 83.9% | 4.969 | 146/174 |
-| `json` | 83.9% | 6.347 | 146/174 |
-| `xml` | 88.5% | 7.314 | 154/174 |
+| `csv` | 74.7% | 2,337 | 112/150 |
+| `toon` | 76.7% | 2,483 | 115/150 |
+| `yaml` | 70.7% | 4,969 | 106/150 |
+| `xml` | 77.3% | 7,314 | 116/150 |
+| `json` | 69.3% | 6,347 | 104/150 |

 ##### E-commerce orders with nested structures

 | Format | Accuracy | Tokens | Correct/Total |
 | ------ | -------- | ------ | ------------- |
-| `toon` | 90.9% | 5.967 | 120/132 |
-| `csv` | 93.9% | 6.735 | 124/132 |
-| `yaml` | 87.1% | 7.328 | 115/132 |
-| `json` | 87.9% | 9.694 | 116/132 |
-| `xml` | 93.2% | 10.992 | 123/132 |
+| `toon` | 80.0% | 5,967 | 96/120 |
+| `csv` | 75.8% | 6,735 | 91/120 |
+| `yaml` | 74.2% | 7,328 | 89/120 |
+| `json` | 79.2% | 9,694 | 95/120 |
+| `xml` | 78.3% | 10,992 | 94/120 |

 ##### Time-series analytics data

 | Format | Accuracy | Tokens | Correct/Total |
 | ------ | -------- | ------ | ------------- |
-| `csv` | 89.7% | 1.393 | 78/87 |
-| `toon` | 88.5% | 1.515 | 77/87 |
-| `yaml` | 83.9% | 2.938 | 73/87 |
-| `json` | 88.5% | 3.665 | 77/87 |
-| `xml` | 85.1% | 4.376 | 74/87 |
+| `csv` | 75.5% | 1,393 | 77/102 |
+| `toon` | 76.5% | 1,515 | 78/102 |
+| `yaml` | 74.5% | 2,938 | 76/102 |
+| `json` | 76.5% | 3,665 | 78/102 |
+| `xml` | 74.5% | 4,376 | 76/102 |

 ##### Top 100 GitHub repositories

 | Format | Accuracy | Tokens | Correct/Total |
 | ------ | -------- | ------ | ------------- |
-| `toon` | 76.2% | 8.745 | 64/84 |
-| `csv` | 69.0% | 8.513 | 58/84 |
-| `yaml` | 71.4% | 13.129 | 60/84 |
-| `json` | 69.0% | 15.145 | 58/84 |
-| `xml` | 71.4% | 17.095 | 60/84 |
+| `toon` | 74.4% | 8,745 | 67/90 |
+| `csv` | 73.3% | 8,513 | 66/90 |
+| `yaml` | 62.2% | 13,129 | 56/90 |
+| `json` | 61.1% | 15,145 | 55/90 |
+| `xml` | 61.1% | 17,095 | 55/90 |

 #### Performance by Model

-##### gpt-5-nano
-
-| Format | Accuracy | Correct/Total |
-| ------ | -------- | ------------- |
-| `toon` | 99.4% | 158/159 |
-| `yaml` | 95.0% | 151/159 |
-| `csv` | 92.5% | 147/159 |
-| `json` | 92.5% | 147/159 |
-| `xml` | 91.2% | 145/159 |
-
-##### claude-haiku-4-5
-
-| Format | Accuracy | Correct/Total |
-| ------ | -------- | ------------- |
-| `toon` | 75.5% | 120/159 |
-| `xml` | 75.5% | 120/159 |
-| `csv` | 75.5% | 120/159 |
-| `json` | 75.5% | 120/159 |
-| `yaml` | 74.2% | 118/159 |
-
 ##### gemini-2.5-flash

 | Format | Accuracy | Correct/Total |
 | ------ | -------- | ------------- |
-| `xml` | 91.8% | 146/159 |
-| `csv` | 86.2% | 137/159 |
-| `toon` | 84.9% | 135/159 |
-| `json` | 81.8% | 130/159 |
-| `yaml` | 78.6% | 125/159 |
+| `xml` | 90.3% | 139/154 |
+| `csv` | 89.0% | 137/154 |
+| `toon` | 87.0% | 134/154 |
+| `json` | 79.2% | 122/154 |
+| `yaml` | 76.0% | 117/154 |
+
+##### gpt-5-nano
+
+| Format | Accuracy | Correct/Total |
+| ------ | -------- | ------------- |
+| `toon` | 96.1% | 148/154 |
+| `csv` | 90.3% | 139/154 |
+| `yaml` | 89.0% | 137/154 |
+| `json` | 87.7% | 135/154 |
+| `xml` | 83.8% | 129/154 |
+
+##### claude-haiku-4-5-20251001
+
+| Format | Accuracy | Correct/Total |
+| ------ | -------- | ------------- |
+| `json` | 48.7% | 75/154 |
+| `toon` | 48.1% | 74/154 |
+| `xml` | 47.4% | 73/154 |
+| `yaml` | 47.4% | 73/154 |
+| `csv` | 45.5% | 70/154 |

 </details>

@@ -336,32 +336,34 @@ Four datasets designed to test different structural patterns:

 #### Question Types

-159 questions are generated dynamically across three categories:
+154 questions are generated dynamically across three categories:

- **Field retrieval (50%)**: Direct value lookups
+- **Field retrieval (40%)**: Direct value lookups or values that can be read straight off a record (including booleans and simple counts such as array lengths)
  - Example: "What is Alice's salary?" → `75000`
+  - Example: "How many items are in order ORD-0042?" → `3`
  - Example: "What is the customer name for order ORD-0042?" → `John Doe`

- **Aggregation (25%)**: Counting and summation tasks
+- **Aggregation (32%)**: Dataset-level totals and averages plus single-condition filters (counts, sums, min/max comparisons)
  - Example: "How many employees work in Engineering?" → `17`
  - Example: "What is the total revenue across all orders?" → `45123.50`
+  - Example: "How many employees have salary > 80000?" → `23`

- **Filtering (25%)**: Conditional queries
+- **Filtering (28%)**: Multi-condition queries requiring compound logic (AND constraints across fields)
  - Example: "How many employees in Sales have salary > 80000?" → `5`
-  - Example: "How many orders have total > 400?" → `12`
+  - Example: "How many active employees have more than 10 years of experience?" → `8`

 #### Evaluation Process

-1. **Format conversion:** Each dataset is converted to all 5 formats (TOON, JSON, YAML, CSV, XML).
+1. **Format conversion:** Each dataset is converted to all 5 formats (TOON, CSV, XML, JSON, YAML).
 2. **Query LLM**: Each model receives formatted data + question in a prompt and extracts the answer.
-4. **Validate with LLM-as-judge**: `gpt-5-nano` validates if the answer is semantically correct (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`).
+3. **Validate with LLM-as-judge**: `gpt-5-nano` validates if the answer is semantically correct (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`).

 #### Models & Configuration

- **Models tested**: `gpt-5-nano`, `claude-haiku-4-5`, `gemini-2.5-flash`
+- **Models tested**: `gemini-2.5-flash`, `gpt-5-nano`, `claude-haiku-4-5-20251001`
 - **Token counting**: Using `gpt-tokenizer` with `o200k_base` encoding (GPT-5 tokenizer)
 - **Temperature**: 0 (for non-reasoning models)
- **Total evaluations**: 159 questions × 5 formats × 3 models = 2,385 LLM calls
+- **Total evaluations**: 154 questions × 5 formats × 3 models = 2,310 LLM calls

 </details>