chore(benchmarks): replace LLM-as-judge, new structural validation

2026-01-29 15:24:10 +08:00 · 2025-11-07 21:28:21 +01:00
parent 9a519dd114
commit acca69c64a
25 changed files with 1311 additions and 396 deletions
--- a/README.md
+++ b/README.md
@@ -75,7 +75,7 @@ See [benchmarks](#benchmarks) for concrete comparisons across different data str

 ## Key Features

- 💸 **Token-efficient:** typically 30–60% fewer tokens than JSON[^1]
+- 💸 **Token-efficient:** typically 30-60% fewer tokens on large uniform arrays vs formatted JSON[^1]
 - 🤿 **LLM-friendly guardrails:** explicit lengths and fields enable validation
 - 🍱 **Minimal syntax:** removes redundant punctuation (braces, brackets, most quotes)
 - 📐 **Indentation-based structure:** like YAML, uses whitespace instead of braces
@@ -108,19 +108,19 @@ Datasets with nested or semi-uniform structures. CSV excluded as it cannot prope
 ```
 🛒 E-commerce orders with nested structures  ┊  Tabular: 33%
   │
-   TOON                █████████████░░░░░░░    72,743 tokens
-   ├─ vs JSON          (−33.1%)               108,731 tokens
-   ├─ vs JSON compact  (+5.5%)                 68,936 tokens
-   ├─ vs YAML          (−14.1%)                84,724 tokens
-   └─ vs XML           (−40.5%)               122,313 tokens
+   TOON                █████████████░░░░░░░    72,771 tokens
+   ├─ vs JSON          (−33.1%)               108,806 tokens
+   ├─ vs JSON compact  (+5.5%)                 68,975 tokens
+   ├─ vs YAML          (−14.2%)                84,780 tokens
+   └─ vs XML           (−40.5%)               122,406 tokens

 🧾 Semi-uniform event logs  ┊  Tabular: 50%
   │
-   TOON                █████████████████░░░   153,223 tokens
-   ├─ vs JSON          (−15.0%)               180,196 tokens
-   ├─ vs JSON compact  (+19.9%)               127,740 tokens
-   ├─ vs YAML          (−0.8%)                154,514 tokens
-   └─ vs XML           (−25.2%)               204,800 tokens
+   TOON                █████████████████░░░   153,211 tokens
+   ├─ vs JSON          (−15.0%)               180,176 tokens
+   ├─ vs JSON compact  (+19.9%)               127,731 tokens
+   ├─ vs YAML          (−0.8%)                154,505 tokens
+   └─ vs XML           (−25.2%)               204,777 tokens

 🧩 Deeply nested configuration  ┊  Tabular: 0%
   │
@@ -131,11 +131,11 @@ Datasets with nested or semi-uniform structures. CSV excluded as it cannot prope
   └─ vs XML           (−37.4%)                 1,008 tokens

 ──────────────────────────────────── Total ────────────────────────────────────
-   TOON                ████████████████░░░░   226,597 tokens
-   ├─ vs JSON          (−21.8%)               289,846 tokens
-   ├─ vs JSON compact  (+14.9%)               197,240 tokens
-   ├─ vs YAML          (−5.5%)                239,911 tokens
-   └─ vs XML           (−30.9%)               328,121 tokens
+   TOON                ████████████████░░░░   226,613 tokens
+   ├─ vs JSON          (−21.8%)               289,901 tokens
+   ├─ vs JSON compact  (+14.9%)               197,270 tokens
+   ├─ vs YAML          (−5.6%)                239,958 tokens
+   └─ vs XML           (−31.0%)               328,191 tokens
 ```

 #### Flat-Only Track
@@ -145,21 +145,21 @@ Datasets with flat tabular structures where CSV is applicable.
 ```
 👥 Uniform employee records  ┊  Tabular: 100%
   │
-   CSV                 ███████████████████░    46,956 tokens
-   TOON                ████████████████████    49,827 tokens   (+6.1% vs CSV)
-   ├─ vs JSON          (−60.7%)               126,854 tokens
-   ├─ vs JSON compact  (−36.8%)                78,850 tokens
-   ├─ vs YAML          (−50.0%)                99,701 tokens
-   └─ vs XML           (−66.0%)               146,440 tokens
+   CSV                 ███████████████████░    46,954 tokens
+   TOON                ████████████████████    49,831 tokens   (+6.1% vs CSV)
+   ├─ vs JSON          (−60.7%)               126,860 tokens
+   ├─ vs JSON compact  (−36.8%)                78,856 tokens
+   ├─ vs YAML          (−50.0%)                99,706 tokens
+   └─ vs XML           (−66.0%)               146,444 tokens

 📈 Time-series analytics data  ┊  Tabular: 100%
   │
-   CSV                 ██████████████████░░     8,396 tokens
-   TOON                ████████████████████     9,128 tokens   (+8.7% vs CSV)
-   ├─ vs JSON          (−59.0%)                22,258 tokens
-   ├─ vs JSON compact  (−35.8%)                14,224 tokens
-   ├─ vs YAML          (−48.9%)                17,871 tokens
-   └─ vs XML           (−65.7%)                26,629 tokens
+   CSV                 ██████████████████░░     8,388 tokens
+   TOON                ████████████████████     9,120 tokens   (+8.7% vs CSV)
+   ├─ vs JSON          (−59.0%)                22,250 tokens
+   ├─ vs JSON compact  (−35.8%)                14,216 tokens
+   ├─ vs YAML          (−48.9%)                17,863 tokens
+   └─ vs XML           (−65.7%)                26,621 tokens

 ⭐ Top 100 GitHub repositories  ┊  Tabular: 100%
   │
@@ -171,12 +171,12 @@ Datasets with flat tabular structures where CSV is applicable.
   └─ vs XML           (−48.8%)                17,095 tokens

 ──────────────────────────────────── Total ────────────────────────────────────
-   CSV                 ███████████████████░    63,865 tokens
-   TOON                ████████████████████    67,700 tokens   (+6.0% vs CSV)
-   ├─ vs JSON          (−58.8%)               164,257 tokens
-   ├─ vs JSON compact  (−35.2%)               104,529 tokens
-   ├─ vs YAML          (−48.2%)               130,701 tokens
-   └─ vs XML           (−64.4%)               190,164 tokens
+   CSV                 ███████████████████░    63,855 tokens
+   TOON                ████████████████████    67,696 tokens   (+6.0% vs CSV)
+   ├─ vs JSON          (−58.8%)               164,255 tokens
+   ├─ vs JSON compact  (−35.2%)               104,527 tokens
+   ├─ vs YAML          (−48.2%)               130,698 tokens
+   └─ vs XML           (−64.4%)               190,160 tokens
 ```

 <details>
@@ -186,64 +186,64 @@ Datasets with flat tabular structures where CSV is applicable.

 **Savings:** 13,130 tokens (59.0% reduction vs JSON)

-**JSON** (22,258 tokens):
+**JSON** (22,250 tokens):

 ```json
 {
  "metrics": [
    {
      "date": "2025-01-01",
-      "views": 7708,
-      "clicks": 595,
-      "conversions": 69,
-      "revenue": 15369.93,
-      "bounceRate": 0.35
+      "views": 5715,
+      "clicks": 211,
+      "conversions": 28,
+      "revenue": 7976.46,
+      "bounceRate": 0.47
    },
    {
      "date": "2025-01-02",
-      "views": 5894,
-      "clicks": 381,
-      "conversions": 21,
-      "revenue": 2112.12,
-      "bounceRate": 0.3
+      "views": 7103,
+      "clicks": 393,
+      "conversions": 28,
+      "revenue": 8360.53,
+      "bounceRate": 0.32
    },
    {
      "date": "2025-01-03",
-      "views": 6835,
-      "clicks": 422,
-      "conversions": 35,
-      "revenue": 4525.73,
+      "views": 7248,
+      "clicks": 378,
+      "conversions": 24,
+      "revenue": 3212.57,
      "bounceRate": 0.5
    },
    {
      "date": "2025-01-04",
-      "views": 5325,
-      "clicks": 305,
-      "conversions": 22,
-      "revenue": 2445.3,
-      "bounceRate": 0.44
+      "views": 2927,
+      "clicks": 77,
+      "conversions": 11,
+      "revenue": 1211.69,
+      "bounceRate": 0.62
    },
    {
      "date": "2025-01-05",
-      "views": 2974,
-      "clicks": 61,
-      "conversions": 6,
-      "revenue": 956.57,
-      "bounceRate": 0.47
+      "views": 3530,
+      "clicks": 82,
+      "conversions": 8,
+      "revenue": 462.77,
+      "bounceRate": 0.56
    }
  ]
 }
 ```

-**TOON** (9,128 tokens):
+**TOON** (9,120 tokens):

 ```
 metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
-  2025-01-01,7708,595,69,15369.93,0.35
-  2025-01-02,5894,381,21,2112.12,0.3
-  2025-01-03,6835,422,35,4525.73,0.5
-  2025-01-04,5325,305,22,2445.3,0.44
-  2025-01-05,2974,61,6,956.57,0.47
+  2025-01-01,5715,211,28,7976.46,0.47
+  2025-01-02,7103,393,28,8360.53,0.32
+  2025-01-03,7248,378,24,3212.57,0.5
+  2025-01-04,2927,77,11,1211.69,0.62
+  2025-01-05,3530,82,8,462.77,0.56
 ```

 ---
@@ -317,7 +317,7 @@ repositories[3]{id,name,repo,description,createdAt,updatedAt,pushedAt,stars,watc

 <!-- automd:file src="./benchmarks/results/retrieval-accuracy.md" -->

-Benchmarks test LLM comprehension across different input formats using 204 data retrieval questions on 4 models.
+Benchmarks test LLM comprehension across different input formats using 209 data retrieval questions on 4 models.

 <details>
 <summary><strong>Show Dataset Catalog</strong></summary>
@@ -332,6 +332,11 @@ Benchmarks test LLM comprehension across different input formats using 204 data
 | Top 100 GitHub repositories | 100 | uniform | ✓ | 100% |
 | Semi-uniform event logs | 75 | semi-uniform | ✗ | 50% |
 | Deeply nested configuration | 11 | deep | ✗ | 0% |
+| Valid complete dataset (control) | 20 | uniform | ✓ | 100% |
+| Array truncated: 3 rows removed from end | 17 | uniform | ✓ | 100% |
+| Extra rows added beyond declared length | 23 | uniform | ✓ | 100% |
+| Inconsistent field count (missing salary in row 10) | 20 | uniform | ✓ | 100% |
+| Missing required fields (no email in multiple rows) | 20 | uniform | ✓ | 100% |

 **Structure classes:**
 - **uniform**: All objects have identical fields with primitive values
@@ -350,67 +355,69 @@ Benchmarks test LLM comprehension across different input formats using 204 data
 Each format's overall performance, balancing accuracy against token cost:

 ```
-TOON           ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓   17.2  │  75.5% acc  │  4,389 tokens
-CSV            ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░   16.6  │  67.8% acc  │  4,080 tokens
-JSON compact   ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░   14.7  │  73.3% acc  │  4,982 tokens
-YAML           ▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░   12.1  │  72.4% acc  │  5,976 tokens
-JSON           ▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░   10.0  │  72.4% acc  │  7,260 tokens
-XML            ▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░    8.4  │  69.0% acc  │  8,251 tokens
+TOON           ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓   26.9  │  73.9% acc  │  2,744 tokens
+JSON compact   ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░   22.9  │  70.7% acc  │  3,081 tokens
+YAML           ▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░   18.6  │  69.0% acc  │  3,719 tokens
+JSON           ▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░░   15.3  │  69.7% acc  │  4,545 tokens
+XML            ▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░   13.0  │  67.1% acc  │  5,167 tokens
 ```

-TOON achieves **75.5%** accuracy (vs JSON's 72.4%) while using **39.5% fewer tokens**.
+TOON achieves **73.9%** accuracy (vs JSON's 69.7%) while using **39.6% fewer tokens**.
+
+**Note on CSV:** Excluded from ranking as it only supports 436/209 questions (flat tabular data only). While CSV is highly token-efficient for simple tabular data, it cannot represent nested structures that other formats handle.

 #### Per-Model Accuracy

-Accuracy across 4 LLMs on 204 data retrieval questions:
+Accuracy across 4 LLMs on 209 data retrieval questions:

 ```
 claude-haiku-4-5-20251001
-→ TOON           ████████████░░░░░░░░    62.3% (127/204)
-  JSON           ███████████░░░░░░░░░    56.9% (116/204)
-  YAML           ███████████░░░░░░░░░    55.9% (114/204)
-  JSON compact   ███████████░░░░░░░░░    54.9% (112/204)
-  XML            ███████████░░░░░░░░░    54.9% (112/204)
-  CSV            █████████░░░░░░░░░░░    47.1% (49/104)
+→ TOON           ████████████░░░░░░░░    59.8% (125/209)
+  JSON           ███████████░░░░░░░░░    57.4% (120/209)
+  YAML           ███████████░░░░░░░░░    56.0% (117/209)
+  XML            ███████████░░░░░░░░░    55.5% (116/209)
+  JSON compact   ███████████░░░░░░░░░    55.0% (115/209)
+  CSV            ██████████░░░░░░░░░░    50.5% (55/109)

 gemini-2.5-flash
-→ TOON           ██████████████████░░    91.2% (186/204)
-  YAML           ██████████████████░░    89.7% (183/204)
-  JSON compact   ██████████████████░░    87.7% (179/204)
-  JSON           ██████████████████░░    87.7% (179/204)
-  XML            █████████████████░░░    87.3% (178/204)
-  CSV            █████████████████░░░    85.6% (89/104)
+→ TOON           ██████████████████░░    87.6% (183/209)
+  CSV            █████████████████░░░    86.2% (94/109)
+  JSON compact   ████████████████░░░░    82.3% (172/209)
+  YAML           ████████████████░░░░    79.4% (166/209)
+  XML            ████████████████░░░░    79.4% (166/209)
+  JSON           ███████████████░░░░░    77.0% (161/209)

 gpt-5-nano
-  JSON compact   ███████████████████░    93.6% (191/204)
-  CSV            ██████████████████░░    90.4% (94/104)
-  JSON           ██████████████████░░    89.7% (183/204)
-→ TOON           ██████████████████░░    89.2% (182/204)
-  YAML           ██████████████████░░    89.2% (182/204)
-  XML            ████████████████░░░░    81.4% (166/204)
+→ TOON           ██████████████████░░    90.9% (190/209)
+  JSON compact   ██████████████████░░    90.9% (190/209)
+  JSON           ██████████████████░░    89.0% (186/209)
+  CSV            ██████████████████░░    89.0% (97/109)
+  YAML           █████████████████░░░    87.1% (182/209)
+  XML            ████████████████░░░░    80.9% (169/209)

 grok-4-fast-non-reasoning
-→ TOON           ████████████░░░░░░░░    59.3% (121/204)
-  JSON compact   ███████████░░░░░░░░░    56.9% (116/204)
-  JSON           ███████████░░░░░░░░░    55.4% (113/204)
-  YAML           ███████████░░░░░░░░░    54.9% (112/204)
-  XML            ██████████░░░░░░░░░░    52.5% (107/204)
-  CSV            ██████████░░░░░░░░░░    48.1% (50/104)
+→ TOON           ███████████░░░░░░░░░    57.4% (120/209)
+  JSON           ███████████░░░░░░░░░    55.5% (116/209)
+  JSON compact   ███████████░░░░░░░░░    54.5% (114/209)
+  YAML           ███████████░░░░░░░░░    53.6% (112/209)
+  XML            ███████████░░░░░░░░░    52.6% (110/209)
+  CSV            ██████████░░░░░░░░░░    52.3% (57/109)
 ```

-**Key tradeoff:** TOON achieves **75.5% accuracy** (vs JSON's 72.4%) while using **39.5% fewer tokens** on these datasets.
+**Key tradeoff:** TOON achieves **73.9% accuracy** (vs JSON's 69.7%) while using **39.6% fewer tokens** on these datasets.

 <details>
 <summary><strong>Performance by dataset, model, and question type</strong></summary>

 #### Performance by Question Type

-| Question Type | TOON | JSON compact | JSON | YAML | XML | CSV |
+| Question Type | TOON | JSON compact | JSON | CSV | YAML | XML |
 | ------------- | ---- | ---- | ---- | ---- | ---- | ---- |
-| Field Retrieval | 100.0% | 98.9% | 99.6% | 99.3% | 98.5% | 100.0% |
-| Aggregation | 56.3% | 52.4% | 53.2% | 53.2% | 47.2% | 40.5% |
-| Filtering | 58.9% | 58.3% | 54.2% | 53.1% | 50.5% | 49.1% |
-| Structure Awareness | 89.0% | 85.0% | 82.0% | 85.0% | 79.0% | 84.4% |
+| Field Retrieval | 99.6% | 99.3% | 99.3% | 100.0% | 98.2% | 98.9% |
+| Aggregation | 54.4% | 47.2% | 48.8% | 44.0% | 47.6% | 41.3% |
+| Filtering | 56.3% | 57.3% | 50.5% | 49.1% | 51.0% | 47.9% |
+| Structure Awareness | 88.0% | 83.0% | 83.0% | 85.9% | 80.0% | 80.0% |
+| Structural Validation | 70.0% | 45.0% | 50.0% | 80.0% | 60.0% | 80.0% |

 #### Performance by Dataset

@@ -418,64 +425,119 @@ grok-4-fast-non-reasoning

 | Format | Accuracy | Tokens | Correct/Total |
 | ------ | -------- | ------ | ------------- |
-| `csv` | 70.7% | 2,337 | 116/164 |
-| `toon` | 72.0% | 2,483 | 118/164 |
-| `json-compact` | 71.3% | 3,943 | 117/164 |
-| `yaml` | 70.1% | 4,969 | 115/164 |
-| `json-pretty` | 72.6% | 6,347 | 119/164 |
-| `xml` | 70.7% | 7,314 | 116/164 |
+| `csv` | 72.0% | 2,352 | 118/164 |
+| `toon` | 73.8% | 2,518 | 121/164 |
+| `json-compact` | 69.5% | 3,953 | 114/164 |
+| `yaml` | 68.3% | 4,982 | 112/164 |
+| `json-pretty` | 68.3% | 6,360 | 112/164 |
+| `xml` | 69.5% | 7,324 | 114/164 |

 ##### E-commerce orders with nested structures

 | Format | Accuracy | Tokens | Correct/Total |
 | ------ | -------- | ------ | ------------- |
-| `toon` | 83.5% | 7,197 | 137/164 |
-| `json-compact` | 79.3% | 6,784 | 130/164 |
-| `yaml` | 78.7% | 8,334 | 129/164 |
-| `json-pretty` | 78.7% | 10,700 | 129/164 |
-| `xml` | 73.8% | 12,013 | 121/164 |
+| `toon` | 81.1% | 7,232 | 133/164 |
+| `json-compact` | 76.8% | 6,794 | 126/164 |
+| `yaml` | 75.6% | 8,347 | 124/164 |
+| `json-pretty` | 76.2% | 10,713 | 125/164 |
+| `xml` | 74.4% | 12,023 | 122/164 |

 ##### Time-series analytics data

 | Format | Accuracy | Tokens | Correct/Total |
 | ------ | -------- | ------ | ------------- |
-| `toon` | 75.8% | 1,513 | 91/120 |
-| `csv` | 72.5% | 1,391 | 87/120 |
-| `json-compact` | 70.0% | 2,339 | 84/120 |
-| `yaml` | 70.0% | 2,936 | 84/120 |
-| `json-pretty` | 71.7% | 3,663 | 86/120 |
-| `xml` | 71.7% | 4,374 | 86/120 |
+| `csv` | 73.3% | 1,406 | 88/120 |
+| `toon` | 72.5% | 1,548 | 87/120 |
+| `json-compact` | 71.7% | 2,349 | 86/120 |
+| `yaml` | 71.7% | 2,949 | 86/120 |
+| `json-pretty` | 68.3% | 3,676 | 82/120 |
+| `xml` | 68.3% | 4,384 | 82/120 |

 ##### Top 100 GitHub repositories

 | Format | Accuracy | Tokens | Correct/Total |
 | ------ | -------- | ------ | ------------- |
-| `toon` | 64.4% | 8,745 | 85/132 |
-| `csv` | 59.8% | 8,513 | 79/132 |
-| `json-compact` | 60.6% | 11,455 | 80/132 |
-| `yaml` | 61.4% | 13,129 | 81/132 |
-| `json-pretty` | 59.1% | 15,145 | 78/132 |
-| `xml` | 51.5% | 17,095 | 68/132 |
+| `toon` | 62.9% | 8,780 | 83/132 |
+| `csv` | 61.4% | 8,528 | 81/132 |
+| `yaml` | 59.8% | 13,142 | 79/132 |
+| `json-compact` | 55.3% | 11,465 | 73/132 |
+| `json-pretty` | 56.1% | 15,158 | 74/132 |
+| `xml` | 48.5% | 17,105 | 64/132 |

 ##### Semi-uniform event logs

 | Format | Accuracy | Tokens | Correct/Total |
 | ------ | -------- | ------ | ------------- |
-| `json-compact` | 67.5% | 4,809 | 81/120 |
-| `yaml` | 63.3% | 5,814 | 76/120 |
-| `toon` | 62.5% | 5,764 | 75/120 |
-| `json-pretty` | 59.2% | 6,784 | 71/120 |
-| `xml` | 55.0% | 7,699 | 66/120 |
+| `json-compact` | 63.3% | 4,819 | 76/120 |
+| `toon` | 57.5% | 5,799 | 69/120 |
+| `json-pretty` | 59.2% | 6,797 | 71/120 |
+| `yaml` | 48.3% | 5,827 | 58/120 |
+| `xml` | 46.7% | 7,709 | 56/120 |

 ##### Deeply nested configuration

 | Format | Accuracy | Tokens | Correct/Total |
 | ------ | -------- | ------ | ------------- |
-| `json-compact` | 91.4% | 564 | 106/116 |
-| `toon` | 94.8% | 631 | 110/116 |
-| `yaml` | 91.4% | 673 | 106/116 |
-| `json-pretty` | 93.1% | 919 | 108/116 |
-| `xml` | 91.4% | 1,008 | 106/116 |
+| `json-compact` | 92.2% | 574 | 107/116 |
+| `toon` | 95.7% | 666 | 111/116 |
+| `yaml` | 91.4% | 686 | 106/116 |
+| `json-pretty` | 94.0% | 932 | 109/116 |
+| `xml` | 92.2% | 1,018 | 107/116 |
+
+##### Valid complete dataset (control)
+
+| Format | Accuracy | Tokens | Correct/Total |
+| ------ | -------- | ------ | ------------- |
+| `toon` | 100.0% | 544 | 4/4 |
+| `json-compact` | 100.0% | 795 | 4/4 |
+| `yaml` | 100.0% | 1,003 | 4/4 |
+| `json-pretty` | 100.0% | 1,282 | 4/4 |
+| `csv` | 25.0% | 492 | 1/4 |
+| `xml` | 0.0% | 1,467 | 0/4 |
+
+##### Array truncated: 3 rows removed from end
+
+| Format | Accuracy | Tokens | Correct/Total |
+| ------ | -------- | ------ | ------------- |
+| `csv` | 100.0% | 425 | 4/4 |
+| `xml` | 100.0% | 1,251 | 4/4 |
+| `toon` | 0.0% | 474 | 0/4 |
+| `json-compact` | 0.0% | 681 | 0/4 |
+| `json-pretty` | 0.0% | 1,096 | 0/4 |
+| `yaml` | 0.0% | 859 | 0/4 |
+
+##### Extra rows added beyond declared length
+
+| Format | Accuracy | Tokens | Correct/Total |
+| ------ | -------- | ------ | ------------- |
+| `csv` | 100.0% | 566 | 4/4 |
+| `toon` | 75.0% | 621 | 3/4 |
+| `xml` | 100.0% | 1,692 | 4/4 |
+| `yaml` | 75.0% | 1,157 | 3/4 |
+| `json-compact` | 50.0% | 917 | 2/4 |
+| `json-pretty` | 50.0% | 1,476 | 2/4 |
+
+##### Inconsistent field count (missing salary in row 10)
+
+| Format | Accuracy | Tokens | Correct/Total |
+| ------ | -------- | ------ | ------------- |
+| `csv` | 75.0% | 489 | 3/4 |
+| `yaml` | 100.0% | 996 | 4/4 |
+| `toon` | 100.0% | 1,019 | 4/4 |
+| `json-compact` | 75.0% | 790 | 3/4 |
+| `xml` | 100.0% | 1,458 | 4/4 |
+| `json-pretty` | 75.0% | 1,274 | 3/4 |
+
+##### Missing required fields (no email in multiple rows)
+
+| Format | Accuracy | Tokens | Correct/Total |
+| ------ | -------- | ------ | ------------- |
+| `csv` | 100.0% | 329 | 4/4 |
+| `xml` | 100.0% | 1,411 | 4/4 |
+| `toon` | 75.0% | 983 | 3/4 |
+| `yaml` | 25.0% | 960 | 1/4 |
+| `json-pretty` | 25.0% | 1,230 | 1/4 |
+| `json-compact` | 0.0% | 755 | 0/4 |

 #### Performance by Model

@@ -483,45 +545,45 @@ grok-4-fast-non-reasoning

 | Format | Accuracy | Correct/Total |
 | ------ | -------- | ------------- |
-| `toon` | 62.3% | 127/204 |
-| `json-pretty` | 56.9% | 116/204 |
-| `yaml` | 55.9% | 114/204 |
-| `json-compact` | 54.9% | 112/204 |
-| `xml` | 54.9% | 112/204 |
-| `csv` | 47.1% | 49/104 |
+| `toon` | 59.8% | 125/209 |
+| `json-pretty` | 57.4% | 120/209 |
+| `yaml` | 56.0% | 117/209 |
+| `xml` | 55.5% | 116/209 |
+| `json-compact` | 55.0% | 115/209 |
+| `csv` | 50.5% | 55/109 |

 ##### gemini-2.5-flash

 | Format | Accuracy | Correct/Total |
 | ------ | -------- | ------------- |
-| `toon` | 91.2% | 186/204 |
-| `yaml` | 89.7% | 183/204 |
-| `json-compact` | 87.7% | 179/204 |
-| `json-pretty` | 87.7% | 179/204 |
-| `xml` | 87.3% | 178/204 |
-| `csv` | 85.6% | 89/104 |
+| `toon` | 87.6% | 183/209 |
+| `csv` | 86.2% | 94/109 |
+| `json-compact` | 82.3% | 172/209 |
+| `yaml` | 79.4% | 166/209 |
+| `xml` | 79.4% | 166/209 |
+| `json-pretty` | 77.0% | 161/209 |

 ##### gpt-5-nano

 | Format | Accuracy | Correct/Total |
 | ------ | -------- | ------------- |
-| `json-compact` | 93.6% | 191/204 |
-| `csv` | 90.4% | 94/104 |
-| `json-pretty` | 89.7% | 183/204 |
-| `toon` | 89.2% | 182/204 |
-| `yaml` | 89.2% | 182/204 |
-| `xml` | 81.4% | 166/204 |
+| `toon` | 90.9% | 190/209 |
+| `json-compact` | 90.9% | 190/209 |
+| `json-pretty` | 89.0% | 186/209 |
+| `csv` | 89.0% | 97/109 |
+| `yaml` | 87.1% | 182/209 |
+| `xml` | 80.9% | 169/209 |

 ##### grok-4-fast-non-reasoning

 | Format | Accuracy | Correct/Total |
 | ------ | -------- | ------------- |
-| `toon` | 59.3% | 121/204 |
-| `json-compact` | 56.9% | 116/204 |
-| `json-pretty` | 55.4% | 113/204 |
-| `yaml` | 54.9% | 112/204 |
-| `xml` | 52.5% | 107/204 |
-| `csv` | 48.1% | 50/104 |
+| `toon` | 57.4% | 120/209 |
+| `json-pretty` | 55.5% | 116/209 |
+| `json-compact` | 54.5% | 114/209 |
+| `yaml` | 53.6% | 112/209 |
+| `xml` | 52.6% | 110/209 |
+| `csv` | 52.3% | 57/109 |

 </details>

@@ -534,8 +596,9 @@ This benchmark tests **LLM comprehension and data retrieval accuracy** across di

 #### Datasets Tested

-Six datasets designed to test different structural patterns:
+Eleven datasets designed to test different structural patterns and validation capabilities:

+**Primary datasets:**
 1. **Tabular** (100 employee records): Uniform objects with identical fields – optimal for TOON's tabular format.
 2. **Nested** (50 e-commerce orders): Complex structures with nested customer objects and item arrays.
 3. **Analytics** (60 days of metrics): Time-series data with dates and numeric values.
@@ -543,21 +606,28 @@ Six datasets designed to test different structural patterns:
 5. **Event Logs** (75 logs): Semi-uniform data with ~50% flat logs and ~50% with nested error objects.
 6. **Nested Config** (1 configuration): Deeply nested configuration with minimal tabular eligibility.

+**Structural validation datasets:**
+7. **Control**: Valid complete dataset (baseline for validation)
+8. **Truncated**: Array with 3 rows removed from end (tests [N] length detection)
+9. **Extra rows**: Array with 3 additional rows beyond declared length
+10. **Width mismatch**: Inconsistent field count (missing salary in row 10)
+11. **Missing fields**: Systematic field omissions (no email in multiple rows)
+
 #### Question Types

-204 questions are generated dynamically across four categories:
+209 questions are generated dynamically across five categories:

 - **Field retrieval (33%)**: Direct value lookups or values that can be read straight off a record (including booleans and simple counts such as array lengths)
  - Example: "What is Alice's salary?" → `75000`
  - Example: "How many items are in order ORD-0042?" → `3`
  - Example: "What is the customer name for order ORD-0042?" → `John Doe`

- **Aggregation (31%)**: Dataset-level totals and averages plus single-condition filters (counts, sums, min/max comparisons)
+- **Aggregation (30%)**: Dataset-level totals and averages plus single-condition filters (counts, sums, min/max comparisons)
  - Example: "How many employees work in Engineering?" → `17`
  - Example: "What is the total revenue across all orders?" → `45123.50`
  - Example: "How many employees have salary > 80000?" → `23`

- **Filtering (24%)**: Multi-condition queries requiring compound logic (AND constraints across fields)
+- **Filtering (23%)**: Multi-condition queries requiring compound logic (AND constraints across fields)
  - Example: "How many employees in Sales have salary > 80000?" → `5`
  - Example: "How many active employees have more than 10 years of experience?" → `8`

@@ -566,18 +636,23 @@ Six datasets designed to test different structural patterns:
  - Example: "List the field names for employees" → `id, name, email, department, salary, yearsExperience, active`
  - Example: "What is the department of the last employee?" → `Sales`

+- **Structural validation (2%)**: Tests ability to detect incomplete, truncated, or corrupted data using structural metadata
+  - Example: "Is this data complete and valid?" → `YES` (control dataset) or `NO` (corrupted datasets)
+  - Tests TOON's [N] length validation and {fields} consistency checking
+  - Demonstrates CSV's lack of structural validation capabilities
+
 #### Evaluation Process

-1. **Format conversion**: Each dataset is converted to all 6 formats (TOON, JSON compact, JSON, YAML, XML, CSV).
+1. **Format conversion**: Each dataset is converted to all 6 formats (TOON, JSON compact, JSON, CSV, YAML, XML).
 2. **Query LLM**: Each model receives formatted data + question in a prompt and extracts the answer.
-3. **Validate with LLM-as-judge**: `gpt-5-nano` validates if the answer is semantically correct (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`).
+3. **Validate deterministically**: Answers are validated using type-aware comparison (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`) without requiring an LLM judge.

 #### Models & Configuration

 - **Models tested**: `claude-haiku-4-5-20251001`, `gemini-2.5-flash`, `gpt-5-nano`, `grok-4-fast-non-reasoning`
 - **Token counting**: Using `gpt-tokenizer` with `o200k_base` encoding (GPT-5 tokenizer)
 - **Temperature**: Not set (models use their defaults)
- **Total evaluations**: 204 questions × 6 formats × 4 models = 4,896 LLM calls
+- **Total evaluations**: 209 questions × 6 formats × 4 models = 5,016 LLM calls

 </details>

@@ -782,6 +857,9 @@ items[1]:
    status: active
 ```

+> [!NOTE]
+> Tabular format requires identical field sets across all objects (same keys, order doesn't matter) and primitive values only (strings, numbers, booleans, null).
+
 #### Mixed and Non-Uniform Arrays

 Arrays that don't meet the tabular requirements use list format: