Files
toon/benchmarks/results/accuracy/report.md
2025-10-27 16:02:51 +01:00

4.0 KiB

Retrieval Accuracy

Tested across 3 LLMs with data retrieval tasks:

gpt-5-nano
  toon         ████████████████████  99.4% (158/159)
  yaml         ███████████████████░  95.0% (151/159)
  csv          ██████████████████░░  92.5% (147/159)
  json         ██████████████████░░  92.5% (147/159)
  xml          ██████████████████░░  91.2% (145/159)

claude-haiku-4-5
  toon         ███████████████░░░░░  75.5% (120/159)
  xml          ███████████████░░░░░  75.5% (120/159)
  csv          ███████████████░░░░░  75.5% (120/159)
  json         ███████████████░░░░░  75.5% (120/159)
  yaml         ███████████████░░░░░  74.2% (118/159)

gemini-2.5-flash
  xml          ██████████████████░░  91.8% (146/159)
  csv          █████████████████░░░  86.2% (137/159)
  toon         █████████████████░░░  84.9% (135/159)
  json         ████████████████░░░░  81.8% (130/159)
  yaml         ████████████████░░░░  78.6% (125/159)

Advantage: TOON achieves 86.6% accuracy (vs JSON's 83.2%) while using 46.3% fewer tokens.

View detailed breakdown by dataset and model

Performance by Dataset

Uniform employee records (TOON optimal format)
Format Accuracy Tokens Correct/Total
toon 87.4% 2.483 152/174
csv 82.8% 2.337 144/174
yaml 83.9% 4.969 146/174
json 83.9% 6.347 146/174
xml 88.5% 7.314 154/174
E-commerce orders with nested structures
Format Accuracy Tokens Correct/Total
toon 90.9% 5.967 120/132
csv 93.9% 6.735 124/132
yaml 87.1% 7.328 115/132
json 87.9% 9.694 116/132
xml 93.2% 10.992 123/132
Time-series analytics data
Format Accuracy Tokens Correct/Total
csv 89.7% 1.393 78/87
toon 88.5% 1.515 77/87
yaml 83.9% 2.938 73/87
json 88.5% 3.665 77/87
xml 85.1% 4.376 74/87
Top 100 GitHub repositories
Format Accuracy Tokens Correct/Total
toon 76.2% 8.745 64/84
csv 69.0% 8.513 58/84
yaml 71.4% 13.129 60/84
json 69.0% 15.145 58/84
xml 71.4% 17.095 60/84

Performance by Model

gpt-5-nano
Format Accuracy Correct/Total
toon 99.4% 158/159
yaml 95.0% 151/159
csv 92.5% 147/159
json 92.5% 147/159
xml 91.2% 145/159
claude-haiku-4-5
Format Accuracy Correct/Total
toon 75.5% 120/159
xml 75.5% 120/159
csv 75.5% 120/159
json 75.5% 120/159
yaml 74.2% 118/159
gemini-2.5-flash
Format Accuracy Correct/Total
xml 91.8% 146/159
csv 86.2% 137/159
toon 84.9% 135/159
json 81.8% 130/159
yaml 78.6% 125/159

Methodology

  • Semantic validation: LLM-as-judge validates responses semantically (not exact string matching).
  • Token counting: Using gpt-tokenizer with o200k_base encoding.
  • Question types: ~160 questions across field retrieval, aggregation, and filtering tasks.
  • Datasets: Faker.js-generated datasets (seeded) + GitHub repositories.