chore: more work on benchmarks

2026-01-29 15:24:10 +08:00 · 2025-11-06 15:51:31 +01:00
parent bc711ccecf
commit a9d52fc69b
15 changed files with 1647 additions and 213 deletions
--- a/benchmarks/README.md
+++ b/benchmarks/README.md
@@ -34,7 +34,7 @@ Results are saved to `results/token-efficiency.md`.

 Tests how well LLMs can answer questions about data in different formats (TOON, JSON, JSON compact, XML, YAML, CSV):

-1. Generate ~150-160 questions across 6 datasets (CSV only included for datasets with flat/tabular structure)
+1. Generate ~200 questions across 6 datasets (CSV only included for datasets with flat/tabular structure)
 2. Convert each dataset to all supported formats
 3. Query each LLM with formatted data + question
 4. Validate answers using `gpt-5-nano` as judge