test(benchmark): overhaul generation

2026-01-29 15:24:10 +08:00 · 2025-11-06 14:45:44 +01:00
parent 9863875706
commit bc711ccecf
19 changed files with 2254 additions and 997 deletions
--- a/benchmarks/README.md
+++ b/benchmarks/README.md
@@ -34,8 +34,8 @@ Results are saved to `results/token-efficiency.md`.

 Tests how well LLMs can answer questions about data in different formats (TOON, JSON, JSON compact, XML, YAML, CSV):

-1. Generate ~150-160 questions across 4 datasets
-2. Convert each dataset to all 6 formats
+1. Generate ~150-160 questions across 6 datasets (CSV only included for datasets with flat/tabular structure)
+2. Convert each dataset to all supported formats
 3. Query each LLM with formatted data + question
 4. Validate answers using `gpt-5-nano` as judge
 5. Aggregate metrics and generate report