chore: more work on benchmarks

This commit is contained in:
Johann Schopplich
2025-11-06 15:51:31 +01:00
parent bc711ccecf
commit a9d52fc69b
15 changed files with 1647 additions and 213 deletions

View File

@@ -34,7 +34,7 @@ Results are saved to `results/token-efficiency.md`.
Tests how well LLMs can answer questions about data in different formats (TOON, JSON, JSON compact, XML, YAML, CSV):
1. Generate ~150-160 questions across 6 datasets (CSV only included for datasets with flat/tabular structure)
1. Generate ~200 questions across 6 datasets (CSV only included for datasets with flat/tabular structure)
2. Convert each dataset to all supported formats
3. Query each LLM with formatted data + question
4. Validate answers using `gpt-5-nano` as judge