11 Commits

Author SHA1 Message Date
Johann Schopplich
acca69c64a chore(benchmarks): replace LLM-as-judge, new structural validation 2025-11-07 21:28:21 +01:00
Johann Schopplich
89df613059 chore(benchmarks): add structure-awareness questions 2025-11-07 09:03:51 +01:00
Johann Schopplich
a9d52fc69b chore: more work on benchmarks 2025-11-06 15:51:31 +01:00
Johann Schopplich
bc711ccecf test(benchmark): overhaul generation 2025-11-06 14:45:44 +01:00
Johann Schopplich
983728e913 refactor: progress bar configuration 2025-10-30 15:24:22 +01:00
Johann Schopplich
2c4f3c4362 test: add benchmarks for compact vs. pretty JSON 2025-10-30 15:02:51 +01:00
Johann Schopplich
ecf578a7dc text(accuracy): add Grok-4-fast, remove default temperature 2025-10-28 22:54:00 +01:00
Johann Schopplich
67c0df8cb0 docs: overhaul retrieval accuracy benchmark 2025-10-28 20:22:43 +01:00
Johann Schopplich
4ec7e84f5f refactor: shared utils for benchmark scripts 2025-10-27 17:37:27 +01:00
Johann Schopplich
05b3d43023 test: refactor accuracy benchmark generation 2025-10-27 14:07:20 +01:00
Johann Schopplich
3c840259fe test: add LLM retrieval accuracy tests 2025-10-27 11:48:33 +01:00