From 52dc9c4b3f39b39a845787c4cf83b30fd052b5e7 Mon Sep 17 00:00:00 2001 From: Johann Schopplich Date: Tue, 28 Oct 2025 08:39:43 +0100 Subject: [PATCH] docs: clarify retrieval accuracy metrics --- README.md | 14 +++++--------- benchmarks/results/accuracy/report.md | 4 ++-- benchmarks/results/accuracy/summary.json | 2 +- benchmarks/src/report.ts | 7 +++++-- 4 files changed, 13 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index 26e15e1..f6bdb37 100644 --- a/README.md +++ b/README.md @@ -212,7 +212,7 @@ metrics[5]{date,views,clicks,conversions,revenue,bounceRate}: ### Retrieval Accuracy -Tested across **3 LLMs** with data retrieval tasks: +Accuracy across **3 LLMs** on **159 data retrieval questions**: ``` gpt-5-nano @@ -323,7 +323,7 @@ gemini-2.5-flash #### What's Being Measured -This benchmark tests **LLM comprehension and data retrieval accuracy** when data is presented in different formats. Each LLM receives formatted data and must answer questions about it (this does NOT test LLM's ability to generate TOON output). +This benchmark tests **LLM comprehension and data retrieval accuracy** across different input formats. Each LLM receives formatted data and must answer questions about it (this does **not** test model's ability to generate TOON output). #### Datasets Tested @@ -336,7 +336,7 @@ Four datasets designed to test different structural patterns: #### Question Types -~160 questions are generated dynamically across three categories: +159 questions are generated dynamically across three categories: - **Field retrieval (50%)**: Direct value lookups - Example: "What is Alice's salary?" → `75000` @@ -352,13 +352,9 @@ Four datasets designed to test different structural patterns: #### Evaluation Process -1. **Format conversion**: Each dataset is converted to all 5 formats (TOON, JSON, YAML, CSV, XML). +1. **Format conversion:** Each dataset is converted to all 5 formats (TOON, JSON, YAML, CSV, XML). 2. **Query LLM**: Each model receives formatted data + question in a prompt and extracts the answer. -3. **Validate with LLM-as-judge**: GPT-5-nano validates if the answer is semantically correct. - -#### Semantic Validation - -Answers are validated by an LLM judge (`gpt-5-nano`) using semantic equivalence, not exact string matching (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`). +4. **Validate with LLM-as-judge**: `gpt-5-nano` validates if the answer is semantically correct (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`). #### Models & Configuration diff --git a/benchmarks/results/accuracy/report.md b/benchmarks/results/accuracy/report.md index ff366db..3ddd0f2 100644 --- a/benchmarks/results/accuracy/report.md +++ b/benchmarks/results/accuracy/report.md @@ -1,6 +1,6 @@ ### Retrieval Accuracy -Tested across **3 LLMs** with data retrieval tasks: +Accuracy across **3 LLMs** on **159 data retrieval questions**: ``` gpt-5-nano @@ -124,7 +124,7 @@ Four datasets designed to test different structural patterns: #### Question Types -~160 questions are generated dynamically across three categories: +159 questions are generated dynamically across three categories: - **Field retrieval (50%)**: Direct value lookups - Example: "What is Alice's salary?" → `75000` diff --git a/benchmarks/results/accuracy/summary.json b/benchmarks/results/accuracy/summary.json index f5dd2c2..b3aa797 100644 --- a/benchmarks/results/accuracy/summary.json +++ b/benchmarks/results/accuracy/summary.json @@ -87,5 +87,5 @@ "yaml-analytics": 2938, "yaml-github": 13129 }, - "timestamp": "2025-10-28T06:43:10.560Z" + "timestamp": "2025-10-28T07:39:09.360Z" } diff --git a/benchmarks/src/report.ts b/benchmarks/src/report.ts index 65859b5..dbc5987 100644 --- a/benchmarks/src/report.ts +++ b/benchmarks/src/report.ts @@ -177,10 +177,13 @@ ${tableRows} `.trimStart() }).join('\n') + // Calculate total unique questions + const totalQuestions = [...new Set(results.map(r => r.questionId))].length + return ` ### Retrieval Accuracy -Tested across **${modelCount} ${modelCount === 1 ? 'LLM' : 'LLMs'}** with data retrieval tasks: +Accuracy across **${modelCount} ${modelCount === 1 ? 'LLM' : 'LLMs'}** on **${totalQuestions} data retrieval questions**: \`\`\` ${modelBreakdown} @@ -217,7 +220,7 @@ Four datasets designed to test different structural patterns: #### Question Types -~160 questions are generated dynamically across three categories: +${totalQuestions} questions are generated dynamically across three categories: - **Field retrieval (50%)**: Direct value lookups - Example: "What is Alice's salary?" → \`75000\`