docs: clarify retrieval accuracy metrics

2026-01-29 15:24:10 +08:00 · 2025-10-28 08:39:43 +01:00
parent cdd4a20c67
commit 52dc9c4b3f
4 changed files with 13 additions and 14 deletions
--- a/README.md
+++ b/README.md
@@ -212,7 +212,7 @@ metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:

 ### Retrieval Accuracy

-Tested across **3 LLMs** with data retrieval tasks:
+Accuracy across **3 LLMs** on **159 data retrieval questions**:

 ```
 gpt-5-nano
@@ -323,7 +323,7 @@ gemini-2.5-flash

 #### What's Being Measured

-This benchmark tests **LLM comprehension and data retrieval accuracy** when data is presented in different formats. Each LLM receives formatted data and must answer questions about it (this does NOT test LLM's ability to generate TOON output).
+This benchmark tests **LLM comprehension and data retrieval accuracy** across different input formats. Each LLM receives formatted data and must answer questions about it (this does **not** test model's ability to generate TOON output).

 #### Datasets Tested

@@ -336,7 +336,7 @@ Four datasets designed to test different structural patterns:

 #### Question Types

-~160 questions are generated dynamically across three categories:
+159 questions are generated dynamically across three categories:

 - **Field retrieval (50%)**: Direct value lookups
  - Example: "What is Alice's salary?" → `75000`
@@ -352,13 +352,9 @@ Four datasets designed to test different structural patterns:

 #### Evaluation Process

-1. **Format conversion**: Each dataset is converted to all 5 formats (TOON, JSON, YAML, CSV, XML).
+1. **Format conversion:** Each dataset is converted to all 5 formats (TOON, JSON, YAML, CSV, XML).
 2. **Query LLM**: Each model receives formatted data + question in a prompt and extracts the answer.
-3. **Validate with LLM-as-judge**: GPT-5-nano validates if the answer is semantically correct.
-
-#### Semantic Validation
-
-Answers are validated by an LLM judge (`gpt-5-nano`) using semantic equivalence, not exact string matching (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`).
+4. **Validate with LLM-as-judge**: `gpt-5-nano` validates if the answer is semantically correct (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`).

 #### Models & Configuration

--- a/benchmarks/results/accuracy/report.md
+++ b/benchmarks/results/accuracy/report.md
@@ -1,6 +1,6 @@
 ### Retrieval Accuracy

-Tested across **3 LLMs** with data retrieval tasks:
+Accuracy across **3 LLMs** on **159 data retrieval questions**:

 ```
 gpt-5-nano
@@ -124,7 +124,7 @@ Four datasets designed to test different structural patterns:

 #### Question Types

-~160 questions are generated dynamically across three categories:
+159 questions are generated dynamically across three categories:

 - **Field retrieval (50%)**: Direct value lookups
  - Example: "What is Alice's salary?" → `75000`
--- a/benchmarks/results/accuracy/summary.json
+++ b/benchmarks/results/accuracy/summary.json
@@ -87,5 +87,5 @@
    "yaml-analytics": 2938,
    "yaml-github": 13129
  },
-  "timestamp": "2025-10-28T06:43:10.560Z"
+  "timestamp": "2025-10-28T07:39:09.360Z"
 }
--- a/benchmarks/src/report.ts
+++ b/benchmarks/src/report.ts
@@ -177,10 +177,13 @@ ${tableRows}
 `.trimStart()
  }).join('\n')

+  // Calculate total unique questions
+  const totalQuestions = [...new Set(results.map(r => r.questionId))].length
+
  return `
 ### Retrieval Accuracy

-Tested across **${modelCount} ${modelCount === 1 ? 'LLM' : 'LLMs'}** with data retrieval tasks:
+Accuracy across **${modelCount} ${modelCount === 1 ? 'LLM' : 'LLMs'}** on **${totalQuestions} data retrieval questions**:

 \`\`\`
 ${modelBreakdown}
@@ -217,7 +220,7 @@ Four datasets designed to test different structural patterns:

 #### Question Types

-~160 questions are generated dynamically across three categories:
+${totalQuestions} questions are generated dynamically across three categories:

 - **Field retrieval (50%)**: Direct value lookups
  - Example: "What is Alice's salary?" → \`75000\`