docs: clarify retrieval accuracy metrics

This commit is contained in:
Johann Schopplich
2025-10-28 08:39:43 +01:00
parent cdd4a20c67
commit 52dc9c4b3f
4 changed files with 13 additions and 14 deletions

View File

@@ -212,7 +212,7 @@ metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
### Retrieval Accuracy ### Retrieval Accuracy
Tested across **3 LLMs** with data retrieval tasks: Accuracy across **3 LLMs** on **159 data retrieval questions**:
``` ```
gpt-5-nano gpt-5-nano
@@ -323,7 +323,7 @@ gemini-2.5-flash
#### What's Being Measured #### What's Being Measured
This benchmark tests **LLM comprehension and data retrieval accuracy** when data is presented in different formats. Each LLM receives formatted data and must answer questions about it (this does NOT test LLM's ability to generate TOON output). This benchmark tests **LLM comprehension and data retrieval accuracy** across different input formats. Each LLM receives formatted data and must answer questions about it (this does **not** test model's ability to generate TOON output).
#### Datasets Tested #### Datasets Tested
@@ -336,7 +336,7 @@ Four datasets designed to test different structural patterns:
#### Question Types #### Question Types
~160 questions are generated dynamically across three categories: 159 questions are generated dynamically across three categories:
- **Field retrieval (50%)**: Direct value lookups - **Field retrieval (50%)**: Direct value lookups
- Example: "What is Alice's salary?" → `75000` - Example: "What is Alice's salary?" → `75000`
@@ -352,13 +352,9 @@ Four datasets designed to test different structural patterns:
#### Evaluation Process #### Evaluation Process
1. **Format conversion**: Each dataset is converted to all 5 formats (TOON, JSON, YAML, CSV, XML). 1. **Format conversion:** Each dataset is converted to all 5 formats (TOON, JSON, YAML, CSV, XML).
2. **Query LLM**: Each model receives formatted data + question in a prompt and extracts the answer. 2. **Query LLM**: Each model receives formatted data + question in a prompt and extracts the answer.
3. **Validate with LLM-as-judge**: GPT-5-nano validates if the answer is semantically correct. 4. **Validate with LLM-as-judge**: `gpt-5-nano` validates if the answer is semantically correct (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`).
#### Semantic Validation
Answers are validated by an LLM judge (`gpt-5-nano`) using semantic equivalence, not exact string matching (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`).
#### Models & Configuration #### Models & Configuration

View File

@@ -1,6 +1,6 @@
### Retrieval Accuracy ### Retrieval Accuracy
Tested across **3 LLMs** with data retrieval tasks: Accuracy across **3 LLMs** on **159 data retrieval questions**:
``` ```
gpt-5-nano gpt-5-nano
@@ -124,7 +124,7 @@ Four datasets designed to test different structural patterns:
#### Question Types #### Question Types
~160 questions are generated dynamically across three categories: 159 questions are generated dynamically across three categories:
- **Field retrieval (50%)**: Direct value lookups - **Field retrieval (50%)**: Direct value lookups
- Example: "What is Alice's salary?" → `75000` - Example: "What is Alice's salary?" → `75000`

View File

@@ -87,5 +87,5 @@
"yaml-analytics": 2938, "yaml-analytics": 2938,
"yaml-github": 13129 "yaml-github": 13129
}, },
"timestamp": "2025-10-28T06:43:10.560Z" "timestamp": "2025-10-28T07:39:09.360Z"
} }

View File

@@ -177,10 +177,13 @@ ${tableRows}
`.trimStart() `.trimStart()
}).join('\n') }).join('\n')
// Calculate total unique questions
const totalQuestions = [...new Set(results.map(r => r.questionId))].length
return ` return `
### Retrieval Accuracy ### Retrieval Accuracy
Tested across **${modelCount} ${modelCount === 1 ? 'LLM' : 'LLMs'}** with data retrieval tasks: Accuracy across **${modelCount} ${modelCount === 1 ? 'LLM' : 'LLMs'}** on **${totalQuestions} data retrieval questions**:
\`\`\` \`\`\`
${modelBreakdown} ${modelBreakdown}
@@ -217,7 +220,7 @@ Four datasets designed to test different structural patterns:
#### Question Types #### Question Types
~160 questions are generated dynamically across three categories: ${totalQuestions} questions are generated dynamically across three categories:
- **Field retrieval (50%)**: Direct value lookups - **Field retrieval (50%)**: Direct value lookups
- Example: "What is Alice's salary?" → \`75000\` - Example: "What is Alice's salary?" → \`75000\`