mirror of
https://github.com/voson-wang/toon.git
synced 2026-01-29 15:24:10 +08:00
docs: clarify retrieval accuracy metrics
This commit is contained in:
14
README.md
14
README.md
@@ -212,7 +212,7 @@ metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
|
|||||||
|
|
||||||
### Retrieval Accuracy
|
### Retrieval Accuracy
|
||||||
|
|
||||||
Tested across **3 LLMs** with data retrieval tasks:
|
Accuracy across **3 LLMs** on **159 data retrieval questions**:
|
||||||
|
|
||||||
```
|
```
|
||||||
gpt-5-nano
|
gpt-5-nano
|
||||||
@@ -323,7 +323,7 @@ gemini-2.5-flash
|
|||||||
|
|
||||||
#### What's Being Measured
|
#### What's Being Measured
|
||||||
|
|
||||||
This benchmark tests **LLM comprehension and data retrieval accuracy** when data is presented in different formats. Each LLM receives formatted data and must answer questions about it (this does NOT test LLM's ability to generate TOON output).
|
This benchmark tests **LLM comprehension and data retrieval accuracy** across different input formats. Each LLM receives formatted data and must answer questions about it (this does **not** test model's ability to generate TOON output).
|
||||||
|
|
||||||
#### Datasets Tested
|
#### Datasets Tested
|
||||||
|
|
||||||
@@ -336,7 +336,7 @@ Four datasets designed to test different structural patterns:
|
|||||||
|
|
||||||
#### Question Types
|
#### Question Types
|
||||||
|
|
||||||
~160 questions are generated dynamically across three categories:
|
159 questions are generated dynamically across three categories:
|
||||||
|
|
||||||
- **Field retrieval (50%)**: Direct value lookups
|
- **Field retrieval (50%)**: Direct value lookups
|
||||||
- Example: "What is Alice's salary?" → `75000`
|
- Example: "What is Alice's salary?" → `75000`
|
||||||
@@ -352,13 +352,9 @@ Four datasets designed to test different structural patterns:
|
|||||||
|
|
||||||
#### Evaluation Process
|
#### Evaluation Process
|
||||||
|
|
||||||
1. **Format conversion**: Each dataset is converted to all 5 formats (TOON, JSON, YAML, CSV, XML).
|
1. **Format conversion:** Each dataset is converted to all 5 formats (TOON, JSON, YAML, CSV, XML).
|
||||||
2. **Query LLM**: Each model receives formatted data + question in a prompt and extracts the answer.
|
2. **Query LLM**: Each model receives formatted data + question in a prompt and extracts the answer.
|
||||||
3. **Validate with LLM-as-judge**: GPT-5-nano validates if the answer is semantically correct.
|
4. **Validate with LLM-as-judge**: `gpt-5-nano` validates if the answer is semantically correct (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`).
|
||||||
|
|
||||||
#### Semantic Validation
|
|
||||||
|
|
||||||
Answers are validated by an LLM judge (`gpt-5-nano`) using semantic equivalence, not exact string matching (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`).
|
|
||||||
|
|
||||||
#### Models & Configuration
|
#### Models & Configuration
|
||||||
|
|
||||||
|
|||||||
@@ -1,6 +1,6 @@
|
|||||||
### Retrieval Accuracy
|
### Retrieval Accuracy
|
||||||
|
|
||||||
Tested across **3 LLMs** with data retrieval tasks:
|
Accuracy across **3 LLMs** on **159 data retrieval questions**:
|
||||||
|
|
||||||
```
|
```
|
||||||
gpt-5-nano
|
gpt-5-nano
|
||||||
@@ -124,7 +124,7 @@ Four datasets designed to test different structural patterns:
|
|||||||
|
|
||||||
#### Question Types
|
#### Question Types
|
||||||
|
|
||||||
~160 questions are generated dynamically across three categories:
|
159 questions are generated dynamically across three categories:
|
||||||
|
|
||||||
- **Field retrieval (50%)**: Direct value lookups
|
- **Field retrieval (50%)**: Direct value lookups
|
||||||
- Example: "What is Alice's salary?" → `75000`
|
- Example: "What is Alice's salary?" → `75000`
|
||||||
|
|||||||
@@ -87,5 +87,5 @@
|
|||||||
"yaml-analytics": 2938,
|
"yaml-analytics": 2938,
|
||||||
"yaml-github": 13129
|
"yaml-github": 13129
|
||||||
},
|
},
|
||||||
"timestamp": "2025-10-28T06:43:10.560Z"
|
"timestamp": "2025-10-28T07:39:09.360Z"
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -177,10 +177,13 @@ ${tableRows}
|
|||||||
`.trimStart()
|
`.trimStart()
|
||||||
}).join('\n')
|
}).join('\n')
|
||||||
|
|
||||||
|
// Calculate total unique questions
|
||||||
|
const totalQuestions = [...new Set(results.map(r => r.questionId))].length
|
||||||
|
|
||||||
return `
|
return `
|
||||||
### Retrieval Accuracy
|
### Retrieval Accuracy
|
||||||
|
|
||||||
Tested across **${modelCount} ${modelCount === 1 ? 'LLM' : 'LLMs'}** with data retrieval tasks:
|
Accuracy across **${modelCount} ${modelCount === 1 ? 'LLM' : 'LLMs'}** on **${totalQuestions} data retrieval questions**:
|
||||||
|
|
||||||
\`\`\`
|
\`\`\`
|
||||||
${modelBreakdown}
|
${modelBreakdown}
|
||||||
@@ -217,7 +220,7 @@ Four datasets designed to test different structural patterns:
|
|||||||
|
|
||||||
#### Question Types
|
#### Question Types
|
||||||
|
|
||||||
~160 questions are generated dynamically across three categories:
|
${totalQuestions} questions are generated dynamically across three categories:
|
||||||
|
|
||||||
- **Field retrieval (50%)**: Direct value lookups
|
- **Field retrieval (50%)**: Direct value lookups
|
||||||
- Example: "What is Alice's salary?" → \`75000\`
|
- Example: "What is Alice's salary?" → \`75000\`
|
||||||
|
|||||||
Reference in New Issue
Block a user