docs: overhaul retrieval accuracy benchmark

This commit is contained in:
Johann Schopplich
2025-10-28 20:22:51 +01:00
parent 67c0df8cb0
commit e400e68ad6

View File

@@ -215,13 +215,6 @@ metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
Accuracy across **3 LLMs** on **154 data retrieval questions**:
```
gemini-2.5-flash
xml ██████████████████░░ 90.3% (139/154)
csv ██████████████████░░ 89.0% (137/154)
toon █████████████████░░░ 87.0% (134/154)
json ████████████████░░░░ 79.2% (122/154)
yaml ███████████████░░░░░ 76.0% (117/154)
gpt-5-nano
toon ███████████████████░ 96.1% (148/154)
csv ██████████████████░░ 90.3% (139/154)
@@ -229,6 +222,13 @@ gpt-5-nano
json ██████████████████░░ 87.7% (135/154)
xml █████████████████░░░ 83.8% (129/154)
gemini-2.5-flash
xml ██████████████████░░ 90.3% (139/154)
csv ██████████████████░░ 89.0% (137/154)
toon █████████████████░░░ 87.0% (134/154)
json ████████████████░░░░ 79.2% (122/154)
yaml ███████████████░░░░░ 76.0% (117/154)
claude-haiku-4-5-20251001
json ██████████░░░░░░░░░░ 48.7% (75/154)
toon ██████████░░░░░░░░░░ 48.1% (74/154)
@@ -286,16 +286,6 @@ claude-haiku-4-5-20251001
#### Performance by Model
##### gemini-2.5-flash
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
| `xml` | 90.3% | 139/154 |
| `csv` | 89.0% | 137/154 |
| `toon` | 87.0% | 134/154 |
| `json` | 79.2% | 122/154 |
| `yaml` | 76.0% | 117/154 |
##### gpt-5-nano
| Format | Accuracy | Correct/Total |
@@ -306,6 +296,16 @@ claude-haiku-4-5-20251001
| `json` | 87.7% | 135/154 |
| `xml` | 83.8% | 129/154 |
##### gemini-2.5-flash
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
| `xml` | 90.3% | 139/154 |
| `csv` | 89.0% | 137/154 |
| `toon` | 87.0% | 134/154 |
| `json` | 79.2% | 122/154 |
| `yaml` | 76.0% | 117/154 |
##### claude-haiku-4-5-20251001
| Format | Accuracy | Correct/Total |
@@ -360,7 +360,7 @@ Four datasets designed to test different structural patterns:
#### Models & Configuration
- **Models tested**: `gemini-2.5-flash`, `gpt-5-nano`, `claude-haiku-4-5-20251001`
- **Models tested**: `claude-haiku-4-5-20251001`, `gemini-2.5-flash`, `gpt-5-nano`
- **Token counting**: Using `gpt-tokenizer` with `o200k_base` encoding (GPT-5 tokenizer)
- **Temperature**: 0 (for non-reasoning models)
- **Total evaluations**: 154 questions × 5 formats × 3 models = 2,310 LLM calls