mirror of
https://github.com/voson-wang/toon.git
synced 2026-01-29 23:34:10 +08:00
docs: overhaul retrieval accuracy benchmark
This commit is contained in:
36
README.md
36
README.md
@@ -215,13 +215,6 @@ metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
|
|||||||
Accuracy across **3 LLMs** on **154 data retrieval questions**:
|
Accuracy across **3 LLMs** on **154 data retrieval questions**:
|
||||||
|
|
||||||
```
|
```
|
||||||
gemini-2.5-flash
|
|
||||||
xml ██████████████████░░ 90.3% (139/154)
|
|
||||||
csv ██████████████████░░ 89.0% (137/154)
|
|
||||||
toon █████████████████░░░ 87.0% (134/154)
|
|
||||||
json ████████████████░░░░ 79.2% (122/154)
|
|
||||||
yaml ███████████████░░░░░ 76.0% (117/154)
|
|
||||||
|
|
||||||
gpt-5-nano
|
gpt-5-nano
|
||||||
toon ███████████████████░ 96.1% (148/154)
|
toon ███████████████████░ 96.1% (148/154)
|
||||||
csv ██████████████████░░ 90.3% (139/154)
|
csv ██████████████████░░ 90.3% (139/154)
|
||||||
@@ -229,6 +222,13 @@ gpt-5-nano
|
|||||||
json ██████████████████░░ 87.7% (135/154)
|
json ██████████████████░░ 87.7% (135/154)
|
||||||
xml █████████████████░░░ 83.8% (129/154)
|
xml █████████████████░░░ 83.8% (129/154)
|
||||||
|
|
||||||
|
gemini-2.5-flash
|
||||||
|
xml ██████████████████░░ 90.3% (139/154)
|
||||||
|
csv ██████████████████░░ 89.0% (137/154)
|
||||||
|
toon █████████████████░░░ 87.0% (134/154)
|
||||||
|
json ████████████████░░░░ 79.2% (122/154)
|
||||||
|
yaml ███████████████░░░░░ 76.0% (117/154)
|
||||||
|
|
||||||
claude-haiku-4-5-20251001
|
claude-haiku-4-5-20251001
|
||||||
json ██████████░░░░░░░░░░ 48.7% (75/154)
|
json ██████████░░░░░░░░░░ 48.7% (75/154)
|
||||||
toon ██████████░░░░░░░░░░ 48.1% (74/154)
|
toon ██████████░░░░░░░░░░ 48.1% (74/154)
|
||||||
@@ -286,16 +286,6 @@ claude-haiku-4-5-20251001
|
|||||||
|
|
||||||
#### Performance by Model
|
#### Performance by Model
|
||||||
|
|
||||||
##### gemini-2.5-flash
|
|
||||||
|
|
||||||
| Format | Accuracy | Correct/Total |
|
|
||||||
| ------ | -------- | ------------- |
|
|
||||||
| `xml` | 90.3% | 139/154 |
|
|
||||||
| `csv` | 89.0% | 137/154 |
|
|
||||||
| `toon` | 87.0% | 134/154 |
|
|
||||||
| `json` | 79.2% | 122/154 |
|
|
||||||
| `yaml` | 76.0% | 117/154 |
|
|
||||||
|
|
||||||
##### gpt-5-nano
|
##### gpt-5-nano
|
||||||
|
|
||||||
| Format | Accuracy | Correct/Total |
|
| Format | Accuracy | Correct/Total |
|
||||||
@@ -306,6 +296,16 @@ claude-haiku-4-5-20251001
|
|||||||
| `json` | 87.7% | 135/154 |
|
| `json` | 87.7% | 135/154 |
|
||||||
| `xml` | 83.8% | 129/154 |
|
| `xml` | 83.8% | 129/154 |
|
||||||
|
|
||||||
|
##### gemini-2.5-flash
|
||||||
|
|
||||||
|
| Format | Accuracy | Correct/Total |
|
||||||
|
| ------ | -------- | ------------- |
|
||||||
|
| `xml` | 90.3% | 139/154 |
|
||||||
|
| `csv` | 89.0% | 137/154 |
|
||||||
|
| `toon` | 87.0% | 134/154 |
|
||||||
|
| `json` | 79.2% | 122/154 |
|
||||||
|
| `yaml` | 76.0% | 117/154 |
|
||||||
|
|
||||||
##### claude-haiku-4-5-20251001
|
##### claude-haiku-4-5-20251001
|
||||||
|
|
||||||
| Format | Accuracy | Correct/Total |
|
| Format | Accuracy | Correct/Total |
|
||||||
@@ -360,7 +360,7 @@ Four datasets designed to test different structural patterns:
|
|||||||
|
|
||||||
#### Models & Configuration
|
#### Models & Configuration
|
||||||
|
|
||||||
- **Models tested**: `gemini-2.5-flash`, `gpt-5-nano`, `claude-haiku-4-5-20251001`
|
- **Models tested**: `claude-haiku-4-5-20251001`, `gemini-2.5-flash`, `gpt-5-nano`
|
||||||
- **Token counting**: Using `gpt-tokenizer` with `o200k_base` encoding (GPT-5 tokenizer)
|
- **Token counting**: Using `gpt-tokenizer` with `o200k_base` encoding (GPT-5 tokenizer)
|
||||||
- **Temperature**: 0 (for non-reasoning models)
|
- **Temperature**: 0 (for non-reasoning models)
|
||||||
- **Total evaluations**: 154 questions × 5 formats × 3 models = 2,310 LLM calls
|
- **Total evaluations**: 154 questions × 5 formats × 3 models = 2,310 LLM calls
|
||||||
|
|||||||
Reference in New Issue
Block a user