docs: overhaul retrieval accuracy benchmark

2026-01-29 23:34:10 +08:00 · 2025-10-28 20:22:51 +01:00
parent 67c0df8cb0
commit e400e68ad6
1 changed files with 18 additions and 18 deletions
--- a/README.md
+++ b/README.md
@@ -215,13 +215,6 @@ metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
 Accuracy across **3 LLMs** on **154 data retrieval questions**:

 ```
-gemini-2.5-flash
-  xml          ██████████████████░░  90.3% (139/154)
-  csv          ██████████████████░░  89.0% (137/154)
-  toon         █████████████████░░░  87.0% (134/154)
-  json         ████████████████░░░░  79.2% (122/154)
-  yaml         ███████████████░░░░░  76.0% (117/154)
-
 gpt-5-nano
  toon         ███████████████████░  96.1% (148/154)
  csv          ██████████████████░░  90.3% (139/154)
@@ -229,6 +222,13 @@ gpt-5-nano
  json         ██████████████████░░  87.7% (135/154)
  xml          █████████████████░░░  83.8% (129/154)

+gemini-2.5-flash
+  xml          ██████████████████░░  90.3% (139/154)
+  csv          ██████████████████░░  89.0% (137/154)
+  toon         █████████████████░░░  87.0% (134/154)
+  json         ████████████████░░░░  79.2% (122/154)
+  yaml         ███████████████░░░░░  76.0% (117/154)
+
 claude-haiku-4-5-20251001
  json         ██████████░░░░░░░░░░  48.7% (75/154)
  toon         ██████████░░░░░░░░░░  48.1% (74/154)
@@ -286,16 +286,6 @@ claude-haiku-4-5-20251001

 #### Performance by Model

-##### gemini-2.5-flash
-
-| Format | Accuracy | Correct/Total |
-| ------ | -------- | ------------- |
-| `xml` | 90.3% | 139/154 |
-| `csv` | 89.0% | 137/154 |
-| `toon` | 87.0% | 134/154 |
-| `json` | 79.2% | 122/154 |
-| `yaml` | 76.0% | 117/154 |
-
 ##### gpt-5-nano

 | Format | Accuracy | Correct/Total |
@@ -306,6 +296,16 @@ claude-haiku-4-5-20251001
 | `json` | 87.7% | 135/154 |
 | `xml` | 83.8% | 129/154 |

+##### gemini-2.5-flash
+
+| Format | Accuracy | Correct/Total |
+| ------ | -------- | ------------- |
+| `xml` | 90.3% | 139/154 |
+| `csv` | 89.0% | 137/154 |
+| `toon` | 87.0% | 134/154 |
+| `json` | 79.2% | 122/154 |
+| `yaml` | 76.0% | 117/154 |
+
 ##### claude-haiku-4-5-20251001

 | Format | Accuracy | Correct/Total |
@@ -360,7 +360,7 @@ Four datasets designed to test different structural patterns:

 #### Models & Configuration

- **Models tested**: `gemini-2.5-flash`, `gpt-5-nano`, `claude-haiku-4-5-20251001`
+- **Models tested**: `claude-haiku-4-5-20251001`, `gemini-2.5-flash`, `gpt-5-nano`
 - **Token counting**: Using `gpt-tokenizer` with `o200k_base` encoding (GPT-5 tokenizer)
 - **Temperature**: 0 (for non-reasoning models)
 - **Total evaluations**: 154 questions × 5 formats × 3 models = 2,310 LLM calls