docs(website): highlight benchmarks

This commit is contained in:
Johann Schopplich
2025-11-18 10:14:07 +01:00
parent 9bebbb4070
commit 0ac629a085
9 changed files with 33 additions and 46 deletions

View File

@@ -101,7 +101,8 @@ grok-4-fast-non-reasoning
CSV ██████████░░░░░░░░░░ 52.3% (57/109)
```
**Key tradeoff:** TOON achieves **73.9% accuracy** (vs JSON's 69.7%) while using **39.6% fewer tokens** on these datasets.
> [!TIP] Results Summary
> TOON achieves **73.9% accuracy** (vs JSON's 69.7%) while using **39.6% fewer tokens** on these datasets.
<details>
<summary><strong>Performance by dataset, model, and question type</strong></summary>
@@ -284,9 +285,6 @@ grok-4-fast-non-reasoning
</details>
<details>
<summary><strong>How the benchmark works</strong></summary>
#### What's Being Measured
This benchmark tests **LLM comprehension and data retrieval accuracy** across different input formats. Each LLM receives formatted data and must answer questions about it. This does **not** test the model's ability to generate TOON output only to read and understand it.
@@ -305,7 +303,7 @@ Eleven datasets designed to test different structural patterns and validation ca
**Structural validation datasets:**
7. **Control**: Valid complete dataset (baseline for validation)
8. **Truncated**: Array with 3 rows removed from end (tests [N] length detection)
8. **Truncated**: Array with 3 rows removed from end (tests `[N]` length detection)
9. **Extra rows**: Array with 3 additional rows beyond declared length
10. **Width mismatch**: Inconsistent field count (missing salary in row 10)
11. **Missing fields**: Systematic field omissions (no email in multiple rows)
@@ -328,14 +326,14 @@ Eleven datasets designed to test different structural patterns and validation ca
- Example: "How many employees in Sales have salary > 80000?" → `5`
- Example: "How many active employees have more than 10 years of experience?" → `8`
- **Structure awareness (12%)**: Tests format-native structural affordances (TOON's [N] count and {fields}, CSV's header row)
- **Structure awareness (12%)**: Tests format-native structural affordances (TOON's `[N]` count and `{fields}`, CSV's header row)
- Example: "How many employees are in the dataset?" → `100`
- Example: "List the field names for employees" → `id, name, email, department, salary, yearsExperience, active`
- Example: "What is the department of the last employee?" → `Sales`
- **Structural validation (2%)**: Tests ability to detect incomplete, truncated, or corrupted data using structural metadata
- Example: "Is this data complete and valid?" → `YES` (control dataset) or `NO` (corrupted datasets)
- Tests TOON's [N] length validation and {fields} consistency checking
- Tests TOON's `[N]` length validation and `{fields}` consistency checking
- Demonstrates CSV's lack of structural validation capabilities
#### Evaluation Process
@@ -351,13 +349,8 @@ Eleven datasets designed to test different structural patterns and validation ca
- **Temperature**: Not set (models use their defaults)
- **Total evaluations**: 209 questions × 6 formats × 4 models = 5,016 LLM calls
</details>
<!-- /automd -->
> [!NOTE]
> **Key takeaway:** TOON achieves **73.9% accuracy** (vs JSON's 69.7%) while using **39.6% fewer tokens** on these datasets. The explicit structure (array lengths `[N]` and field lists `{fields}`) helps models track and validate data more reliably.
## Token Efficiency
Token counts are measured using the GPT-5 `o200k_base` tokenizer via [`gpt-tokenizer`](https://github.com/niieani/gpt-tokenizer). Savings are calculated against formatted JSON (2-space indentation) as the primary baseline, with additional comparisons to compact JSON (minified), YAML, and XML. Actual savings vary by model and tokenizer.