mirror of
https://github.com/voson-wang/toon.git
synced 2026-01-29 15:24:10 +08:00
docs(website): highlight benchmarks
This commit is contained in:
16
README.md
16
README.md
@@ -17,7 +17,7 @@ The similarity to CSV is intentional: CSV is simple and ubiquitous, and TOON aim
|
||||
Think of it as a translation layer: use JSON programmatically, and encode it as TOON for LLM input.
|
||||
|
||||
> [!TIP]
|
||||
> TOON is production-ready, but also an idea in progress. Nothing's set in stone – help shape where it goes by contributing to the [spec](https://github.com/toon-format/spec) or sharing feedback.
|
||||
> The TOON format is stable, but also an idea in progress. Nothing's set in stone – help shape where it goes by contributing to the [spec](https://github.com/toon-format/spec) or sharing feedback.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
@@ -244,7 +244,8 @@ grok-4-fast-non-reasoning
|
||||
CSV ██████████░░░░░░░░░░ 52.3% (57/109)
|
||||
```
|
||||
|
||||
**Key tradeoff:** TOON achieves **73.9% accuracy** (vs JSON's 69.7%) while using **39.6% fewer tokens** on these datasets.
|
||||
> [!TIP] Results Summary
|
||||
> TOON achieves **73.9% accuracy** (vs JSON's 69.7%) while using **39.6% fewer tokens** on these datasets.
|
||||
|
||||
<details>
|
||||
<summary><strong>Performance by dataset, model, and question type</strong></summary>
|
||||
@@ -427,9 +428,6 @@ grok-4-fast-non-reasoning
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><strong>How the benchmark works</strong></summary>
|
||||
|
||||
#### What's Being Measured
|
||||
|
||||
This benchmark tests **LLM comprehension and data retrieval accuracy** across different input formats. Each LLM receives formatted data and must answer questions about it. This does **not** test the model's ability to generate TOON output – only to read and understand it.
|
||||
@@ -448,7 +446,7 @@ Eleven datasets designed to test different structural patterns and validation ca
|
||||
|
||||
**Structural validation datasets:**
|
||||
7. **Control**: Valid complete dataset (baseline for validation)
|
||||
8. **Truncated**: Array with 3 rows removed from end (tests [N] length detection)
|
||||
8. **Truncated**: Array with 3 rows removed from end (tests `[N]` length detection)
|
||||
9. **Extra rows**: Array with 3 additional rows beyond declared length
|
||||
10. **Width mismatch**: Inconsistent field count (missing salary in row 10)
|
||||
11. **Missing fields**: Systematic field omissions (no email in multiple rows)
|
||||
@@ -471,14 +469,14 @@ Eleven datasets designed to test different structural patterns and validation ca
|
||||
- Example: "How many employees in Sales have salary > 80000?" → `5`
|
||||
- Example: "How many active employees have more than 10 years of experience?" → `8`
|
||||
|
||||
- **Structure awareness (12%)**: Tests format-native structural affordances (TOON's [N] count and {fields}, CSV's header row)
|
||||
- **Structure awareness (12%)**: Tests format-native structural affordances (TOON's `[N]` count and `{fields}`, CSV's header row)
|
||||
- Example: "How many employees are in the dataset?" → `100`
|
||||
- Example: "List the field names for employees" → `id, name, email, department, salary, yearsExperience, active`
|
||||
- Example: "What is the department of the last employee?" → `Sales`
|
||||
|
||||
- **Structural validation (2%)**: Tests ability to detect incomplete, truncated, or corrupted data using structural metadata
|
||||
- Example: "Is this data complete and valid?" → `YES` (control dataset) or `NO` (corrupted datasets)
|
||||
- Tests TOON's [N] length validation and {fields} consistency checking
|
||||
- Tests TOON's `[N]` length validation and `{fields}` consistency checking
|
||||
- Demonstrates CSV's lack of structural validation capabilities
|
||||
|
||||
#### Evaluation Process
|
||||
@@ -494,8 +492,6 @@ Eleven datasets designed to test different structural patterns and validation ca
|
||||
- **Temperature**: Not set (models use their defaults)
|
||||
- **Total evaluations**: 209 questions × 6 formats × 4 models = 5,016 LLM calls
|
||||
|
||||
</details>
|
||||
|
||||
<!-- /automd -->
|
||||
|
||||
### Token Efficiency
|
||||
|
||||
Reference in New Issue
Block a user