docs(website): highlight benchmarks

2026-01-29 23:34:10 +08:00 · 2025-11-18 10:14:07 +01:00
parent 9bebbb4070
commit 0ac629a085
9 changed files with 33 additions and 46 deletions
--- a/README.md
+++ b/README.md
@@ -17,7 +17,7 @@ The similarity to CSV is intentional: CSV is simple and ubiquitous, and TOON aim
 Think of it as a translation layer: use JSON programmatically, and encode it as TOON for LLM input.
 > [!TIP]
-> TOON is production-ready, but also an idea in progress. Nothing's set in stone – help shape where it goes by contributing to the [spec](https://github.com/toon-format/spec) or sharing feedback.
+> The TOON format is stable, but also an idea in progress. Nothing's set in stone – help shape where it goes by contributing to the [spec](https://github.com/toon-format/spec) or sharing feedback.
 ## Table of Contents
@@ -244,7 +244,8 @@ grok-4-fast-non-reasoning
  CSV            ██████████░░░░░░░░░░    52.3% (57/109)
 ```
-**Key tradeoff:** TOON achieves **73.9% accuracy** (vs JSON's 69.7%) while using **39.6% fewer tokens** on these datasets.
+> [!TIP] Results Summary
 > TOON achieves **73.9% accuracy** (vs JSON's 69.7%) while using **39.6% fewer tokens** on these datasets.
 <details>
 <summary><strong>Performance by dataset, model, and question type</strong></summary>
@@ -427,9 +428,6 @@ grok-4-fast-non-reasoning
 </details>
 <details>
 <summary><strong>How the benchmark works</strong></summary>
 #### What's Being Measured
 This benchmark tests **LLM comprehension and data retrieval accuracy** across different input formats. Each LLM receives formatted data and must answer questions about it. This does **not** test the model's ability to generate TOON output – only to read and understand it.
@@ -448,7 +446,7 @@ Eleven datasets designed to test different structural patterns and validation ca
 **Structural validation datasets:**
 7. **Control**: Valid complete dataset (baseline for validation)
-8. **Truncated**: Array with 3 rows removed from end (tests [N] length detection)
+8. **Truncated**: Array with 3 rows removed from end (tests `[N]` length detection)
 9. **Extra rows**: Array with 3 additional rows beyond declared length
 10. **Width mismatch**: Inconsistent field count (missing salary in row 10)
 11. **Missing fields**: Systematic field omissions (no email in multiple rows)
@@ -471,14 +469,14 @@ Eleven datasets designed to test different structural patterns and validation ca
  - Example: "How many employees in Sales have salary > 80000?" → `5`
  - Example: "How many active employees have more than 10 years of experience?" → `8`
- **Structure awareness (12%)**: Tests format-native structural affordances (TOON's [N] count and {fields}, CSV's header row)
+- **Structure awareness (12%)**: Tests format-native structural affordances (TOON's `[N]` count and `{fields}`, CSV's header row)
  - Example: "How many employees are in the dataset?" → `100`
  - Example: "List the field names for employees" → `id, name, email, department, salary, yearsExperience, active`
  - Example: "What is the department of the last employee?" → `Sales`
 - **Structural validation (2%)**: Tests ability to detect incomplete, truncated, or corrupted data using structural metadata
  - Example: "Is this data complete and valid?" → `YES` (control dataset) or `NO` (corrupted datasets)
-  - Tests TOON's [N] length validation and {fields} consistency checking
+  - Tests TOON's `[N]` length validation and `{fields}` consistency checking
  - Demonstrates CSV's lack of structural validation capabilities
 #### Evaluation Process
@@ -494,8 +492,6 @@ Eleven datasets designed to test different structural patterns and validation ca
 - **Temperature**: Not set (models use their defaults)
 - **Total evaluations**: 209 questions × 6 formats × 4 models = 5,016 LLM calls
 </details>
 <!-- /automd -->
 ### Token Efficiency
--- a/benchmarks/results/retrieval-accuracy.md
+++ b/benchmarks/results/retrieval-accuracy.md
@@ -85,7 +85,8 @@ grok-4-fast-non-reasoning
  CSV            ██████████░░░░░░░░░░    52.3% (57/109)
 ```
-**Key tradeoff:** TOON achieves **73.9% accuracy** (vs JSON's 69.7%) while using **39.6% fewer tokens** on these datasets.
+> [!TIP] Results Summary
 > TOON achieves **73.9% accuracy** (vs JSON's 69.7%) while using **39.6% fewer tokens** on these datasets.
 <details>
 <summary><strong>Performance by dataset, model, and question type</strong></summary>
@@ -268,9 +269,6 @@ grok-4-fast-non-reasoning
 </details>
 <details>
 <summary><strong>How the benchmark works</strong></summary>
 #### What's Being Measured
 This benchmark tests **LLM comprehension and data retrieval accuracy** across different input formats. Each LLM receives formatted data and must answer questions about it. This does **not** test the model's ability to generate TOON output – only to read and understand it.
@@ -289,7 +287,7 @@ Eleven datasets designed to test different structural patterns and validation ca
 **Structural validation datasets:**
 7. **Control**: Valid complete dataset (baseline for validation)
-8. **Truncated**: Array with 3 rows removed from end (tests [N] length detection)
+8. **Truncated**: Array with 3 rows removed from end (tests `[N]` length detection)
 9. **Extra rows**: Array with 3 additional rows beyond declared length
 10. **Width mismatch**: Inconsistent field count (missing salary in row 10)
 11. **Missing fields**: Systematic field omissions (no email in multiple rows)
@@ -312,14 +310,14 @@ Eleven datasets designed to test different structural patterns and validation ca
  - Example: "How many employees in Sales have salary > 80000?" → `5`
  - Example: "How many active employees have more than 10 years of experience?" → `8`
- **Structure awareness (12%)**: Tests format-native structural affordances (TOON's [N] count and {fields}, CSV's header row)
+- **Structure awareness (12%)**: Tests format-native structural affordances (TOON's `[N]` count and `{fields}`, CSV's header row)
  - Example: "How many employees are in the dataset?" → `100`
  - Example: "List the field names for employees" → `id, name, email, department, salary, yearsExperience, active`
  - Example: "What is the department of the last employee?" → `Sales`
 - **Structural validation (2%)**: Tests ability to detect incomplete, truncated, or corrupted data using structural metadata
  - Example: "Is this data complete and valid?" → `YES` (control dataset) or `NO` (corrupted datasets)
-  - Tests TOON's [N] length validation and {fields} consistency checking
+  - Tests TOON's `[N]` length validation and `{fields}` consistency checking
  - Demonstrates CSV's lack of structural validation capabilities
 #### Evaluation Process
@@ -334,5 +332,3 @@ Eleven datasets designed to test different structural patterns and validation ca
 - **Token counting**: Using `gpt-tokenizer` with `o200k_base` encoding (GPT-5 tokenizer)
 - **Temperature**: Not set (models use their defaults)
 - **Total evaluations**: 209 questions × 6 formats × 4 models = 5,016 LLM calls
 </details>
--- a/benchmarks/src/report.ts
+++ b/benchmarks/src/report.ts
@@ -275,9 +275,6 @@ ${modelPerformance}
 </details>
 <details>
 <summary><strong>How the benchmark works</strong></summary>
 #### What's Being Measured
 This benchmark tests **LLM comprehension and data retrieval accuracy** across different input formats. Each LLM receives formatted data and must answer questions about it. This does **not** test the model's ability to generate TOON output – only to read and understand it.
@@ -296,7 +293,7 @@ Eleven datasets designed to test different structural patterns and validation ca
 **Structural validation datasets:**
 7. **Control**: Valid complete dataset (baseline for validation)
-8. **Truncated**: Array with 3 rows removed from end (tests [N] length detection)
+8. **Truncated**: Array with 3 rows removed from end (tests \`[N]\` length detection)
 9. **Extra rows**: Array with 3 additional rows beyond declared length
 10. **Width mismatch**: Inconsistent field count (missing salary in row 10)
 11. **Missing fields**: Systematic field omissions (no email in multiple rows)
@@ -319,14 +316,14 @@ ${totalQuestions} questions are generated dynamically across five categories:
  - Example: "How many employees in Sales have salary > 80000?" → \`5\`
  - Example: "How many active employees have more than 10 years of experience?" → \`8\`
- **Structure awareness (${structureAwarenessPercent}%)**: Tests format-native structural affordances (TOON's [N] count and {fields}, CSV's header row)
+- **Structure awareness (${structureAwarenessPercent}%)**: Tests format-native structural affordances (TOON's \`[N]\` count and \`{fields}\`, CSV's header row)
  - Example: "How many employees are in the dataset?" → \`100\`
  - Example: "List the field names for employees" → \`id, name, email, department, salary, yearsExperience, active\`
  - Example: "What is the department of the last employee?" → \`Sales\`
 - **Structural validation (${structuralValidationPercent}%)**: Tests ability to detect incomplete, truncated, or corrupted data using structural metadata
  - Example: "Is this data complete and valid?" → \`YES\` (control dataset) or \`NO\` (corrupted datasets)
-  - Tests TOON's [N] length validation and {fields} consistency checking
+  - Tests TOON's \`[N]\` length validation and \`{fields}\` consistency checking
  - Demonstrates CSV's lack of structural validation capabilities
 #### Evaluation Process
@@ -341,8 +338,6 @@ ${totalQuestions} questions are generated dynamically across five categories:
 - **Token counting**: Using \`gpt-tokenizer\` with \`o200k_base\` encoding (GPT-5 tokenizer)
 - **Temperature**: Not set (models use their defaults)
 - **Total evaluations**: ${totalQuestions} questions × ${formatCount} formats × ${modelNames.length} models = ${totalEvaluations.toLocaleString('en-US')} LLM calls
 </details>
 `.trim()
 }
@@ -398,7 +393,10 @@ function generateSummaryComparison(
  if (!toon || !json)
    return ''
-  return `**Key tradeoff:** TOON achieves **${(toon.accuracy * 100).toFixed(1)}% accuracy** (vs JSON's ${(json.accuracy * 100).toFixed(1)}%) while using **${((1 - toon.totalTokens / json.totalTokens) * 100).toFixed(1)}% fewer tokens** on these datasets.`
+  return `
 > [!TIP] Results Summary
 > TOON achieves **${(toon.accuracy * 100).toFixed(1)}% accuracy** (vs JSON's ${(json.accuracy * 100).toFixed(1)}%) while using **${((1 - toon.totalTokens / json.totalTokens) * 100).toFixed(1)}% fewer tokens** on these datasets.
 `.trim()
 }
 /**
--- a/docs/.vitepress/theme/overrides.css
+++ b/docs/.vitepress/theme/overrides.css
@@ -10,6 +10,10 @@ details summary {
  cursor: pointer;
 }
 .vp-doc [class*="language-"] code {
  color: var(--vp-c-text-1)
 }
 .VPHomeHero .image-src {
  max-width: 180px !important;
  max-height: 180px !important;
--- a/docs/ecosystem/implementations.md
+++ b/docs/ecosystem/implementations.md
@@ -5,7 +5,7 @@ TOON has official and community implementations across multiple programming lang
 The code examples throughout this documentation site use the TypeScript implementation by default, but the format and concepts apply equally to all languages.
 > [!NOTE]
-> When implementing TOON in other languages, please follow the [specification](https://github.com/toon-format/spec/blob/main/SPEC.md) to ensure compatibility across implementations. The [conformance tests](https://github.com/toon-format/spec/tree/main/tests) provide language-agnostic test fixtures that validate your implementation.
+> When implementing TOON in other languages, please follow the [spec](https://github.com/toon-format/spec/blob/main/SPEC.md) to ensure compatibility across implementations. The [conformance tests](https://github.com/toon-format/spec/tree/main/tests) provide language-agnostic test fixtures that validate your implementation.
 ## Official Implementations
--- a/docs/guide/benchmarks.md
+++ b/docs/guide/benchmarks.md
@@ -101,7 +101,8 @@ grok-4-fast-non-reasoning
  CSV            ██████████░░░░░░░░░░    52.3% (57/109)
 ```
-**Key tradeoff:** TOON achieves **73.9% accuracy** (vs JSON's 69.7%) while using **39.6% fewer tokens** on these datasets.
+> [!TIP] Results Summary
 > TOON achieves **73.9% accuracy** (vs JSON's 69.7%) while using **39.6% fewer tokens** on these datasets.
 <details>
 <summary><strong>Performance by dataset, model, and question type</strong></summary>
@@ -284,9 +285,6 @@ grok-4-fast-non-reasoning
 </details>
 <details>
 <summary><strong>How the benchmark works</strong></summary>
 #### What's Being Measured
 This benchmark tests **LLM comprehension and data retrieval accuracy** across different input formats. Each LLM receives formatted data and must answer questions about it. This does **not** test the model's ability to generate TOON output – only to read and understand it.
@@ -305,7 +303,7 @@ Eleven datasets designed to test different structural patterns and validation ca
 **Structural validation datasets:**
 7. **Control**: Valid complete dataset (baseline for validation)
-8. **Truncated**: Array with 3 rows removed from end (tests [N] length detection)
+8. **Truncated**: Array with 3 rows removed from end (tests `[N]` length detection)
 9. **Extra rows**: Array with 3 additional rows beyond declared length
 10. **Width mismatch**: Inconsistent field count (missing salary in row 10)
 11. **Missing fields**: Systematic field omissions (no email in multiple rows)
@@ -328,14 +326,14 @@ Eleven datasets designed to test different structural patterns and validation ca
  - Example: "How many employees in Sales have salary > 80000?" → `5`
  - Example: "How many active employees have more than 10 years of experience?" → `8`
- **Structure awareness (12%)**: Tests format-native structural affordances (TOON's [N] count and {fields}, CSV's header row)
+- **Structure awareness (12%)**: Tests format-native structural affordances (TOON's `[N]` count and `{fields}`, CSV's header row)
  - Example: "How many employees are in the dataset?" → `100`
  - Example: "List the field names for employees" → `id, name, email, department, salary, yearsExperience, active`
  - Example: "What is the department of the last employee?" → `Sales`
 - **Structural validation (2%)**: Tests ability to detect incomplete, truncated, or corrupted data using structural metadata
  - Example: "Is this data complete and valid?" → `YES` (control dataset) or `NO` (corrupted datasets)
-  - Tests TOON's [N] length validation and {fields} consistency checking
+  - Tests TOON's `[N]` length validation and `{fields}` consistency checking
  - Demonstrates CSV's lack of structural validation capabilities
 #### Evaluation Process
@@ -351,13 +349,8 @@ Eleven datasets designed to test different structural patterns and validation ca
 - **Temperature**: Not set (models use their defaults)
 - **Total evaluations**: 209 questions × 6 formats × 4 models = 5,016 LLM calls
 </details>
 <!-- /automd -->
 > [!NOTE]
 > **Key takeaway:** TOON achieves **73.9% accuracy** (vs JSON's 69.7%) while using **39.6% fewer tokens** on these datasets. The explicit structure (array lengths `[N]` and field lists `{fields}`) helps models track and validate data more reliably.
 ## Token Efficiency
 Token counts are measured using the GPT-5 `o200k_base` tokenizer via [`gpt-tokenizer`](https://github.com/niieani/gpt-tokenizer). Savings are calculated against formatted JSON (2-space indentation) as the primary baseline, with additional comparisons to compact JSON (minified), YAML, and XML. Actual savings vary by model and tokenizer.
--- a/docs/guide/getting-started.md
+++ b/docs/guide/getting-started.md
@@ -113,8 +113,8 @@ TOON is optimized for specific use cases. It aims to:
 TOON excels with uniform arrays of objects – data with the same structure across items. For LLM prompts, the format produces deterministic, minimally quoted text with built-in validation. Explicit array lengths (`[N]`) and field headers (`{fields}`) help detect truncation and malformed data, while the tabular structure declares fields once rather than repeating them in every row.
-::: tip Production Ready
+::: tip
-TOON is production-ready and actively maintained, with implementations in TypeScript, Python, Go, Rust, .NET, and more. The format is stable, but also an idea in progress. Nothing's set in stone – help shape where it goes by contributing to the [specification](https://github.com/toon-format/spec) or sharing feedback.
+The TOON format is stable, but also an idea in progress. Nothing's set in stone – help shape where it goes by contributing to the [spec](https://github.com/toon-format/spec) or sharing feedback.
 :::
 ## When Not to Use TOON
--- a/docs/index.md
+++ b/docs/index.md
@@ -14,8 +14,8 @@ hero:
      text: Get Started
      link: /guide/getting-started
    - theme: alt
-      text: Format Overview
+      text: Benchmarks
-      link: /guide/format-overview
+      link: /guide/benchmarks
    - theme: alt
      text: CLI
      link: /cli/
--- a/docs/reference/spec.md
+++ b/docs/reference/spec.md
@@ -5,7 +5,7 @@ The [TOON specification](https://github.com/toon-format/spec) is the authoritati
 You don't need this page to *use* TOON. It's mainly for implementers and contributors. If you're looking to learn how to use TOON, start with the [Getting Started](/guide/getting-started) guide instead.
 > [!TIP]
-> TOON is production-ready, but also an idea in progress. Nothing's set in stone – help shape where it goes by contributing to the spec or sharing feedback.
+> The TOON specification is stable, but also an idea in progress. Nothing's set in stone – help shape where it goes by contributing to it or sharing feedback!
 ## Current Version