From 0ac629a085fb4ca9f36430658218351302b8c14e Mon Sep 17 00:00:00 2001
From: Johann Schopplich <mail@johannschopplich.com>
Date: Tue, 18 Nov 2025 10:14:07 +0100
Subject: [PATCH] docs(website): highlight benchmarks

---
 README.md                                | 16 ++++++----------
 benchmarks/results/retrieval-accuracy.md | 14 +++++---------
 benchmarks/src/report.ts                 | 16 +++++++---------
 docs/.vitepress/theme/overrides.css      |  4 ++++
 docs/ecosystem/implementations.md        |  2 +-
 docs/guide/benchmarks.md                 | 17 +++++------------
 docs/guide/getting-started.md            |  4 ++--
 docs/index.md                            |  4 ++--
 docs/reference/spec.md                   |  2 +-
 9 files changed, 33 insertions(+), 46 deletions(-)
diff --git a/README.md b/README.md
index b337a2c..59c855f 100644
--- a/README.md
+++ b/README.md
@@ -17,7 +17,7 @@ The similarity to CSV is intentional: CSV is simple and ubiquitous, and TOON aim
 Think of it as a translation layer: use JSON programmatically, and encode it as TOON for LLM input.
 
 > [!TIP]
-> TOON is production-ready, but also an idea in progress. Nothing's set in stone – help shape where it goes by contributing to the [spec](https://github.com/toon-format/spec) or sharing feedback.
+> The TOON format is stable, but also an idea in progress. Nothing's set in stone – help shape where it goes by contributing to the [spec](https://github.com/toon-format/spec) or sharing feedback.
 
 ## Table of Contents
 
@@ -244,7 +244,8 @@ grok-4-fast-non-reasoning
   CSV            ██████████░░░░░░░░░░    52.3% (57/109)
 ```
 
-**Key tradeoff:** TOON achieves **73.9% accuracy** (vs JSON's 69.7%) while using **39.6% fewer tokens** on these datasets.
+> [!TIP] Results Summary
+> TOON achieves **73.9% accuracy** (vs JSON's 69.7%) while using **39.6% fewer tokens** on these datasets.
 
 <details>
 <summary><strong>Performance by dataset, model, and question type</strong></summary>
@@ -427,9 +428,6 @@ grok-4-fast-non-reasoning
 
 </details>
 
-<details>
-<summary><strong>How the benchmark works</strong></summary>
-
 #### What's Being Measured
 
 This benchmark tests **LLM comprehension and data retrieval accuracy** across different input formats. Each LLM receives formatted data and must answer questions about it. This does **not** test the model's ability to generate TOON output – only to read and understand it.
@@ -448,7 +446,7 @@ Eleven datasets designed to test different structural patterns and validation ca
 
 **Structural validation datasets:**
 7. **Control**: Valid complete dataset (baseline for validation)
-8. **Truncated**: Array with 3 rows removed from end (tests [N] length detection)
+8. **Truncated**: Array with 3 rows removed from end (tests `[N]` length detection)
 9. **Extra rows**: Array with 3 additional rows beyond declared length
 10. **Width mismatch**: Inconsistent field count (missing salary in row 10)
 11. **Missing fields**: Systematic field omissions (no email in multiple rows)
@@ -471,14 +469,14 @@ Eleven datasets designed to test different structural patterns and validation ca
   - Example: "How many employees in Sales have salary > 80000?" → `5`
   - Example: "How many active employees have more than 10 years of experience?" → `8`
 
-- **Structure awareness (12%)**: Tests format-native structural affordances (TOON's [N] count and {fields}, CSV's header row)
+- **Structure awareness (12%)**: Tests format-native structural affordances (TOON's `[N]` count and `{fields}`, CSV's header row)
   - Example: "How many employees are in the dataset?" → `100`
   - Example: "List the field names for employees" → `id, name, email, department, salary, yearsExperience, active`
   - Example: "What is the department of the last employee?" → `Sales`
 
 - **Structural validation (2%)**: Tests ability to detect incomplete, truncated, or corrupted data using structural metadata
   - Example: "Is this data complete and valid?" → `YES` (control dataset) or `NO` (corrupted datasets)
-  - Tests TOON's [N] length validation and {fields} consistency checking
+  - Tests TOON's `[N]` length validation and `{fields}` consistency checking
   - Demonstrates CSV's lack of structural validation capabilities
 
 #### Evaluation Process
@@ -494,8 +492,6 @@ Eleven datasets designed to test different structural patterns and validation ca
 - **Temperature**: Not set (models use their defaults)
 - **Total evaluations**: 209 questions × 6 formats × 4 models = 5,016 LLM calls
 
-</details>
-
 <!-- /automd -->
 
 ### Token Efficiency
diff --git a/benchmarks/results/retrieval-accuracy.md b/benchmarks/results/retrieval-accuracy.md
index b03ffbb..5a9d02d 100644
--- a/benchmarks/results/retrieval-accuracy.md
+++ b/benchmarks/results/retrieval-accuracy.md
@@ -85,7 +85,8 @@ grok-4-fast-non-reasoning
   CSV            ██████████░░░░░░░░░░    52.3% (57/109)
 ```
 
-**Key tradeoff:** TOON achieves **73.9% accuracy** (vs JSON's 69.7%) while using **39.6% fewer tokens** on these datasets.
+> [!TIP] Results Summary
+> TOON achieves **73.9% accuracy** (vs JSON's 69.7%) while using **39.6% fewer tokens** on these datasets.
 
 <details>
 <summary><strong>Performance by dataset, model, and question type</strong></summary>
@@ -268,9 +269,6 @@ grok-4-fast-non-reasoning
 
 </details>
 
-<details>
-<summary><strong>How the benchmark works</strong></summary>
-
 #### What's Being Measured
 
 This benchmark tests **LLM comprehension and data retrieval accuracy** across different input formats. Each LLM receives formatted data and must answer questions about it. This does **not** test the model's ability to generate TOON output – only to read and understand it.
@@ -289,7 +287,7 @@ Eleven datasets designed to test different structural patterns and validation ca
 
 **Structural validation datasets:**
 7. **Control**: Valid complete dataset (baseline for validation)
-8. **Truncated**: Array with 3 rows removed from end (tests [N] length detection)
+8. **Truncated**: Array with 3 rows removed from end (tests `[N]` length detection)
 9. **Extra rows**: Array with 3 additional rows beyond declared length
 10. **Width mismatch**: Inconsistent field count (missing salary in row 10)
 11. **Missing fields**: Systematic field omissions (no email in multiple rows)
@@ -312,14 +310,14 @@ Eleven datasets designed to test different structural patterns and validation ca
   - Example: "How many employees in Sales have salary > 80000?" → `5`
   - Example: "How many active employees have more than 10 years of experience?" → `8`
 
-- **Structure awareness (12%)**: Tests format-native structural affordances (TOON's [N] count and {fields}, CSV's header row)
+- **Structure awareness (12%)**: Tests format-native structural affordances (TOON's `[N]` count and `{fields}`, CSV's header row)
   - Example: "How many employees are in the dataset?" → `100`
   - Example: "List the field names for employees" → `id, name, email, department, salary, yearsExperience, active`
   - Example: "What is the department of the last employee?" → `Sales`
 
 - **Structural validation (2%)**: Tests ability to detect incomplete, truncated, or corrupted data using structural metadata
   - Example: "Is this data complete and valid?" → `YES` (control dataset) or `NO` (corrupted datasets)
-  - Tests TOON's [N] length validation and {fields} consistency checking
+  - Tests TOON's `[N]` length validation and `{fields}` consistency checking
   - Demonstrates CSV's lack of structural validation capabilities
 
 #### Evaluation Process
@@ -334,5 +332,3 @@ Eleven datasets designed to test different structural patterns and validation ca
 - **Token counting**: Using `gpt-tokenizer` with `o200k_base` encoding (GPT-5 tokenizer)
 - **Temperature**: Not set (models use their defaults)
 - **Total evaluations**: 209 questions × 6 formats × 4 models = 5,016 LLM calls
-
-</details>
diff --git a/benchmarks/src/report.ts b/benchmarks/src/report.ts
index 8b39873..0adafa6 100644
--- a/benchmarks/src/report.ts
+++ b/benchmarks/src/report.ts
@@ -275,9 +275,6 @@ ${modelPerformance}
 
 </details>
 
-<details>
-<summary><strong>How the benchmark works</strong></summary>
-
 #### What's Being Measured
 
 This benchmark tests **LLM comprehension and data retrieval accuracy** across different input formats. Each LLM receives formatted data and must answer questions about it. This does **not** test the model's ability to generate TOON output – only to read and understand it.
@@ -296,7 +293,7 @@ Eleven datasets designed to test different structural patterns and validation ca
 
 **Structural validation datasets:**
 7. **Control**: Valid complete dataset (baseline for validation)
-8. **Truncated**: Array with 3 rows removed from end (tests [N] length detection)
+8. **Truncated**: Array with 3 rows removed from end (tests \`[N]\` length detection)
 9. **Extra rows**: Array with 3 additional rows beyond declared length
 10. **Width mismatch**: Inconsistent field count (missing salary in row 10)
 11. **Missing fields**: Systematic field omissions (no email in multiple rows)
@@ -319,14 +316,14 @@ ${totalQuestions} questions are generated dynamically across five categories:
   - Example: "How many employees in Sales have salary > 80000?" → \`5\`
   - Example: "How many active employees have more than 10 years of experience?" → \`8\`
 
-- **Structure awareness (${structureAwarenessPercent}%)**: Tests format-native structural affordances (TOON's [N] count and {fields}, CSV's header row)
+- **Structure awareness (${structureAwarenessPercent}%)**: Tests format-native structural affordances (TOON's \`[N]\` count and \`{fields}\`, CSV's header row)
   - Example: "How many employees are in the dataset?" → \`100\`
   - Example: "List the field names for employees" → \`id, name, email, department, salary, yearsExperience, active\`
   - Example: "What is the department of the last employee?" → \`Sales\`
 
 - **Structural validation (${structuralValidationPercent}%)**: Tests ability to detect incomplete, truncated, or corrupted data using structural metadata
   - Example: "Is this data complete and valid?" → \`YES\` (control dataset) or \`NO\` (corrupted datasets)
-  - Tests TOON's [N] length validation and {fields} consistency checking
+  - Tests TOON's \`[N]\` length validation and \`{fields}\` consistency checking
   - Demonstrates CSV's lack of structural validation capabilities
 
 #### Evaluation Process
@@ -341,8 +338,6 @@ ${totalQuestions} questions are generated dynamically across five categories:
 - **Token counting**: Using \`gpt-tokenizer\` with \`o200k_base\` encoding (GPT-5 tokenizer)
 - **Temperature**: Not set (models use their defaults)
 - **Total evaluations**: ${totalQuestions} questions × ${formatCount} formats × ${modelNames.length} models = ${totalEvaluations.toLocaleString('en-US')} LLM calls
-
-</details>
 `.trim()
 }
 
@@ -398,7 +393,10 @@ function generateSummaryComparison(
   if (!toon || !json)
     return ''
 
-  return `**Key tradeoff:** TOON achieves **${(toon.accuracy * 100).toFixed(1)}% accuracy** (vs JSON's ${(json.accuracy * 100).toFixed(1)}%) while using **${((1 - toon.totalTokens / json.totalTokens) * 100).toFixed(1)}% fewer tokens** on these datasets.`
+  return `
+> [!TIP] Results Summary
+> TOON achieves **${(toon.accuracy * 100).toFixed(1)}% accuracy** (vs JSON's ${(json.accuracy * 100).toFixed(1)}%) while using **${((1 - toon.totalTokens / json.totalTokens) * 100).toFixed(1)}% fewer tokens** on these datasets.
+`.trim()
 }
 
 /**
diff --git a/docs/.vitepress/theme/overrides.css b/docs/.vitepress/theme/overrides.css
index b600428..b99a9af 100644
--- a/docs/.vitepress/theme/overrides.css
+++ b/docs/.vitepress/theme/overrides.css
@@ -10,6 +10,10 @@ details summary {
   cursor: pointer;
 }
 
+.vp-doc [class*="language-"] code {
+  color: var(--vp-c-text-1)
+}
+
 .VPHomeHero .image-src {
   max-width: 180px !important;
   max-height: 180px !important;
diff --git a/docs/ecosystem/implementations.md b/docs/ecosystem/implementations.md
index 8a758fb..b6d1d85 100644
--- a/docs/ecosystem/implementations.md
+++ b/docs/ecosystem/implementations.md
@@ -5,7 +5,7 @@ TOON has official and community implementations across multiple programming lang
 The code examples throughout this documentation site use the TypeScript implementation by default, but the format and concepts apply equally to all languages.
 
 > [!NOTE]
-> When implementing TOON in other languages, please follow the [specification](https://github.com/toon-format/spec/blob/main/SPEC.md) to ensure compatibility across implementations. The [conformance tests](https://github.com/toon-format/spec/tree/main/tests) provide language-agnostic test fixtures that validate your implementation.
+> When implementing TOON in other languages, please follow the [spec](https://github.com/toon-format/spec/blob/main/SPEC.md) to ensure compatibility across implementations. The [conformance tests](https://github.com/toon-format/spec/tree/main/tests) provide language-agnostic test fixtures that validate your implementation.
 
 ## Official Implementations
 
diff --git a/docs/guide/benchmarks.md b/docs/guide/benchmarks.md
index af6e0d2..c6c6460 100644
--- a/docs/guide/benchmarks.md
+++ b/docs/guide/benchmarks.md
@@ -101,7 +101,8 @@ grok-4-fast-non-reasoning
   CSV            ██████████░░░░░░░░░░    52.3% (57/109)
 ```
 
-**Key tradeoff:** TOON achieves **73.9% accuracy** (vs JSON's 69.7%) while using **39.6% fewer tokens** on these datasets.
+> [!TIP] Results Summary
+> TOON achieves **73.9% accuracy** (vs JSON's 69.7%) while using **39.6% fewer tokens** on these datasets.
 
 <details>
 <summary><strong>Performance by dataset, model, and question type</strong></summary>
@@ -284,9 +285,6 @@ grok-4-fast-non-reasoning
 
 </details>
 
-<details>
-<summary><strong>How the benchmark works</strong></summary>
-
 #### What's Being Measured
 
 This benchmark tests **LLM comprehension and data retrieval accuracy** across different input formats. Each LLM receives formatted data and must answer questions about it. This does **not** test the model's ability to generate TOON output – only to read and understand it.
@@ -305,7 +303,7 @@ Eleven datasets designed to test different structural patterns and validation ca
 
 **Structural validation datasets:**
 7. **Control**: Valid complete dataset (baseline for validation)
-8. **Truncated**: Array with 3 rows removed from end (tests [N] length detection)
+8. **Truncated**: Array with 3 rows removed from end (tests `[N]` length detection)
 9. **Extra rows**: Array with 3 additional rows beyond declared length
 10. **Width mismatch**: Inconsistent field count (missing salary in row 10)
 11. **Missing fields**: Systematic field omissions (no email in multiple rows)
@@ -328,14 +326,14 @@ Eleven datasets designed to test different structural patterns and validation ca
   - Example: "How many employees in Sales have salary > 80000?" → `5`
   - Example: "How many active employees have more than 10 years of experience?" → `8`
 
-- **Structure awareness (12%)**: Tests format-native structural affordances (TOON's [N] count and {fields}, CSV's header row)
+- **Structure awareness (12%)**: Tests format-native structural affordances (TOON's `[N]` count and `{fields}`, CSV's header row)
   - Example: "How many employees are in the dataset?" → `100`
   - Example: "List the field names for employees" → `id, name, email, department, salary, yearsExperience, active`
   - Example: "What is the department of the last employee?" → `Sales`
 
 - **Structural validation (2%)**: Tests ability to detect incomplete, truncated, or corrupted data using structural metadata
   - Example: "Is this data complete and valid?" → `YES` (control dataset) or `NO` (corrupted datasets)
-  - Tests TOON's [N] length validation and {fields} consistency checking
+  - Tests TOON's `[N]` length validation and `{fields}` consistency checking
   - Demonstrates CSV's lack of structural validation capabilities
 
 #### Evaluation Process
@@ -351,13 +349,8 @@ Eleven datasets designed to test different structural patterns and validation ca
 - **Temperature**: Not set (models use their defaults)
 - **Total evaluations**: 209 questions × 6 formats × 4 models = 5,016 LLM calls
 
-</details>
-
 <!-- /automd -->
 
-> [!NOTE]
-> **Key takeaway:** TOON achieves **73.9% accuracy** (vs JSON's 69.7%) while using **39.6% fewer tokens** on these datasets. The explicit structure (array lengths `[N]` and field lists `{fields}`) helps models track and validate data more reliably.
-
 ## Token Efficiency
 
 Token counts are measured using the GPT-5 `o200k_base` tokenizer via [`gpt-tokenizer`](https://github.com/niieani/gpt-tokenizer). Savings are calculated against formatted JSON (2-space indentation) as the primary baseline, with additional comparisons to compact JSON (minified), YAML, and XML. Actual savings vary by model and tokenizer.
diff --git a/docs/guide/getting-started.md b/docs/guide/getting-started.md
index 4199a38..f51cc15 100644
--- a/docs/guide/getting-started.md
+++ b/docs/guide/getting-started.md
@@ -113,8 +113,8 @@ TOON is optimized for specific use cases. It aims to:
 
 TOON excels with uniform arrays of objects – data with the same structure across items. For LLM prompts, the format produces deterministic, minimally quoted text with built-in validation. Explicit array lengths (`[N]`) and field headers (`{fields}`) help detect truncation and malformed data, while the tabular structure declares fields once rather than repeating them in every row.
 
-::: tip Production Ready
-TOON is production-ready and actively maintained, with implementations in TypeScript, Python, Go, Rust, .NET, and more. The format is stable, but also an idea in progress. Nothing's set in stone – help shape where it goes by contributing to the [specification](https://github.com/toon-format/spec) or sharing feedback.
+::: tip
+The TOON format is stable, but also an idea in progress. Nothing's set in stone – help shape where it goes by contributing to the [spec](https://github.com/toon-format/spec) or sharing feedback.
 :::
 
 ## When Not to Use TOON
diff --git a/docs/index.md b/docs/index.md
index 4621f6e..360bdc7 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -14,8 +14,8 @@ hero:
       text: Get Started
       link: /guide/getting-started
     - theme: alt
-      text: Format Overview
-      link: /guide/format-overview
+      text: Benchmarks
+      link: /guide/benchmarks
     - theme: alt
       text: CLI
       link: /cli/
diff --git a/docs/reference/spec.md b/docs/reference/spec.md
index 0ac2718..2638eb2 100644
--- a/docs/reference/spec.md
+++ b/docs/reference/spec.md
@@ -5,7 +5,7 @@ The [TOON specification](https://github.com/toon-format/spec) is the authoritati
 You don't need this page to *use* TOON. It's mainly for implementers and contributors. If you're looking to learn how to use TOON, start with the [Getting Started](/guide/getting-started) guide instead.
 
 > [!TIP]
-> TOON is production-ready, but also an idea in progress. Nothing's set in stone – help shape where it goes by contributing to the spec or sharing feedback.
+> The TOON specification is stable, but also an idea in progress. Nothing's set in stone – help shape where it goes by contributing to it or sharing feedback!
 
 ## Current Version