docs(website): highlight benchmarks

2026-01-29 15:24:10 +08:00 · 2025-11-18 10:14:07 +01:00
parent 9bebbb4070
commit 0ac629a085
9 changed files with 33 additions and 46 deletions
--- a/benchmarks/results/retrieval-accuracy.md
+++ b/benchmarks/results/retrieval-accuracy.md
@@ -85,7 +85,8 @@ grok-4-fast-non-reasoning
  CSV            ██████████░░░░░░░░░░    52.3% (57/109)
 ```

-**Key tradeoff:** TOON achieves **73.9% accuracy** (vs JSON's 69.7%) while using **39.6% fewer tokens** on these datasets.
+> [!TIP] Results Summary
+> TOON achieves **73.9% accuracy** (vs JSON's 69.7%) while using **39.6% fewer tokens** on these datasets.

 <details>
 <summary><strong>Performance by dataset, model, and question type</strong></summary>
@@ -268,9 +269,6 @@ grok-4-fast-non-reasoning

 </details>

-<details>
-<summary><strong>How the benchmark works</strong></summary>
-
 #### What's Being Measured

 This benchmark tests **LLM comprehension and data retrieval accuracy** across different input formats. Each LLM receives formatted data and must answer questions about it. This does **not** test the model's ability to generate TOON output – only to read and understand it.
@@ -289,7 +287,7 @@ Eleven datasets designed to test different structural patterns and validation ca

 **Structural validation datasets:**
 7. **Control**: Valid complete dataset (baseline for validation)
-8. **Truncated**: Array with 3 rows removed from end (tests [N] length detection)
+8. **Truncated**: Array with 3 rows removed from end (tests `[N]` length detection)
 9. **Extra rows**: Array with 3 additional rows beyond declared length
 10. **Width mismatch**: Inconsistent field count (missing salary in row 10)
 11. **Missing fields**: Systematic field omissions (no email in multiple rows)
@@ -312,14 +310,14 @@ Eleven datasets designed to test different structural patterns and validation ca
  - Example: "How many employees in Sales have salary > 80000?" → `5`
  - Example: "How many active employees have more than 10 years of experience?" → `8`

- **Structure awareness (12%)**: Tests format-native structural affordances (TOON's [N] count and {fields}, CSV's header row)
+- **Structure awareness (12%)**: Tests format-native structural affordances (TOON's `[N]` count and `{fields}`, CSV's header row)
  - Example: "How many employees are in the dataset?" → `100`
  - Example: "List the field names for employees" → `id, name, email, department, salary, yearsExperience, active`
  - Example: "What is the department of the last employee?" → `Sales`

 - **Structural validation (2%)**: Tests ability to detect incomplete, truncated, or corrupted data using structural metadata
  - Example: "Is this data complete and valid?" → `YES` (control dataset) or `NO` (corrupted datasets)
-  - Tests TOON's [N] length validation and {fields} consistency checking
+  - Tests TOON's `[N]` length validation and `{fields}` consistency checking
  - Demonstrates CSV's lack of structural validation capabilities

 #### Evaluation Process
@@ -334,5 +332,3 @@ Eleven datasets designed to test different structural patterns and validation ca
 - **Token counting**: Using `gpt-tokenizer` with `o200k_base` encoding (GPT-5 tokenizer)
 - **Temperature**: Not set (models use their defaults)
 - **Total evaluations**: 209 questions × 6 formats × 4 models = 5,016 LLM calls
-
-</details>
--- a/benchmarks/src/report.ts
+++ b/benchmarks/src/report.ts
@@ -275,9 +275,6 @@ ${modelPerformance}

 </details>

-<details>
-<summary><strong>How the benchmark works</strong></summary>
-
 #### What's Being Measured

 This benchmark tests **LLM comprehension and data retrieval accuracy** across different input formats. Each LLM receives formatted data and must answer questions about it. This does **not** test the model's ability to generate TOON output – only to read and understand it.
@@ -296,7 +293,7 @@ Eleven datasets designed to test different structural patterns and validation ca

 **Structural validation datasets:**
 7. **Control**: Valid complete dataset (baseline for validation)
-8. **Truncated**: Array with 3 rows removed from end (tests [N] length detection)
+8. **Truncated**: Array with 3 rows removed from end (tests \`[N]\` length detection)
 9. **Extra rows**: Array with 3 additional rows beyond declared length
 10. **Width mismatch**: Inconsistent field count (missing salary in row 10)
 11. **Missing fields**: Systematic field omissions (no email in multiple rows)
@@ -319,14 +316,14 @@ ${totalQuestions} questions are generated dynamically across five categories:
  - Example: "How many employees in Sales have salary > 80000?" → \`5\`
  - Example: "How many active employees have more than 10 years of experience?" → \`8\`

- **Structure awareness (${structureAwarenessPercent}%)**: Tests format-native structural affordances (TOON's [N] count and {fields}, CSV's header row)
+- **Structure awareness (${structureAwarenessPercent}%)**: Tests format-native structural affordances (TOON's \`[N]\` count and \`{fields}\`, CSV's header row)
  - Example: "How many employees are in the dataset?" → \`100\`
  - Example: "List the field names for employees" → \`id, name, email, department, salary, yearsExperience, active\`
  - Example: "What is the department of the last employee?" → \`Sales\`

 - **Structural validation (${structuralValidationPercent}%)**: Tests ability to detect incomplete, truncated, or corrupted data using structural metadata
  - Example: "Is this data complete and valid?" → \`YES\` (control dataset) or \`NO\` (corrupted datasets)
-  - Tests TOON's [N] length validation and {fields} consistency checking
+  - Tests TOON's \`[N]\` length validation and \`{fields}\` consistency checking
  - Demonstrates CSV's lack of structural validation capabilities

 #### Evaluation Process
@@ -341,8 +338,6 @@ ${totalQuestions} questions are generated dynamically across five categories:
 - **Token counting**: Using \`gpt-tokenizer\` with \`o200k_base\` encoding (GPT-5 tokenizer)
 - **Temperature**: Not set (models use their defaults)
 - **Total evaluations**: ${totalQuestions} questions × ${formatCount} formats × ${modelNames.length} models = ${totalEvaluations.toLocaleString('en-US')} LLM calls
-
-</details>
 `.trim()
 }

@@ -398,7 +393,10 @@ function generateSummaryComparison(
  if (!toon || !json)
    return ''

-  return `**Key tradeoff:** TOON achieves **${(toon.accuracy * 100).toFixed(1)}% accuracy** (vs JSON's ${(json.accuracy * 100).toFixed(1)}%) while using **${((1 - toon.totalTokens / json.totalTokens) * 100).toFixed(1)}% fewer tokens** on these datasets.`
+  return `
+> [!TIP] Results Summary
+> TOON achieves **${(toon.accuracy * 100).toFixed(1)}% accuracy** (vs JSON's ${(json.accuracy * 100).toFixed(1)}%) while using **${((1 - toon.totalTokens / json.totalTokens) * 100).toFixed(1)}% fewer tokens** on these datasets.
+`.trim()
 }

 /**