docs: add accuracy per 1k tokens report (closes #72)

2026-01-29 15:24:10 +08:00 · 2025-11-05 08:21:57 +01:00
parent 9268fdf3ef
commit af17efe128
8 changed files with 413 additions and 180 deletions
--- a/README.md
+++ b/README.md
@@ -62,12 +62,14 @@ For small payloads, JSON/CSV/YAML work fine. TOON's value emerges at scale: when

 ## Key Features

- 💸 **Token-efficient:** typically 30–60% fewer tokens than JSON
+- 💸 **Token-efficient:** typically 30–60% fewer tokens than JSON[^1]
 - 🤿 **LLM-friendly guardrails:** explicit lengths and fields enable validation
 - 🍱 **Minimal syntax:** removes redundant punctuation (braces, brackets, most quotes)
 - 📐 **Indentation-based structure:** like YAML, uses whitespace instead of braces
 - 🧺 **Tabular arrays:** declare keys once, stream data as rows

+[^1]: For flat tabular data, CSV is more compact. TOON adds minimal overhead to provide explicit structure and validation that improves LLM reliability.
+
 ## Benchmarks

 > [!TIP]
@@ -80,12 +82,10 @@ Token counts are measured using the GPT-5 `o200k_base` tokenizer via [`gpt-token
 The benchmarks use datasets optimized for TOON's strengths (uniform tabular data). Real-world performance depends on your data structure.

 > [!NOTE]
-> CSV/TSV isn't shown in the token-efficiency chart because it doesn't encode nesting without flattening. For flat datasets, see CSV token counts in the [Retrieval Accuracy](#retrieval-accuracy) tables.
+> CSV/TSV doesn't support nested structures, so it's not included in this comparison. For flat datasets where CSV applies, see token counts and accuracy metrics in the [Retrieval Accuracy](#retrieval-accuracy) section.

 <!-- automd:file src="./benchmarks/results/token-efficiency.md" -->

-### Token Efficiency
-
 ```
 ⭐ GitHub Repositories       ██████████████░░░░░░░░░░░    8,745 tokens
                             vs JSON (−42.3%)           15,145
@@ -251,9 +251,28 @@ metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:

 <!-- /automd -->

+### Retrieval Accuracy
+
 <!-- automd:file src="./benchmarks/results/retrieval-accuracy.md" -->

-### Retrieval Accuracy
+Benchmarks test LLM comprehension across different input formats using 154 data retrieval questions on 4 models.
+
+#### Efficiency Ranking (Accuracy per 1K Tokens)
+
+Each format's overall performance, balancing accuracy against token cost:
+
+```
+toon           ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓   15.0  │  70.1% acc  │  4,678 tokens
+csv            ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░   14.3  │  67.7% acc  │  4,745 tokens
+json-compact   ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░   11.0  │  65.3% acc  │  5,925 tokens
+yaml           ▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░    9.4  │  66.7% acc  │  7,091 tokens
+json-pretty    ▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░    7.5  │  65.4% acc  │  8,713 tokens
+xml            ▓▓▓▓▓▓▓▓▓░░░░░░░░░░░    6.8  │  67.2% acc  │  9,944 tokens
+```
+
+TOON achieves **70.1%** accuracy (vs JSON's 65.4%) while using **46.3% fewer tokens**.
+
+#### Per-Model Accuracy

 Accuracy across **4 LLMs** on 154 data retrieval questions:

@@ -915,7 +934,7 @@ By default, the decoder validates input strictly:
 - Format familiarity and structure matter as much as token count. TOON's tabular format requires arrays of objects with identical keys and primitive values only. When this doesn't hold (due to mixed types, non-uniform objects, or nested structures), TOON switches to list format where JSON can be more efficient at scale.
  - **TOON excels at:** Uniform arrays of objects (same fields, primitive values), especially large datasets with consistent structure.
  - **JSON is better for:** Non-uniform data, deeply nested structures, and objects with varying field sets.
-  - **CSV is more compact for:** Flat, uniform tables without nesting. TOON adds minimal overhead (`[N]` length markers, delimiter scoping, deterministic quoting) to improve LLM reliability while staying close to CSV's token efficiency.
+  - **CSV is more compact for:** Flat, uniform tables without nesting. TOON adds structure (`[N]` length markers, delimiter scoping, deterministic quoting) that improves LLM reliability with minimal token overhead.
 - **Token counts vary by tokenizer and model.** Benchmarks use a GPT-style tokenizer (cl100k/o200k); actual savings will differ with other models (e.g., [SentencePiece](https://github.com/google/sentencepiece)).
 - **TOON is designed for LLM input** where human readability and token efficiency matter. It's **not** a drop-in replacement for JSON in APIs or storage.

--- a/benchmarks/results/retrieval-accuracy.md
+++ b/benchmarks/results/retrieval-accuracy.md
@@ -1,4 +1,21 @@
-### Retrieval Accuracy
+Benchmarks test LLM comprehension across different input formats using 154 data retrieval questions on 4 models.
+
+#### Efficiency Ranking (Accuracy per 1K Tokens)
+
+Each format's overall performance, balancing accuracy against token cost:
+
+```
+toon           ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓   15.0  │  70.1% acc  │  4,678 tokens
+csv            ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░   14.3  │  67.7% acc  │  4,745 tokens
+json-compact   ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░   11.0  │  65.3% acc  │  5,925 tokens
+yaml           ▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░    9.4  │  66.7% acc  │  7,091 tokens
+json-pretty    ▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░    7.5  │  65.4% acc  │  8,713 tokens
+xml            ▓▓▓▓▓▓▓▓▓░░░░░░░░░░░    6.8  │  67.2% acc  │  9,944 tokens
+```
+
+TOON achieves **70.1%** accuracy (vs JSON's 65.4%) while using **46.3% fewer tokens**.
+
+#### Per-Model Accuracy

 Accuracy across **4 LLMs** on 154 data retrieval questions:

--- a/benchmarks/results/token-efficiency.md
+++ b/benchmarks/results/token-efficiency.md
@@ -1,5 +1,3 @@
-### Token Efficiency
-
 ```
 ⭐ GitHub Repositories       ██████████████░░░░░░░░░░░    8,745 tokens
                             vs JSON (−42.3%)           15,145
--- a/benchmarks/scripts/accuracy-benchmark.ts
+++ b/benchmarks/scripts/accuracy-benchmark.ts
@@ -1,15 +1,17 @@
 import type { Question } from '../src/types'
+import * as fsp from 'node:fs/promises'
 import * as path from 'node:path'
 import process from 'node:process'
 import * as prompts from '@clack/prompts'
 import PQueue from 'p-queue'
-import { DEFAULT_CONCURRENCY, DRY_RUN, DRY_RUN_LIMITS, MODEL_RPM_LIMITS, ROOT_DIR } from '../src/constants'
+import { BENCHMARKS_DIR, DEFAULT_CONCURRENCY, DRY_RUN, DRY_RUN_LIMITS, MODEL_RPM_LIMITS, ROOT_DIR } from '../src/constants'
 import { datasets } from '../src/datasets'
 import { evaluateQuestion, models } from '../src/evaluate'
 import { formatters } from '../src/formatters'
 import { generateQuestions } from '../src/questions'
-import { calculateFormatResults, calculateTokenCounts, saveResults } from '../src/report'
+import { calculateFormatResults, calculateTokenCounts, generateAccuracyReport } from '../src/report'
 import { getAllModelResults, hasModelResults, saveModelResults } from '../src/storage'
+import { ensureDir } from '../src/utils'

 prompts.intro('Retrieval Accuracy Benchmark')

@@ -142,13 +144,15 @@ if (allResults.length === 0) {
  process.exit(0)
 }

-// Calculate token counts freshly (deterministic, no need to persist)
 const tokenCounts = calculateTokenCounts(formatters)
-
-// Calculate format statistics and save report
 const formatResults = calculateFormatResults(allResults, tokenCounts)
-const resultsDir = await saveResults(allResults, formatResults, questions, tokenCounts)
+const accuracyReport = generateAccuracyReport(allResults, formatResults, tokenCounts)

-const reportPath = path.join(resultsDir, 'retrieval-accuracy.md')
-prompts.log.info(`Report saved to: \`${path.relative(ROOT_DIR, reportPath)}\``)
+const resultsDir = path.join(BENCHMARKS_DIR, 'results')
+await ensureDir(resultsDir)
+
+const outputFilePath = path.join(resultsDir, 'retrieval-accuracy.md')
+await fsp.writeFile(outputFilePath, accuracyReport)
+
+prompts.log.info(`Report saved to: \`${path.relative(ROOT_DIR, outputFilePath)}\``)
 reportSpinner.stop('Report generation complete!')
--- a/benchmarks/scripts/token-efficiency-benchmark.ts
+++ b/benchmarks/scripts/token-efficiency-benchmark.ts
@@ -217,4 +217,4 @@ await ensureDir(resultsDir)
 const outputFilePath = path.join(resultsDir, 'token-efficiency.md')
 await fsp.writeFile(outputFilePath, markdown, 'utf-8')

-prompts.log.success(`Result saved to \`${path.relative(ROOT_DIR, outputFilePath)}\``)
+prompts.log.success(`Report saved to \`${path.relative(ROOT_DIR, outputFilePath)}\``)
--- a/benchmarks/src/report.ts
+++ b/benchmarks/src/report.ts
@@ -1,10 +1,30 @@
-import type { EvaluationResult, FormatResult, Question } from './types'
-import * as fsp from 'node:fs/promises'
-import * as path from 'node:path'
-import { BENCHMARKS_DIR, FORMATTER_DISPLAY_NAMES } from './constants'
+import type { EfficiencyRanking, EvaluationResult, FormatResult, Question } from './types'
+import { FORMATTER_DISPLAY_NAMES } from './constants'
 import { datasets } from './datasets'
 import { models } from './evaluate'
-import { createProgressBar, ensureDir, tokenize } from './utils'
+import { generateQuestions } from './questions'
+import { createProgressBar, tokenize } from './utils'
+
+const EFFICIENCY_CHART_STYLE: 'vertical' | 'horizontal' = 'horizontal'
+
+/**
+ * Calculate token counts for all format+dataset combinations
+ */
+export function calculateTokenCounts(
+  formatters: Record<string, (data: unknown) => string>,
+): Record<string, number> {
+  const tokenCounts: Record<string, number> = {}
+
+  for (const [formatName, formatter] of Object.entries(formatters)) {
+    for (const dataset of datasets) {
+      const formatted = formatter(dataset.data)
+      const key = `${formatName}-${dataset.name}`
+      tokenCounts[key] = tokenize(formatted)
+    }
+  }
+
+  return tokenCounts
+}

 /**
 * Calculate per-format statistics from evaluation results
@@ -40,9 +60,80 @@ export function calculateFormatResults(
 }

 /**
- * Generate embeddable markdown report from results
+ * Generate consolidated retrieval accuracy report
 */
-export function generateMarkdownReport(
+export function generateAccuracyReport(
+  results: EvaluationResult[],
+  formatResults: FormatResult[],
+  tokenCounts: Record<string, number>,
+): string {
+  const questions = generateQuestions()
+  const totalQuestions = [...new Set(results.map(r => r.questionId))].length
+  const modelIds = models.map(m => m.modelId)
+  const modelNames = modelIds.filter(id => results.some(r => r.model === id))
+
+  return `
+Benchmarks test LLM comprehension across different input formats using ${totalQuestions} data retrieval questions on ${modelNames.length} ${modelNames.length === 1 ? 'model' : 'models'}.
+
+#### Efficiency Ranking (Accuracy per 1K Tokens)
+
+${generateEfficiencyRankingReport(formatResults)}
+
+#### Per-Model Accuracy
+
+${generateDetailedAccuracyReport(formatResults, results, questions, tokenCounts)}
+`.trimStart()
+}
+
+/**
+ * Generate efficiency ranking report
+ */
+function generateEfficiencyRankingReport(
+  formatResults: FormatResult[],
+): string {
+  const toon = formatResults.find(r => r.format === 'toon')
+  const json = formatResults.find(r => r.format === 'json-pretty')
+
+  // Build efficiency ranking (accuracy per 1k tokens)
+  const efficiencyRanking = formatResults
+    .map((fr) => {
+      const efficiency = (fr.accuracy * 100) / (fr.totalTokens / 1000)
+      return {
+        format: fr.format,
+        efficiency,
+        accuracy: fr.accuracy,
+        tokens: fr.totalTokens,
+      }
+    })
+    .sort((a, b) => b.efficiency - a.efficiency)
+
+  const efficiencyChart = EFFICIENCY_CHART_STYLE === 'vertical'
+    ? generateVerticalEfficiencyChart(efficiencyRanking)
+    : generateHorizontalEfficiencyChart(efficiencyRanking)
+
+  // Build summary text
+  let summary = ''
+  if (toon && json) {
+    const toonVsJson = `**${(toon.accuracy * 100).toFixed(1)}%** accuracy (vs JSON's ${(json.accuracy * 100).toFixed(1)}%)`
+    const tokenSavings = `**${((1 - toon.totalTokens / json.totalTokens) * 100).toFixed(1)}% fewer tokens**`
+    summary = `TOON achieves ${toonVsJson} while using ${tokenSavings}.`
+  }
+
+  return `
+Each format's overall performance, balancing accuracy against token cost:
+
+\`\`\`
+${efficiencyChart}
+\`\`\`
+
+${summary}
+`.trim()
+}
+
+/**
+ * Generate detailed accuracy report with breakdowns and methodology
+ */
+function generateDetailedAccuracyReport(
  formatResults: FormatResult[],
  results: EvaluationResult[],
  questions: Question[],
@@ -54,125 +145,17 @@ export function generateMarkdownReport(
  const modelIds = models.map(m => m.modelId)
  const modelNames = modelIds.filter(id => results.some(r => r.model === id))

-  const maxDisplayNameWidth = Math.max(
-    ...Object.values(FORMATTER_DISPLAY_NAMES).map(name => name.length),
-  )
-  const progressBarWidth = 20
+  // Generate model breakdown section
+  const modelBreakdown = generateModelBreakdown(formatResults, results, modelNames)

-  const modelBreakdown = modelNames.map((modelName, i) => {
-    const modelResults = formatResults.map((fr) => {
-      const modelFormatResults = results.filter(r => r.model === modelName && r.format === fr.format)
-      const correctCount = modelFormatResults.filter(r => r.isCorrect).length
-      const totalCount = modelFormatResults.length
-      const accuracy = totalCount > 0 ? correctCount / totalCount : 0
+  // Generate summary comparison
+  const summaryComparison = generateSummaryComparison(toon, json)

-      return {
-        format: fr.format,
-        accuracy,
-        correctCount,
-        totalCount,
-      }
-    }).sort((a, b) => b.accuracy - a.accuracy)
+  // Generate performance by dataset
+  const datasetBreakdown = generateDatasetBreakdown(formatResults, results, questions, tokenCounts)

-    const formatLines = modelResults.map((result) => {
-      const bar = createProgressBar(result.accuracy, 1, progressBarWidth)
-      const accuracyString = `${(result.accuracy * 100).toFixed(1)}%`.padStart(6)
-      const countString = `(${result.correctCount}/${result.totalCount})`
-      const prefix = result.format === 'toon' ? '→ ' : '  '
-      const displayName = FORMATTER_DISPLAY_NAMES[result.format] || result.format
-      return `${prefix}${displayName.padEnd(maxDisplayNameWidth)}   ${bar}   ${accuracyString} ${countString}`
-    }).join('\n')
-
-    // Add blank line before model name, except for first model
-    return `${i > 0 ? '\n' : ''}${modelName}\n${formatLines}`
-  }).join('\n')
-
-  // Build summary comparison
-  const summaryComparison = toon && json
-    ? `**Key tradeoff:** TOON achieves **${(toon.accuracy * 100).toFixed(1)}% accuracy** (vs JSON's ${(json.accuracy * 100).toFixed(1)}%) while using **${((1 - toon.totalTokens / json.totalTokens) * 100).toFixed(1)}% fewer tokens** on these datasets.`
-    : ''
-
-  // Build performance by dataset
-  const datasetBreakdown = datasets.map((dataset) => {
-    const datasetResults = formatResults.map((fr) => {
-      const datasetFormatResults = results.filter(r => r.questionId.includes(dataset.name) || questions.find(q => q.id === r.questionId)?.dataset === dataset.name)
-      if (datasetFormatResults.length === 0)
-        return undefined
-
-      const formatDatasetResults = datasetFormatResults.filter(r => r.format === fr.format)
-      if (formatDatasetResults.length === 0)
-        return undefined
-
-      const correctCount = formatDatasetResults.filter(r => r.isCorrect).length
-      const totalCount = formatDatasetResults.length
-      const accuracy = totalCount > 0 ? correctCount / totalCount : 0
-
-      // Get token count for this dataset+format
-      const tokenKey = `${fr.format}-${dataset.name}`
-      const tokens = tokenCounts[tokenKey] || fr.totalTokens
-
-      return {
-        format: fr.format,
-        accuracy,
-        tokens,
-        correctCount,
-        totalCount,
-      }
-    }).filter(Boolean) as { format: string, accuracy: number, tokens: number, correctCount: number, totalCount: number }[]
-
-    if (datasetResults.length === 0)
-      return ''
-
-    // Sort by efficiency
-    datasetResults.sort((a, b) => {
-      const effA = (a.accuracy ** 2) / (a.tokens / 1000)
-      const effB = (b.accuracy ** 2) / (b.tokens / 1000)
-      return effB - effA
-    })
-
-    const tableRows = datasetResults.slice(0, 6).map(result =>
-      `| \`${result.format}\` | ${(result.accuracy * 100).toFixed(1)}% | ${result.tokens.toLocaleString('en-US')} | ${result.correctCount}/${result.totalCount} |`,
-    ).join('\n')
-
-    return `
-##### ${dataset.description}
-
-| Format | Accuracy | Tokens | Correct/Total |
-| ------ | -------- | ------ | ------------- |
-${tableRows}
-`.trimStart()
-  }).filter(Boolean).join('\n').trim()
-
-  // Build performance by model
-  const modelPerformance = modelNames.map((modelName) => {
-    const modelResults = formatResults.map((fr) => {
-      const modelFormatResults = results.filter(r => r.model === modelName && r.format === fr.format)
-      const correctCount = modelFormatResults.filter(r => r.isCorrect).length
-      const totalCount = modelFormatResults.length
-      const accuracy = correctCount / totalCount
-
-      return {
-        format: fr.format,
-        accuracy,
-        correctCount,
-        totalCount,
-      }
-    }).sort((a, b) => b.accuracy - a.accuracy)
-
-    const tableRows = modelResults.map(result =>
-      `| \`${result.format}\` | ${(result.accuracy * 100).toFixed(1)}% | ${result.correctCount}/${result.totalCount} |`,
-    ).join('\n')
-
-    return `
-##### ${modelName}
-
-| Format | Accuracy | Correct/Total |
-| ------ | -------- | ------------- |
-${tableRows}
-`.trimStart()
-  }).join('\n').trim()
-
-  // Calculate total unique questions
+  // Generate performance by model
+  const modelPerformance = generateModelPerformanceTable(formatResults, results, modelNames)
  const totalQuestions = [...new Set(results.map(r => r.questionId))].length

  // Calculate question type distribution
@@ -195,8 +178,6 @@ ${tableRows}
  const totalEvaluations = totalQuestions * formatCount * modelNames.length

  return `
-### Retrieval Accuracy
-
 Accuracy across **${modelNames.length} ${modelNames.length === 1 ? 'LLM' : 'LLMs'}** on ${totalQuestions} data retrieval questions:

 \`\`\`
@@ -266,47 +247,245 @@ ${totalQuestions} questions are generated dynamically across three categories:
 - **Total evaluations**: ${totalQuestions} questions × ${formatCount} formats × ${modelNames.length} models = ${totalEvaluations.toLocaleString('en-US')} LLM calls

 </details>
-`.trimStart()
+`.trim()
 }

 /**
- * Calculate token counts for all format+dataset combinations
+ * Generate ASCII bar chart showing per-model accuracy across formats
 */
-export function calculateTokenCounts(
-  formatters: Record<string, (data: unknown) => string>,
-): Record<string, number> {
-  const tokenCounts: Record<string, number> = {}
-
-  for (const [formatName, formatter] of Object.entries(formatters)) {
-    for (const dataset of datasets) {
-      const formatted = formatter(dataset.data)
-      const key = `${formatName}-${dataset.name}`
-      tokenCounts[key] = tokenize(formatted)
-    }
-  }
-
-  return tokenCounts
-}
-
-/**
- * Save results to disk
- *
- * @remarks
- * Per-model results are managed separately via storage.ts
- * This function only generates the aggregated markdown report
- */
-export async function saveResults(
-  results: EvaluationResult[],
+function generateModelBreakdown(
  formatResults: FormatResult[],
+  results: EvaluationResult[],
+  modelNames: string[],
+): string {
+  const maxDisplayNameWidth = Math.max(
+    ...Object.values(FORMATTER_DISPLAY_NAMES).map(name => name.length),
+  )
+  const progressBarWidth = 20
+
+  return modelNames.map((modelName, i) => {
+    const modelResults = formatResults.map((fr) => {
+      const modelFormatResults = results.filter(r => r.model === modelName && r.format === fr.format)
+      const correctCount = modelFormatResults.filter(r => r.isCorrect).length
+      const totalCount = modelFormatResults.length
+      const accuracy = totalCount > 0 ? correctCount / totalCount : 0
+
+      return {
+        format: fr.format,
+        accuracy,
+        correctCount,
+        totalCount,
+      }
+    }).sort((a, b) => b.accuracy - a.accuracy)
+
+    const formatLines = modelResults.map((result) => {
+      const bar = createProgressBar(result.accuracy, 1, progressBarWidth)
+      const accuracyString = `${(result.accuracy * 100).toFixed(1)}%`.padStart(6)
+      const countString = `(${result.correctCount}/${result.totalCount})`
+      const prefix = result.format === 'toon' ? '→ ' : '  '
+      const displayName = FORMATTER_DISPLAY_NAMES[result.format] || result.format
+      return `${prefix}${displayName.padEnd(maxDisplayNameWidth)}   ${bar}   ${accuracyString} ${countString}`
+    }).join('\n')
+
+    // Add blank line before model name, except for first model
+    return `${i > 0 ? '\n' : ''}${modelName}\n${formatLines}`
+  }).join('\n')
+}
+
+/**
+ * Generate summary comparison between TOON and JSON formats
+ */
+function generateSummaryComparison(
+  toon: FormatResult | undefined,
+  json: FormatResult | undefined,
+): string {
+  if (!toon || !json)
+    return ''
+
+  return `**Key tradeoff:** TOON achieves **${(toon.accuracy * 100).toFixed(1)}% accuracy** (vs JSON's ${(json.accuracy * 100).toFixed(1)}%) while using **${((1 - toon.totalTokens / json.totalTokens) * 100).toFixed(1)}% fewer tokens** on these datasets.`
+}
+
+/**
+ * Generate per-dataset performance breakdown tables
+ */
+function generateDatasetBreakdown(
+  formatResults: FormatResult[],
+  results: EvaluationResult[],
  questions: Question[],
  tokenCounts: Record<string, number>,
-): Promise<string> {
-  const resultsDir = path.join(BENCHMARKS_DIR, 'results')
-  await ensureDir(resultsDir)
+): string {
+  return datasets.map((dataset) => {
+    const datasetResults = formatResults.map((fr) => {
+      const datasetFormatResults = results.filter(r => r.questionId.includes(dataset.name) || questions.find(q => q.id === r.questionId)?.dataset === dataset.name)
+      if (datasetFormatResults.length === 0)
+        return undefined

-  // Generate markdown report from all available model results
-  const report = generateMarkdownReport(formatResults, results, questions, tokenCounts)
-  await fsp.writeFile(path.join(resultsDir, 'retrieval-accuracy.md'), report)
+      const formatDatasetResults = datasetFormatResults.filter(r => r.format === fr.format)
+      if (formatDatasetResults.length === 0)
+        return undefined

-  return resultsDir
+      const correctCount = formatDatasetResults.filter(r => r.isCorrect).length
+      const totalCount = formatDatasetResults.length
+      const accuracy = totalCount > 0 ? correctCount / totalCount : 0
+
+      // Get token count for this dataset+format
+      const tokenKey = `${fr.format}-${dataset.name}`
+      const tokens = tokenCounts[tokenKey] || fr.totalTokens
+
+      return {
+        format: fr.format,
+        accuracy,
+        tokens,
+        correctCount,
+        totalCount,
+      }
+    }).filter(Boolean) as { format: string, accuracy: number, tokens: number, correctCount: number, totalCount: number }[]
+
+    if (datasetResults.length === 0)
+      return ''
+
+    // Sort by efficiency
+    datasetResults.sort((a, b) => {
+      const effA = (a.accuracy ** 2) / (a.tokens / 1000)
+      const effB = (b.accuracy ** 2) / (b.tokens / 1000)
+      return effB - effA
+    })
+
+    const tableRows = datasetResults.slice(0, 6).map(result =>
+      `| \`${result.format}\` | ${(result.accuracy * 100).toFixed(1)}% | ${result.tokens.toLocaleString('en-US')} | ${result.correctCount}/${result.totalCount} |`,
+    ).join('\n')
+
+    return `
+##### ${dataset.description}
+
+| Format | Accuracy | Tokens | Correct/Total |
+| ------ | -------- | ------ | ------------- |
+${tableRows}
+`.trimStart()
+  }).filter(Boolean).join('\n').trim()
+}
+
+/**
+ * Generate per-model performance comparison tables
+ */
+function generateModelPerformanceTable(
+  formatResults: FormatResult[],
+  results: EvaluationResult[],
+  modelNames: string[],
+): string {
+  return modelNames.map((modelName) => {
+    const modelResults = formatResults.map((fr) => {
+      const modelFormatResults = results.filter(r => r.model === modelName && r.format === fr.format)
+      const correctCount = modelFormatResults.filter(r => r.isCorrect).length
+      const totalCount = modelFormatResults.length
+      const accuracy = correctCount / totalCount
+
+      return {
+        format: fr.format,
+        accuracy,
+        correctCount,
+        totalCount,
+      }
+    }).sort((a, b) => b.accuracy - a.accuracy)
+
+    const tableRows = modelResults.map(result =>
+      `| \`${result.format}\` | ${(result.accuracy * 100).toFixed(1)}% | ${result.correctCount}/${result.totalCount} |`,
+    ).join('\n')
+
+    return `
+##### ${modelName}
+
+| Format | Accuracy | Correct/Total |
+| ------ | -------- | ------------- |
+${tableRows}
+`.trimStart()
+  }).join('\n').trim()
+}
+
+/**
+ * Generate horizontal bar chart for efficiency ranking
+ */
+function generateHorizontalEfficiencyChart(
+  ranking: EfficiencyRanking[],
+): string {
+  const barWidth = 20
+  const maxEfficiency = Math.max(...ranking.map(r => r.efficiency))
+  const maxFormatWidth = Math.max(...ranking.map(r => r.format.length))
+
+  return ranking
+    .map((r) => {
+      const normalizedValue = r.efficiency / maxEfficiency
+      const bar = createProgressBar(normalizedValue, 1, barWidth, { filled: '▓', empty: '░' })
+      const formatName = r.format.padEnd(maxFormatWidth)
+      const efficiency = r.efficiency.toFixed(1).padStart(4)
+      const accuracy = `${(r.accuracy * 100).toFixed(1)}%`.padStart(5)
+      const tokens = r.tokens.toLocaleString('en-US').padStart(5)
+
+      return `${formatName}   ${bar}   ${efficiency}  │  ${accuracy} acc  │  ${tokens} tokens`
+    })
+    .join('\n')
+}
+
+/**
+ * Generate vertical bar chart for efficiency ranking
+ */
+function generateVerticalEfficiencyChart(
+  ranking: EfficiencyRanking[],
+): string {
+  const maxEfficiency = Math.max(...ranking.map(r => r.efficiency))
+  const chartHeight = 8
+
+  // Generate rows from top to bottom
+  const rows: string[] = []
+
+  // Y-axis and bars
+  for (let i = chartHeight; i >= 0; i--) {
+    const threshold = (i / chartHeight) * maxEfficiency
+    const yLabel = i === chartHeight || i === Math.floor(chartHeight / 2) || i === 0
+      ? Math.round(threshold).toString().padStart(4)
+      : '    '
+
+    const bars = ranking
+      .map((r) => {
+        const barHeight = (r.efficiency / maxEfficiency) * chartHeight
+        let char = ' '
+        if (barHeight >= i) {
+          // Use different characters for visual distinction
+          if (ranking.indexOf(r) === 0)
+            char = '▓' // Top format
+          else if (ranking.indexOf(r) <= 2)
+            char = '▒' // Top 3
+          else
+            char = '░' // Rest
+        }
+        return char
+      })
+      .join('    ')
+
+    rows.push(`${yLabel}│  ${bars}`)
+  }
+
+  // X-axis
+  const axis = `    └──${ranking.map(() => '┴').join('────')}──`
+  rows.push(axis)
+
+  // Format labels (split long names into multiple rows)
+  const formatRow1 = ranking
+    .map((r) => {
+      const parts = r.format.split('-')
+      return (parts[0] || '').padEnd(5).substring(0, 5)
+    })
+    .join('')
+  rows.push(`      ${formatRow1}`)
+
+  const formatRow2 = ranking
+    .map((r) => {
+      const parts = r.format.split('-')
+      return (parts[1] || '').padEnd(5).substring(0, 5)
+    })
+    .join('')
+  if (formatRow2.trim())
+    rows.push(`      ${formatRow2}`)
+
+  return rows.join('\n')
 }
--- a/benchmarks/src/types.ts
+++ b/benchmarks/src/types.ts
@@ -32,3 +32,10 @@ export interface FormatResult {
  correctCount: number
  totalCount: number
 }
+
+export interface EfficiencyRanking {
+  format: string
+  efficiency: number
+  accuracy: number
+  tokens: number
+}
--- a/benchmarks/src/utils.ts
+++ b/benchmarks/src/utils.ts
@@ -7,16 +7,25 @@ import { encode } from 'gpt-tokenizer'
 * @param value - Current value
 * @param max - Maximum value
 * @param width - Width of the bar in characters (default: 25)
- * @returns ASCII progress bar string (`█` for filled, `░` for empty)
+ * @param chars - Characters to use for filled and empty sections
+ * @param chars.filled - Character for filled portion (default: '█')
+ * @param chars.empty - Character for empty portion (default: '░')
+ * @returns ASCII progress bar string
 *
 * @example
 * createProgressBar(75, 100, 20) // "███████████████░░░░░"
 * createProgressBar(0.5, 1, 10)  // "█████░░░░░"
+ * createProgressBar(0.75, 1, 20, { filled: '▓', empty: '░' }) // "▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░"
 */
-export function createProgressBar(value: number, max: number, width = 25): string {
+export function createProgressBar(
+  value: number,
+  max: number,
+  width = 25,
+  chars: { filled: string, empty: string } = { filled: '█', empty: '░' },
+): string {
  const filled = Math.round((value / max) * width)
  const empty = width - filled
-  return '█'.repeat(filled) + '░'.repeat(empty)
+  return chars.filled.repeat(filled) + chars.empty.repeat(empty)
 }

 /**