docs: add accuracy per 1k tokens report (closes #72)

This commit is contained in:
Johann Schopplich
2025-11-05 08:21:57 +01:00
parent 9268fdf3ef
commit af17efe128
8 changed files with 413 additions and 180 deletions

View File

@@ -62,12 +62,14 @@ For small payloads, JSON/CSV/YAML work fine. TOON's value emerges at scale: when
## Key Features ## Key Features
- 💸 **Token-efficient:** typically 3060% fewer tokens than JSON - 💸 **Token-efficient:** typically 3060% fewer tokens than JSON[^1]
- 🤿 **LLM-friendly guardrails:** explicit lengths and fields enable validation - 🤿 **LLM-friendly guardrails:** explicit lengths and fields enable validation
- 🍱 **Minimal syntax:** removes redundant punctuation (braces, brackets, most quotes) - 🍱 **Minimal syntax:** removes redundant punctuation (braces, brackets, most quotes)
- 📐 **Indentation-based structure:** like YAML, uses whitespace instead of braces - 📐 **Indentation-based structure:** like YAML, uses whitespace instead of braces
- 🧺 **Tabular arrays:** declare keys once, stream data as rows - 🧺 **Tabular arrays:** declare keys once, stream data as rows
[^1]: For flat tabular data, CSV is more compact. TOON adds minimal overhead to provide explicit structure and validation that improves LLM reliability.
## Benchmarks ## Benchmarks
> [!TIP] > [!TIP]
@@ -80,12 +82,10 @@ Token counts are measured using the GPT-5 `o200k_base` tokenizer via [`gpt-token
The benchmarks use datasets optimized for TOON's strengths (uniform tabular data). Real-world performance depends on your data structure. The benchmarks use datasets optimized for TOON's strengths (uniform tabular data). Real-world performance depends on your data structure.
> [!NOTE] > [!NOTE]
> CSV/TSV isn't shown in the token-efficiency chart because it doesn't encode nesting without flattening. For flat datasets, see CSV token counts in the [Retrieval Accuracy](#retrieval-accuracy) tables. > CSV/TSV doesn't support nested structures, so it's not included in this comparison. For flat datasets where CSV applies, see token counts and accuracy metrics in the [Retrieval Accuracy](#retrieval-accuracy) section.
<!-- automd:file src="./benchmarks/results/token-efficiency.md" --> <!-- automd:file src="./benchmarks/results/token-efficiency.md" -->
### Token Efficiency
``` ```
⭐ GitHub Repositories ██████████████░░░░░░░░░░░ 8,745 tokens ⭐ GitHub Repositories ██████████████░░░░░░░░░░░ 8,745 tokens
vs JSON (42.3%) 15,145 vs JSON (42.3%) 15,145
@@ -251,9 +251,28 @@ metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
<!-- /automd --> <!-- /automd -->
### Retrieval Accuracy
<!-- automd:file src="./benchmarks/results/retrieval-accuracy.md" --> <!-- automd:file src="./benchmarks/results/retrieval-accuracy.md" -->
### Retrieval Accuracy Benchmarks test LLM comprehension across different input formats using 154 data retrieval questions on 4 models.
#### Efficiency Ranking (Accuracy per 1K Tokens)
Each format's overall performance, balancing accuracy against token cost:
```
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 15.0 │ 70.1% acc │ 4,678 tokens
csv ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░ 14.3 │ 67.7% acc │ 4,745 tokens
json-compact ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░ 11.0 │ 65.3% acc │ 5,925 tokens
yaml ▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░ 9.4 │ 66.7% acc │ 7,091 tokens
json-pretty ▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░ 7.5 │ 65.4% acc │ 8,713 tokens
xml ▓▓▓▓▓▓▓▓▓░░░░░░░░░░░ 6.8 │ 67.2% acc │ 9,944 tokens
```
TOON achieves **70.1%** accuracy (vs JSON's 65.4%) while using **46.3% fewer tokens**.
#### Per-Model Accuracy
Accuracy across **4 LLMs** on 154 data retrieval questions: Accuracy across **4 LLMs** on 154 data retrieval questions:
@@ -915,7 +934,7 @@ By default, the decoder validates input strictly:
- Format familiarity and structure matter as much as token count. TOON's tabular format requires arrays of objects with identical keys and primitive values only. When this doesn't hold (due to mixed types, non-uniform objects, or nested structures), TOON switches to list format where JSON can be more efficient at scale. - Format familiarity and structure matter as much as token count. TOON's tabular format requires arrays of objects with identical keys and primitive values only. When this doesn't hold (due to mixed types, non-uniform objects, or nested structures), TOON switches to list format where JSON can be more efficient at scale.
- **TOON excels at:** Uniform arrays of objects (same fields, primitive values), especially large datasets with consistent structure. - **TOON excels at:** Uniform arrays of objects (same fields, primitive values), especially large datasets with consistent structure.
- **JSON is better for:** Non-uniform data, deeply nested structures, and objects with varying field sets. - **JSON is better for:** Non-uniform data, deeply nested structures, and objects with varying field sets.
- **CSV is more compact for:** Flat, uniform tables without nesting. TOON adds minimal overhead (`[N]` length markers, delimiter scoping, deterministic quoting) to improve LLM reliability while staying close to CSV's token efficiency. - **CSV is more compact for:** Flat, uniform tables without nesting. TOON adds structure (`[N]` length markers, delimiter scoping, deterministic quoting) that improves LLM reliability with minimal token overhead.
- **Token counts vary by tokenizer and model.** Benchmarks use a GPT-style tokenizer (cl100k/o200k); actual savings will differ with other models (e.g., [SentencePiece](https://github.com/google/sentencepiece)). - **Token counts vary by tokenizer and model.** Benchmarks use a GPT-style tokenizer (cl100k/o200k); actual savings will differ with other models (e.g., [SentencePiece](https://github.com/google/sentencepiece)).
- **TOON is designed for LLM input** where human readability and token efficiency matter. It's **not** a drop-in replacement for JSON in APIs or storage. - **TOON is designed for LLM input** where human readability and token efficiency matter. It's **not** a drop-in replacement for JSON in APIs or storage.

View File

@@ -1,4 +1,21 @@
### Retrieval Accuracy Benchmarks test LLM comprehension across different input formats using 154 data retrieval questions on 4 models.
#### Efficiency Ranking (Accuracy per 1K Tokens)
Each format's overall performance, balancing accuracy against token cost:
```
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 15.0 │ 70.1% acc │ 4,678 tokens
csv ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░ 14.3 │ 67.7% acc │ 4,745 tokens
json-compact ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░ 11.0 │ 65.3% acc │ 5,925 tokens
yaml ▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░ 9.4 │ 66.7% acc │ 7,091 tokens
json-pretty ▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░ 7.5 │ 65.4% acc │ 8,713 tokens
xml ▓▓▓▓▓▓▓▓▓░░░░░░░░░░░ 6.8 │ 67.2% acc │ 9,944 tokens
```
TOON achieves **70.1%** accuracy (vs JSON's 65.4%) while using **46.3% fewer tokens**.
#### Per-Model Accuracy
Accuracy across **4 LLMs** on 154 data retrieval questions: Accuracy across **4 LLMs** on 154 data retrieval questions:

View File

@@ -1,5 +1,3 @@
### Token Efficiency
``` ```
⭐ GitHub Repositories ██████████████░░░░░░░░░░░ 8,745 tokens ⭐ GitHub Repositories ██████████████░░░░░░░░░░░ 8,745 tokens
vs JSON (42.3%) 15,145 vs JSON (42.3%) 15,145

View File

@@ -1,15 +1,17 @@
import type { Question } from '../src/types' import type { Question } from '../src/types'
import * as fsp from 'node:fs/promises'
import * as path from 'node:path' import * as path from 'node:path'
import process from 'node:process' import process from 'node:process'
import * as prompts from '@clack/prompts' import * as prompts from '@clack/prompts'
import PQueue from 'p-queue' import PQueue from 'p-queue'
import { DEFAULT_CONCURRENCY, DRY_RUN, DRY_RUN_LIMITS, MODEL_RPM_LIMITS, ROOT_DIR } from '../src/constants' import { BENCHMARKS_DIR, DEFAULT_CONCURRENCY, DRY_RUN, DRY_RUN_LIMITS, MODEL_RPM_LIMITS, ROOT_DIR } from '../src/constants'
import { datasets } from '../src/datasets' import { datasets } from '../src/datasets'
import { evaluateQuestion, models } from '../src/evaluate' import { evaluateQuestion, models } from '../src/evaluate'
import { formatters } from '../src/formatters' import { formatters } from '../src/formatters'
import { generateQuestions } from '../src/questions' import { generateQuestions } from '../src/questions'
import { calculateFormatResults, calculateTokenCounts, saveResults } from '../src/report' import { calculateFormatResults, calculateTokenCounts, generateAccuracyReport } from '../src/report'
import { getAllModelResults, hasModelResults, saveModelResults } from '../src/storage' import { getAllModelResults, hasModelResults, saveModelResults } from '../src/storage'
import { ensureDir } from '../src/utils'
prompts.intro('Retrieval Accuracy Benchmark') prompts.intro('Retrieval Accuracy Benchmark')
@@ -142,13 +144,15 @@ if (allResults.length === 0) {
process.exit(0) process.exit(0)
} }
// Calculate token counts freshly (deterministic, no need to persist)
const tokenCounts = calculateTokenCounts(formatters) const tokenCounts = calculateTokenCounts(formatters)
// Calculate format statistics and save report
const formatResults = calculateFormatResults(allResults, tokenCounts) const formatResults = calculateFormatResults(allResults, tokenCounts)
const resultsDir = await saveResults(allResults, formatResults, questions, tokenCounts) const accuracyReport = generateAccuracyReport(allResults, formatResults, tokenCounts)
const reportPath = path.join(resultsDir, 'retrieval-accuracy.md') const resultsDir = path.join(BENCHMARKS_DIR, 'results')
prompts.log.info(`Report saved to: \`${path.relative(ROOT_DIR, reportPath)}\``) await ensureDir(resultsDir)
const outputFilePath = path.join(resultsDir, 'retrieval-accuracy.md')
await fsp.writeFile(outputFilePath, accuracyReport)
prompts.log.info(`Report saved to: \`${path.relative(ROOT_DIR, outputFilePath)}\``)
reportSpinner.stop('Report generation complete!') reportSpinner.stop('Report generation complete!')

View File

@@ -217,4 +217,4 @@ await ensureDir(resultsDir)
const outputFilePath = path.join(resultsDir, 'token-efficiency.md') const outputFilePath = path.join(resultsDir, 'token-efficiency.md')
await fsp.writeFile(outputFilePath, markdown, 'utf-8') await fsp.writeFile(outputFilePath, markdown, 'utf-8')
prompts.log.success(`Result saved to \`${path.relative(ROOT_DIR, outputFilePath)}\``) prompts.log.success(`Report saved to \`${path.relative(ROOT_DIR, outputFilePath)}\``)

View File

@@ -1,10 +1,30 @@
import type { EvaluationResult, FormatResult, Question } from './types' import type { EfficiencyRanking, EvaluationResult, FormatResult, Question } from './types'
import * as fsp from 'node:fs/promises' import { FORMATTER_DISPLAY_NAMES } from './constants'
import * as path from 'node:path'
import { BENCHMARKS_DIR, FORMATTER_DISPLAY_NAMES } from './constants'
import { datasets } from './datasets' import { datasets } from './datasets'
import { models } from './evaluate' import { models } from './evaluate'
import { createProgressBar, ensureDir, tokenize } from './utils' import { generateQuestions } from './questions'
import { createProgressBar, tokenize } from './utils'
const EFFICIENCY_CHART_STYLE: 'vertical' | 'horizontal' = 'horizontal'
/**
* Calculate token counts for all format+dataset combinations
*/
export function calculateTokenCounts(
formatters: Record<string, (data: unknown) => string>,
): Record<string, number> {
const tokenCounts: Record<string, number> = {}
for (const [formatName, formatter] of Object.entries(formatters)) {
for (const dataset of datasets) {
const formatted = formatter(dataset.data)
const key = `${formatName}-${dataset.name}`
tokenCounts[key] = tokenize(formatted)
}
}
return tokenCounts
}
/** /**
* Calculate per-format statistics from evaluation results * Calculate per-format statistics from evaluation results
@@ -40,9 +60,80 @@ export function calculateFormatResults(
} }
/** /**
* Generate embeddable markdown report from results * Generate consolidated retrieval accuracy report
*/ */
export function generateMarkdownReport( export function generateAccuracyReport(
results: EvaluationResult[],
formatResults: FormatResult[],
tokenCounts: Record<string, number>,
): string {
const questions = generateQuestions()
const totalQuestions = [...new Set(results.map(r => r.questionId))].length
const modelIds = models.map(m => m.modelId)
const modelNames = modelIds.filter(id => results.some(r => r.model === id))
return `
Benchmarks test LLM comprehension across different input formats using ${totalQuestions} data retrieval questions on ${modelNames.length} ${modelNames.length === 1 ? 'model' : 'models'}.
#### Efficiency Ranking (Accuracy per 1K Tokens)
${generateEfficiencyRankingReport(formatResults)}
#### Per-Model Accuracy
${generateDetailedAccuracyReport(formatResults, results, questions, tokenCounts)}
`.trimStart()
}
/**
* Generate efficiency ranking report
*/
function generateEfficiencyRankingReport(
formatResults: FormatResult[],
): string {
const toon = formatResults.find(r => r.format === 'toon')
const json = formatResults.find(r => r.format === 'json-pretty')
// Build efficiency ranking (accuracy per 1k tokens)
const efficiencyRanking = formatResults
.map((fr) => {
const efficiency = (fr.accuracy * 100) / (fr.totalTokens / 1000)
return {
format: fr.format,
efficiency,
accuracy: fr.accuracy,
tokens: fr.totalTokens,
}
})
.sort((a, b) => b.efficiency - a.efficiency)
const efficiencyChart = EFFICIENCY_CHART_STYLE === 'vertical'
? generateVerticalEfficiencyChart(efficiencyRanking)
: generateHorizontalEfficiencyChart(efficiencyRanking)
// Build summary text
let summary = ''
if (toon && json) {
const toonVsJson = `**${(toon.accuracy * 100).toFixed(1)}%** accuracy (vs JSON's ${(json.accuracy * 100).toFixed(1)}%)`
const tokenSavings = `**${((1 - toon.totalTokens / json.totalTokens) * 100).toFixed(1)}% fewer tokens**`
summary = `TOON achieves ${toonVsJson} while using ${tokenSavings}.`
}
return `
Each format's overall performance, balancing accuracy against token cost:
\`\`\`
${efficiencyChart}
\`\`\`
${summary}
`.trim()
}
/**
* Generate detailed accuracy report with breakdowns and methodology
*/
function generateDetailedAccuracyReport(
formatResults: FormatResult[], formatResults: FormatResult[],
results: EvaluationResult[], results: EvaluationResult[],
questions: Question[], questions: Question[],
@@ -54,125 +145,17 @@ export function generateMarkdownReport(
const modelIds = models.map(m => m.modelId) const modelIds = models.map(m => m.modelId)
const modelNames = modelIds.filter(id => results.some(r => r.model === id)) const modelNames = modelIds.filter(id => results.some(r => r.model === id))
const maxDisplayNameWidth = Math.max( // Generate model breakdown section
...Object.values(FORMATTER_DISPLAY_NAMES).map(name => name.length), const modelBreakdown = generateModelBreakdown(formatResults, results, modelNames)
)
const progressBarWidth = 20
const modelBreakdown = modelNames.map((modelName, i) => { // Generate summary comparison
const modelResults = formatResults.map((fr) => { const summaryComparison = generateSummaryComparison(toon, json)
const modelFormatResults = results.filter(r => r.model === modelName && r.format === fr.format)
const correctCount = modelFormatResults.filter(r => r.isCorrect).length
const totalCount = modelFormatResults.length
const accuracy = totalCount > 0 ? correctCount / totalCount : 0
return { // Generate performance by dataset
format: fr.format, const datasetBreakdown = generateDatasetBreakdown(formatResults, results, questions, tokenCounts)
accuracy,
correctCount,
totalCount,
}
}).sort((a, b) => b.accuracy - a.accuracy)
const formatLines = modelResults.map((result) => { // Generate performance by model
const bar = createProgressBar(result.accuracy, 1, progressBarWidth) const modelPerformance = generateModelPerformanceTable(formatResults, results, modelNames)
const accuracyString = `${(result.accuracy * 100).toFixed(1)}%`.padStart(6)
const countString = `(${result.correctCount}/${result.totalCount})`
const prefix = result.format === 'toon' ? '→ ' : ' '
const displayName = FORMATTER_DISPLAY_NAMES[result.format] || result.format
return `${prefix}${displayName.padEnd(maxDisplayNameWidth)} ${bar} ${accuracyString} ${countString}`
}).join('\n')
// Add blank line before model name, except for first model
return `${i > 0 ? '\n' : ''}${modelName}\n${formatLines}`
}).join('\n')
// Build summary comparison
const summaryComparison = toon && json
? `**Key tradeoff:** TOON achieves **${(toon.accuracy * 100).toFixed(1)}% accuracy** (vs JSON's ${(json.accuracy * 100).toFixed(1)}%) while using **${((1 - toon.totalTokens / json.totalTokens) * 100).toFixed(1)}% fewer tokens** on these datasets.`
: ''
// Build performance by dataset
const datasetBreakdown = datasets.map((dataset) => {
const datasetResults = formatResults.map((fr) => {
const datasetFormatResults = results.filter(r => r.questionId.includes(dataset.name) || questions.find(q => q.id === r.questionId)?.dataset === dataset.name)
if (datasetFormatResults.length === 0)
return undefined
const formatDatasetResults = datasetFormatResults.filter(r => r.format === fr.format)
if (formatDatasetResults.length === 0)
return undefined
const correctCount = formatDatasetResults.filter(r => r.isCorrect).length
const totalCount = formatDatasetResults.length
const accuracy = totalCount > 0 ? correctCount / totalCount : 0
// Get token count for this dataset+format
const tokenKey = `${fr.format}-${dataset.name}`
const tokens = tokenCounts[tokenKey] || fr.totalTokens
return {
format: fr.format,
accuracy,
tokens,
correctCount,
totalCount,
}
}).filter(Boolean) as { format: string, accuracy: number, tokens: number, correctCount: number, totalCount: number }[]
if (datasetResults.length === 0)
return ''
// Sort by efficiency
datasetResults.sort((a, b) => {
const effA = (a.accuracy ** 2) / (a.tokens / 1000)
const effB = (b.accuracy ** 2) / (b.tokens / 1000)
return effB - effA
})
const tableRows = datasetResults.slice(0, 6).map(result =>
`| \`${result.format}\` | ${(result.accuracy * 100).toFixed(1)}% | ${result.tokens.toLocaleString('en-US')} | ${result.correctCount}/${result.totalCount} |`,
).join('\n')
return `
##### ${dataset.description}
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
${tableRows}
`.trimStart()
}).filter(Boolean).join('\n').trim()
// Build performance by model
const modelPerformance = modelNames.map((modelName) => {
const modelResults = formatResults.map((fr) => {
const modelFormatResults = results.filter(r => r.model === modelName && r.format === fr.format)
const correctCount = modelFormatResults.filter(r => r.isCorrect).length
const totalCount = modelFormatResults.length
const accuracy = correctCount / totalCount
return {
format: fr.format,
accuracy,
correctCount,
totalCount,
}
}).sort((a, b) => b.accuracy - a.accuracy)
const tableRows = modelResults.map(result =>
`| \`${result.format}\` | ${(result.accuracy * 100).toFixed(1)}% | ${result.correctCount}/${result.totalCount} |`,
).join('\n')
return `
##### ${modelName}
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
${tableRows}
`.trimStart()
}).join('\n').trim()
// Calculate total unique questions
const totalQuestions = [...new Set(results.map(r => r.questionId))].length const totalQuestions = [...new Set(results.map(r => r.questionId))].length
// Calculate question type distribution // Calculate question type distribution
@@ -195,8 +178,6 @@ ${tableRows}
const totalEvaluations = totalQuestions * formatCount * modelNames.length const totalEvaluations = totalQuestions * formatCount * modelNames.length
return ` return `
### Retrieval Accuracy
Accuracy across **${modelNames.length} ${modelNames.length === 1 ? 'LLM' : 'LLMs'}** on ${totalQuestions} data retrieval questions: Accuracy across **${modelNames.length} ${modelNames.length === 1 ? 'LLM' : 'LLMs'}** on ${totalQuestions} data retrieval questions:
\`\`\` \`\`\`
@@ -266,47 +247,245 @@ ${totalQuestions} questions are generated dynamically across three categories:
- **Total evaluations**: ${totalQuestions} questions × ${formatCount} formats × ${modelNames.length} models = ${totalEvaluations.toLocaleString('en-US')} LLM calls - **Total evaluations**: ${totalQuestions} questions × ${formatCount} formats × ${modelNames.length} models = ${totalEvaluations.toLocaleString('en-US')} LLM calls
</details> </details>
`.trimStart() `.trim()
} }
/** /**
* Calculate token counts for all format+dataset combinations * Generate ASCII bar chart showing per-model accuracy across formats
*/ */
export function calculateTokenCounts( function generateModelBreakdown(
formatters: Record<string, (data: unknown) => string>,
): Record<string, number> {
const tokenCounts: Record<string, number> = {}
for (const [formatName, formatter] of Object.entries(formatters)) {
for (const dataset of datasets) {
const formatted = formatter(dataset.data)
const key = `${formatName}-${dataset.name}`
tokenCounts[key] = tokenize(formatted)
}
}
return tokenCounts
}
/**
* Save results to disk
*
* @remarks
* Per-model results are managed separately via storage.ts
* This function only generates the aggregated markdown report
*/
export async function saveResults(
results: EvaluationResult[],
formatResults: FormatResult[], formatResults: FormatResult[],
results: EvaluationResult[],
modelNames: string[],
): string {
const maxDisplayNameWidth = Math.max(
...Object.values(FORMATTER_DISPLAY_NAMES).map(name => name.length),
)
const progressBarWidth = 20
return modelNames.map((modelName, i) => {
const modelResults = formatResults.map((fr) => {
const modelFormatResults = results.filter(r => r.model === modelName && r.format === fr.format)
const correctCount = modelFormatResults.filter(r => r.isCorrect).length
const totalCount = modelFormatResults.length
const accuracy = totalCount > 0 ? correctCount / totalCount : 0
return {
format: fr.format,
accuracy,
correctCount,
totalCount,
}
}).sort((a, b) => b.accuracy - a.accuracy)
const formatLines = modelResults.map((result) => {
const bar = createProgressBar(result.accuracy, 1, progressBarWidth)
const accuracyString = `${(result.accuracy * 100).toFixed(1)}%`.padStart(6)
const countString = `(${result.correctCount}/${result.totalCount})`
const prefix = result.format === 'toon' ? '→ ' : ' '
const displayName = FORMATTER_DISPLAY_NAMES[result.format] || result.format
return `${prefix}${displayName.padEnd(maxDisplayNameWidth)} ${bar} ${accuracyString} ${countString}`
}).join('\n')
// Add blank line before model name, except for first model
return `${i > 0 ? '\n' : ''}${modelName}\n${formatLines}`
}).join('\n')
}
/**
* Generate summary comparison between TOON and JSON formats
*/
function generateSummaryComparison(
toon: FormatResult | undefined,
json: FormatResult | undefined,
): string {
if (!toon || !json)
return ''
return `**Key tradeoff:** TOON achieves **${(toon.accuracy * 100).toFixed(1)}% accuracy** (vs JSON's ${(json.accuracy * 100).toFixed(1)}%) while using **${((1 - toon.totalTokens / json.totalTokens) * 100).toFixed(1)}% fewer tokens** on these datasets.`
}
/**
* Generate per-dataset performance breakdown tables
*/
function generateDatasetBreakdown(
formatResults: FormatResult[],
results: EvaluationResult[],
questions: Question[], questions: Question[],
tokenCounts: Record<string, number>, tokenCounts: Record<string, number>,
): Promise<string> { ): string {
const resultsDir = path.join(BENCHMARKS_DIR, 'results') return datasets.map((dataset) => {
await ensureDir(resultsDir) const datasetResults = formatResults.map((fr) => {
const datasetFormatResults = results.filter(r => r.questionId.includes(dataset.name) || questions.find(q => q.id === r.questionId)?.dataset === dataset.name)
if (datasetFormatResults.length === 0)
return undefined
// Generate markdown report from all available model results const formatDatasetResults = datasetFormatResults.filter(r => r.format === fr.format)
const report = generateMarkdownReport(formatResults, results, questions, tokenCounts) if (formatDatasetResults.length === 0)
await fsp.writeFile(path.join(resultsDir, 'retrieval-accuracy.md'), report) return undefined
return resultsDir const correctCount = formatDatasetResults.filter(r => r.isCorrect).length
const totalCount = formatDatasetResults.length
const accuracy = totalCount > 0 ? correctCount / totalCount : 0
// Get token count for this dataset+format
const tokenKey = `${fr.format}-${dataset.name}`
const tokens = tokenCounts[tokenKey] || fr.totalTokens
return {
format: fr.format,
accuracy,
tokens,
correctCount,
totalCount,
}
}).filter(Boolean) as { format: string, accuracy: number, tokens: number, correctCount: number, totalCount: number }[]
if (datasetResults.length === 0)
return ''
// Sort by efficiency
datasetResults.sort((a, b) => {
const effA = (a.accuracy ** 2) / (a.tokens / 1000)
const effB = (b.accuracy ** 2) / (b.tokens / 1000)
return effB - effA
})
const tableRows = datasetResults.slice(0, 6).map(result =>
`| \`${result.format}\` | ${(result.accuracy * 100).toFixed(1)}% | ${result.tokens.toLocaleString('en-US')} | ${result.correctCount}/${result.totalCount} |`,
).join('\n')
return `
##### ${dataset.description}
| Format | Accuracy | Tokens | Correct/Total |
| ------ | -------- | ------ | ------------- |
${tableRows}
`.trimStart()
}).filter(Boolean).join('\n').trim()
}
/**
* Generate per-model performance comparison tables
*/
function generateModelPerformanceTable(
formatResults: FormatResult[],
results: EvaluationResult[],
modelNames: string[],
): string {
return modelNames.map((modelName) => {
const modelResults = formatResults.map((fr) => {
const modelFormatResults = results.filter(r => r.model === modelName && r.format === fr.format)
const correctCount = modelFormatResults.filter(r => r.isCorrect).length
const totalCount = modelFormatResults.length
const accuracy = correctCount / totalCount
return {
format: fr.format,
accuracy,
correctCount,
totalCount,
}
}).sort((a, b) => b.accuracy - a.accuracy)
const tableRows = modelResults.map(result =>
`| \`${result.format}\` | ${(result.accuracy * 100).toFixed(1)}% | ${result.correctCount}/${result.totalCount} |`,
).join('\n')
return `
##### ${modelName}
| Format | Accuracy | Correct/Total |
| ------ | -------- | ------------- |
${tableRows}
`.trimStart()
}).join('\n').trim()
}
/**
* Generate horizontal bar chart for efficiency ranking
*/
function generateHorizontalEfficiencyChart(
ranking: EfficiencyRanking[],
): string {
const barWidth = 20
const maxEfficiency = Math.max(...ranking.map(r => r.efficiency))
const maxFormatWidth = Math.max(...ranking.map(r => r.format.length))
return ranking
.map((r) => {
const normalizedValue = r.efficiency / maxEfficiency
const bar = createProgressBar(normalizedValue, 1, barWidth, { filled: '▓', empty: '░' })
const formatName = r.format.padEnd(maxFormatWidth)
const efficiency = r.efficiency.toFixed(1).padStart(4)
const accuracy = `${(r.accuracy * 100).toFixed(1)}%`.padStart(5)
const tokens = r.tokens.toLocaleString('en-US').padStart(5)
return `${formatName} ${bar} ${efficiency}${accuracy} acc │ ${tokens} tokens`
})
.join('\n')
}
/**
* Generate vertical bar chart for efficiency ranking
*/
function generateVerticalEfficiencyChart(
ranking: EfficiencyRanking[],
): string {
const maxEfficiency = Math.max(...ranking.map(r => r.efficiency))
const chartHeight = 8
// Generate rows from top to bottom
const rows: string[] = []
// Y-axis and bars
for (let i = chartHeight; i >= 0; i--) {
const threshold = (i / chartHeight) * maxEfficiency
const yLabel = i === chartHeight || i === Math.floor(chartHeight / 2) || i === 0
? Math.round(threshold).toString().padStart(4)
: ' '
const bars = ranking
.map((r) => {
const barHeight = (r.efficiency / maxEfficiency) * chartHeight
let char = ' '
if (barHeight >= i) {
// Use different characters for visual distinction
if (ranking.indexOf(r) === 0)
char = '▓' // Top format
else if (ranking.indexOf(r) <= 2)
char = '▒' // Top 3
else
char = '░' // Rest
}
return char
})
.join(' ')
rows.push(`${yLabel}${bars}`)
}
// X-axis
const axis = ` └──${ranking.map(() => '┴').join('────')}──`
rows.push(axis)
// Format labels (split long names into multiple rows)
const formatRow1 = ranking
.map((r) => {
const parts = r.format.split('-')
return (parts[0] || '').padEnd(5).substring(0, 5)
})
.join('')
rows.push(` ${formatRow1}`)
const formatRow2 = ranking
.map((r) => {
const parts = r.format.split('-')
return (parts[1] || '').padEnd(5).substring(0, 5)
})
.join('')
if (formatRow2.trim())
rows.push(` ${formatRow2}`)
return rows.join('\n')
} }

View File

@@ -32,3 +32,10 @@ export interface FormatResult {
correctCount: number correctCount: number
totalCount: number totalCount: number
} }
export interface EfficiencyRanking {
format: string
efficiency: number
accuracy: number
tokens: number
}

View File

@@ -7,16 +7,25 @@ import { encode } from 'gpt-tokenizer'
* @param value - Current value * @param value - Current value
* @param max - Maximum value * @param max - Maximum value
* @param width - Width of the bar in characters (default: 25) * @param width - Width of the bar in characters (default: 25)
* @returns ASCII progress bar string (`█` for filled, `░` for empty) * @param chars - Characters to use for filled and empty sections
* @param chars.filled - Character for filled portion (default: '█')
* @param chars.empty - Character for empty portion (default: '░')
* @returns ASCII progress bar string
* *
* @example * @example
* createProgressBar(75, 100, 20) // "███████████████░░░░░" * createProgressBar(75, 100, 20) // "███████████████░░░░░"
* createProgressBar(0.5, 1, 10) // "█████░░░░░" * createProgressBar(0.5, 1, 10) // "█████░░░░░"
* createProgressBar(0.75, 1, 20, { filled: '▓', empty: '░' }) // "▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░"
*/ */
export function createProgressBar(value: number, max: number, width = 25): string { export function createProgressBar(
value: number,
max: number,
width = 25,
chars: { filled: string, empty: string } = { filled: '█', empty: '░' },
): string {
const filled = Math.round((value / max) * width) const filled = Math.round((value / max) * width)
const empty = width - filled const empty = width - filled
return '█'.repeat(filled) + '░'.repeat(empty) return chars.filled.repeat(filled) + chars.empty.repeat(empty)
} }
/** /**