diff --git a/README.md b/README.md index 3eb1226..36f1dc0 100644 --- a/README.md +++ b/README.md @@ -60,6 +60,19 @@ For small payloads, JSON/CSV/YAML work fine. TOON's value emerges at scale: when +
+When NOT to use TOON + +TOON excels with uniform arrays of objects, but there are cases where other formats are better: + +- **Deeply nested or non-uniform structures** (tabular eligibility โ‰ˆ 0%): JSON-compact often uses fewer tokens. Example: complex configuration objects with many nested levels. +- **Semi-uniform arrays** (~40โ€“60% tabular eligibility): Token savings diminish. Prefer JSON if your pipelines already rely on it. +- **Flat CSV use-cases**: CSV is smaller than TOON for pure tabular data. TOON adds minimal overhead (~5-10%) to provide structure (length markers, field headers, delimiter scoping) that improves LLM reliability. + +See [benchmarks](#benchmarks) for concrete comparisons across different data structures. + +
+ ## Key Features - ๐Ÿ’ธ **Token-efficient:** typically 30โ€“60% fewer tokens than JSON[^1] @@ -75,14 +88,16 @@ For small payloads, JSON/CSV/YAML work fine. TOON's value emerges at scale: when > [!TIP] > Try the interactive [Format Tokenization Playground](https://www.curiouslychase.com/playground/format-tokenization-exploration) to compare token usage across CSV, JSON, YAML, and TOON with your own data. +Benchmarks are organized into two tracks to ensure fair comparisons: + +- **Mixed-Structure Track**: Datasets with nested or semi-uniform structures (TOON vs JSON, YAML, XML). CSV excluded as it cannot properly represent these structures. +- **Flat-Only Track**: Datasets with flat tabular structures where CSV is applicable (CSV vs TOON vs JSON, YAML, XML). + ### Token Efficiency Token counts are measured using the GPT-5 `o200k_base` tokenizer via [`gpt-tokenizer`](https://github.com/niieani/gpt-tokenizer). Savings are calculated against formatted JSON (2-space indentation) as the primary baseline, with additional comparisons to compact JSON (minified), YAML, and XML. Actual savings vary by model and tokenizer. -The benchmarks use datasets optimized for TOON's strengths (uniform tabular data). Real-world performance depends on your data structure. - -> [!NOTE] -> CSV/TSV doesn't support nested structures, so it's not included in this comparison. For flat datasets where CSV applies, see token counts and accuracy metrics in the [Retrieval Accuracy](#retrieval-accuracy) section. +The benchmarks test datasets across different structural patterns (uniform, semi-uniform, nested, deeply nested) to show where TOON excels and where other formats may be better. diff --git a/benchmarks/README.md b/benchmarks/README.md index 2a80526..81e78c5 100644 --- a/benchmarks/README.md +++ b/benchmarks/README.md @@ -34,8 +34,8 @@ Results are saved to `results/token-efficiency.md`. Tests how well LLMs can answer questions about data in different formats (TOON, JSON, JSON compact, XML, YAML, CSV): -1. Generate ~150-160 questions across 4 datasets -2. Convert each dataset to all 6 formats +1. Generate ~150-160 questions across 6 datasets (CSV only included for datasets with flat/tabular structure) +2. Convert each dataset to all supported formats 3. Query each LLM with formatted data + question 4. Validate answers using `gpt-5-nano` as judge 5. Aggregate metrics and generate report diff --git a/benchmarks/results/token-efficiency.md b/benchmarks/results/token-efficiency.md index 8b7e9a2..5808256 100644 --- a/benchmarks/results/token-efficiency.md +++ b/benchmarks/results/token-efficiency.md @@ -1,36 +1,149 @@ + +## Mixed-Structure Track + +Datasets with nested or semi-uniform structures. CSV excluded as it cannot properly represent these structures. + ``` -โญ GitHub Repositories โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 8,745 tokens - vs JSON (โˆ’42.3%) 15,145 - vs JSON compact (โˆ’23.7%) 11,455 - vs YAML (โˆ’33.4%) 13,129 - vs XML (โˆ’48.8%) 17,095 +๐Ÿ›’ E-commerce orders with nested structures [eligibility: 33%] +toon โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 58,528 tokens + vs JSON (โˆ’37.9%) 94,207 + vs JSON compact (+0.9%) 57,979 + vs YAML (โˆ’17.8%) 71,223 + vs XML (โˆ’45.2%) 106,720 -๐Ÿ“ˆ Daily Analytics โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 4,507 tokens - vs JSON (โˆ’58.9%) 10,977 - vs JSON compact (โˆ’35.7%) 7,013 - vs YAML (โˆ’48.8%) 8,810 - vs XML (โˆ’65.7%) 13,128 +๐Ÿงพ Semi-uniform event logs [eligibility: 50%] +toon โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–‘โ–‘โ–‘ 154,419 tokens + vs JSON (โˆ’15.0%) 181,592 + vs JSON compact (+19.9%) 128,836 + vs YAML (โˆ’0.9%) 155,749 + vs XML (โˆ’25.1%) 206,271 -๐Ÿ›’ E-Commerce Order โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 166 tokens - vs JSON (โˆ’35.4%) 257 - vs JSON compact (โˆ’2.9%) 171 - vs YAML (โˆ’15.7%) 197 - vs XML (โˆ’38.7%) 271 +๐Ÿงฉ Deeply nested configuration [eligibility: 0%] +toon โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 630 tokens + vs JSON (โˆ’31.4%) 918 + vs JSON compact (+11.9%) 563 + vs YAML (โˆ’6.4%) 673 + vs XML (โˆ’37.4%) 1,007 -โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ -Total โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 13,418 tokens - vs JSON (โˆ’49.1%) 26,379 - vs JSON compact (โˆ’28.0%) 18,639 - vs YAML (โˆ’39.4%) 22,136 - vs XML (โˆ’56.0%) 30,494 +โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ +Total +toon โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–‘โ–‘โ–‘โ–‘โ–‘ 213,577 tokens + vs JSON (โˆ’22.8%) 276,717 + vs JSON compact (+14.0%) 187,378 + vs YAML (โˆ’6.2%) 227,645 + vs XML (โˆ’32.0%) 313,998 ``` +## Flat-Only Track + +Datasets with flat tabular structures where CSV is applicable. + +``` +๐Ÿ‘ฅ Uniform employee records (TOON optimal format) [eligibility: 100%] +csv โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–‘ 46,968 tokens +toon โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“ 49,841 tokens (+5.8% vs CSV) + vs JSON (โˆ’60.7%) 126,886 + vs JSON compact (โˆ’36.8%) 78,882 + vs YAML (โˆ’50.0%) 99,743 + vs XML (โˆ’66.0%) 146,465 + +๐Ÿ“ˆ Time-series analytics data [eligibility: 100%] +csv โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–‘โ–‘ 8,382 tokens +toon โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“ 9,114 tokens (+8.0% vs CSV) + vs JSON (โˆ’59.0%) 22,244 + vs JSON compact (โˆ’35.9%) 14,210 + vs YAML (โˆ’49.0%) 17,857 + vs XML (โˆ’65.8%) 26,615 + +โญ Top 100 GitHub repositories [eligibility: 100%] +csv โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–‘ 8,513 tokens +toon โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“ 8,745 tokens (+2.7% vs CSV) + vs JSON (โˆ’42.3%) 15,145 + vs JSON compact (โˆ’23.7%) 11,455 + vs YAML (โˆ’33.4%) 13,129 + vs XML (โˆ’48.8%) 17,095 + +โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ +Total +csv โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–‘ 63,863 tokens +toon โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“ 67,700 tokens (+5.7% vs CSV) + vs JSON (โˆ’58.8%) 164,275 + vs JSON compact (โˆ’35.2%) 104,547 + vs YAML (โˆ’48.2%) 130,729 + vs XML (โˆ’64.4%) 190,175 +``` + +
View detailed examples -#### โญ GitHub Repositories +#### ๐Ÿ“ˆ Time-series analytics data -**Configuration:** Top 100 GitHub repositories with stars, forks, and metadata +**Savings:** 13,130 tokens (59.0% reduction vs JSON) + +**JSON** (22,244 tokens): + +```json +{ + "metrics": [ + { + "date": "2025-01-01", + "views": 4324, + "clicks": 146, + "conversions": 21, + "revenue": 3834.57, + "bounceRate": 0.4 + }, + { + "date": "2025-01-02", + "views": 6248, + "clicks": 407, + "conversions": 22, + "revenue": 2936.12, + "bounceRate": 0.62 + }, + { + "date": "2025-01-03", + "views": 7382, + "clicks": 270, + "conversions": 24, + "revenue": 6825.19, + "bounceRate": 0.7 + }, + { + "date": "2025-01-04", + "views": 4586, + "clicks": 267, + "conversions": 24, + "revenue": 2391.11, + "bounceRate": 0.64 + }, + { + "date": "2025-01-05", + "views": 6171, + "clicks": 227, + "conversions": 12, + "revenue": 3430.1, + "bounceRate": 0.39 + } + ] +} +``` + +**TOON** (9,114 tokens): + +``` +metrics[5]{date,views,clicks,conversions,revenue,bounceRate}: + 2025-01-01,4324,146,21,3834.57,0.4 + 2025-01-02,6248,407,22,2936.12,0.62 + 2025-01-03,7382,270,24,6825.19,0.7 + 2025-01-04,4586,267,24,2391.11,0.64 + 2025-01-05,6171,227,12,3430.1,0.39 +``` + +--- + +#### โญ Top 100 GitHub repositories **Savings:** 6,400 tokens (42.3% reduction vs JSON) @@ -91,72 +204,4 @@ repositories[3]{id,name,repo,description,createdAt,updatedAt,pushedAt,stars,watc 21737465,awesome,sindresorhus/awesome,๐Ÿ˜Ž Awesome lists about all kinds of interesting topics,"2014-07-11T13:42:37Z","2025-10-28T12:40:21Z","2025-10-27T17:57:31Z",410052,8017,32029,main ``` ---- - -#### ๐Ÿ“ˆ Daily Analytics - -**Configuration:** 180 days of web metrics (views, clicks, conversions, revenue) - -**Savings:** 6,470 tokens (58.9% reduction vs JSON) - -**JSON** (10,977 tokens): - -```json -{ - "metrics": [ - { - "date": "2025-01-01", - "views": 6890, - "clicks": 401, - "conversions": 23, - "revenue": 6015.59, - "bounceRate": 0.63 - }, - { - "date": "2025-01-02", - "views": 6940, - "clicks": 323, - "conversions": 37, - "revenue": 9086.44, - "bounceRate": 0.36 - }, - { - "date": "2025-01-03", - "views": 4390, - "clicks": 346, - "conversions": 26, - "revenue": 6360.75, - "bounceRate": 0.48 - }, - { - "date": "2025-01-04", - "views": 3429, - "clicks": 231, - "conversions": 13, - "revenue": 2360.96, - "bounceRate": 0.65 - }, - { - "date": "2025-01-05", - "views": 5804, - "clicks": 186, - "conversions": 22, - "revenue": 2535.96, - "bounceRate": 0.37 - } - ] -} -``` - -**TOON** (4,507 tokens): - -``` -metrics[5]{date,views,clicks,conversions,revenue,bounceRate}: - 2025-01-01,6890,401,23,6015.59,0.63 - 2025-01-02,6940,323,37,9086.44,0.36 - 2025-01-03,4390,346,26,6360.75,0.48 - 2025-01-04,3429,231,13,2360.96,0.65 - 2025-01-05,5804,186,22,2535.96,0.37 -``` -
diff --git a/benchmarks/scripts/accuracy-benchmark.ts b/benchmarks/scripts/accuracy-benchmark.ts index ad4f5db..4e273bb 100644 --- a/benchmarks/scripts/accuracy-benchmark.ts +++ b/benchmarks/scripts/accuracy-benchmark.ts @@ -5,16 +5,83 @@ import process from 'node:process' import * as prompts from '@clack/prompts' import PQueue from 'p-queue' import { BENCHMARKS_DIR, DEFAULT_CONCURRENCY, DRY_RUN, DRY_RUN_LIMITS, MODEL_RPM_LIMITS, ROOT_DIR } from '../src/constants' -import { datasets } from '../src/datasets' +import { ACCURACY_DATASETS } from '../src/datasets' import { evaluateQuestion, models } from '../src/evaluate' -import { formatters } from '../src/formatters' +import { formatters, supportsCSV } from '../src/formatters' import { generateQuestions } from '../src/questions' import { calculateFormatResults, calculateTokenCounts, generateAccuracyReport } from '../src/report' import { getAllModelResults, hasModelResults, saveModelResults } from '../src/storage' import { ensureDir } from '../src/utils' +// Constants +const PROGRESS_UPDATE_INTERVAL = 10 +const RATE_LIMIT_INTERVAL_MS = 60_000 + prompts.intro('Retrieval Accuracy Benchmark') +/** + * Generate evaluation tasks for a model + */ +function generateEvaluationTasks(questions: Question[]): { question: Question, formatName: string }[] { + const tasks: { question: Question, formatName: string }[] = [] + + for (const question of questions) { + for (const [formatName] of Object.entries(formatters)) { + // Skip CSV for datasets that don't support it + const dataset = ACCURACY_DATASETS.find(d => d.name === question.dataset) + if (formatName === 'csv' && dataset && !supportsCSV(dataset)) + continue + + tasks.push({ question, formatName }) + } + } + + return tasks +} + +/** + * Check which models already have saved results + */ +async function checkExistingResults(activeModels: typeof models) { + const existingModelResults: Record = {} + + for (const model of activeModels) { + const existingResult = await hasModelResults(model.modelId) + if (existingResult) + existingModelResults[model.modelId] = existingResult + } + + return existingModelResults +} + +/** + * Create a progress updater function + */ +function createProgressUpdater(spinner: ReturnType, total: number) { + let completed = 0 + + return () => { + completed++ + if (completed % PROGRESS_UPDATE_INTERVAL === 0 || completed === total) { + const percent = ((completed / total) * 100).toFixed(1) + spinner.message(`Progress: ${completed}/${total} (${percent}%)`) + } + } +} + +/** + * Create a rate-limited queue for model evaluation + */ +function createEvaluationQueue(modelId: string) { + const rpmLimit = MODEL_RPM_LIMITS[modelId] + + return new PQueue({ + concurrency: DEFAULT_CONCURRENCY, + intervalCap: rpmLimit ?? Infinity, + interval: rpmLimit ? RATE_LIMIT_INTERVAL_MS : 0, + }) +} + // Prompt user to select which models to benchmark const modelChoices = models.map(({ modelId }) => ({ value: modelId, @@ -37,15 +104,10 @@ const activeModels = models.filter(m => selectedModels.includes(m.modelId)) prompts.log.info(`Selected ${activeModels.length} model(s): ${activeModels.map(m => m.modelId).join(', ')}`) // Check which models already have results -const existingModelResults: Record = {} -for (const model of activeModels) { - const existingResult = await hasModelResults(model.modelId) - if (existingResult) - existingModelResults[model.modelId] = existingResult -} +const existingModelResults = await checkExistingResults(activeModels) if (Object.keys(existingModelResults).length > 0) { - prompts.log.info(`Found existing results for ${Object.values(existingModelResults).length} model(s)`) + prompts.log.info(`Found existing results for ${Object.keys(existingModelResults).length} model(s)`) } if (DRY_RUN) { @@ -75,31 +137,22 @@ for (const model of activeModels) { prompts.log.step(`Running benchmark for ${modelId}`) // Generate evaluation tasks for this model - const tasks: { question: Question, formatName: string }[] = [] - for (const question of questions) { - for (const [formatName] of Object.entries(formatters)) { - tasks.push({ question, formatName }) - } - } + const tasks = generateEvaluationTasks(questions) const total = tasks.length const rpmLimit = MODEL_RPM_LIMITS[modelId] - const queue = new PQueue({ - concurrency: DEFAULT_CONCURRENCY, - intervalCap: rpmLimit ?? Infinity, - interval: rpmLimit ? 60_000 : 0, - }) + const queue = createEvaluationQueue(modelId) const evalSpinner = prompts.spinner() evalSpinner.start(`Running ${total} evaluations (concurrency: ${DEFAULT_CONCURRENCY}, RPM limit: ${rpmLimit ?? 'unlimited'})`) - let completed = 0 + const updateProgress = createProgressUpdater(evalSpinner, total) // Queue all tasks const modelResultPromises = tasks.map(task => queue.add(async () => { // Format data on-demand - const dataset = datasets.find(d => d.name === task.question.dataset)! + const dataset = ACCURACY_DATASETS.find(d => d.name === task.question.dataset)! const formatter = formatters[task.formatName]! const formattedData = formatter(dataset.data) @@ -111,11 +164,7 @@ for (const model of activeModels) { }) // Progress update after task completes - completed++ - if (completed % 10 === 0 || completed === total) { - const percent = ((completed / total) * 100).toFixed(1) - evalSpinner.message(`Progress: ${completed}/${total} (${percent}%)`) - } + updateProgress() return result }), @@ -154,5 +203,5 @@ await ensureDir(resultsDir) const outputFilePath = path.join(resultsDir, 'retrieval-accuracy.md') await fsp.writeFile(outputFilePath, accuracyReport) -prompts.log.info(`Report saved to: \`${path.relative(ROOT_DIR, outputFilePath)}\``) reportSpinner.stop('Report generation complete!') +prompts.log.info(`Report saved to: \`${path.relative(ROOT_DIR, outputFilePath)}\``) diff --git a/benchmarks/scripts/token-efficiency-benchmark.ts b/benchmarks/scripts/token-efficiency-benchmark.ts index b5d4ebe..1e36a13 100644 --- a/benchmarks/scripts/token-efficiency-benchmark.ts +++ b/benchmarks/scripts/token-efficiency-benchmark.ts @@ -1,11 +1,11 @@ +import type { Dataset } from '../src/types' import * as fsp from 'node:fs/promises' import * as path from 'node:path' import * as prompts from '@clack/prompts' import { encode } from '../../packages/toon/src' -import githubRepos from '../data/github-repos.json' with { type: 'json' } import { BENCHMARKS_DIR, FORMATTER_DISPLAY_NAMES, ROOT_DIR } from '../src/constants' -import { generateAnalyticsData, generateOrderData } from '../src/datasets' -import { formatters } from '../src/formatters' +import { TOKEN_EFFICIENCY_DATASETS } from '../src/datasets' +import { formatters, supportsCSV } from '../src/formatters' import { createProgressBar, ensureDir, tokenize } from '../src/utils' interface FormatMetrics { @@ -16,55 +16,160 @@ interface FormatMetrics { } interface BenchmarkResult { - name: string - emoji: string - description: string - data: Record + dataset: Dataset formats: FormatMetrics[] - showDetailed: boolean } -const BENCHMARK_EXAMPLES = [ - { - name: 'GitHub Repositories', - emoji: 'โญ', - description: 'Top 100 GitHub repositories with stars, forks, and metadata', - getData: () => ({ repositories: githubRepos }), - showDetailed: true, - }, - { - name: 'Daily Analytics', - emoji: '๐Ÿ“ˆ', - description: '180 days of web metrics (views, clicks, conversions, revenue)', - getData: () => generateAnalyticsData(180), - showDetailed: true, - }, - { - name: 'E-Commerce Order', - emoji: '๐Ÿ›’', - description: 'Single nested order with customer and items', - getData: generateOrderData, - showDetailed: false, - }, -] as const +// Constants +const DATASET_ICONS: Record = { + 'tabular': '๐Ÿ‘ฅ', + 'nested': '๐Ÿ›’', + 'analytics': '๐Ÿ“ˆ', + 'github': 'โญ', + 'event-logs': '๐Ÿงพ', + 'nested-config': '๐Ÿงฉ', +} + +const COMPARISON_FORMAT_ORDER = ['json-pretty', 'json-compact', 'yaml', 'xml'] as const + +const PROGRESS_BAR_CONFIG = { filled: 'โ–“', empty: 'โ–‘' } as const +const PROGRESS_BAR_WIDTH = 20 +const TOKEN_PADDING = 7 +const LABEL_PADDING = 60 +const COMPARISON_LABEL_PADDING = 30 + +const SEPARATOR = 'โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€' +const DEFAULT_DATASET_ICON = '๐Ÿ“Š' + +const DETAILED_EXAMPLE_DATASETS = ['github', 'analytics'] as const +const GITHUB_REPO_LIMIT = 3 +const GITHUB_DESC_LIMIT = 80 +const ANALYTICS_METRICS_LIMIT = 5 prompts.intro('Token Efficiency Benchmark') +/** + * Format a comparison line showing savings vs TOON + */ +function formatComparisonLine(format: FormatMetrics): string { + const label = FORMATTER_DISPLAY_NAMES[format.name] || format.name.toUpperCase() + const signedPercent = format.savingsPercent >= 0 + ? `โˆ’${format.savingsPercent.toFixed(1)}%` + : `+${Math.abs(format.savingsPercent).toFixed(1)}%` + const labelWithSavings = `vs ${label} (${signedPercent})`.padEnd(COMPARISON_LABEL_PADDING) + const tokenStr = format.tokens.toLocaleString('en-US').padStart(TOKEN_PADDING) + return ` ${labelWithSavings}${tokenStr}` +} + +/** + * Calculate total tokens and savings for a set of datasets + */ +function calculateTotalMetrics(datasets: BenchmarkResult[], formatNames: readonly string[]) { + const totalToonTokens = datasets.reduce((sum, r) => { + const toon = r.formats.find(f => f.name === 'toon')! + return sum + toon.tokens + }, 0) + + const totals = formatNames.map((formatName) => { + const totalTokens = datasets.reduce((sum, r) => { + const format = r.formats.find(f => f.name === formatName) + return sum + (format?.tokens || 0) + }, 0) + const savings = totalTokens - totalToonTokens + const savingsPercent = (savings / totalTokens) * 100 + return { name: formatName, tokens: totalTokens, savingsPercent } + }) + + return { totalToonTokens, totals } +} + +/** + * Generate total lines for a track + */ +function generateTotalLines( + totalToonTokens: number, + totals: { name: string, tokens: number, savingsPercent: number }[], + baselineFormat?: { name: string, tokens: number }, +) { + const lines: string[] = ['Total '] + + if (baselineFormat) { + // Flat-only track with CSV baseline + const csvPercentage = Math.min(100, (baselineFormat.tokens / totalToonTokens) * 100) + const csvBar = createProgressBar(csvPercentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG) + const csvStr = baselineFormat.tokens.toLocaleString('en-US').padStart(TOKEN_PADDING) + lines.push(`csv ${csvBar} ${csvStr} tokens`) + + const overheadPercent = ((totalToonTokens - baselineFormat.tokens) / totalToonTokens) * 100 + const toonBar = createProgressBar(100, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG) + const toonStr = totalToonTokens.toLocaleString('en-US').padStart(TOKEN_PADDING) + lines.push(`toon ${toonBar} ${toonStr} tokens (+${overheadPercent.toFixed(1)}% vs CSV)`) + } + else { + // Mixed-structure track + const totalPercentage = Math.min(100, (totalToonTokens / totals[0]!.tokens) * 100) + const totalBar = createProgressBar(totalPercentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG) + const toonStr = totalToonTokens.toLocaleString('en-US').padStart(TOKEN_PADDING) + lines.push(`toon ${totalBar} ${toonStr} tokens`) + } + + // Add comparison lines + for (const format of totals) { + lines.push(formatComparisonLine({ + name: format.name, + tokens: format.tokens, + savings: 0, // Not used in this context + savingsPercent: format.savingsPercent, + })) + } + + return lines.join('\n') +} + +/** + * Generate bar chart for a dataset + */ +function generateDatasetChart(result: BenchmarkResult): string { + const { dataset, formats } = result + const toon = formats.find(f => f.name === 'toon')! + const jsonPretty = formats.find(f => f.name === 'json-pretty')! + + const emoji = DATASET_ICONS[dataset.name] || DEFAULT_DATASET_ICON + const eligibility = dataset.metadata.tabularEligibility + const name = `${dataset.description} [eligibility: ${eligibility}%]` + const percentage = Math.min(100, 100 - jsonPretty.savingsPercent) + const bar = createProgressBar(percentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG) + const toonStr = toon.tokens.toLocaleString('en-US') + + const line1 = `${emoji} ${name.padEnd(LABEL_PADDING)}\ntoon ${bar} ${toonStr.padStart(TOKEN_PADDING)} tokens` + + const comparisonLines = COMPARISON_FORMAT_ORDER.map((formatName) => { + const format = formats.find(f => f.name === formatName) + if (!format) + return null + + return formatComparisonLine(format) + }).filter(Boolean) + + return [line1, ...comparisonLines].join('\n') +} + const results: BenchmarkResult[] = [] -const totalTokensByFormat: Record = {} -for (const example of BENCHMARK_EXAMPLES) { - const data = example.getData() - - // Calculate tokens for each format +// Calculate token counts for all datasets +for (const dataset of TOKEN_EFFICIENCY_DATASETS) { const formatMetrics: FormatMetrics[] = [] const tokensByFormat: Record = {} + // Calculate tokens for each format for (const [formatName, formatter] of Object.entries(formatters)) { - const formattedString = formatter(data) + // Skip CSV for datasets that don't support it + if (formatName === 'csv' && !supportsCSV(dataset)) + continue + + const formattedString = formatter(dataset.data) const tokens = tokenize(formattedString) tokensByFormat[formatName] = tokens - totalTokensByFormat[formatName] = (totalTokensByFormat[formatName] || 0) + tokens } // Calculate savings vs TOON @@ -80,105 +185,126 @@ for (const example of BENCHMARK_EXAMPLES) { } results.push({ - name: example.name, - emoji: example.emoji, - description: example.description, - data, + dataset, formats: formatMetrics, - showDetailed: example.showDetailed, }) } -// Calculate total savings percentages -const totalToonTokens = totalTokensByFormat.toon! -const totalSavingsPercent: Record = {} -for (const [formatName, totalTokens] of Object.entries(totalTokensByFormat)) { - if (formatName === 'toon') { - totalSavingsPercent[formatName] = 0 - } - else { - const savings = totalTokens - totalToonTokens - totalSavingsPercent[formatName] = (savings / totalTokens) * 100 - } -} +// Separate datasets by CSV support +const mixedStructureDatasets = results.filter(r => !supportsCSV(r.dataset)) +const flatOnlyDatasets = results.filter(r => supportsCSV(r.dataset)) -// Generate ASCII bar chart visualization (stacked compact format) -const formatOrder = ['json-pretty', 'json-compact', 'yaml', 'xml'] -const datasetRows = results +// Mixed-Structure Track (no CSV) +const mixedCharts = mixedStructureDatasets + .map(result => generateDatasetChart(result)) + .join('\n\n') + +// Flat-Only Track (with CSV) +const flatCharts = flatOnlyDatasets .map((result) => { + const csv = result.formats.find(f => f.name === 'csv') const toon = result.formats.find(f => f.name === 'toon')! - const percentage = result.formats.find(f => f.name === 'json-pretty')!.savingsPercent - const bar = createProgressBar(100 - percentage, 100) // Invert to show TOON tokens + + if (!csv) + return generateDatasetChart(result) + + // Special handling to show CSV first with TOON overhead + const { dataset } = result + const emoji = DATASET_ICONS[dataset.name] || DEFAULT_DATASET_ICON + const eligibility = dataset.metadata.tabularEligibility + const name = `${dataset.description} [eligibility: ${eligibility}%]` + + // CSV line + const csvPercentage = Math.min(100, (csv.tokens / toon.tokens) * 100) + const csvBar = createProgressBar(csvPercentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG) + const csvStr = csv.tokens.toLocaleString('en-US') + + const line1 = `${emoji} ${name.padEnd(LABEL_PADDING)}\ncsv ${csvBar} ${csvStr.padStart(TOKEN_PADDING)} tokens` + + // TOON line with overhead vs CSV + const toonOverhead = toon.tokens - csv.tokens + const toonOverheadPercent = (toonOverhead / toon.tokens) * 100 + const toonBar = createProgressBar(100, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG) const toonStr = toon.tokens.toLocaleString('en-US') + const toonVsCSV = toonOverheadPercent >= 0 + ? `(+${toonOverheadPercent.toFixed(1)}% vs CSV)` + : `(${toonOverheadPercent.toFixed(1)}% vs CSV)` + const toonLine = `toon ${toonBar} ${toonStr.padStart(TOKEN_PADDING)} tokens ${toonVsCSV}` - const line1 = `${result.emoji} ${result.name.padEnd(25)} ${bar} ${toonStr.padStart(6)} tokens` + // Other format comparisons (vs TOON) + const comparisonLines = COMPARISON_FORMAT_ORDER.map((formatName) => { + const format = result.formats.find(f => f.name === formatName) + if (!format) + return null - const comparisonLines = formatOrder.map((formatName) => { - const format = result.formats.find(f => f.name === formatName)! - const label = FORMATTER_DISPLAY_NAMES[formatName] || formatName.toUpperCase() - const signedPercent = format.savingsPercent >= 0 - ? `โˆ’${format.savingsPercent.toFixed(1)}%` - : `+${Math.abs(format.savingsPercent).toFixed(1)}%` - const labelWithSavings = `vs ${label} (${signedPercent})`.padEnd(27) - const tokenStr = format.tokens.toLocaleString('en-US').padStart(6) - return ` ${labelWithSavings}${tokenStr}` - }) + return formatComparisonLine(format) + }).filter(Boolean) - return [line1, ...comparisonLines].join('\n') + return [line1, toonLine, ...comparisonLines].join('\n') }) .join('\n\n') -// Add separator and totals row -const separator = 'โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€' +// Calculate totals for mixed structure +const { totalToonTokens: totalToonTokensMixed, totals: mixedTotals } = calculateTotalMetrics(mixedStructureDatasets, COMPARISON_FORMAT_ORDER) +const mixedTotalLines = generateTotalLines(totalToonTokensMixed, mixedTotals) -// Calculate bar for totals (TOON vs average of comparison formats) -const comparisonTokens = formatOrder.map(name => totalTokensByFormat[name]!) -const averageComparisonTokens = comparisonTokens.reduce((a, b) => a + b, 0) / comparisonTokens.length -const totalPercentage = (totalToonTokens / averageComparisonTokens) * 100 -const totalBar = createProgressBar(totalPercentage, 100) +// Calculate totals for flat-only +const { totalToonTokens: totalToonTokensFlat, totals: flatTotals } = calculateTotalMetrics(flatOnlyDatasets, COMPARISON_FORMAT_ORDER) +const totalCSVTokensFlat = flatOnlyDatasets.reduce((sum, r) => { + const csv = r.formats.find(f => f.name === 'csv') + return sum + (csv?.tokens || 0) +}, 0) +const flatTotalLines = generateTotalLines(totalToonTokensFlat, flatTotals, { name: 'csv', tokens: totalCSVTokensFlat }) -const totalLine1 = `Total ${totalBar} ${totalToonTokens.toLocaleString('en-US').padStart(6)} tokens` +const barChartSection = ` +## Mixed-Structure Track -const totalComparisonLines = formatOrder.map((formatName) => { - const label = FORMATTER_DISPLAY_NAMES[formatName] || formatName.toUpperCase() - const tokens = totalTokensByFormat[formatName]! - const percent = totalSavingsPercent[formatName]! - const signedPercent = percent >= 0 ? `โˆ’${percent.toFixed(1)}%` : `+${Math.abs(percent).toFixed(1)}%` - const labelWithSavings = `vs ${label} (${signedPercent})`.padEnd(27) - const tokenStr = tokens.toLocaleString('en-US').padStart(6) - return ` ${labelWithSavings}${tokenStr}` -}) +Datasets with nested or semi-uniform structures. CSV excluded as it cannot properly represent these structures. -const barChartSection = `${datasetRows}\n\n${separator}\n${totalLine1}\n${totalComparisonLines.join('\n')}` +\`\`\` +${mixedCharts} -// Generate detailed examples (only for selected examples) -// Note: Large datasets are truncated for display readability in the report. -// Token counts are calculated from the full datasets, not the truncated versions. +${SEPARATOR} +${mixedTotalLines} +\`\`\` + +## Flat-Only Track + +Datasets with flat tabular structures where CSV is applicable. + +\`\`\` +${flatCharts} + +${SEPARATOR} +${flatTotalLines} +\`\`\` +`.trim() + +// Generate detailed examples (optional: show a few examples) const detailedExamples = results - .filter(result => result.showDetailed) + .filter(r => DETAILED_EXAMPLE_DATASETS.includes(r.dataset.name as any)) .map((result, i, filtered) => { - // Truncate large datasets for display - let displayData = result.data - if (result.name === 'GitHub Repositories') { + let displayData = result.dataset.data + + // Truncate for display + if (result.dataset.name === 'github') { displayData = { - repositories: result.data.repositories.slice(0, 3).map((repo: Record) => ({ + repositories: displayData.repositories.slice(0, GITHUB_REPO_LIMIT).map((repo: Record) => ({ ...repo, - description: repo.description?.slice(0, 80) + (repo.description?.length > 80 ? 'โ€ฆ' : ''), + description: repo.description?.slice(0, GITHUB_DESC_LIMIT) + (repo.description?.length > GITHUB_DESC_LIMIT ? 'โ€ฆ' : ''), })), } } - else if (result.name === 'Daily Analytics') { - displayData = { metrics: result.data.metrics.slice(0, 5) } + else if (result.dataset.name === 'analytics') { + displayData = { metrics: displayData.metrics.slice(0, ANALYTICS_METRICS_LIMIT) } } const separator = i < filtered.length - 1 ? '\n\n---' : '' - + const emoji = DATASET_ICONS[result.dataset.name] || DEFAULT_DATASET_ICON const json = result.formats.find(f => f.name === 'json-pretty')! const toon = result.formats.find(f => f.name === 'toon')! - return `#### ${result.emoji} ${result.name} - -**Configuration:** ${result.description} + return `#### ${emoji} ${result.dataset.description} **Savings:** ${json.savings.toLocaleString('en-US')} tokens (${json.savingsPercent.toFixed(1)}% reduction vs JSON) @@ -197,9 +323,7 @@ ${encode(displayData)} .join('\n\n') const markdown = ` -\`\`\` ${barChartSection} -\`\`\`
View detailed examples @@ -209,7 +333,7 @@ ${detailedExamples}
`.trimStart() -prompts.log.message(`${barChartSection}\n`) +prompts.log.message(barChartSection) const resultsDir = path.join(BENCHMARKS_DIR, 'results') await ensureDir(resultsDir) diff --git a/benchmarks/src/constants.ts b/benchmarks/src/constants.ts index adf5327..05daee3 100644 --- a/benchmarks/src/constants.ts +++ b/benchmarks/src/constants.ts @@ -8,7 +8,7 @@ export const BENCHMARKS_DIR: string = url.fileURLToPath(new URL('../', import.me * Model-specific RPM (requests per minute) limits to handle API quotas * * @remarks - * Set `undefined` for models without specific limits + * Set `undefined` for models without specific limits. */ /// keep-sorted export const MODEL_RPM_LIMITS: Record = { @@ -39,7 +39,7 @@ export const FORMATTER_DISPLAY_NAMES: Record = { * Enable dry run mode for quick testing with limited AI requests * * @remarks - * Set via environment variable: `DRY_RUN=true` + * Set via environment variable: `DRY_RUN=true`. */ export const DRY_RUN: boolean = process.env.DRY_RUN === 'true' @@ -123,4 +123,14 @@ export const QUESTION_LIMITS = { aggregationBranches: 2, filteringStarsAndForks: 8, }, + eventLogs: { + fieldRetrieval: 10, + aggregationEndpoints: 3, + filteringLevelAndStatus: 2, + filteringEndpointAndStatus: 2, + }, + nestedConfig: { + fieldRetrieval: 5, + filteringComplex: 2, + }, } as const diff --git a/benchmarks/src/datasets.ts b/benchmarks/src/datasets.ts index fc2d274..e763856 100644 --- a/benchmarks/src/datasets.ts +++ b/benchmarks/src/datasets.ts @@ -5,6 +5,67 @@ import githubRepos from '../data/github-repos.json' with { type: 'json' } // Seed for reproducibility faker.seed(12345) +/** + * Calculate the tabular eligibility percentage of a data structure + * + * @remarks + * Recursively analyzes data to determine what percentage of arrays qualify + * for TOON's tabular format (uniform objects with primitive values only). + */ +export function calculateTabularEligibility(data: unknown): number { + let totalArrays = 0 + let tabularArrays = 0 + + function isTabularArray(arr: unknown[]): boolean { + if (arr.length === 0) + return false + + // Check if all elements are objects + if (!arr.every(item => typeof item === 'object' && item !== null && !Array.isArray(item))) + return false + + // Get keys from first object + const firstKeys = Object.keys(arr[0] as Record) + if (firstKeys.length === 0) + return false + + // Check if all objects have the same keys and only primitive values + return arr.every((item) => { + const itemObj = item as Record + const itemKeys = Object.keys(itemObj) + if (itemKeys.length !== firstKeys.length) + return false + if (!firstKeys.every(key => itemKeys.includes(key))) + return false + + // Check if all values are primitives (no nested objects or arrays) + return firstKeys.every((key) => { + const value = itemObj[key] + return value === null || ['string', 'number', 'boolean'].includes(typeof value) + }) + }) + } + + function traverse(obj: unknown): void { + if (Array.isArray(obj)) { + totalArrays++ + if (isTabularArray(obj)) + tabularArrays++ + + // Continue traversing array elements + obj.forEach(item => traverse(item)) + } + else if (typeof obj === 'object' && obj !== null) { + // Traverse object properties + Object.values(obj).forEach(value => traverse(value)) + } + } + + traverse(data) + + return totalArrays === 0 ? 0 : Math.round((tabularArrays / totalArrays) * 100) +} + /** * Employee record structure for tabular dataset */ @@ -73,6 +134,78 @@ export interface Repository { pushedAt: string } +/** + * Event log structure for semi-uniform dataset + */ +export interface EventLog { + timestamp: string + level: 'info' | 'warn' | 'error' + endpoint: string + statusCode: number + responseTime: number + userId: number + error?: { + message: string + stack: string + retryable: boolean + } +} + +/** + * Nested configuration structure for deeply nested dataset + */ +export interface NestedConfig { + environment: string + version: string + database: { + host: string + port: number + name: string + pool: { + min: number + max: number + idleTimeout: number + } + replicas: { + host: string + port: number + priority: number + }[] + } + features: Record + }[] + }> + authentication: { + providers: { + name: string + clientId: string + scopes: string[] + config: Record + }[] + session: { + secret: string + duration: number + refreshThreshold: number + } + } + permissions: { + roles: Record + groups: Record + } +} + /** * Generate analytics time-series data */ @@ -108,17 +241,13 @@ export function generateAnalyticsData(days: number, startDate = '2025-01-01'): { } /** - * Tabular dataset: 100 uniform employee records - * - * @remarks - * Tests TOON's tabular array format + * Generate employee data (uniform tabular structure) */ const departments: readonly string[] = ['Engineering', 'Sales', 'Marketing', 'HR', 'Operations', 'Finance'] as const -const tabularDataset: Dataset = { - name: 'tabular', - description: 'Uniform employee records (TOON optimal format)', - data: { - employees: Array.from({ length: 100 }, (_, i): Employee => { + +function generateEmployees(count: number): { employees: Employee[] } { + return { + employees: Array.from({ length: count }, (_, i): Employee => { const yearsExp = faker.number.int({ min: 1, max: 25 }) return { id: i + 1, @@ -130,72 +259,132 @@ const tabularDataset: Dataset = { active: faker.datatype.boolean(0.8), // 80% active } }), + } +} + +/** + * Tabular dataset: Uniform employee records + * + * @remarks + * Tests TOON's tabular array format. + */ +const tabularDataset: Dataset = { + name: 'tabular', + description: 'Uniform employee records (TOON optimal format)', + data: generateEmployees(100), + metadata: { + supportsCSV: true, + structureClass: 'uniform', + tabularEligibility: 100, }, } /** - * Nested dataset: 50 e-commerce orders with nested structures - * - * @remarks - * Tests TOON's handling of complex nested objects + * Generate e-commerce orders (nested structure) */ -const productNames: readonly string[] = ['Wireless Mouse', 'USB Cable', 'Laptop Stand', 'Keyboard', 'Webcam', 'Headphones', 'Monitor', 'Desk Lamp'] as const -const statuses: readonly string[] = ['pending', 'processing', 'shipped', 'delivered', 'cancelled'] as const +const PRODUCT_NAMES = ['Wireless Mouse', 'USB Cable', 'Laptop Stand', 'Keyboard', 'Webcam', 'Headphones', 'Monitor', 'Desk Lamp'] as const +const ORDER_STATUSES = ['pending', 'processing', 'shipped', 'delivered', 'cancelled'] as const -const nestedDataset: Dataset = { - name: 'nested', - description: 'E-commerce orders with nested structures', - data: { - orders: Array.from({ length: 50 }, (_, i) => { - const customerId = (i % 20) + 1 - const itemCount = faker.number.int({ min: 1, max: 4 }) +const ORDER_CONSTANTS = { + CUSTOMER_ID_MOD: 20, + MIN_ITEMS: 1, + MAX_ITEMS: 4, + MIN_ITEM_PRICE: 9.99, + MAX_ITEM_PRICE: 199.99, + MIN_ITEM_QUANTITY: 1, + MAX_ITEM_QUANTITY: 5, + SKU_LENGTH: 6, + ORDER_ID_PADDING: 4, + RECENT_DAYS: 90, + TAX_RATE: 0.08, +} as const + +function generateOrders(count: number): { orders: Order[] } { + return { + orders: Array.from({ length: count }, (_, i) => { + const customerId = (i % ORDER_CONSTANTS.CUSTOMER_ID_MOD) + 1 + const itemCount = faker.number.int({ min: ORDER_CONSTANTS.MIN_ITEMS, max: ORDER_CONSTANTS.MAX_ITEMS }) const items = Array.from({ length: itemCount }, (_, j) => { - const price = faker.number.float({ min: 9.99, max: 199.99, fractionDigits: 2 }) - const quantity = faker.number.int({ min: 1, max: 5 }) + const price = faker.number.float({ + min: ORDER_CONSTANTS.MIN_ITEM_PRICE, + max: ORDER_CONSTANTS.MAX_ITEM_PRICE, + fractionDigits: 2, + }) + const quantity = faker.number.int({ + min: ORDER_CONSTANTS.MIN_ITEM_QUANTITY, + max: ORDER_CONSTANTS.MAX_ITEM_QUANTITY, + }) return { - sku: `SKU-${faker.string.alphanumeric({ length: 6 }).toUpperCase()}`, - name: productNames[j % productNames.length]!, + sku: `SKU-${faker.string.alphanumeric({ length: ORDER_CONSTANTS.SKU_LENGTH }).toUpperCase()}`, + name: PRODUCT_NAMES[j % PRODUCT_NAMES.length]!, quantity, price, } }) - const total = Number(items.reduce((sum, item) => sum + (item.price * item.quantity), 0).toFixed(2)) + const subtotal = Number(items.reduce((sum, item) => sum + (item.price * item.quantity), 0).toFixed(2)) + const tax = Number((subtotal * ORDER_CONSTANTS.TAX_RATE).toFixed(2)) + const total = Number((subtotal + tax).toFixed(2)) return { - orderId: `ORD-${String(i + 1).padStart(4, '0')}`, + orderId: `ORD-${String(i + 1).padStart(ORDER_CONSTANTS.ORDER_ID_PADDING, '0')}`, customer: { id: customerId, name: faker.person.fullName(), email: faker.internet.email().toLowerCase(), + phone: faker.phone.number(), }, items, + subtotal, + tax, total, - status: statuses[i % statuses.length]!, - orderDate: faker.date.recent({ days: 90 }).toISOString().split('T')[0], + status: ORDER_STATUSES[i % ORDER_STATUSES.length]!, + orderDate: faker.date.recent({ days: ORDER_CONSTANTS.RECENT_DAYS }).toISOString().split('T')[0], } }), + } +} + +/** + * Nested dataset: E-commerce orders with nested structures + * + * @remarks + * Tests TOON's handling of complex nested objects. + */ +const nestedDataset: Dataset = { + name: 'nested', + description: 'E-commerce orders with nested structures', + data: generateOrders(50), + metadata: { + supportsCSV: false, + structureClass: 'nested', + tabularEligibility: 33, // orders array is not tabular, but items arrays within are }, } /** - * Analytics dataset: 60 days of time-series metrics + * Analytics dataset: Time-series metrics * * @remarks - * Tests TOON's handling of numeric data and date fields + * Tests TOON's handling of numeric data and date fields. */ const analyticsDataset: Dataset = { name: 'analytics', description: 'Time-series analytics data', data: generateAnalyticsData(60), + metadata: { + supportsCSV: true, + structureClass: 'uniform', + tabularEligibility: 100, + }, } /** * Real-world dataset: Top 100 starred GitHub repositories * * @remarks - * Tests TOON's tabular format + * Tests TOON's tabular format with real data. */ const githubDataset: Dataset = { name: 'github', @@ -203,13 +392,18 @@ const githubDataset: Dataset = { data: { repositories: githubRepos, }, + metadata: { + supportsCSV: true, + structureClass: 'uniform', + tabularEligibility: 100, + }, } /** * Generate a single e-commerce order with nested structure * * @remarks - * Used for token efficiency benchmarks + * Used for token efficiency benchmarks. */ export function generateOrderData(): Order { return { @@ -235,11 +429,257 @@ export function generateOrderData(): Order { } /** - * All datasets used in the benchmark + * Generate event logs (semi-uniform structure) + * + * @remarks + * Approximately 50% of logs include nested error objects, 50% are flat. + * This creates ~45% tabular eligibility. */ -export const datasets: Dataset[] = [ - tabularDataset, - nestedDataset, - analyticsDataset, - githubDataset, +export function generateEventLogs(count: number): { logs: EventLog[] } { + const endpoints = ['/api/users', '/api/orders', '/api/products', '/api/auth', '/api/payments'] + const levels = ['info', 'warn', 'error'] as const + + return { + logs: Array.from({ length: count }, () => { + const level = faker.helpers.arrayElement(levels) + const hasError = level === 'error' || (level === 'warn' && faker.datatype.boolean(0.3)) + + const log: EventLog = { + timestamp: faker.date.recent({ days: 7 }).toISOString(), + level, + endpoint: faker.helpers.arrayElement(endpoints), + statusCode: hasError + ? faker.number.int({ min: 400, max: 599 }) + : faker.number.int({ min: 200, max: 299 }), + responseTime: faker.number.int({ min: 10, max: 5000 }), + userId: faker.number.int({ min: 1000, max: 9999 }), + } + + if (hasError) { + log.error = { + message: faker.helpers.arrayElement([ + 'Database connection timeout', + 'Invalid authentication token', + 'Resource not found', + 'Internal server error', + 'Rate limit exceeded', + ]), + stack: `Error: ${faker.lorem.sentence()}\n at ${faker.lorem.word()}\n at ${faker.lorem.word()}`, + retryable: faker.datatype.boolean(0.6), + } + } + + return log + }), + } +} + +/** + * Generate deeply nested configuration + * + * @remarks + * Creates a complex nested structure with minimal tabular eligibility (~0%). + */ +export function generateNestedConfig(): NestedConfig { + return { + environment: faker.helpers.arrayElement(['production', 'staging', 'development']), + version: faker.system.semver(), + database: { + host: faker.internet.domainName(), + port: 5432, + name: faker.database.type(), + pool: { + min: 2, + max: faker.number.int({ min: 10, max: 50 }), + idleTimeout: 30000, + }, + replicas: Array.from({ length: 3 }, (_, i) => ({ + host: `replica-${i + 1}.${faker.internet.domainName()}`, + port: 5432, + priority: i + 1, + })), + }, + features: { + darkMode: { + enabled: faker.datatype.boolean(), + rollout: faker.number.int({ min: 0, max: 100 }), + variants: [ + { + name: 'default', + weight: 70, + config: { theme: 'dark', animations: true }, + }, + { + name: 'minimal', + weight: 30, + config: { theme: 'dark', animations: false }, + }, + ], + }, + analytics: { + enabled: faker.datatype.boolean(), + rollout: faker.number.int({ min: 0, max: 100 }), + variants: [ + { + name: 'full', + weight: 100, + config: { tracking: 'all', sampling: 1.0 }, + }, + ], + }, + }, + authentication: { + providers: [ + { + name: 'oauth2', + clientId: faker.string.uuid(), + scopes: ['read', 'write', 'admin'], + config: { + authUrl: faker.internet.url(), + tokenUrl: faker.internet.url(), + }, + }, + { + name: 'saml', + clientId: faker.string.uuid(), + scopes: ['read'], + config: { + entryPoint: faker.internet.url(), + cert: faker.string.alphanumeric({ length: 64 }), + }, + }, + ], + session: { + secret: faker.string.alphanumeric({ length: 32 }), + duration: 86400, + refreshThreshold: 3600, + }, + }, + permissions: { + roles: { + admin: { + permissions: ['read', 'write', 'delete', 'manage_users', 'manage_roles'], + inherits: [], + }, + editor: { + permissions: ['read', 'write'], + inherits: ['viewer'], + }, + viewer: { + permissions: ['read'], + inherits: [], + }, + }, + groups: { + engineering: { + members: Array.from({ length: 5 }, () => faker.internet.email()), + roles: ['admin', 'editor'], + }, + support: { + members: Array.from({ length: 3 }, () => faker.internet.email()), + roles: ['viewer'], + }, + }, + }, + } +} + +/** + * Event logs dataset: Semi-uniform structure + * + * @remarks + * Tests TOON with semi-uniform data (~50% flat, ~50% with nested errors). + */ +const eventLogsDataset: Dataset = { + name: 'event-logs', + description: 'Semi-uniform event logs', + data: generateEventLogs(75), + metadata: { + supportsCSV: false, + structureClass: 'semi-uniform', + tabularEligibility: 50, // ~50% of logs have nested error objects + }, +} + +/** + * Nested config dataset: Deeply nested structure + * + * @remarks + * Tests TOON's worst-case scenario with deeply nested configuration. + */ +const nestedConfigDataset: Dataset = { + name: 'nested-config', + description: 'Deeply nested configuration', + data: generateNestedConfig(), + metadata: { + supportsCSV: false, + structureClass: 'deep', + tabularEligibility: 0, // Highly nested, minimal tabular arrays + }, +} + +/** + * Datasets for accuracy benchmarks (smaller sizes for faster evaluation) + */ +export const ACCURACY_DATASETS: Dataset[] = [ + tabularDataset, // 100 employees + nestedDataset, // 50 orders + analyticsDataset, // 60 days + githubDataset, // 100 repos + eventLogsDataset, // 75 logs + nestedConfigDataset, // 1 config +] + +/** + * Datasets for token efficiency benchmarks (larger sizes to amplify token differences) + */ +export const TOKEN_EFFICIENCY_DATASETS: Dataset[] = [ + // Tabular: 2000 employees + { + name: 'tabular', + description: 'Uniform employee records (TOON optimal format)', + data: generateEmployees(2000), + metadata: { + supportsCSV: true, + structureClass: 'uniform', + tabularEligibility: 100, + }, + }, + // Nested: 500 orders + { + name: 'nested', + description: 'E-commerce orders with nested structures', + data: generateOrders(500), + metadata: { + supportsCSV: false, + structureClass: 'nested', + tabularEligibility: 33, + }, + }, + // Analytics: 365 days + { + name: 'analytics', + description: 'Time-series analytics data', + data: generateAnalyticsData(365), + metadata: { + supportsCSV: true, + structureClass: 'uniform', + tabularEligibility: 100, + }, + }, + // GitHub: 100 repos (same as accuracy) + githubDataset, + // Event logs: 2000 logs + { + name: 'event-logs', + description: 'Semi-uniform event logs', + data: generateEventLogs(2000), + metadata: { + supportsCSV: false, + structureClass: 'semi-uniform', + tabularEligibility: 50, + }, + }, + // Nested config: 1 config (same as accuracy) + nestedConfigDataset, ] diff --git a/benchmarks/src/formatters.ts b/benchmarks/src/formatters.ts index 98a4fa0..5a9c226 100644 --- a/benchmarks/src/formatters.ts +++ b/benchmarks/src/formatters.ts @@ -1,3 +1,4 @@ +import type { Dataset } from './types' import { stringify as stringifyCSV } from 'csv-stringify/sync' import { XMLBuilder } from 'fast-xml-parser' import { stringify as stringifyYAML } from 'yaml' @@ -75,3 +76,14 @@ function toXML(data: unknown): string { return builder.build(data) } + +/** + * Check if a dataset supports CSV format + * + * @remarks + * CSV is only suitable for flat tabular data. Datasets with nested structures + * should not be compared using CSV as it cannot properly represent the data. + */ +export function supportsCSV(dataset: Dataset): boolean { + return dataset.metadata.supportsCSV +} diff --git a/benchmarks/src/questions.ts b/benchmarks/src/questions.ts deleted file mode 100644 index a644ec2..0000000 --- a/benchmarks/src/questions.ts +++ /dev/null @@ -1,711 +0,0 @@ -/** - * Question generation for TOON benchmarks - * - * Generates ~150-160 questions across different question types and datasets: - * - Field Retrieval: Direct field access with no computation - * Examples: "What is X's salary?", "What is the status of order Y?" - * - Aggregation: Counts, sums, averages, min/max operations (including single-condition filters) - * Examples: "How many X?", "What is the total/average?", "How many X > threshold?" - * - Filtering: Multi-condition queries requiring complex logical operations - * Examples: "How many X WHERE condition1 AND condition2?" - */ - -import type { AnalyticsMetric, Employee, Order, Repository } from './datasets' -import type { Question } from './types' -import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from './constants' -import { datasets } from './datasets' - -/** - * Generate all questions from datasets - */ -export function generateQuestions(): Question[] { - const questions: Question[] = [] - let idCounter = 1 - - // Get datasets with proper typing - const tabular = (datasets.find(d => d.name === 'tabular')?.data.employees as Employee[]) ?? [] - const nested = (datasets.find(d => d.name === 'nested')?.data.orders as Order[]) ?? [] - const analytics = (datasets.find(d => d.name === 'analytics')?.data.metrics as AnalyticsMetric[]) ?? [] - const github = (datasets.find(d => d.name === 'github')?.data.repositories as Repository[]) ?? [] - - if (tabular.length > 0) { - // Field retrieval: specific employees - for (let i = 0; i < Math.min(QUESTION_LIMITS.tabular.fieldRetrieval, tabular.length); i++) { - const emp = tabular[i * 2] || tabular[i] - if (!emp) - continue - - // Rotate through all field types - if (i % 5 === 0) { - questions.push({ - id: `q${idCounter++}`, - prompt: `What is the salary of ${emp.name}?`, - groundTruth: String(emp.salary), - type: 'field-retrieval', - dataset: 'tabular', - }) - } - else if (i % 5 === 1) { - questions.push({ - id: `q${idCounter++}`, - prompt: `What department does ${emp.name} work in?`, - groundTruth: emp.department, - type: 'field-retrieval', - dataset: 'tabular', - }) - } - else if (i % 5 === 2) { - questions.push({ - id: `q${idCounter++}`, - prompt: `What is the email address of ${emp.name}?`, - groundTruth: emp.email, - type: 'field-retrieval', - dataset: 'tabular', - }) - } - else if (i % 5 === 3) { - questions.push({ - id: `q${idCounter++}`, - prompt: `How many years of experience does ${emp.name} have?`, - groundTruth: String(emp.yearsExperience), - type: 'field-retrieval', - dataset: 'tabular', - }) - } - else { - questions.push({ - id: `q${idCounter++}`, - prompt: `Is ${emp.name} an active employee?`, - groundTruth: emp.active ? 'yes' : 'no', - type: 'field-retrieval', - dataset: 'tabular', - }) - } - } - - // Aggregation: count by department - const departments = [...new Set(tabular.map(e => e.department))] - for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.aggregationDepartments)) { - const count = tabular.filter(e => e.department === dept).length - questions.push({ - id: `q${idCounter++}`, - prompt: `How many employees work in ${dept}?`, - groundTruth: String(count), - type: 'aggregation', - dataset: 'tabular', - }) - } - - // Aggregation: salary ranges (single-condition filters) - for (const threshold of QUESTION_THRESHOLDS.tabular.salaryRanges) { - const count = tabular.filter(e => e.salary > threshold).length - questions.push({ - id: `q${idCounter++}`, - prompt: `How many employees have a salary greater than ${threshold}?`, - groundTruth: String(count), - type: 'aggregation', - dataset: 'tabular', - }) - } - - // Aggregation: totals and averages - const totalEmployees = tabular.length - const avgSalary = Math.round(tabular.reduce((sum, e) => sum + e.salary, 0) / totalEmployees) - const activeCount = tabular.filter(e => e.active).length - const inactiveCount = tabular.filter(e => !e.active).length - - questions.push( - { - id: `q${idCounter++}`, - prompt: 'How many employees are in the dataset?', - groundTruth: String(totalEmployees), - type: 'aggregation', - dataset: 'tabular', - }, - { - id: `q${idCounter++}`, - prompt: 'What is the average salary across all employees?', - groundTruth: String(avgSalary), - type: 'aggregation', - dataset: 'tabular', - }, - { - id: `q${idCounter++}`, - prompt: 'How many employees are active?', - groundTruth: String(activeCount), - type: 'aggregation', - dataset: 'tabular', - }, - { - id: `q${idCounter++}`, - prompt: 'How many employees are inactive?', - groundTruth: String(inactiveCount), - type: 'aggregation', - dataset: 'tabular', - }, - ) - - // Filtering: count by department with salary filter (multi-condition) - for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringMultiConditionDepartments)) { - const count = tabular.filter(e => e.department === dept && e.salary > QUESTION_THRESHOLDS.tabular.departmentSalaryThreshold).length - questions.push({ - id: `q${idCounter++}`, - prompt: `How many employees in ${dept} have a salary greater than ${QUESTION_THRESHOLDS.tabular.departmentSalaryThreshold}?`, - groundTruth: String(count), - type: 'filtering', - dataset: 'tabular', - }) - } - - // Filtering: active employees by experience (multi-condition) - for (const exp of QUESTION_THRESHOLDS.tabular.experienceYears.slice(0, QUESTION_LIMITS.tabular.filteringExperience)) { - const count = tabular.filter(e => e.yearsExperience > exp && e.active).length - questions.push({ - id: `q${idCounter++}`, - prompt: `How many active employees have more than ${exp} years of experience?`, - groundTruth: String(count), - type: 'filtering', - dataset: 'tabular', - }) - } - - // Filtering: department by experience (multi-condition) - for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringDepartmentExp)) { - const count = tabular.filter(e => e.department === dept && e.yearsExperience > QUESTION_THRESHOLDS.tabular.departmentExperienceThreshold).length - questions.push({ - id: `q${idCounter++}`, - prompt: `How many employees in ${dept} have more than ${QUESTION_THRESHOLDS.tabular.departmentExperienceThreshold} years of experience?`, - groundTruth: String(count), - type: 'filtering', - dataset: 'tabular', - }) - } - - // Filtering: department by active status (multi-condition) - for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringDepartmentActive)) { - const count = tabular.filter(e => e.department === dept && e.active).length - questions.push({ - id: `q${idCounter++}`, - prompt: `How many active employees work in ${dept}?`, - groundTruth: String(count), - type: 'filtering', - dataset: 'tabular', - }) - } - } - - if (nested.length > 0) { - // Field retrieval: order totals and statuses - for (let i = 0; i < Math.min(QUESTION_LIMITS.nested.fieldRetrievalOrders, nested.length); i++) { - const order = nested[i * 2] || nested[i] - if (!order) - continue - - if (i % 2 === 0) { - questions.push({ - id: `q${idCounter++}`, - prompt: `What is the total for order ${order.orderId}?`, - groundTruth: String(order.total), - type: 'field-retrieval', - dataset: 'nested', - }) - } - else { - questions.push({ - id: `q${idCounter++}`, - prompt: `What is the status of order ${order.orderId}?`, - groundTruth: order.status, - type: 'field-retrieval', - dataset: 'nested', - }) - } - } - - // Field retrieval: customer info and order dates (expanded) - for (let i = 0; i < Math.min(QUESTION_LIMITS.nested.fieldRetrievalCustomers, nested.length); i++) { - const order = nested[i * 2 + 1] || nested[i] - if (!order) - continue - - if (i % 4 === 0) { - questions.push({ - id: `q${idCounter++}`, - prompt: `What is the customer name for order ${order.orderId}?`, - groundTruth: order.customer.name, - type: 'field-retrieval', - dataset: 'nested', - }) - } - else if (i % 4 === 1) { - questions.push({ - id: `q${idCounter++}`, - prompt: `What is the customer email for order ${order.orderId}?`, - groundTruth: order.customer.email, - type: 'field-retrieval', - dataset: 'nested', - }) - } - else if (i % 4 === 2) { - questions.push({ - id: `q${idCounter++}`, - prompt: `What is the order date for order ${order.orderId}?`, - groundTruth: order.orderDate || '', - type: 'field-retrieval', - dataset: 'nested', - }) - } - else { - questions.push({ - id: `q${idCounter++}`, - prompt: `How many items are in order ${order.orderId}?`, - groundTruth: String(order.items.length), - type: 'field-retrieval', - dataset: 'nested', - }) - } - } - - // Aggregation: totals and averages - const totalRevenue = nested.reduce((sum, o) => sum + o.total, 0) - const avgOrderValue = totalRevenue / nested.length - const totalOrders = nested.length - const maxOrderValue = Math.max(...nested.map(o => o.total)) - - // Count by status - const statuses = [...new Set(nested.map(o => o.status))] - for (const status of statuses.slice(0, QUESTION_LIMITS.nested.aggregationStatuses)) { - const count = nested.filter(o => o.status === status).length - questions.push({ - id: `q${idCounter++}`, - prompt: `How many orders have status "${status}"?`, - groundTruth: String(count), - type: 'aggregation', - dataset: 'nested', - }) - } - - questions.push( - { - id: `q${idCounter++}`, - prompt: 'What is the total revenue across all orders?', - groundTruth: String(totalRevenue.toFixed(2)), - type: 'aggregation', - dataset: 'nested', - }, - { - id: `q${idCounter++}`, - prompt: 'What is the average order value?', - groundTruth: String(avgOrderValue.toFixed(2)), - type: 'aggregation', - dataset: 'nested', - }, - { - id: `q${idCounter++}`, - prompt: 'How many orders are in the dataset?', - groundTruth: String(totalOrders), - type: 'aggregation', - dataset: 'nested', - }, - { - id: `q${idCounter++}`, - prompt: 'What is the highest order total?', - groundTruth: String(maxOrderValue.toFixed(2)), - type: 'aggregation', - dataset: 'nested', - }, - ) - - // Aggregation: high-value orders (single-condition filter) - for (const threshold of QUESTION_THRESHOLDS.nested.highValueOrders) { - const count = nested.filter(o => o.total > threshold).length - questions.push({ - id: `q${idCounter++}`, - prompt: `How many orders have a total greater than ${threshold}?`, - groundTruth: String(count), - type: 'aggregation', - dataset: 'nested', - }) - } - - // Filtering: multi-condition queries (status AND value) - const orderStatuses = [...new Set(nested.map(o => o.status))] - for (const status of orderStatuses.slice(0, QUESTION_LIMITS.nested.filteringStatusAndValue)) { - const count = nested.filter(o => o.status === status && o.total > QUESTION_THRESHOLDS.nested.statusValueThreshold).length - questions.push({ - id: `q${idCounter++}`, - prompt: `How many orders have status "${status}" and total greater than ${QUESTION_THRESHOLDS.nested.statusValueThreshold}?`, - groundTruth: String(count), - type: 'filtering', - dataset: 'nested', - }) - } - - // Filtering: status AND items count (multi-condition) - for (const status of orderStatuses.slice(0, QUESTION_LIMITS.nested.filteringStatusAndItems)) { - const count = nested.filter(o => o.status === status && o.items.length >= QUESTION_THRESHOLDS.nested.itemCountThreshold).length - questions.push({ - id: `q${idCounter++}`, - prompt: `How many orders have status "${status}" and at least ${QUESTION_THRESHOLDS.nested.itemCountThreshold} items?`, - groundTruth: String(count), - type: 'filtering', - dataset: 'nested', - }) - } - - // Filtering: total AND items count (multi-condition) - for (const threshold of QUESTION_THRESHOLDS.nested.totalThresholdsForItems) { - const count = nested.filter(o => o.total > threshold && o.items.length >= QUESTION_THRESHOLDS.nested.itemCountThreshold).length - questions.push({ - id: `q${idCounter++}`, - prompt: `How many orders have a total greater than ${threshold} and at least ${QUESTION_THRESHOLDS.nested.itemCountThreshold} items?`, - groundTruth: String(count), - type: 'filtering', - dataset: 'nested', - }) - } - } - - if (analytics.length > 0) { - // Field retrieval: specific dates (expanded with all metrics) - for (let i = 0; i < Math.min(QUESTION_LIMITS.analytics.fieldRetrievalDates, analytics.length); i++) { - const metric = analytics[i * 3] || analytics[i] - if (!metric) - continue - - if (i % 5 === 0) { - questions.push({ - id: `q${idCounter++}`, - prompt: `How many views were recorded on ${metric.date}?`, - groundTruth: String(metric.views), - type: 'field-retrieval', - dataset: 'analytics', - }) - } - else if (i % 5 === 1) { - questions.push({ - id: `q${idCounter++}`, - prompt: `What was the revenue on ${metric.date}?`, - groundTruth: String(metric.revenue), - type: 'field-retrieval', - dataset: 'analytics', - }) - } - else if (i % 5 === 2) { - questions.push({ - id: `q${idCounter++}`, - prompt: `What was the conversion count on ${metric.date}?`, - groundTruth: String(metric.conversions), - type: 'field-retrieval', - dataset: 'analytics', - }) - } - else if (i % 5 === 3) { - questions.push({ - id: `q${idCounter++}`, - prompt: `How many clicks were recorded on ${metric.date}?`, - groundTruth: String(metric.clicks), - type: 'field-retrieval', - dataset: 'analytics', - }) - } - else { - questions.push({ - id: `q${idCounter++}`, - prompt: `What was the bounce rate on ${metric.date}?`, - groundTruth: String(metric.bounceRate), - type: 'field-retrieval', - dataset: 'analytics', - }) - } - } - - // Aggregation: totals and averages - const totalViews = analytics.reduce((sum, m) => sum + m.views, 0) - const totalRevenue = analytics.reduce((sum, m) => sum + m.revenue, 0) - const totalConversions = analytics.reduce((sum, m) => sum + m.conversions, 0) - const avgViews = Math.round(totalViews / analytics.length) - const avgRevenue = totalRevenue / analytics.length - const avgConversions = Math.round(totalConversions / analytics.length) - - questions.push( - { - id: `q${idCounter++}`, - prompt: 'What is the total number of views across all dates?', - groundTruth: String(totalViews), - type: 'aggregation', - dataset: 'analytics', - }, - { - id: `q${idCounter++}`, - prompt: 'What is the total revenue across all dates?', - groundTruth: String(totalRevenue.toFixed(2)), - type: 'aggregation', - dataset: 'analytics', - }, - { - id: `q${idCounter++}`, - prompt: 'What is the total number of conversions across all dates?', - groundTruth: String(totalConversions), - type: 'aggregation', - dataset: 'analytics', - }, - { - id: `q${idCounter++}`, - prompt: 'What is the average number of views per day?', - groundTruth: String(avgViews), - type: 'aggregation', - dataset: 'analytics', - }, - { - id: `q${idCounter++}`, - prompt: 'What is the average revenue per day?', - groundTruth: String(avgRevenue.toFixed(2)), - type: 'aggregation', - dataset: 'analytics', - }, - { - id: `q${idCounter++}`, - prompt: 'What is the average number of conversions per day?', - groundTruth: String(avgConversions), - type: 'aggregation', - dataset: 'analytics', - }, - { - id: `q${idCounter++}`, - prompt: 'How many days are included in the analytics data?', - groundTruth: String(analytics.length), - type: 'aggregation', - dataset: 'analytics', - }, - { - id: `q${idCounter++}`, - prompt: 'What is the highest number of views recorded in a single day?', - groundTruth: String(Math.max(...analytics.map(m => m.views))), - type: 'aggregation', - dataset: 'analytics', - }, - ) - - // Aggregation: high-performing days (single-condition filters) - for (const threshold of QUESTION_THRESHOLDS.analytics.views) { - const count = analytics.filter(m => m.views > threshold).length - questions.push({ - id: `q${idCounter++}`, - prompt: `How many days had more than ${threshold} views?`, - groundTruth: String(count), - type: 'aggregation', - dataset: 'analytics', - }) - } - - // Filtering: multi-condition queries (views AND conversions) - for (const viewThreshold of QUESTION_THRESHOLDS.analytics.viewsForFiltering) { - const count = analytics.filter(m => m.views > viewThreshold && m.conversions > QUESTION_THRESHOLDS.analytics.conversionsForFiltering).length - questions.push({ - id: `q${idCounter++}`, - prompt: `How many days had more than ${viewThreshold} views and more than ${QUESTION_THRESHOLDS.analytics.conversionsForFiltering} conversions?`, - groundTruth: String(count), - type: 'filtering', - dataset: 'analytics', - }) - } - - // Filtering: views AND revenue (expanded) - for (const revenueThreshold of QUESTION_THRESHOLDS.analytics.revenueThresholds.slice(0, 5)) { - const count = analytics.filter(m => m.views > QUESTION_THRESHOLDS.analytics.viewsThresholdForRevenue && m.revenue > revenueThreshold).length - questions.push({ - id: `q${idCounter++}`, - prompt: `How many days had more than ${QUESTION_THRESHOLDS.analytics.viewsThresholdForRevenue} views and revenue greater than ${revenueThreshold}?`, - groundTruth: String(count), - type: 'filtering', - dataset: 'analytics', - }) - } - - // Filtering: clicks AND conversions (multi-condition) - for (const clickThreshold of QUESTION_THRESHOLDS.analytics.clicksForFiltering) { - const count = analytics.filter(m => m.clicks > clickThreshold && m.conversions > QUESTION_THRESHOLDS.analytics.conversionsForClickFiltering).length - questions.push({ - id: `q${idCounter++}`, - prompt: `How many days had more than ${clickThreshold} clicks and more than ${QUESTION_THRESHOLDS.analytics.conversionsForClickFiltering} conversions?`, - groundTruth: String(count), - type: 'filtering', - dataset: 'analytics', - }) - } - - // Filtering: revenue AND bounce rate (multi-condition) - for (const revenueThreshold of QUESTION_THRESHOLDS.analytics.revenueForBounceRate) { - const count = analytics.filter(m => m.revenue > revenueThreshold && m.bounceRate < QUESTION_THRESHOLDS.analytics.bounceRateThreshold).length - questions.push({ - id: `q${idCounter++}`, - prompt: `How many days had revenue greater than ${revenueThreshold} and bounce rate less than ${QUESTION_THRESHOLDS.analytics.bounceRateThreshold}?`, - groundTruth: String(count), - type: 'filtering', - dataset: 'analytics', - }) - } - } - - if (github.length > 0) { - // Helper to extract owner from repo field - const getOwner = (repoFullName: string) => repoFullName.split('/')[0]! - - // Field retrieval: specific repos (diverse fields) - for (let i = 0; i < Math.min(QUESTION_LIMITS.github.fieldRetrievalRepos, github.length); i++) { - const repo = github[i * 7] - if (!repo) - continue - - if (i % 5 === 0) { - questions.push({ - id: `q${idCounter++}`, - prompt: `How many stars does ${repo.repo} have?`, - groundTruth: String(repo.stars), - type: 'field-retrieval', - dataset: 'github', - }) - } - else if (i % 5 === 1) { - questions.push({ - id: `q${idCounter++}`, - prompt: `How many forks does ${repo.repo} have?`, - groundTruth: String(repo.forks), - type: 'field-retrieval', - dataset: 'github', - }) - } - else if (i % 5 === 2) { - questions.push({ - id: `q${idCounter++}`, - prompt: `Who is the owner of ${repo.repo}?`, - groundTruth: getOwner(repo.repo), - type: 'field-retrieval', - dataset: 'github', - }) - } - else if (i % 5 === 3) { - questions.push({ - id: `q${idCounter++}`, - prompt: `What is the default branch of ${repo.repo}?`, - groundTruth: repo.defaultBranch, - type: 'field-retrieval', - dataset: 'github', - }) - } - else { - questions.push({ - id: `q${idCounter++}`, - prompt: `How many watchers does ${repo.repo} have?`, - groundTruth: String(repo.watchers), - type: 'field-retrieval', - dataset: 'github', - }) - } - } - - // Aggregation: popular repositories - const totalStars = github.reduce((sum, r) => sum + r.stars, 0) - const totalRepos = github.length - const avgStars = Math.round(totalStars / totalRepos) - - questions.push( - { - id: `q${idCounter++}`, - prompt: 'What is the total number of stars across all repositories?', - groundTruth: String(totalStars), - type: 'aggregation', - dataset: 'github', - }, - { - id: `q${idCounter++}`, - prompt: 'How many repositories are in the dataset?', - groundTruth: String(totalRepos), - type: 'aggregation', - dataset: 'github', - }, - { - id: `q${idCounter++}`, - prompt: 'What is the average number of stars per repository?', - groundTruth: String(avgStars), - type: 'aggregation', - dataset: 'github', - }, - ) - - // Aggregation: star thresholds (single-condition filters) - for (const threshold of QUESTION_THRESHOLDS.github.stars) { - const count = github.filter(r => r.stars > threshold).length - questions.push({ - id: `q${idCounter++}`, - prompt: `How many repositories have more than ${threshold} stars?`, - groundTruth: String(count), - type: 'aggregation', - dataset: 'github', - }) - } - - // Aggregation: fork thresholds (single-condition filters) - for (const threshold of QUESTION_THRESHOLDS.github.forks) { - const count = github.filter(r => r.forks > threshold).length - questions.push({ - id: `q${idCounter++}`, - prompt: `How many repositories have more than ${threshold} forks?`, - groundTruth: String(count), - type: 'aggregation', - dataset: 'github', - }) - } - - // Aggregation: watcher thresholds (single-condition filters) - for (const threshold of QUESTION_THRESHOLDS.github.watchers) { - const count = github.filter(r => r.watchers > threshold).length - questions.push({ - id: `q${idCounter++}`, - prompt: `How many repositories have more than ${threshold} watchers?`, - groundTruth: String(count), - type: 'aggregation', - dataset: 'github', - }) - } - - // Aggregation: default branch counts - const branches = [...new Set(github.map(r => r.defaultBranch))] - for (const branch of branches.slice(0, QUESTION_LIMITS.github.aggregationBranches)) { - const count = github.filter(r => r.defaultBranch === branch).length - questions.push({ - id: `q${idCounter++}`, - prompt: `How many repositories use "${branch}" as their default branch?`, - groundTruth: String(count), - type: 'aggregation', - dataset: 'github', - }) - } - - // Filtering: multi-condition queries (stars AND forks) - for (const combo of QUESTION_THRESHOLDS.github.starForkCombinations.slice(0, QUESTION_LIMITS.github.filteringStarsAndForks)) { - const count = github.filter(r => r.stars > combo.stars && r.forks > combo.forks).length - questions.push({ - id: `q${idCounter++}`, - prompt: `How many repositories have more than ${combo.stars} stars and more than ${combo.forks} forks?`, - groundTruth: String(count), - type: 'filtering', - dataset: 'github', - }) - } - - // Filtering: stars AND watchers (multi-condition) - for (const combo of QUESTION_THRESHOLDS.github.starWatcherCombinations) { - const count = github.filter(r => r.stars > combo.stars && r.watchers > combo.watchers).length - questions.push({ - id: `q${idCounter++}`, - prompt: `How many repositories have more than ${combo.stars} stars and more than ${combo.watchers} watchers?`, - groundTruth: String(count), - type: 'filtering', - dataset: 'github', - }) - } - } - - return questions -} diff --git a/benchmarks/src/questions/analytics.ts b/benchmarks/src/questions/analytics.ts new file mode 100644 index 0000000..4c58639 --- /dev/null +++ b/benchmarks/src/questions/analytics.ts @@ -0,0 +1,196 @@ +import type { AnalyticsMetric } from '../datasets' +import type { Question } from '../types' +import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from '../constants' +import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils' + +/** + * Generate analytics (website metrics) questions + */ +export function generateAnalyticsQuestions(metrics: AnalyticsMetric[], getId: () => string): Question[] { + const questions: Question[] = [] + + if (metrics.length === 0) + return questions + + // Field retrieval: date-based metrics + const metricFieldGenerators: Array<(metric: AnalyticsMetric, getId: () => string) => Question> = [ + (metric, getId) => new QuestionBuilder() + .id(getId()) + .prompt(`What are the views for ${metric.date}?`) + .groundTruth(String(metric.views)) + .type('field-retrieval') + .dataset('analytics') + .build(), + (metric, getId) => new QuestionBuilder() + .id(getId()) + .prompt(`What is the revenue for ${metric.date}?`) + .groundTruth(String(metric.revenue)) + .type('field-retrieval') + .dataset('analytics') + .build(), + (metric, getId) => new QuestionBuilder() + .id(getId()) + .prompt(`What is the bounce rate for ${metric.date}?`) + .groundTruth(String(metric.bounceRate)) + .type('field-retrieval') + .dataset('analytics') + .build(), + (metric, getId) => new QuestionBuilder() + .id(getId()) + .prompt(`How many conversions were there on ${metric.date}?`) + .groundTruth(String(metric.conversions)) + .type('field-retrieval') + .dataset('analytics') + .build(), + ] + + questions.push(...rotateQuestions( + metrics, + metricFieldGenerators, + QUESTION_LIMITS.analytics.fieldRetrievalDates, + SAMPLE_STRIDES.ANALYTICS_FIELD, + getId, + )) + + // Aggregation: basic statistics + const totalDays = metrics.length + const totalViews = metrics.reduce((sum, m) => sum + m.views, 0) + const totalConversions = metrics.reduce((sum, m) => sum + m.conversions, 0) + const totalRevenue = metrics.reduce((sum, m) => sum + m.revenue, 0) + const avgBounceRate = metrics.reduce((sum, m) => sum + m.bounceRate, 0) / metrics.length + + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt('How many days of data are in the dataset?') + .groundTruth(String(totalDays)) + .type('aggregation') + .dataset('analytics') + .build(), + new QuestionBuilder() + .id(getId()) + .prompt('What is the total number of views across all dates?') + .groundTruth(String(totalViews)) + .type('aggregation') + .dataset('analytics') + .build(), + new QuestionBuilder() + .id(getId()) + .prompt('What is the total number of conversions across all dates?') + .groundTruth(String(totalConversions)) + .type('aggregation') + .dataset('analytics') + .build(), + new QuestionBuilder() + .id(getId()) + .prompt('What is the total revenue across all dates?') + .groundTruth(String(totalRevenue.toFixed(2))) + .type('aggregation') + .dataset('analytics') + .build(), + new QuestionBuilder() + .id(getId()) + .prompt('What is the average bounce rate?') + .groundTruth(String(avgBounceRate.toFixed(2))) + .type('aggregation') + .dataset('analytics') + .build(), + ) + + // Aggregation: high views/conversions + for (const threshold of QUESTION_THRESHOLDS.analytics.views) { + const count = countByPredicate(metrics, m => m.views > threshold) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many days had more than ${threshold} views?`) + .groundTruth(String(count)) + .type('aggregation') + .dataset('analytics') + .build(), + ) + } + + for (const threshold of QUESTION_THRESHOLDS.analytics.conversions) { + const count = countByPredicate(metrics, m => m.conversions > threshold) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many days had more than ${threshold} conversions?`) + .groundTruth(String(count)) + .type('aggregation') + .dataset('analytics') + .build(), + ) + } + + // Filtering: multi-condition (views AND revenue) + for (const threshold of QUESTION_THRESHOLDS.analytics.viewsForFiltering) { + const count = countByPredicate( + metrics, + m => m.views > threshold && m.conversions > QUESTION_THRESHOLDS.analytics.conversionsForFiltering, + ) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many days had more than ${threshold} views and more than ${QUESTION_THRESHOLDS.analytics.conversionsForFiltering} conversions?`) + .groundTruth(String(count)) + .type('filtering') + .dataset('analytics') + .build(), + ) + } + + // Filtering: revenue thresholds + for (const threshold of QUESTION_THRESHOLDS.analytics.revenueThresholds) { + const count = countByPredicate( + metrics, + m => m.revenue > threshold && m.views > QUESTION_THRESHOLDS.analytics.viewsThresholdForRevenue, + ) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many days had revenue greater than ${threshold} with views above ${QUESTION_THRESHOLDS.analytics.viewsThresholdForRevenue}?`) + .groundTruth(String(count)) + .type('filtering') + .dataset('analytics') + .build(), + ) + } + + // Filtering: clicks and conversions + for (const threshold of QUESTION_THRESHOLDS.analytics.clicksForFiltering) { + const count = countByPredicate( + metrics, + m => m.clicks > threshold && m.conversions > QUESTION_THRESHOLDS.analytics.conversionsForClickFiltering, + ) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many days had more than ${threshold} clicks and more than ${QUESTION_THRESHOLDS.analytics.conversionsForClickFiltering} conversions?`) + .groundTruth(String(count)) + .type('filtering') + .dataset('analytics') + .build(), + ) + } + + // Filtering: revenue and bounce rate + for (const threshold of QUESTION_THRESHOLDS.analytics.revenueForBounceRate) { + const count = countByPredicate( + metrics, + m => m.revenue > threshold && m.bounceRate < QUESTION_THRESHOLDS.analytics.bounceRateThreshold, + ) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many days had revenue greater than ${threshold} with bounce rate below ${QUESTION_THRESHOLDS.analytics.bounceRateThreshold}?`) + .groundTruth(String(count)) + .type('filtering') + .dataset('analytics') + .build(), + ) + } + + return questions +} diff --git a/benchmarks/src/questions/event-logs.ts b/benchmarks/src/questions/event-logs.ts new file mode 100644 index 0000000..3e4650a --- /dev/null +++ b/benchmarks/src/questions/event-logs.ts @@ -0,0 +1,162 @@ +import type { EventLog } from '../datasets' +import type { Question } from '../types' +import { QUESTION_LIMITS } from '../constants' +import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils' + +/** + * Generate event log questions + */ +export function generateEventLogsQuestions(logs: EventLog[], getId: () => string): Question[] { + const questions: Question[] = [] + + if (logs.length === 0) + return questions + + // Field retrieval: log metadata + const logFieldGenerators: Array<(log: EventLog, getId: () => string) => Question> = [ + (log, getId) => new QuestionBuilder() + .id(getId()) + .prompt(`What is the level of the log at ${log.timestamp}?`) + .groundTruth(log.level) + .type('field-retrieval') + .dataset('event-logs') + .build(), + (log, getId) => new QuestionBuilder() + .id(getId()) + .prompt(`What is the endpoint for the log at ${log.timestamp}?`) + .groundTruth(log.endpoint) + .type('field-retrieval') + .dataset('event-logs') + .build(), + (log, getId) => new QuestionBuilder() + .id(getId()) + .prompt(`What is the status code for the log at ${log.timestamp}?`) + .groundTruth(String(log.statusCode)) + .type('field-retrieval') + .dataset('event-logs') + .build(), + (log, getId) => new QuestionBuilder() + .id(getId()) + .prompt(`What is the response time for the log at ${log.timestamp}?`) + .groundTruth(String(log.responseTime)) + .type('field-retrieval') + .dataset('event-logs') + .build(), + ] + + questions.push(...rotateQuestions( + logs, + logFieldGenerators, + QUESTION_LIMITS.eventLogs.fieldRetrieval, + SAMPLE_STRIDES.EVENT_LOG_FIELD, + getId, + )) + + // Aggregation: basic statistics + const totalLogs = logs.length + const avgResponseTime = logs.reduce((sum, l) => sum + l.responseTime, 0) / logs.length + + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt('How many log entries are in the dataset?') + .groundTruth(String(totalLogs)) + .type('aggregation') + .dataset('event-logs') + .build(), + new QuestionBuilder() + .id(getId()) + .prompt('What is the average response time across all logs?') + .groundTruth(String(avgResponseTime.toFixed(2))) + .type('aggregation') + .dataset('event-logs') + .build(), + ) + + // Aggregation: by level + const levels = [...new Set(logs.map(l => l.level))] + for (const level of levels) { + const count = countByPredicate(logs, l => l.level === level) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many log entries have level "${level}"?`) + .groundTruth(String(count)) + .type('aggregation') + .dataset('event-logs') + .build(), + ) + } + + // Aggregation: by endpoint + const endpoints = [...new Set(logs.map(l => l.endpoint))] + for (const endpoint of endpoints.slice(0, QUESTION_LIMITS.eventLogs.aggregationEndpoints)) { + const count = countByPredicate(logs, l => l.endpoint === endpoint) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many log entries are for endpoint "${endpoint}"?`) + .groundTruth(String(count)) + .type('aggregation') + .dataset('event-logs') + .build(), + ) + } + + // Aggregation: by status code range + const errorCount = countByPredicate(logs, l => l.statusCode >= 400) + const successCount = countByPredicate(logs, l => l.statusCode >= 200 && l.statusCode < 300) + + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt('How many log entries have a status code indicating an error (>= 400)?') + .groundTruth(String(errorCount)) + .type('aggregation') + .dataset('event-logs') + .build(), + new QuestionBuilder() + .id(getId()) + .prompt('How many log entries have a successful status code (200-299)?') + .groundTruth(String(successCount)) + .type('aggregation') + .dataset('event-logs') + .build(), + ) + + // Filtering: multi-condition (level AND status) + for (const level of levels.slice(0, QUESTION_LIMITS.eventLogs.filteringLevelAndStatus)) { + const count = countByPredicate( + logs, + l => l.level === level && l.statusCode >= 400, + ) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many log entries have level "${level}" and status code >= 400?`) + .groundTruth(String(count)) + .type('filtering') + .dataset('event-logs') + .build(), + ) + } + + // Filtering: endpoint AND status + for (const endpoint of endpoints.slice(0, QUESTION_LIMITS.eventLogs.filteringEndpointAndStatus)) { + const count = countByPredicate( + logs, + l => l.endpoint === endpoint && l.statusCode >= 500, + ) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many log entries are for endpoint "${endpoint}" with status code >= 500?`) + .groundTruth(String(count)) + .type('filtering') + .dataset('event-logs') + .build(), + ) + } + + return questions +} diff --git a/benchmarks/src/questions/github.ts b/benchmarks/src/questions/github.ts new file mode 100644 index 0000000..f9b4bd3 --- /dev/null +++ b/benchmarks/src/questions/github.ts @@ -0,0 +1,184 @@ +import type { Repository } from '../datasets' +import type { Question } from '../types' +import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from '../constants' +import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils' + +/** + * Generate GitHub repository questions + */ +export function generateGithubQuestions(repos: Repository[], getId: () => string): Question[] { + const questions: Question[] = [] + + if (repos.length === 0) + return questions + + // Field retrieval: repository metadata + const repoFieldGenerators: Array<(repo: Repository, getId: () => string) => Question> = [ + (repo, getId) => new QuestionBuilder() + .id(getId()) + .prompt(`How many stars does ${repo.owner}/${repo.name} have?`) + .groundTruth(String(repo.stars)) + .type('field-retrieval') + .dataset('github') + .build(), + (repo, getId) => new QuestionBuilder() + .id(getId()) + .prompt(`How many forks does ${repo.owner}/${repo.name} have?`) + .groundTruth(String(repo.forks)) + .type('field-retrieval') + .dataset('github') + .build(), + (repo, getId) => new QuestionBuilder() + .id(getId()) + .prompt(`How many watchers does ${repo.owner}/${repo.name} have?`) + .groundTruth(String(repo.watchers)) + .type('field-retrieval') + .dataset('github') + .build(), + (repo, getId) => new QuestionBuilder() + .id(getId()) + .prompt(`What is the main branch of ${repo.owner}/${repo.name}?`) + .groundTruth(repo.defaultBranch) + .type('field-retrieval') + .dataset('github') + .build(), + ] + + questions.push(...rotateQuestions( + repos, + repoFieldGenerators, + QUESTION_LIMITS.github.fieldRetrievalRepos, + SAMPLE_STRIDES.REPO_FIELD, + getId, + )) + + // Aggregation: basic statistics + const totalRepos = repos.length + const totalStars = repos.reduce((sum, r) => sum + r.stars, 0) + const totalForks = repos.reduce((sum, r) => sum + r.forks, 0) + const avgStars = totalStars / totalRepos + + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt('How many repositories are in the dataset?') + .groundTruth(String(totalRepos)) + .type('aggregation') + .dataset('github') + .build(), + new QuestionBuilder() + .id(getId()) + .prompt('What is the total number of stars across all repositories?') + .groundTruth(String(totalStars)) + .type('aggregation') + .dataset('github') + .build(), + new QuestionBuilder() + .id(getId()) + .prompt('What is the total number of forks across all repositories?') + .groundTruth(String(totalForks)) + .type('aggregation') + .dataset('github') + .build(), + new QuestionBuilder() + .id(getId()) + .prompt('What is the average number of stars per repository?') + .groundTruth(String(Math.round(avgStars))) + .type('aggregation') + .dataset('github') + .build(), + ) + + // Aggregation: by default branch + const branches = [...new Set(repos.map(r => r.defaultBranch))] + for (const branch of branches.slice(0, QUESTION_LIMITS.github.aggregationBranches)) { + const count = countByPredicate(repos, r => r.defaultBranch === branch) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many repositories use "${branch}" as their default branch?`) + .groundTruth(String(count)) + .type('aggregation') + .dataset('github') + .build(), + ) + } + + // Aggregation: high star counts + for (const threshold of QUESTION_THRESHOLDS.github.stars) { + const count = countByPredicate(repos, r => r.stars > threshold) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many repositories have more than ${threshold} stars?`) + .groundTruth(String(count)) + .type('aggregation') + .dataset('github') + .build(), + ) + } + + // Aggregation: high fork counts + for (const threshold of QUESTION_THRESHOLDS.github.forks) { + const count = countByPredicate(repos, r => r.forks > threshold) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many repositories have more than ${threshold} forks?`) + .groundTruth(String(count)) + .type('aggregation') + .dataset('github') + .build(), + ) + } + + // Aggregation: high watcher counts + for (const threshold of QUESTION_THRESHOLDS.github.watchers) { + const count = countByPredicate(repos, r => r.watchers > threshold) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many repositories have more than ${threshold} watchers?`) + .groundTruth(String(count)) + .type('aggregation') + .dataset('github') + .build(), + ) + } + + // Filtering: multi-condition (stars AND forks) + for (const combo of QUESTION_THRESHOLDS.github.starForkCombinations.slice(0, QUESTION_LIMITS.github.filteringStarsAndForks)) { + const count = countByPredicate( + repos, + r => r.stars > combo.stars && r.forks > combo.forks, + ) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many repositories have more than ${combo.stars} stars and more than ${combo.forks} forks?`) + .groundTruth(String(count)) + .type('filtering') + .dataset('github') + .build(), + ) + } + + // Filtering: stars AND watchers + for (const combo of QUESTION_THRESHOLDS.github.starWatcherCombinations) { + const count = countByPredicate( + repos, + r => r.stars > combo.stars && r.watchers > combo.watchers, + ) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many repositories have more than ${combo.stars} stars and more than ${combo.watchers} watchers?`) + .groundTruth(String(count)) + .type('filtering') + .dataset('github') + .build(), + ) + } + + return questions +} diff --git a/benchmarks/src/questions/index.ts b/benchmarks/src/questions/index.ts new file mode 100644 index 0000000..9bac171 --- /dev/null +++ b/benchmarks/src/questions/index.ts @@ -0,0 +1,46 @@ +import type { AnalyticsMetric, Employee, EventLog, NestedConfig, Order, Repository } from '../datasets' +import type { Question } from '../types' +import { ACCURACY_DATASETS } from '../datasets' +import { generateAnalyticsQuestions } from './analytics' +import { generateEventLogsQuestions } from './event-logs' +import { generateGithubQuestions } from './github' +import { generateNestedQuestions } from './nested' +import { generateNestedConfigQuestions } from './nested-config' +import { generateTabularQuestions } from './tabular' +import { createIdGenerator } from './utils' + +/** + * Generate all questions from datasets + * + * @remarks + * Generates ~150-160 questions across different question types and datasets: + * - Field Retrieval: Direct field access with no computation + * Examples: "What is X's salary?", "What is the status of order Y?" + * - Aggregation: Counts, sums, averages, min/max operations (including single-condition filters) + * Examples: "How many X?", "What is the total/average?", "How many X > threshold?" + * - Filtering: Multi-condition queries requiring complex logical operations + * Examples: "How many X WHERE condition1 AND condition2?" + */ +export function generateQuestions(): Question[] { + const questions: Question[] = [] + const idGen = createIdGenerator() + const getId = () => idGen.next().value + + // Get datasets with proper typing + const tabular = (ACCURACY_DATASETS.find(d => d.name === 'tabular')?.data.employees as Employee[]) ?? [] + const nested = (ACCURACY_DATASETS.find(d => d.name === 'nested')?.data.orders as Order[]) ?? [] + const analytics = (ACCURACY_DATASETS.find(d => d.name === 'analytics')?.data.metrics as AnalyticsMetric[]) ?? [] + const github = (ACCURACY_DATASETS.find(d => d.name === 'github')?.data.repositories as Repository[]) ?? [] + const eventLogs = (ACCURACY_DATASETS.find(d => d.name === 'event-logs')?.data.logs as EventLog[]) ?? [] + const nestedConfig = ACCURACY_DATASETS.find(d => d.name === 'nested-config')?.data as NestedConfig | undefined + + // Generate questions for each dataset + questions.push(...generateTabularQuestions(tabular, getId)) + questions.push(...generateNestedQuestions(nested, getId)) + questions.push(...generateAnalyticsQuestions(analytics, getId)) + questions.push(...generateGithubQuestions(github, getId)) + questions.push(...generateEventLogsQuestions(eventLogs, getId)) + questions.push(...generateNestedConfigQuestions(nestedConfig, getId)) + + return questions +} diff --git a/benchmarks/src/questions/nested-config.ts b/benchmarks/src/questions/nested-config.ts new file mode 100644 index 0000000..8ebc9f6 --- /dev/null +++ b/benchmarks/src/questions/nested-config.ts @@ -0,0 +1,147 @@ +import type { NestedConfig } from '../datasets' +import type { Question } from '../types' +import { QUESTION_LIMITS } from '../constants' +import { QuestionBuilder } from './utils' + +/** + * Generate nested configuration questions + */ +export function generateNestedConfigQuestions(config: NestedConfig | undefined, getId: () => string): Question[] { + const questions: Question[] = [] + + if (!config) + return questions + + // Field retrieval: top-level config values + const fieldRetrievalQuestions = [ + { + prompt: 'What is the environment in the configuration?', + groundTruth: config.environment, + }, + { + prompt: 'What is the database host?', + groundTruth: config.database.host, + }, + { + prompt: 'What is the database port?', + groundTruth: String(config.database.port), + }, + { + prompt: 'What is the maximum connection pool size?', + groundTruth: String(config.database.pool.max), + }, + { + prompt: 'What is the session duration?', + groundTruth: String(config.authentication.session.duration), + }, + ] + + for (const q of fieldRetrievalQuestions.slice(0, QUESTION_LIMITS.nestedConfig.fieldRetrieval)) { + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(q.prompt) + .groundTruth(q.groundTruth) + .type('field-retrieval') + .dataset('nested-config') + .build(), + ) + } + + // Aggregation: counts of nested structures + const roleCount = Object.keys(config.permissions.roles).length + const groupCount = Object.keys(config.permissions.groups).length + const providerCount = config.authentication.providers.length + const featureCount = Object.keys(config.features).length + const replicaCount = config.database.replicas.length + + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt('How many roles are defined in permissions?') + .groundTruth(String(roleCount)) + .type('aggregation') + .dataset('nested-config') + .build(), + new QuestionBuilder() + .id(getId()) + .prompt('How many groups are defined in permissions?') + .groundTruth(String(groupCount)) + .type('aggregation') + .dataset('nested-config') + .build(), + new QuestionBuilder() + .id(getId()) + .prompt('How many authentication providers are configured?') + .groundTruth(String(providerCount)) + .type('aggregation') + .dataset('nested-config') + .build(), + new QuestionBuilder() + .id(getId()) + .prompt('How many feature flags are defined?') + .groundTruth(String(featureCount)) + .type('aggregation') + .dataset('nested-config') + .build(), + new QuestionBuilder() + .id(getId()) + .prompt('How many database replicas are configured?') + .groundTruth(String(replicaCount)) + .type('aggregation') + .dataset('nested-config') + .build(), + ) + + // Aggregation: feature flag details + const enabledFeatures = Object.entries(config.features).filter(([_, f]) => f.enabled).length + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt('How many feature flags are enabled?') + .groundTruth(String(enabledFeatures)) + .type('aggregation') + .dataset('nested-config') + .build(), + ) + + // Aggregation: role permissions + const adminPermissions = config.permissions.roles.admin?.permissions.length ?? 0 + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt('How many permissions does the admin role have?') + .groundTruth(String(adminPermissions)) + .type('aggregation') + .dataset('nested-config') + .build(), + ) + + // Filtering: complex multi-condition queries + const filteringQuestions = [ + { + prompt: 'How many feature flags are enabled with rollout greater than 50%?', + groundTruth: String(Object.entries(config.features) + .filter(([_, f]) => f.enabled && f.rollout > 50).length), + }, + { + prompt: 'How many groups have the admin role?', + groundTruth: String(Object.entries(config.permissions.groups) + .filter(([_, g]) => g.roles.includes('admin')).length), + }, + ] + + for (const q of filteringQuestions.slice(0, QUESTION_LIMITS.nestedConfig.filteringComplex)) { + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(q.prompt) + .groundTruth(q.groundTruth) + .type('filtering') + .dataset('nested-config') + .build(), + ) + } + + return questions +} diff --git a/benchmarks/src/questions/nested.ts b/benchmarks/src/questions/nested.ts new file mode 100644 index 0000000..e54512b --- /dev/null +++ b/benchmarks/src/questions/nested.ts @@ -0,0 +1,202 @@ +import type { Order } from '../datasets' +import type { Question } from '../types' +import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from '../constants' +import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils' + +/** + * Generate nested (orders) questions + */ +export function generateNestedQuestions(orders: Order[], getId: () => string): Question[] { + const questions: Question[] = [] + + if (orders.length === 0) + return questions + + // Field retrieval: order totals and statuses + const orderFieldGenerators: Array<(order: Order, getId: () => string) => Question> = [ + (order, getId) => new QuestionBuilder() + .id(getId()) + .prompt(`What is the total for order ${order.orderId}?`) + .groundTruth(String(order.total)) + .type('field-retrieval') + .dataset('nested') + .build(), + (order, getId) => new QuestionBuilder() + .id(getId()) + .prompt(`What is the status of order ${order.orderId}?`) + .groundTruth(order.status) + .type('field-retrieval') + .dataset('nested') + .build(), + ] + + questions.push(...rotateQuestions( + orders, + orderFieldGenerators, + QUESTION_LIMITS.nested.fieldRetrievalOrders, + SAMPLE_STRIDES.ORDER_FIELD, + getId, + )) + + // Field retrieval: customer info and order dates + const customerFieldGenerators: Array<(order: Order, getId: () => string) => Question> = [ + (order, getId) => new QuestionBuilder() + .id(getId()) + .prompt(`What is the customer name for order ${order.orderId}?`) + .groundTruth(order.customer.name) + .type('field-retrieval') + .dataset('nested') + .build(), + (order, getId) => new QuestionBuilder() + .id(getId()) + .prompt(`What is the customer email for order ${order.orderId}?`) + .groundTruth(order.customer.email) + .type('field-retrieval') + .dataset('nested') + .build(), + (order, getId) => new QuestionBuilder() + .id(getId()) + .prompt(`What is the order date for order ${order.orderId}?`) + .groundTruth(order.orderDate || '') + .type('field-retrieval') + .dataset('nested') + .build(), + (order, getId) => new QuestionBuilder() + .id(getId()) + .prompt(`How many items are in order ${order.orderId}?`) + .groundTruth(String(order.items.length)) + .type('field-retrieval') + .dataset('nested') + .build(), + ] + + // Use stride + 1 for customer fields to offset from order fields + const customerOrders = orders.map((_, i) => orders[i * SAMPLE_STRIDES.CUSTOMER_FIELD + 1] || orders[i]).filter(Boolean) as Order[] + questions.push(...rotateQuestions( + customerOrders, + customerFieldGenerators, + QUESTION_LIMITS.nested.fieldRetrievalCustomers, + 1, + getId, + )) + + // Aggregation: totals and averages + const totalRevenue = orders.reduce((sum, o) => sum + o.total, 0) + const avgOrderValue = totalRevenue / orders.length + const totalOrders = orders.length + const maxOrderValue = Math.max(...orders.map(o => o.total)) + + // Count by status + const statuses = [...new Set(orders.map(o => o.status))] + for (const status of statuses.slice(0, QUESTION_LIMITS.nested.aggregationStatuses)) { + const count = countByPredicate(orders, o => o.status === status) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many orders have status "${status}"?`) + .groundTruth(String(count)) + .type('aggregation') + .dataset('nested') + .build(), + ) + } + + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt('What is the total revenue across all orders?') + .groundTruth(String(totalRevenue.toFixed(2))) + .type('aggregation') + .dataset('nested') + .build(), + new QuestionBuilder() + .id(getId()) + .prompt('What is the average order value?') + .groundTruth(String(avgOrderValue.toFixed(2))) + .type('aggregation') + .dataset('nested') + .build(), + new QuestionBuilder() + .id(getId()) + .prompt('How many orders are in the dataset?') + .groundTruth(String(totalOrders)) + .type('aggregation') + .dataset('nested') + .build(), + new QuestionBuilder() + .id(getId()) + .prompt('What is the highest order total?') + .groundTruth(String(maxOrderValue.toFixed(2))) + .type('aggregation') + .dataset('nested') + .build(), + ) + + // Aggregation: high-value orders (single-condition filter) + for (const threshold of QUESTION_THRESHOLDS.nested.highValueOrders) { + const count = countByPredicate(orders, o => o.total > threshold) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many orders have a total greater than ${threshold}?`) + .groundTruth(String(count)) + .type('aggregation') + .dataset('nested') + .build(), + ) + } + + // Filtering: multi-condition queries (status AND value) + const orderStatuses = [...new Set(orders.map(o => o.status))] + for (const status of orderStatuses.slice(0, QUESTION_LIMITS.nested.filteringStatusAndValue)) { + const count = countByPredicate( + orders, + o => o.status === status && o.total > QUESTION_THRESHOLDS.nested.statusValueThreshold, + ) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many orders have status "${status}" and total greater than ${QUESTION_THRESHOLDS.nested.statusValueThreshold}?`) + .groundTruth(String(count)) + .type('filtering') + .dataset('nested') + .build(), + ) + } + + // Filtering: status AND items count (multi-condition) + for (const status of orderStatuses.slice(0, QUESTION_LIMITS.nested.filteringStatusAndItems)) { + const count = countByPredicate( + orders, + o => o.status === status && o.items.length >= QUESTION_THRESHOLDS.nested.itemCountThreshold, + ) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many orders have status "${status}" and at least ${QUESTION_THRESHOLDS.nested.itemCountThreshold} items?`) + .groundTruth(String(count)) + .type('filtering') + .dataset('nested') + .build(), + ) + } + + // Filtering: total AND items count (multi-condition) + for (const threshold of QUESTION_THRESHOLDS.nested.totalThresholdsForItems) { + const count = countByPredicate( + orders, + o => o.total > threshold && o.items.length >= QUESTION_THRESHOLDS.nested.itemCountThreshold, + ) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many orders have a total greater than ${threshold} and at least ${QUESTION_THRESHOLDS.nested.itemCountThreshold} items?`) + .groundTruth(String(count)) + .type('filtering') + .dataset('nested') + .build(), + ) + } + + return questions +} diff --git a/benchmarks/src/questions/tabular.ts b/benchmarks/src/questions/tabular.ts new file mode 100644 index 0000000..951bfdb --- /dev/null +++ b/benchmarks/src/questions/tabular.ts @@ -0,0 +1,191 @@ +import type { Employee } from '../datasets' +import type { Question } from '../types' +import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from '../constants' +import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils' + +/** + * Generate tabular (employee) questions + */ +export function generateTabularQuestions(employees: Employee[], getId: () => string): Question[] { + const questions: Question[] = [] + + if (employees.length === 0) + return questions + + // Field retrieval: specific employees + const fieldGenerators: Array<(emp: Employee, getId: () => string) => Question> = [ + (emp, getId) => new QuestionBuilder() + .id(getId()) + .prompt(`What is the salary of ${emp.name}?`) + .groundTruth(String(emp.salary)) + .type('field-retrieval') + .dataset('tabular') + .build(), + (emp, getId) => new QuestionBuilder() + .id(getId()) + .prompt(`What department does ${emp.name} work in?`) + .groundTruth(emp.department) + .type('field-retrieval') + .dataset('tabular') + .build(), + (emp, getId) => new QuestionBuilder() + .id(getId()) + .prompt(`What is the email address of ${emp.name}?`) + .groundTruth(emp.email) + .type('field-retrieval') + .dataset('tabular') + .build(), + (emp, getId) => new QuestionBuilder() + .id(getId()) + .prompt(`How many years of experience does ${emp.name} have?`) + .groundTruth(String(emp.yearsExperience)) + .type('field-retrieval') + .dataset('tabular') + .build(), + (emp, getId) => new QuestionBuilder() + .id(getId()) + .prompt(`Is ${emp.name} an active employee?`) + .groundTruth(emp.active ? 'yes' : 'no') + .type('field-retrieval') + .dataset('tabular') + .build(), + ] + + questions.push(...rotateQuestions( + employees, + fieldGenerators, + QUESTION_LIMITS.tabular.fieldRetrieval, + SAMPLE_STRIDES.EMPLOYEE_FIELD, + getId, + )) + + // Aggregation: count by department + const departments = [...new Set(employees.map(e => e.department))] + for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.aggregationDepartments)) { + const count = countByPredicate(employees, e => e.department === dept) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many employees work in ${dept}?`) + .groundTruth(String(count)) + .type('aggregation') + .dataset('tabular') + .build(), + ) + } + + // Aggregation: salary ranges (single-condition filters) + for (const threshold of QUESTION_THRESHOLDS.tabular.salaryRanges) { + const count = countByPredicate(employees, e => e.salary > threshold) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many employees have a salary greater than ${threshold}?`) + .groundTruth(String(count)) + .type('aggregation') + .dataset('tabular') + .build(), + ) + } + + // Aggregation: totals and averages + const totalEmployees = employees.length + const avgSalary = Math.round(employees.reduce((sum, e) => sum + e.salary, 0) / totalEmployees) + const activeCount = countByPredicate(employees, e => e.active) + const inactiveCount = countByPredicate(employees, e => !e.active) + + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt('How many employees are in the dataset?') + .groundTruth(String(totalEmployees)) + .type('aggregation') + .dataset('tabular') + .build(), + new QuestionBuilder() + .id(getId()) + .prompt('What is the average salary across all employees?') + .groundTruth(String(avgSalary)) + .type('aggregation') + .dataset('tabular') + .build(), + new QuestionBuilder() + .id(getId()) + .prompt('How many employees are active?') + .groundTruth(String(activeCount)) + .type('aggregation') + .dataset('tabular') + .build(), + new QuestionBuilder() + .id(getId()) + .prompt('How many employees are inactive?') + .groundTruth(String(inactiveCount)) + .type('aggregation') + .dataset('tabular') + .build(), + ) + + // Filtering: count by department with salary filter (multi-condition) + for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringMultiConditionDepartments)) { + const count = countByPredicate( + employees, + e => e.department === dept && e.salary > QUESTION_THRESHOLDS.tabular.departmentSalaryThreshold, + ) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many employees in ${dept} have a salary greater than ${QUESTION_THRESHOLDS.tabular.departmentSalaryThreshold}?`) + .groundTruth(String(count)) + .type('filtering') + .dataset('tabular') + .build(), + ) + } + + // Filtering: active employees by experience (multi-condition) + for (const exp of QUESTION_THRESHOLDS.tabular.experienceYears.slice(0, QUESTION_LIMITS.tabular.filteringExperience)) { + const count = countByPredicate(employees, e => e.yearsExperience > exp && e.active) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many active employees have more than ${exp} years of experience?`) + .groundTruth(String(count)) + .type('filtering') + .dataset('tabular') + .build(), + ) + } + + // Filtering: department by experience (multi-condition) + for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringDepartmentExp)) { + const count = countByPredicate( + employees, + e => e.department === dept && e.yearsExperience > QUESTION_THRESHOLDS.tabular.departmentExperienceThreshold, + ) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many employees in ${dept} have more than ${QUESTION_THRESHOLDS.tabular.departmentExperienceThreshold} years of experience?`) + .groundTruth(String(count)) + .type('filtering') + .dataset('tabular') + .build(), + ) + } + + // Filtering: department by active status (multi-condition) + for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringDepartmentActive)) { + const count = countByPredicate(employees, e => e.department === dept && e.active) + questions.push( + new QuestionBuilder() + .id(getId()) + .prompt(`How many active employees work in ${dept}?`) + .groundTruth(String(count)) + .type('filtering') + .dataset('tabular') + .build(), + ) + } + + return questions +} diff --git a/benchmarks/src/questions/utils.ts b/benchmarks/src/questions/utils.ts new file mode 100644 index 0000000..45c2c58 --- /dev/null +++ b/benchmarks/src/questions/utils.ts @@ -0,0 +1,95 @@ +import type { Question } from '../types' + +// Constants for sampling strides +export const SAMPLE_STRIDES = { + EMPLOYEE_FIELD: 2, + ORDER_FIELD: 2, + CUSTOMER_FIELD: 2, + ANALYTICS_FIELD: 3, + METRIC_FIELD: 3, + REPO_FIELD: 7, + EVENT_LOG_FIELD: 5, +} as const + +/** + * ID Generator + */ +export function* createIdGenerator(): Generator { + let id = 1 + while (true) { + yield `q${id++}` + } +} + +/** + * Question Builder class for fluent question creation + */ +export class QuestionBuilder { + private question: Partial = {} + + id(id: string): this { + this.question.id = id + return this + } + + prompt(prompt: string): this { + this.question.prompt = prompt + return this + } + + groundTruth(groundTruth: string): this { + this.question.groundTruth = groundTruth + return this + } + + type(type: Question['type']): this { + this.question.type = type + return this + } + + dataset(dataset: Question['dataset']): this { + this.question.dataset = dataset + return this + } + + build(): Question { + if (!this.question.id || !this.question.prompt || !this.question.groundTruth || !this.question.type || !this.question.dataset) { + throw new Error('Incomplete question') + } + return this.question as Question + } +} + +/** + * Helper: Count items matching a predicate + */ +export function countByPredicate(items: T[], predicate: (item: T) => boolean): number { + return items.filter(predicate).length +} + +/** + * Helper: Rotate through question generators + */ +export function rotateQuestions( + items: T[], + generators: Array<(item: T, getId: () => string) => Question>, + limit: number, + stride: number, + getId: () => string, +): Question[] { + const questions: Question[] = [] + + for (let i = 0; i < Math.min(limit, items.length); i++) { + const item = items[i * stride] || items[i] + if (!item) + continue + + const generatorIndex = i % generators.length + const generator = generators[generatorIndex] + if (generator) { + questions.push(generator(item, getId)) + } + } + + return questions +} diff --git a/benchmarks/src/report.ts b/benchmarks/src/report.ts index 9f18e58..4848cb4 100644 --- a/benchmarks/src/report.ts +++ b/benchmarks/src/report.ts @@ -1,7 +1,8 @@ -import type { EfficiencyRanking, EvaluationResult, FormatResult, Question } from './types' +import type { Dataset, EfficiencyRanking, EvaluationResult, FormatResult, Question } from './types' import { FORMATTER_DISPLAY_NAMES } from './constants' -import { datasets } from './datasets' +import { ACCURACY_DATASETS } from './datasets' import { models } from './evaluate' +import { supportsCSV } from './formatters' import { generateQuestions } from './questions' import { createProgressBar, tokenize } from './utils' @@ -16,7 +17,11 @@ export function calculateTokenCounts( const tokenCounts: Record = {} for (const [formatName, formatter] of Object.entries(formatters)) { - for (const dataset of datasets) { + for (const dataset of ACCURACY_DATASETS) { + // Skip CSV for datasets that don't support it + if (formatName === 'csv' && !supportsCSV(dataset)) + continue + const formatted = formatter(dataset.data) const key = `${formatName}-${dataset.name}` tokenCounts[key] = tokenize(formatted) @@ -42,9 +47,9 @@ export function calculateFormatResults( const accuracy = correctCount / totalCount // Calculate average tokens across all datasets for this format - const avgTokens = Object.entries(tokenCounts) + const formatTokenEntries = Object.entries(tokenCounts) .filter(([key]) => key.startsWith(`${formatName}-`)) - .reduce((sum, [, tokens]) => sum + tokens, 0) / datasets.length + const avgTokens = formatTokenEntries.reduce((sum, [, tokens]) => sum + tokens, 0) / formatTokenEntries.length const averageLatency = formatResults.reduce((sum, r) => sum + r.latencyMs, 0) / totalCount @@ -75,6 +80,8 @@ export function generateAccuracyReport( return ` Benchmarks test LLM comprehension across different input formats using ${totalQuestions} data retrieval questions on ${modelNames.length} ${modelNames.length === 1 ? 'model' : 'models'}. +${generateDatasetCatalog(ACCURACY_DATASETS)} + #### Efficiency Ranking (Accuracy per 1K Tokens) ${generateEfficiencyRankingReport(formatResults)} @@ -85,6 +92,38 @@ ${generateDetailedAccuracyReport(formatResults, results, questions, tokenCounts) `.trimStart() } +/** + * Generate dataset catalog section + */ +function generateDatasetCatalog(datasets: Dataset[]): string { + const rows = datasets.map((dataset) => { + const csvSupport = supportsCSV(dataset) ? 'โœ“' : 'โœ—' + const rowCount = Object.values(dataset.data)[0]?.length ?? 1 + const structure = dataset.metadata.structureClass + const eligibility = `${dataset.metadata.tabularEligibility}%` + + return `| ${dataset.description} | ${rowCount} | ${structure} | ${csvSupport} | ${eligibility} |` + }).join('\n') + + return ` +#### Dataset Catalog + +| Dataset | Rows | Structure | CSV Support | Eligibility | +| ------- | ---- | --------- | ----------- | ----------- | +${rows} + +**Structure classes:** +- **uniform**: All objects have identical fields with primitive values +- **semi-uniform**: Mix of uniform and non-uniform structures +- **nested**: Objects with nested structures (nested objects or arrays) +- **deep**: Highly nested with minimal tabular eligibility + +**CSV Support:** โœ“ (supported), โœ— (not supported - would require lossy flattening) + +**Eligibility:** Percentage of arrays that qualify for TOON's tabular format (uniform objects with primitive values) +`.trim() +} + /** * Generate efficiency ranking report */ @@ -168,10 +207,12 @@ function generateDetailedAccuracyReport( const filteringPercent = ((filteringCount / totalQuestions) * 100).toFixed(0) // Calculate dataset sizes - const tabularSize = datasets.find(d => d.name === 'tabular')?.data.employees?.length || 0 - const nestedSize = datasets.find(d => d.name === 'nested')?.data.orders?.length || 0 - const analyticsSize = datasets.find(d => d.name === 'analytics')?.data.metrics?.length || 0 - const githubSize = datasets.find(d => d.name === 'github')?.data.repositories?.length || 0 + const tabularSize = ACCURACY_DATASETS.find(d => d.name === 'tabular')?.data.employees?.length || 0 + const nestedSize = ACCURACY_DATASETS.find(d => d.name === 'nested')?.data.orders?.length || 0 + const analyticsSize = ACCURACY_DATASETS.find(d => d.name === 'analytics')?.data.metrics?.length || 0 + const githubSize = ACCURACY_DATASETS.find(d => d.name === 'github')?.data.repositories?.length || 0 + const eventLogsSize = ACCURACY_DATASETS.find(d => d.name === 'event-logs')?.data.logs?.length || 0 + const nestedConfigSize = 1 // Single config object // Calculate number of formats and evaluations const formatCount = formatResults.length @@ -208,12 +249,14 @@ This benchmark tests **LLM comprehension and data retrieval accuracy** across di #### Datasets Tested -Four datasets designed to test different structural patterns (all contain arrays of uniform objects, TOON's optimal format): +Six datasets designed to test different structural patterns: 1. **Tabular** (${tabularSize} employee records): Uniform objects with identical fields โ€“ optimal for TOON's tabular format. 2. **Nested** (${nestedSize} e-commerce orders): Complex structures with nested customer objects and item arrays. 3. **Analytics** (${analyticsSize} days of metrics): Time-series data with dates and numeric values. 4. **GitHub** (${githubSize} repositories): Real-world data from top GitHub repos by stars. +5. **Event Logs** (${eventLogsSize} logs): Semi-uniform data with ~50% flat logs and ~50% with nested error objects. +6. **Nested Config** (${nestedConfigSize} configuration): Deeply nested configuration with minimal tabular eligibility. #### Question Types @@ -314,7 +357,7 @@ function generateDatasetBreakdown( questions: Question[], tokenCounts: Record, ): string { - return datasets.map((dataset) => { + return ACCURACY_DATASETS.map((dataset) => { const datasetResults = formatResults.map((fr) => { const datasetFormatResults = results.filter(r => r.questionId.includes(dataset.name) || questions.find(q => q.id === r.questionId)?.dataset === dataset.name) if (datasetFormatResults.length === 0) diff --git a/benchmarks/src/types.ts b/benchmarks/src/types.ts index 0b3da4c..5676920 100644 --- a/benchmarks/src/types.ts +++ b/benchmarks/src/types.ts @@ -1,7 +1,14 @@ +export interface DatasetMetadata { + supportsCSV: boolean + structureClass: 'uniform' | 'semi-uniform' | 'nested' | 'deep' + tabularEligibility: number +} + export interface Dataset { name: string description: string data: Record + metadata: DatasetMetadata } export interface Question {