test(benchmark): overhaul generation

This commit is contained in:
Johann Schopplich
2025-11-06 14:45:44 +01:00
parent 9863875706
commit bc711ccecf
19 changed files with 2254 additions and 997 deletions

View File

@@ -60,6 +60,19 @@ For small payloads, JSON/CSV/YAML work fine. TOON's value emerges at scale: when
</details> </details>
<details>
<summary><strong>When NOT to use TOON</strong></summary>
TOON excels with uniform arrays of objects, but there are cases where other formats are better:
- **Deeply nested or non-uniform structures** (tabular eligibility ≈ 0%): JSON-compact often uses fewer tokens. Example: complex configuration objects with many nested levels.
- **Semi-uniform arrays** (~4060% tabular eligibility): Token savings diminish. Prefer JSON if your pipelines already rely on it.
- **Flat CSV use-cases**: CSV is smaller than TOON for pure tabular data. TOON adds minimal overhead (~5-10%) to provide structure (length markers, field headers, delimiter scoping) that improves LLM reliability.
See [benchmarks](#benchmarks) for concrete comparisons across different data structures.
</details>
## Key Features ## Key Features
- 💸 **Token-efficient:** typically 3060% fewer tokens than JSON[^1] - 💸 **Token-efficient:** typically 3060% fewer tokens than JSON[^1]
@@ -75,14 +88,16 @@ For small payloads, JSON/CSV/YAML work fine. TOON's value emerges at scale: when
> [!TIP] > [!TIP]
> Try the interactive [Format Tokenization Playground](https://www.curiouslychase.com/playground/format-tokenization-exploration) to compare token usage across CSV, JSON, YAML, and TOON with your own data. > Try the interactive [Format Tokenization Playground](https://www.curiouslychase.com/playground/format-tokenization-exploration) to compare token usage across CSV, JSON, YAML, and TOON with your own data.
Benchmarks are organized into two tracks to ensure fair comparisons:
- **Mixed-Structure Track**: Datasets with nested or semi-uniform structures (TOON vs JSON, YAML, XML). CSV excluded as it cannot properly represent these structures.
- **Flat-Only Track**: Datasets with flat tabular structures where CSV is applicable (CSV vs TOON vs JSON, YAML, XML).
### Token Efficiency ### Token Efficiency
Token counts are measured using the GPT-5 `o200k_base` tokenizer via [`gpt-tokenizer`](https://github.com/niieani/gpt-tokenizer). Savings are calculated against formatted JSON (2-space indentation) as the primary baseline, with additional comparisons to compact JSON (minified), YAML, and XML. Actual savings vary by model and tokenizer. Token counts are measured using the GPT-5 `o200k_base` tokenizer via [`gpt-tokenizer`](https://github.com/niieani/gpt-tokenizer). Savings are calculated against formatted JSON (2-space indentation) as the primary baseline, with additional comparisons to compact JSON (minified), YAML, and XML. Actual savings vary by model and tokenizer.
The benchmarks use datasets optimized for TOON's strengths (uniform tabular data). Real-world performance depends on your data structure. The benchmarks test datasets across different structural patterns (uniform, semi-uniform, nested, deeply nested) to show where TOON excels and where other formats may be better.
> [!NOTE]
> CSV/TSV doesn't support nested structures, so it's not included in this comparison. For flat datasets where CSV applies, see token counts and accuracy metrics in the [Retrieval Accuracy](#retrieval-accuracy) section.
<!-- automd:file src="./benchmarks/results/token-efficiency.md" --> <!-- automd:file src="./benchmarks/results/token-efficiency.md" -->

View File

@@ -34,8 +34,8 @@ Results are saved to `results/token-efficiency.md`.
Tests how well LLMs can answer questions about data in different formats (TOON, JSON, JSON compact, XML, YAML, CSV): Tests how well LLMs can answer questions about data in different formats (TOON, JSON, JSON compact, XML, YAML, CSV):
1. Generate ~150-160 questions across 4 datasets 1. Generate ~150-160 questions across 6 datasets (CSV only included for datasets with flat/tabular structure)
2. Convert each dataset to all 6 formats 2. Convert each dataset to all supported formats
3. Query each LLM with formatted data + question 3. Query each LLM with formatted data + question
4. Validate answers using `gpt-5-nano` as judge 4. Validate answers using `gpt-5-nano` as judge
5. Aggregate metrics and generate report 5. Aggregate metrics and generate report

View File

@@ -1,36 +1,149 @@
## Mixed-Structure Track
Datasets with nested or semi-uniform structures. CSV excluded as it cannot properly represent these structures.
``` ```
⭐ GitHub Repositories ██████████████░░░░░░░░░░░ 8,745 tokens 🛒 E-commerce orders with nested structures [eligibility: 33%]
vs JSON (42.3%) 15,145 toon ▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░ 58,528 tokens
vs JSON compact (23.7%) 11,455 vs JSON (37.9%) 94,207
vs YAML (33.4%) 13,129 vs JSON compact (+0.9%) 57,979
vs XML (48.8%) 17,095 vs YAML (17.8%) 71,223
vs XML (45.2%) 106,720
📈 Daily Analytics ██████████░░░░░░░░░░░░░░░ 4,507 tokens 🧾 Semi-uniform event logs [eligibility: 50%]
vs JSON (58.9%) 10,977 toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░ 154,419 tokens
vs JSON compact (35.7%) 7,013 vs JSON (15.0%) 181,592
vs YAML (48.8%) 8,810 vs JSON compact (+19.9%) 128,836
vs XML (65.7%) 13,128 vs YAML (0.9%) 155,749
vs XML (25.1%) 206,271
🛒 E-Commerce Order ████████████████░░░░░░░░░ 166 tokens 🧩 Deeply nested configuration [eligibility: 0%]
vs JSON (35.4%) 257 toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░ 630 tokens
vs JSON compact (2.9%) 171 vs JSON (31.4%) 918
vs YAML (15.7%) 197 vs JSON compact (+11.9%) 563
vs XML (38.7%) 271 vs YAML (6.4%) 673
vs XML (37.4%) 1,007
───────────────────────────────────────────────────────────────────── ─────────────────────────────────────────────────────────────────────────────────
Total ██████████████░░░░░░░░░░░ 13,418 tokens Total
vs JSON (49.1%) 26,379 toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░ 213,577 tokens
vs JSON compact (28.0%) 18,639 vs JSON (22.8%) 276,717
vs YAML (39.4%) 22,136 vs JSON compact (+14.0%) 187,378
vs XML (56.0%) 30,494 vs YAML (6.2%) 227,645
vs XML (32.0%) 313,998
``` ```
## Flat-Only Track
Datasets with flat tabular structures where CSV is applicable.
```
👥 Uniform employee records (TOON optimal format) [eligibility: 100%]
csv ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░ 46,968 tokens
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 49,841 tokens (+5.8% vs CSV)
vs JSON (60.7%) 126,886
vs JSON compact (36.8%) 78,882
vs YAML (50.0%) 99,743
vs XML (66.0%) 146,465
📈 Time-series analytics data [eligibility: 100%]
csv ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░ 8,382 tokens
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 9,114 tokens (+8.0% vs CSV)
vs JSON (59.0%) 22,244
vs JSON compact (35.9%) 14,210
vs YAML (49.0%) 17,857
vs XML (65.8%) 26,615
⭐ Top 100 GitHub repositories [eligibility: 100%]
csv ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░ 8,513 tokens
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 8,745 tokens (+2.7% vs CSV)
vs JSON (42.3%) 15,145
vs JSON compact (23.7%) 11,455
vs YAML (33.4%) 13,129
vs XML (48.8%) 17,095
─────────────────────────────────────────────────────────────────────────────────
Total
csv ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░ 63,863 tokens
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 67,700 tokens (+5.7% vs CSV)
vs JSON (58.8%) 164,275
vs JSON compact (35.2%) 104,547
vs YAML (48.2%) 130,729
vs XML (64.4%) 190,175
```
<details> <details>
<summary><strong>View detailed examples</strong></summary> <summary><strong>View detailed examples</strong></summary>
#### ⭐ GitHub Repositories #### 📈 Time-series analytics data
**Configuration:** Top 100 GitHub repositories with stars, forks, and metadata **Savings:** 13,130 tokens (59.0% reduction vs JSON)
**JSON** (22,244 tokens):
```json
{
"metrics": [
{
"date": "2025-01-01",
"views": 4324,
"clicks": 146,
"conversions": 21,
"revenue": 3834.57,
"bounceRate": 0.4
},
{
"date": "2025-01-02",
"views": 6248,
"clicks": 407,
"conversions": 22,
"revenue": 2936.12,
"bounceRate": 0.62
},
{
"date": "2025-01-03",
"views": 7382,
"clicks": 270,
"conversions": 24,
"revenue": 6825.19,
"bounceRate": 0.7
},
{
"date": "2025-01-04",
"views": 4586,
"clicks": 267,
"conversions": 24,
"revenue": 2391.11,
"bounceRate": 0.64
},
{
"date": "2025-01-05",
"views": 6171,
"clicks": 227,
"conversions": 12,
"revenue": 3430.1,
"bounceRate": 0.39
}
]
}
```
**TOON** (9,114 tokens):
```
metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
2025-01-01,4324,146,21,3834.57,0.4
2025-01-02,6248,407,22,2936.12,0.62
2025-01-03,7382,270,24,6825.19,0.7
2025-01-04,4586,267,24,2391.11,0.64
2025-01-05,6171,227,12,3430.1,0.39
```
---
#### ⭐ Top 100 GitHub repositories
**Savings:** 6,400 tokens (42.3% reduction vs JSON) **Savings:** 6,400 tokens (42.3% reduction vs JSON)
@@ -91,72 +204,4 @@ repositories[3]{id,name,repo,description,createdAt,updatedAt,pushedAt,stars,watc
21737465,awesome,sindresorhus/awesome,😎 Awesome lists about all kinds of interesting topics,"2014-07-11T13:42:37Z","2025-10-28T12:40:21Z","2025-10-27T17:57:31Z",410052,8017,32029,main 21737465,awesome,sindresorhus/awesome,😎 Awesome lists about all kinds of interesting topics,"2014-07-11T13:42:37Z","2025-10-28T12:40:21Z","2025-10-27T17:57:31Z",410052,8017,32029,main
``` ```
---
#### 📈 Daily Analytics
**Configuration:** 180 days of web metrics (views, clicks, conversions, revenue)
**Savings:** 6,470 tokens (58.9% reduction vs JSON)
**JSON** (10,977 tokens):
```json
{
"metrics": [
{
"date": "2025-01-01",
"views": 6890,
"clicks": 401,
"conversions": 23,
"revenue": 6015.59,
"bounceRate": 0.63
},
{
"date": "2025-01-02",
"views": 6940,
"clicks": 323,
"conversions": 37,
"revenue": 9086.44,
"bounceRate": 0.36
},
{
"date": "2025-01-03",
"views": 4390,
"clicks": 346,
"conversions": 26,
"revenue": 6360.75,
"bounceRate": 0.48
},
{
"date": "2025-01-04",
"views": 3429,
"clicks": 231,
"conversions": 13,
"revenue": 2360.96,
"bounceRate": 0.65
},
{
"date": "2025-01-05",
"views": 5804,
"clicks": 186,
"conversions": 22,
"revenue": 2535.96,
"bounceRate": 0.37
}
]
}
```
**TOON** (4,507 tokens):
```
metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
2025-01-01,6890,401,23,6015.59,0.63
2025-01-02,6940,323,37,9086.44,0.36
2025-01-03,4390,346,26,6360.75,0.48
2025-01-04,3429,231,13,2360.96,0.65
2025-01-05,5804,186,22,2535.96,0.37
```
</details> </details>

View File

@@ -5,16 +5,83 @@ import process from 'node:process'
import * as prompts from '@clack/prompts' import * as prompts from '@clack/prompts'
import PQueue from 'p-queue' import PQueue from 'p-queue'
import { BENCHMARKS_DIR, DEFAULT_CONCURRENCY, DRY_RUN, DRY_RUN_LIMITS, MODEL_RPM_LIMITS, ROOT_DIR } from '../src/constants' import { BENCHMARKS_DIR, DEFAULT_CONCURRENCY, DRY_RUN, DRY_RUN_LIMITS, MODEL_RPM_LIMITS, ROOT_DIR } from '../src/constants'
import { datasets } from '../src/datasets' import { ACCURACY_DATASETS } from '../src/datasets'
import { evaluateQuestion, models } from '../src/evaluate' import { evaluateQuestion, models } from '../src/evaluate'
import { formatters } from '../src/formatters' import { formatters, supportsCSV } from '../src/formatters'
import { generateQuestions } from '../src/questions' import { generateQuestions } from '../src/questions'
import { calculateFormatResults, calculateTokenCounts, generateAccuracyReport } from '../src/report' import { calculateFormatResults, calculateTokenCounts, generateAccuracyReport } from '../src/report'
import { getAllModelResults, hasModelResults, saveModelResults } from '../src/storage' import { getAllModelResults, hasModelResults, saveModelResults } from '../src/storage'
import { ensureDir } from '../src/utils' import { ensureDir } from '../src/utils'
// Constants
const PROGRESS_UPDATE_INTERVAL = 10
const RATE_LIMIT_INTERVAL_MS = 60_000
prompts.intro('Retrieval Accuracy Benchmark') prompts.intro('Retrieval Accuracy Benchmark')
/**
* Generate evaluation tasks for a model
*/
function generateEvaluationTasks(questions: Question[]): { question: Question, formatName: string }[] {
const tasks: { question: Question, formatName: string }[] = []
for (const question of questions) {
for (const [formatName] of Object.entries(formatters)) {
// Skip CSV for datasets that don't support it
const dataset = ACCURACY_DATASETS.find(d => d.name === question.dataset)
if (formatName === 'csv' && dataset && !supportsCSV(dataset))
continue
tasks.push({ question, formatName })
}
}
return tasks
}
/**
* Check which models already have saved results
*/
async function checkExistingResults(activeModels: typeof models) {
const existingModelResults: Record<string, boolean> = {}
for (const model of activeModels) {
const existingResult = await hasModelResults(model.modelId)
if (existingResult)
existingModelResults[model.modelId] = existingResult
}
return existingModelResults
}
/**
* Create a progress updater function
*/
function createProgressUpdater(spinner: ReturnType<typeof prompts.spinner>, total: number) {
let completed = 0
return () => {
completed++
if (completed % PROGRESS_UPDATE_INTERVAL === 0 || completed === total) {
const percent = ((completed / total) * 100).toFixed(1)
spinner.message(`Progress: ${completed}/${total} (${percent}%)`)
}
}
}
/**
* Create a rate-limited queue for model evaluation
*/
function createEvaluationQueue(modelId: string) {
const rpmLimit = MODEL_RPM_LIMITS[modelId]
return new PQueue({
concurrency: DEFAULT_CONCURRENCY,
intervalCap: rpmLimit ?? Infinity,
interval: rpmLimit ? RATE_LIMIT_INTERVAL_MS : 0,
})
}
// Prompt user to select which models to benchmark // Prompt user to select which models to benchmark
const modelChoices = models.map(({ modelId }) => ({ const modelChoices = models.map(({ modelId }) => ({
value: modelId, value: modelId,
@@ -37,15 +104,10 @@ const activeModels = models.filter(m => selectedModels.includes(m.modelId))
prompts.log.info(`Selected ${activeModels.length} model(s): ${activeModels.map(m => m.modelId).join(', ')}`) prompts.log.info(`Selected ${activeModels.length} model(s): ${activeModels.map(m => m.modelId).join(', ')}`)
// Check which models already have results // Check which models already have results
const existingModelResults: Record<string, boolean> = {} const existingModelResults = await checkExistingResults(activeModels)
for (const model of activeModels) {
const existingResult = await hasModelResults(model.modelId)
if (existingResult)
existingModelResults[model.modelId] = existingResult
}
if (Object.keys(existingModelResults).length > 0) { if (Object.keys(existingModelResults).length > 0) {
prompts.log.info(`Found existing results for ${Object.values(existingModelResults).length} model(s)`) prompts.log.info(`Found existing results for ${Object.keys(existingModelResults).length} model(s)`)
} }
if (DRY_RUN) { if (DRY_RUN) {
@@ -75,31 +137,22 @@ for (const model of activeModels) {
prompts.log.step(`Running benchmark for ${modelId}`) prompts.log.step(`Running benchmark for ${modelId}`)
// Generate evaluation tasks for this model // Generate evaluation tasks for this model
const tasks: { question: Question, formatName: string }[] = [] const tasks = generateEvaluationTasks(questions)
for (const question of questions) {
for (const [formatName] of Object.entries(formatters)) {
tasks.push({ question, formatName })
}
}
const total = tasks.length const total = tasks.length
const rpmLimit = MODEL_RPM_LIMITS[modelId] const rpmLimit = MODEL_RPM_LIMITS[modelId]
const queue = new PQueue({ const queue = createEvaluationQueue(modelId)
concurrency: DEFAULT_CONCURRENCY,
intervalCap: rpmLimit ?? Infinity,
interval: rpmLimit ? 60_000 : 0,
})
const evalSpinner = prompts.spinner() const evalSpinner = prompts.spinner()
evalSpinner.start(`Running ${total} evaluations (concurrency: ${DEFAULT_CONCURRENCY}, RPM limit: ${rpmLimit ?? 'unlimited'})`) evalSpinner.start(`Running ${total} evaluations (concurrency: ${DEFAULT_CONCURRENCY}, RPM limit: ${rpmLimit ?? 'unlimited'})`)
let completed = 0 const updateProgress = createProgressUpdater(evalSpinner, total)
// Queue all tasks // Queue all tasks
const modelResultPromises = tasks.map(task => const modelResultPromises = tasks.map(task =>
queue.add(async () => { queue.add(async () => {
// Format data on-demand // Format data on-demand
const dataset = datasets.find(d => d.name === task.question.dataset)! const dataset = ACCURACY_DATASETS.find(d => d.name === task.question.dataset)!
const formatter = formatters[task.formatName]! const formatter = formatters[task.formatName]!
const formattedData = formatter(dataset.data) const formattedData = formatter(dataset.data)
@@ -111,11 +164,7 @@ for (const model of activeModels) {
}) })
// Progress update after task completes // Progress update after task completes
completed++ updateProgress()
if (completed % 10 === 0 || completed === total) {
const percent = ((completed / total) * 100).toFixed(1)
evalSpinner.message(`Progress: ${completed}/${total} (${percent}%)`)
}
return result return result
}), }),
@@ -154,5 +203,5 @@ await ensureDir(resultsDir)
const outputFilePath = path.join(resultsDir, 'retrieval-accuracy.md') const outputFilePath = path.join(resultsDir, 'retrieval-accuracy.md')
await fsp.writeFile(outputFilePath, accuracyReport) await fsp.writeFile(outputFilePath, accuracyReport)
prompts.log.info(`Report saved to: \`${path.relative(ROOT_DIR, outputFilePath)}\``)
reportSpinner.stop('Report generation complete!') reportSpinner.stop('Report generation complete!')
prompts.log.info(`Report saved to: \`${path.relative(ROOT_DIR, outputFilePath)}\``)

View File

@@ -1,11 +1,11 @@
import type { Dataset } from '../src/types'
import * as fsp from 'node:fs/promises' import * as fsp from 'node:fs/promises'
import * as path from 'node:path' import * as path from 'node:path'
import * as prompts from '@clack/prompts' import * as prompts from '@clack/prompts'
import { encode } from '../../packages/toon/src' import { encode } from '../../packages/toon/src'
import githubRepos from '../data/github-repos.json' with { type: 'json' }
import { BENCHMARKS_DIR, FORMATTER_DISPLAY_NAMES, ROOT_DIR } from '../src/constants' import { BENCHMARKS_DIR, FORMATTER_DISPLAY_NAMES, ROOT_DIR } from '../src/constants'
import { generateAnalyticsData, generateOrderData } from '../src/datasets' import { TOKEN_EFFICIENCY_DATASETS } from '../src/datasets'
import { formatters } from '../src/formatters' import { formatters, supportsCSV } from '../src/formatters'
import { createProgressBar, ensureDir, tokenize } from '../src/utils' import { createProgressBar, ensureDir, tokenize } from '../src/utils'
interface FormatMetrics { interface FormatMetrics {
@@ -16,55 +16,160 @@ interface FormatMetrics {
} }
interface BenchmarkResult { interface BenchmarkResult {
name: string dataset: Dataset
emoji: string
description: string
data: Record<string, any>
formats: FormatMetrics[] formats: FormatMetrics[]
showDetailed: boolean
} }
const BENCHMARK_EXAMPLES = [ // Constants
{ const DATASET_ICONS: Record<string, string> = {
name: 'GitHub Repositories', 'tabular': '👥',
emoji: '', 'nested': '🛒',
description: 'Top 100 GitHub repositories with stars, forks, and metadata', 'analytics': '📈',
getData: () => ({ repositories: githubRepos }), 'github': '⭐',
showDetailed: true, 'event-logs': '🧾',
}, 'nested-config': '🧩',
{ }
name: 'Daily Analytics',
emoji: '📈', const COMPARISON_FORMAT_ORDER = ['json-pretty', 'json-compact', 'yaml', 'xml'] as const
description: '180 days of web metrics (views, clicks, conversions, revenue)',
getData: () => generateAnalyticsData(180), const PROGRESS_BAR_CONFIG = { filled: '▓', empty: '░' } as const
showDetailed: true, const PROGRESS_BAR_WIDTH = 20
}, const TOKEN_PADDING = 7
{ const LABEL_PADDING = 60
name: 'E-Commerce Order', const COMPARISON_LABEL_PADDING = 30
emoji: '🛒',
description: 'Single nested order with customer and items', const SEPARATOR = '─────────────────────────────────────────────────────────────────────────────────'
getData: generateOrderData, const DEFAULT_DATASET_ICON = '📊'
showDetailed: false,
}, const DETAILED_EXAMPLE_DATASETS = ['github', 'analytics'] as const
] as const const GITHUB_REPO_LIMIT = 3
const GITHUB_DESC_LIMIT = 80
const ANALYTICS_METRICS_LIMIT = 5
prompts.intro('Token Efficiency Benchmark') prompts.intro('Token Efficiency Benchmark')
/**
* Format a comparison line showing savings vs TOON
*/
function formatComparisonLine(format: FormatMetrics): string {
const label = FORMATTER_DISPLAY_NAMES[format.name] || format.name.toUpperCase()
const signedPercent = format.savingsPercent >= 0
? `${format.savingsPercent.toFixed(1)}%`
: `+${Math.abs(format.savingsPercent).toFixed(1)}%`
const labelWithSavings = `vs ${label} (${signedPercent})`.padEnd(COMPARISON_LABEL_PADDING)
const tokenStr = format.tokens.toLocaleString('en-US').padStart(TOKEN_PADDING)
return ` ${labelWithSavings}${tokenStr}`
}
/**
* Calculate total tokens and savings for a set of datasets
*/
function calculateTotalMetrics(datasets: BenchmarkResult[], formatNames: readonly string[]) {
const totalToonTokens = datasets.reduce((sum, r) => {
const toon = r.formats.find(f => f.name === 'toon')!
return sum + toon.tokens
}, 0)
const totals = formatNames.map((formatName) => {
const totalTokens = datasets.reduce((sum, r) => {
const format = r.formats.find(f => f.name === formatName)
return sum + (format?.tokens || 0)
}, 0)
const savings = totalTokens - totalToonTokens
const savingsPercent = (savings / totalTokens) * 100
return { name: formatName, tokens: totalTokens, savingsPercent }
})
return { totalToonTokens, totals }
}
/**
* Generate total lines for a track
*/
function generateTotalLines(
totalToonTokens: number,
totals: { name: string, tokens: number, savingsPercent: number }[],
baselineFormat?: { name: string, tokens: number },
) {
const lines: string[] = ['Total ']
if (baselineFormat) {
// Flat-only track with CSV baseline
const csvPercentage = Math.min(100, (baselineFormat.tokens / totalToonTokens) * 100)
const csvBar = createProgressBar(csvPercentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
const csvStr = baselineFormat.tokens.toLocaleString('en-US').padStart(TOKEN_PADDING)
lines.push(`csv ${csvBar} ${csvStr} tokens`)
const overheadPercent = ((totalToonTokens - baselineFormat.tokens) / totalToonTokens) * 100
const toonBar = createProgressBar(100, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
const toonStr = totalToonTokens.toLocaleString('en-US').padStart(TOKEN_PADDING)
lines.push(`toon ${toonBar} ${toonStr} tokens (+${overheadPercent.toFixed(1)}% vs CSV)`)
}
else {
// Mixed-structure track
const totalPercentage = Math.min(100, (totalToonTokens / totals[0]!.tokens) * 100)
const totalBar = createProgressBar(totalPercentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
const toonStr = totalToonTokens.toLocaleString('en-US').padStart(TOKEN_PADDING)
lines.push(`toon ${totalBar} ${toonStr} tokens`)
}
// Add comparison lines
for (const format of totals) {
lines.push(formatComparisonLine({
name: format.name,
tokens: format.tokens,
savings: 0, // Not used in this context
savingsPercent: format.savingsPercent,
}))
}
return lines.join('\n')
}
/**
* Generate bar chart for a dataset
*/
function generateDatasetChart(result: BenchmarkResult): string {
const { dataset, formats } = result
const toon = formats.find(f => f.name === 'toon')!
const jsonPretty = formats.find(f => f.name === 'json-pretty')!
const emoji = DATASET_ICONS[dataset.name] || DEFAULT_DATASET_ICON
const eligibility = dataset.metadata.tabularEligibility
const name = `${dataset.description} [eligibility: ${eligibility}%]`
const percentage = Math.min(100, 100 - jsonPretty.savingsPercent)
const bar = createProgressBar(percentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
const toonStr = toon.tokens.toLocaleString('en-US')
const line1 = `${emoji} ${name.padEnd(LABEL_PADDING)}\ntoon ${bar} ${toonStr.padStart(TOKEN_PADDING)} tokens`
const comparisonLines = COMPARISON_FORMAT_ORDER.map((formatName) => {
const format = formats.find(f => f.name === formatName)
if (!format)
return null
return formatComparisonLine(format)
}).filter(Boolean)
return [line1, ...comparisonLines].join('\n')
}
const results: BenchmarkResult[] = [] const results: BenchmarkResult[] = []
const totalTokensByFormat: Record<string, number> = {}
for (const example of BENCHMARK_EXAMPLES) { // Calculate token counts for all datasets
const data = example.getData() for (const dataset of TOKEN_EFFICIENCY_DATASETS) {
// Calculate tokens for each format
const formatMetrics: FormatMetrics[] = [] const formatMetrics: FormatMetrics[] = []
const tokensByFormat: Record<string, number> = {} const tokensByFormat: Record<string, number> = {}
// Calculate tokens for each format
for (const [formatName, formatter] of Object.entries(formatters)) { for (const [formatName, formatter] of Object.entries(formatters)) {
const formattedString = formatter(data) // Skip CSV for datasets that don't support it
if (formatName === 'csv' && !supportsCSV(dataset))
continue
const formattedString = formatter(dataset.data)
const tokens = tokenize(formattedString) const tokens = tokenize(formattedString)
tokensByFormat[formatName] = tokens tokensByFormat[formatName] = tokens
totalTokensByFormat[formatName] = (totalTokensByFormat[formatName] || 0) + tokens
} }
// Calculate savings vs TOON // Calculate savings vs TOON
@@ -80,105 +185,126 @@ for (const example of BENCHMARK_EXAMPLES) {
} }
results.push({ results.push({
name: example.name, dataset,
emoji: example.emoji,
description: example.description,
data,
formats: formatMetrics, formats: formatMetrics,
showDetailed: example.showDetailed,
}) })
} }
// Calculate total savings percentages // Separate datasets by CSV support
const totalToonTokens = totalTokensByFormat.toon! const mixedStructureDatasets = results.filter(r => !supportsCSV(r.dataset))
const totalSavingsPercent: Record<string, number> = {} const flatOnlyDatasets = results.filter(r => supportsCSV(r.dataset))
for (const [formatName, totalTokens] of Object.entries(totalTokensByFormat)) {
if (formatName === 'toon') {
totalSavingsPercent[formatName] = 0
}
else {
const savings = totalTokens - totalToonTokens
totalSavingsPercent[formatName] = (savings / totalTokens) * 100
}
}
// Generate ASCII bar chart visualization (stacked compact format) // Mixed-Structure Track (no CSV)
const formatOrder = ['json-pretty', 'json-compact', 'yaml', 'xml'] const mixedCharts = mixedStructureDatasets
const datasetRows = results .map(result => generateDatasetChart(result))
.join('\n\n')
// Flat-Only Track (with CSV)
const flatCharts = flatOnlyDatasets
.map((result) => { .map((result) => {
const csv = result.formats.find(f => f.name === 'csv')
const toon = result.formats.find(f => f.name === 'toon')! const toon = result.formats.find(f => f.name === 'toon')!
const percentage = result.formats.find(f => f.name === 'json-pretty')!.savingsPercent
const bar = createProgressBar(100 - percentage, 100) // Invert to show TOON tokens if (!csv)
return generateDatasetChart(result)
// Special handling to show CSV first with TOON overhead
const { dataset } = result
const emoji = DATASET_ICONS[dataset.name] || DEFAULT_DATASET_ICON
const eligibility = dataset.metadata.tabularEligibility
const name = `${dataset.description} [eligibility: ${eligibility}%]`
// CSV line
const csvPercentage = Math.min(100, (csv.tokens / toon.tokens) * 100)
const csvBar = createProgressBar(csvPercentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
const csvStr = csv.tokens.toLocaleString('en-US')
const line1 = `${emoji} ${name.padEnd(LABEL_PADDING)}\ncsv ${csvBar} ${csvStr.padStart(TOKEN_PADDING)} tokens`
// TOON line with overhead vs CSV
const toonOverhead = toon.tokens - csv.tokens
const toonOverheadPercent = (toonOverhead / toon.tokens) * 100
const toonBar = createProgressBar(100, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
const toonStr = toon.tokens.toLocaleString('en-US') const toonStr = toon.tokens.toLocaleString('en-US')
const toonVsCSV = toonOverheadPercent >= 0
? `(+${toonOverheadPercent.toFixed(1)}% vs CSV)`
: `(${toonOverheadPercent.toFixed(1)}% vs CSV)`
const toonLine = `toon ${toonBar} ${toonStr.padStart(TOKEN_PADDING)} tokens ${toonVsCSV}`
const line1 = `${result.emoji} ${result.name.padEnd(25)} ${bar} ${toonStr.padStart(6)} tokens` // Other format comparisons (vs TOON)
const comparisonLines = COMPARISON_FORMAT_ORDER.map((formatName) => {
const format = result.formats.find(f => f.name === formatName)
if (!format)
return null
const comparisonLines = formatOrder.map((formatName) => { return formatComparisonLine(format)
const format = result.formats.find(f => f.name === formatName)! }).filter(Boolean)
const label = FORMATTER_DISPLAY_NAMES[formatName] || formatName.toUpperCase()
const signedPercent = format.savingsPercent >= 0
? `${format.savingsPercent.toFixed(1)}%`
: `+${Math.abs(format.savingsPercent).toFixed(1)}%`
const labelWithSavings = `vs ${label} (${signedPercent})`.padEnd(27)
const tokenStr = format.tokens.toLocaleString('en-US').padStart(6)
return ` ${labelWithSavings}${tokenStr}`
})
return [line1, ...comparisonLines].join('\n') return [line1, toonLine, ...comparisonLines].join('\n')
}) })
.join('\n\n') .join('\n\n')
// Add separator and totals row // Calculate totals for mixed structure
const separator = '─────────────────────────────────────────────────────────────────────' const { totalToonTokens: totalToonTokensMixed, totals: mixedTotals } = calculateTotalMetrics(mixedStructureDatasets, COMPARISON_FORMAT_ORDER)
const mixedTotalLines = generateTotalLines(totalToonTokensMixed, mixedTotals)
// Calculate bar for totals (TOON vs average of comparison formats) // Calculate totals for flat-only
const comparisonTokens = formatOrder.map(name => totalTokensByFormat[name]!) const { totalToonTokens: totalToonTokensFlat, totals: flatTotals } = calculateTotalMetrics(flatOnlyDatasets, COMPARISON_FORMAT_ORDER)
const averageComparisonTokens = comparisonTokens.reduce((a, b) => a + b, 0) / comparisonTokens.length const totalCSVTokensFlat = flatOnlyDatasets.reduce((sum, r) => {
const totalPercentage = (totalToonTokens / averageComparisonTokens) * 100 const csv = r.formats.find(f => f.name === 'csv')
const totalBar = createProgressBar(totalPercentage, 100) return sum + (csv?.tokens || 0)
}, 0)
const flatTotalLines = generateTotalLines(totalToonTokensFlat, flatTotals, { name: 'csv', tokens: totalCSVTokensFlat })
const totalLine1 = `Total ${totalBar} ${totalToonTokens.toLocaleString('en-US').padStart(6)} tokens` const barChartSection = `
## Mixed-Structure Track
const totalComparisonLines = formatOrder.map((formatName) => { Datasets with nested or semi-uniform structures. CSV excluded as it cannot properly represent these structures.
const label = FORMATTER_DISPLAY_NAMES[formatName] || formatName.toUpperCase()
const tokens = totalTokensByFormat[formatName]!
const percent = totalSavingsPercent[formatName]!
const signedPercent = percent >= 0 ? `${percent.toFixed(1)}%` : `+${Math.abs(percent).toFixed(1)}%`
const labelWithSavings = `vs ${label} (${signedPercent})`.padEnd(27)
const tokenStr = tokens.toLocaleString('en-US').padStart(6)
return ` ${labelWithSavings}${tokenStr}`
})
const barChartSection = `${datasetRows}\n\n${separator}\n${totalLine1}\n${totalComparisonLines.join('\n')}` \`\`\`
${mixedCharts}
// Generate detailed examples (only for selected examples) ${SEPARATOR}
// Note: Large datasets are truncated for display readability in the report. ${mixedTotalLines}
// Token counts are calculated from the full datasets, not the truncated versions. \`\`\`
## Flat-Only Track
Datasets with flat tabular structures where CSV is applicable.
\`\`\`
${flatCharts}
${SEPARATOR}
${flatTotalLines}
\`\`\`
`.trim()
// Generate detailed examples (optional: show a few examples)
const detailedExamples = results const detailedExamples = results
.filter(result => result.showDetailed) .filter(r => DETAILED_EXAMPLE_DATASETS.includes(r.dataset.name as any))
.map((result, i, filtered) => { .map((result, i, filtered) => {
// Truncate large datasets for display let displayData = result.dataset.data
let displayData = result.data
if (result.name === 'GitHub Repositories') { // Truncate for display
if (result.dataset.name === 'github') {
displayData = { displayData = {
repositories: result.data.repositories.slice(0, 3).map((repo: Record<string, any>) => ({ repositories: displayData.repositories.slice(0, GITHUB_REPO_LIMIT).map((repo: Record<string, any>) => ({
...repo, ...repo,
description: repo.description?.slice(0, 80) + (repo.description?.length > 80 ? '…' : ''), description: repo.description?.slice(0, GITHUB_DESC_LIMIT) + (repo.description?.length > GITHUB_DESC_LIMIT ? '…' : ''),
})), })),
} }
} }
else if (result.name === 'Daily Analytics') { else if (result.dataset.name === 'analytics') {
displayData = { metrics: result.data.metrics.slice(0, 5) } displayData = { metrics: displayData.metrics.slice(0, ANALYTICS_METRICS_LIMIT) }
} }
const separator = i < filtered.length - 1 ? '\n\n---' : '' const separator = i < filtered.length - 1 ? '\n\n---' : ''
const emoji = DATASET_ICONS[result.dataset.name] || DEFAULT_DATASET_ICON
const json = result.formats.find(f => f.name === 'json-pretty')! const json = result.formats.find(f => f.name === 'json-pretty')!
const toon = result.formats.find(f => f.name === 'toon')! const toon = result.formats.find(f => f.name === 'toon')!
return `#### ${result.emoji} ${result.name} return `#### ${emoji} ${result.dataset.description}
**Configuration:** ${result.description}
**Savings:** ${json.savings.toLocaleString('en-US')} tokens (${json.savingsPercent.toFixed(1)}% reduction vs JSON) **Savings:** ${json.savings.toLocaleString('en-US')} tokens (${json.savingsPercent.toFixed(1)}% reduction vs JSON)
@@ -197,9 +323,7 @@ ${encode(displayData)}
.join('\n\n') .join('\n\n')
const markdown = ` const markdown = `
\`\`\`
${barChartSection} ${barChartSection}
\`\`\`
<details> <details>
<summary><strong>View detailed examples</strong></summary> <summary><strong>View detailed examples</strong></summary>
@@ -209,7 +333,7 @@ ${detailedExamples}
</details> </details>
`.trimStart() `.trimStart()
prompts.log.message(`${barChartSection}\n`) prompts.log.message(barChartSection)
const resultsDir = path.join(BENCHMARKS_DIR, 'results') const resultsDir = path.join(BENCHMARKS_DIR, 'results')
await ensureDir(resultsDir) await ensureDir(resultsDir)

View File

@@ -8,7 +8,7 @@ export const BENCHMARKS_DIR: string = url.fileURLToPath(new URL('../', import.me
* Model-specific RPM (requests per minute) limits to handle API quotas * Model-specific RPM (requests per minute) limits to handle API quotas
* *
* @remarks * @remarks
* Set `undefined` for models without specific limits * Set `undefined` for models without specific limits.
*/ */
/// keep-sorted /// keep-sorted
export const MODEL_RPM_LIMITS: Record<string, number | undefined> = { export const MODEL_RPM_LIMITS: Record<string, number | undefined> = {
@@ -39,7 +39,7 @@ export const FORMATTER_DISPLAY_NAMES: Record<string, string> = {
* Enable dry run mode for quick testing with limited AI requests * Enable dry run mode for quick testing with limited AI requests
* *
* @remarks * @remarks
* Set via environment variable: `DRY_RUN=true` * Set via environment variable: `DRY_RUN=true`.
*/ */
export const DRY_RUN: boolean = process.env.DRY_RUN === 'true' export const DRY_RUN: boolean = process.env.DRY_RUN === 'true'
@@ -123,4 +123,14 @@ export const QUESTION_LIMITS = {
aggregationBranches: 2, aggregationBranches: 2,
filteringStarsAndForks: 8, filteringStarsAndForks: 8,
}, },
eventLogs: {
fieldRetrieval: 10,
aggregationEndpoints: 3,
filteringLevelAndStatus: 2,
filteringEndpointAndStatus: 2,
},
nestedConfig: {
fieldRetrieval: 5,
filteringComplex: 2,
},
} as const } as const

View File

@@ -5,6 +5,67 @@ import githubRepos from '../data/github-repos.json' with { type: 'json' }
// Seed for reproducibility // Seed for reproducibility
faker.seed(12345) faker.seed(12345)
/**
* Calculate the tabular eligibility percentage of a data structure
*
* @remarks
* Recursively analyzes data to determine what percentage of arrays qualify
* for TOON's tabular format (uniform objects with primitive values only).
*/
export function calculateTabularEligibility(data: unknown): number {
let totalArrays = 0
let tabularArrays = 0
function isTabularArray(arr: unknown[]): boolean {
if (arr.length === 0)
return false
// Check if all elements are objects
if (!arr.every(item => typeof item === 'object' && item !== null && !Array.isArray(item)))
return false
// Get keys from first object
const firstKeys = Object.keys(arr[0] as Record<string, unknown>)
if (firstKeys.length === 0)
return false
// Check if all objects have the same keys and only primitive values
return arr.every((item) => {
const itemObj = item as Record<string, unknown>
const itemKeys = Object.keys(itemObj)
if (itemKeys.length !== firstKeys.length)
return false
if (!firstKeys.every(key => itemKeys.includes(key)))
return false
// Check if all values are primitives (no nested objects or arrays)
return firstKeys.every((key) => {
const value = itemObj[key]
return value === null || ['string', 'number', 'boolean'].includes(typeof value)
})
})
}
function traverse(obj: unknown): void {
if (Array.isArray(obj)) {
totalArrays++
if (isTabularArray(obj))
tabularArrays++
// Continue traversing array elements
obj.forEach(item => traverse(item))
}
else if (typeof obj === 'object' && obj !== null) {
// Traverse object properties
Object.values(obj).forEach(value => traverse(value))
}
}
traverse(data)
return totalArrays === 0 ? 0 : Math.round((tabularArrays / totalArrays) * 100)
}
/** /**
* Employee record structure for tabular dataset * Employee record structure for tabular dataset
*/ */
@@ -73,6 +134,78 @@ export interface Repository {
pushedAt: string pushedAt: string
} }
/**
* Event log structure for semi-uniform dataset
*/
export interface EventLog {
timestamp: string
level: 'info' | 'warn' | 'error'
endpoint: string
statusCode: number
responseTime: number
userId: number
error?: {
message: string
stack: string
retryable: boolean
}
}
/**
* Nested configuration structure for deeply nested dataset
*/
export interface NestedConfig {
environment: string
version: string
database: {
host: string
port: number
name: string
pool: {
min: number
max: number
idleTimeout: number
}
replicas: {
host: string
port: number
priority: number
}[]
}
features: Record<string, {
enabled: boolean
rollout: number
variants: {
name: string
weight: number
config: Record<string, any>
}[]
}>
authentication: {
providers: {
name: string
clientId: string
scopes: string[]
config: Record<string, any>
}[]
session: {
secret: string
duration: number
refreshThreshold: number
}
}
permissions: {
roles: Record<string, {
permissions: string[]
inherits: string[]
}>
groups: Record<string, {
members: string[]
roles: string[]
}>
}
}
/** /**
* Generate analytics time-series data * Generate analytics time-series data
*/ */
@@ -108,17 +241,13 @@ export function generateAnalyticsData(days: number, startDate = '2025-01-01'): {
} }
/** /**
* Tabular dataset: 100 uniform employee records * Generate employee data (uniform tabular structure)
*
* @remarks
* Tests TOON's tabular array format
*/ */
const departments: readonly string[] = ['Engineering', 'Sales', 'Marketing', 'HR', 'Operations', 'Finance'] as const const departments: readonly string[] = ['Engineering', 'Sales', 'Marketing', 'HR', 'Operations', 'Finance'] as const
const tabularDataset: Dataset = {
name: 'tabular', function generateEmployees(count: number): { employees: Employee[] } {
description: 'Uniform employee records (TOON optimal format)', return {
data: { employees: Array.from({ length: count }, (_, i): Employee => {
employees: Array.from({ length: 100 }, (_, i): Employee => {
const yearsExp = faker.number.int({ min: 1, max: 25 }) const yearsExp = faker.number.int({ min: 1, max: 25 })
return { return {
id: i + 1, id: i + 1,
@@ -130,72 +259,132 @@ const tabularDataset: Dataset = {
active: faker.datatype.boolean(0.8), // 80% active active: faker.datatype.boolean(0.8), // 80% active
} }
}), }),
}
}
/**
* Tabular dataset: Uniform employee records
*
* @remarks
* Tests TOON's tabular array format.
*/
const tabularDataset: Dataset = {
name: 'tabular',
description: 'Uniform employee records (TOON optimal format)',
data: generateEmployees(100),
metadata: {
supportsCSV: true,
structureClass: 'uniform',
tabularEligibility: 100,
}, },
} }
/** /**
* Nested dataset: 50 e-commerce orders with nested structures * Generate e-commerce orders (nested structure)
*
* @remarks
* Tests TOON's handling of complex nested objects
*/ */
const productNames: readonly string[] = ['Wireless Mouse', 'USB Cable', 'Laptop Stand', 'Keyboard', 'Webcam', 'Headphones', 'Monitor', 'Desk Lamp'] as const const PRODUCT_NAMES = ['Wireless Mouse', 'USB Cable', 'Laptop Stand', 'Keyboard', 'Webcam', 'Headphones', 'Monitor', 'Desk Lamp'] as const
const statuses: readonly string[] = ['pending', 'processing', 'shipped', 'delivered', 'cancelled'] as const const ORDER_STATUSES = ['pending', 'processing', 'shipped', 'delivered', 'cancelled'] as const
const nestedDataset: Dataset = { const ORDER_CONSTANTS = {
name: 'nested', CUSTOMER_ID_MOD: 20,
description: 'E-commerce orders with nested structures', MIN_ITEMS: 1,
data: { MAX_ITEMS: 4,
orders: Array.from({ length: 50 }, (_, i) => { MIN_ITEM_PRICE: 9.99,
const customerId = (i % 20) + 1 MAX_ITEM_PRICE: 199.99,
const itemCount = faker.number.int({ min: 1, max: 4 }) MIN_ITEM_QUANTITY: 1,
MAX_ITEM_QUANTITY: 5,
SKU_LENGTH: 6,
ORDER_ID_PADDING: 4,
RECENT_DAYS: 90,
TAX_RATE: 0.08,
} as const
function generateOrders(count: number): { orders: Order[] } {
return {
orders: Array.from({ length: count }, (_, i) => {
const customerId = (i % ORDER_CONSTANTS.CUSTOMER_ID_MOD) + 1
const itemCount = faker.number.int({ min: ORDER_CONSTANTS.MIN_ITEMS, max: ORDER_CONSTANTS.MAX_ITEMS })
const items = Array.from({ length: itemCount }, (_, j) => { const items = Array.from({ length: itemCount }, (_, j) => {
const price = faker.number.float({ min: 9.99, max: 199.99, fractionDigits: 2 }) const price = faker.number.float({
const quantity = faker.number.int({ min: 1, max: 5 }) min: ORDER_CONSTANTS.MIN_ITEM_PRICE,
max: ORDER_CONSTANTS.MAX_ITEM_PRICE,
fractionDigits: 2,
})
const quantity = faker.number.int({
min: ORDER_CONSTANTS.MIN_ITEM_QUANTITY,
max: ORDER_CONSTANTS.MAX_ITEM_QUANTITY,
})
return { return {
sku: `SKU-${faker.string.alphanumeric({ length: 6 }).toUpperCase()}`, sku: `SKU-${faker.string.alphanumeric({ length: ORDER_CONSTANTS.SKU_LENGTH }).toUpperCase()}`,
name: productNames[j % productNames.length]!, name: PRODUCT_NAMES[j % PRODUCT_NAMES.length]!,
quantity, quantity,
price, price,
} }
}) })
const total = Number(items.reduce((sum, item) => sum + (item.price * item.quantity), 0).toFixed(2)) const subtotal = Number(items.reduce((sum, item) => sum + (item.price * item.quantity), 0).toFixed(2))
const tax = Number((subtotal * ORDER_CONSTANTS.TAX_RATE).toFixed(2))
const total = Number((subtotal + tax).toFixed(2))
return { return {
orderId: `ORD-${String(i + 1).padStart(4, '0')}`, orderId: `ORD-${String(i + 1).padStart(ORDER_CONSTANTS.ORDER_ID_PADDING, '0')}`,
customer: { customer: {
id: customerId, id: customerId,
name: faker.person.fullName(), name: faker.person.fullName(),
email: faker.internet.email().toLowerCase(), email: faker.internet.email().toLowerCase(),
phone: faker.phone.number(),
}, },
items, items,
subtotal,
tax,
total, total,
status: statuses[i % statuses.length]!, status: ORDER_STATUSES[i % ORDER_STATUSES.length]!,
orderDate: faker.date.recent({ days: 90 }).toISOString().split('T')[0], orderDate: faker.date.recent({ days: ORDER_CONSTANTS.RECENT_DAYS }).toISOString().split('T')[0],
} }
}), }),
}
}
/**
* Nested dataset: E-commerce orders with nested structures
*
* @remarks
* Tests TOON's handling of complex nested objects.
*/
const nestedDataset: Dataset = {
name: 'nested',
description: 'E-commerce orders with nested structures',
data: generateOrders(50),
metadata: {
supportsCSV: false,
structureClass: 'nested',
tabularEligibility: 33, // orders array is not tabular, but items arrays within are
}, },
} }
/** /**
* Analytics dataset: 60 days of time-series metrics * Analytics dataset: Time-series metrics
* *
* @remarks * @remarks
* Tests TOON's handling of numeric data and date fields * Tests TOON's handling of numeric data and date fields.
*/ */
const analyticsDataset: Dataset = { const analyticsDataset: Dataset = {
name: 'analytics', name: 'analytics',
description: 'Time-series analytics data', description: 'Time-series analytics data',
data: generateAnalyticsData(60), data: generateAnalyticsData(60),
metadata: {
supportsCSV: true,
structureClass: 'uniform',
tabularEligibility: 100,
},
} }
/** /**
* Real-world dataset: Top 100 starred GitHub repositories * Real-world dataset: Top 100 starred GitHub repositories
* *
* @remarks * @remarks
* Tests TOON's tabular format * Tests TOON's tabular format with real data.
*/ */
const githubDataset: Dataset = { const githubDataset: Dataset = {
name: 'github', name: 'github',
@@ -203,13 +392,18 @@ const githubDataset: Dataset = {
data: { data: {
repositories: githubRepos, repositories: githubRepos,
}, },
metadata: {
supportsCSV: true,
structureClass: 'uniform',
tabularEligibility: 100,
},
} }
/** /**
* Generate a single e-commerce order with nested structure * Generate a single e-commerce order with nested structure
* *
* @remarks * @remarks
* Used for token efficiency benchmarks * Used for token efficiency benchmarks.
*/ */
export function generateOrderData(): Order { export function generateOrderData(): Order {
return { return {
@@ -235,11 +429,257 @@ export function generateOrderData(): Order {
} }
/** /**
* All datasets used in the benchmark * Generate event logs (semi-uniform structure)
*
* @remarks
* Approximately 50% of logs include nested error objects, 50% are flat.
* This creates ~45% tabular eligibility.
*/ */
export const datasets: Dataset[] = [ export function generateEventLogs(count: number): { logs: EventLog[] } {
tabularDataset, const endpoints = ['/api/users', '/api/orders', '/api/products', '/api/auth', '/api/payments']
nestedDataset, const levels = ['info', 'warn', 'error'] as const
analyticsDataset,
githubDataset, return {
logs: Array.from({ length: count }, () => {
const level = faker.helpers.arrayElement(levels)
const hasError = level === 'error' || (level === 'warn' && faker.datatype.boolean(0.3))
const log: EventLog = {
timestamp: faker.date.recent({ days: 7 }).toISOString(),
level,
endpoint: faker.helpers.arrayElement(endpoints),
statusCode: hasError
? faker.number.int({ min: 400, max: 599 })
: faker.number.int({ min: 200, max: 299 }),
responseTime: faker.number.int({ min: 10, max: 5000 }),
userId: faker.number.int({ min: 1000, max: 9999 }),
}
if (hasError) {
log.error = {
message: faker.helpers.arrayElement([
'Database connection timeout',
'Invalid authentication token',
'Resource not found',
'Internal server error',
'Rate limit exceeded',
]),
stack: `Error: ${faker.lorem.sentence()}\n at ${faker.lorem.word()}\n at ${faker.lorem.word()}`,
retryable: faker.datatype.boolean(0.6),
}
}
return log
}),
}
}
/**
* Generate deeply nested configuration
*
* @remarks
* Creates a complex nested structure with minimal tabular eligibility (~0%).
*/
export function generateNestedConfig(): NestedConfig {
return {
environment: faker.helpers.arrayElement(['production', 'staging', 'development']),
version: faker.system.semver(),
database: {
host: faker.internet.domainName(),
port: 5432,
name: faker.database.type(),
pool: {
min: 2,
max: faker.number.int({ min: 10, max: 50 }),
idleTimeout: 30000,
},
replicas: Array.from({ length: 3 }, (_, i) => ({
host: `replica-${i + 1}.${faker.internet.domainName()}`,
port: 5432,
priority: i + 1,
})),
},
features: {
darkMode: {
enabled: faker.datatype.boolean(),
rollout: faker.number.int({ min: 0, max: 100 }),
variants: [
{
name: 'default',
weight: 70,
config: { theme: 'dark', animations: true },
},
{
name: 'minimal',
weight: 30,
config: { theme: 'dark', animations: false },
},
],
},
analytics: {
enabled: faker.datatype.boolean(),
rollout: faker.number.int({ min: 0, max: 100 }),
variants: [
{
name: 'full',
weight: 100,
config: { tracking: 'all', sampling: 1.0 },
},
],
},
},
authentication: {
providers: [
{
name: 'oauth2',
clientId: faker.string.uuid(),
scopes: ['read', 'write', 'admin'],
config: {
authUrl: faker.internet.url(),
tokenUrl: faker.internet.url(),
},
},
{
name: 'saml',
clientId: faker.string.uuid(),
scopes: ['read'],
config: {
entryPoint: faker.internet.url(),
cert: faker.string.alphanumeric({ length: 64 }),
},
},
],
session: {
secret: faker.string.alphanumeric({ length: 32 }),
duration: 86400,
refreshThreshold: 3600,
},
},
permissions: {
roles: {
admin: {
permissions: ['read', 'write', 'delete', 'manage_users', 'manage_roles'],
inherits: [],
},
editor: {
permissions: ['read', 'write'],
inherits: ['viewer'],
},
viewer: {
permissions: ['read'],
inherits: [],
},
},
groups: {
engineering: {
members: Array.from({ length: 5 }, () => faker.internet.email()),
roles: ['admin', 'editor'],
},
support: {
members: Array.from({ length: 3 }, () => faker.internet.email()),
roles: ['viewer'],
},
},
},
}
}
/**
* Event logs dataset: Semi-uniform structure
*
* @remarks
* Tests TOON with semi-uniform data (~50% flat, ~50% with nested errors).
*/
const eventLogsDataset: Dataset = {
name: 'event-logs',
description: 'Semi-uniform event logs',
data: generateEventLogs(75),
metadata: {
supportsCSV: false,
structureClass: 'semi-uniform',
tabularEligibility: 50, // ~50% of logs have nested error objects
},
}
/**
* Nested config dataset: Deeply nested structure
*
* @remarks
* Tests TOON's worst-case scenario with deeply nested configuration.
*/
const nestedConfigDataset: Dataset = {
name: 'nested-config',
description: 'Deeply nested configuration',
data: generateNestedConfig(),
metadata: {
supportsCSV: false,
structureClass: 'deep',
tabularEligibility: 0, // Highly nested, minimal tabular arrays
},
}
/**
* Datasets for accuracy benchmarks (smaller sizes for faster evaluation)
*/
export const ACCURACY_DATASETS: Dataset[] = [
tabularDataset, // 100 employees
nestedDataset, // 50 orders
analyticsDataset, // 60 days
githubDataset, // 100 repos
eventLogsDataset, // 75 logs
nestedConfigDataset, // 1 config
]
/**
* Datasets for token efficiency benchmarks (larger sizes to amplify token differences)
*/
export const TOKEN_EFFICIENCY_DATASETS: Dataset[] = [
// Tabular: 2000 employees
{
name: 'tabular',
description: 'Uniform employee records (TOON optimal format)',
data: generateEmployees(2000),
metadata: {
supportsCSV: true,
structureClass: 'uniform',
tabularEligibility: 100,
},
},
// Nested: 500 orders
{
name: 'nested',
description: 'E-commerce orders with nested structures',
data: generateOrders(500),
metadata: {
supportsCSV: false,
structureClass: 'nested',
tabularEligibility: 33,
},
},
// Analytics: 365 days
{
name: 'analytics',
description: 'Time-series analytics data',
data: generateAnalyticsData(365),
metadata: {
supportsCSV: true,
structureClass: 'uniform',
tabularEligibility: 100,
},
},
// GitHub: 100 repos (same as accuracy)
githubDataset,
// Event logs: 2000 logs
{
name: 'event-logs',
description: 'Semi-uniform event logs',
data: generateEventLogs(2000),
metadata: {
supportsCSV: false,
structureClass: 'semi-uniform',
tabularEligibility: 50,
},
},
// Nested config: 1 config (same as accuracy)
nestedConfigDataset,
] ]

View File

@@ -1,3 +1,4 @@
import type { Dataset } from './types'
import { stringify as stringifyCSV } from 'csv-stringify/sync' import { stringify as stringifyCSV } from 'csv-stringify/sync'
import { XMLBuilder } from 'fast-xml-parser' import { XMLBuilder } from 'fast-xml-parser'
import { stringify as stringifyYAML } from 'yaml' import { stringify as stringifyYAML } from 'yaml'
@@ -75,3 +76,14 @@ function toXML(data: unknown): string {
return builder.build(data) return builder.build(data)
} }
/**
* Check if a dataset supports CSV format
*
* @remarks
* CSV is only suitable for flat tabular data. Datasets with nested structures
* should not be compared using CSV as it cannot properly represent the data.
*/
export function supportsCSV(dataset: Dataset): boolean {
return dataset.metadata.supportsCSV
}

View File

@@ -1,711 +0,0 @@
/**
* Question generation for TOON benchmarks
*
* Generates ~150-160 questions across different question types and datasets:
* - Field Retrieval: Direct field access with no computation
* Examples: "What is X's salary?", "What is the status of order Y?"
* - Aggregation: Counts, sums, averages, min/max operations (including single-condition filters)
* Examples: "How many X?", "What is the total/average?", "How many X > threshold?"
* - Filtering: Multi-condition queries requiring complex logical operations
* Examples: "How many X WHERE condition1 AND condition2?"
*/
import type { AnalyticsMetric, Employee, Order, Repository } from './datasets'
import type { Question } from './types'
import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from './constants'
import { datasets } from './datasets'
/**
* Generate all questions from datasets
*/
export function generateQuestions(): Question[] {
const questions: Question[] = []
let idCounter = 1
// Get datasets with proper typing
const tabular = (datasets.find(d => d.name === 'tabular')?.data.employees as Employee[]) ?? []
const nested = (datasets.find(d => d.name === 'nested')?.data.orders as Order[]) ?? []
const analytics = (datasets.find(d => d.name === 'analytics')?.data.metrics as AnalyticsMetric[]) ?? []
const github = (datasets.find(d => d.name === 'github')?.data.repositories as Repository[]) ?? []
if (tabular.length > 0) {
// Field retrieval: specific employees
for (let i = 0; i < Math.min(QUESTION_LIMITS.tabular.fieldRetrieval, tabular.length); i++) {
const emp = tabular[i * 2] || tabular[i]
if (!emp)
continue
// Rotate through all field types
if (i % 5 === 0) {
questions.push({
id: `q${idCounter++}`,
prompt: `What is the salary of ${emp.name}?`,
groundTruth: String(emp.salary),
type: 'field-retrieval',
dataset: 'tabular',
})
}
else if (i % 5 === 1) {
questions.push({
id: `q${idCounter++}`,
prompt: `What department does ${emp.name} work in?`,
groundTruth: emp.department,
type: 'field-retrieval',
dataset: 'tabular',
})
}
else if (i % 5 === 2) {
questions.push({
id: `q${idCounter++}`,
prompt: `What is the email address of ${emp.name}?`,
groundTruth: emp.email,
type: 'field-retrieval',
dataset: 'tabular',
})
}
else if (i % 5 === 3) {
questions.push({
id: `q${idCounter++}`,
prompt: `How many years of experience does ${emp.name} have?`,
groundTruth: String(emp.yearsExperience),
type: 'field-retrieval',
dataset: 'tabular',
})
}
else {
questions.push({
id: `q${idCounter++}`,
prompt: `Is ${emp.name} an active employee?`,
groundTruth: emp.active ? 'yes' : 'no',
type: 'field-retrieval',
dataset: 'tabular',
})
}
}
// Aggregation: count by department
const departments = [...new Set(tabular.map(e => e.department))]
for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.aggregationDepartments)) {
const count = tabular.filter(e => e.department === dept).length
questions.push({
id: `q${idCounter++}`,
prompt: `How many employees work in ${dept}?`,
groundTruth: String(count),
type: 'aggregation',
dataset: 'tabular',
})
}
// Aggregation: salary ranges (single-condition filters)
for (const threshold of QUESTION_THRESHOLDS.tabular.salaryRanges) {
const count = tabular.filter(e => e.salary > threshold).length
questions.push({
id: `q${idCounter++}`,
prompt: `How many employees have a salary greater than ${threshold}?`,
groundTruth: String(count),
type: 'aggregation',
dataset: 'tabular',
})
}
// Aggregation: totals and averages
const totalEmployees = tabular.length
const avgSalary = Math.round(tabular.reduce((sum, e) => sum + e.salary, 0) / totalEmployees)
const activeCount = tabular.filter(e => e.active).length
const inactiveCount = tabular.filter(e => !e.active).length
questions.push(
{
id: `q${idCounter++}`,
prompt: 'How many employees are in the dataset?',
groundTruth: String(totalEmployees),
type: 'aggregation',
dataset: 'tabular',
},
{
id: `q${idCounter++}`,
prompt: 'What is the average salary across all employees?',
groundTruth: String(avgSalary),
type: 'aggregation',
dataset: 'tabular',
},
{
id: `q${idCounter++}`,
prompt: 'How many employees are active?',
groundTruth: String(activeCount),
type: 'aggregation',
dataset: 'tabular',
},
{
id: `q${idCounter++}`,
prompt: 'How many employees are inactive?',
groundTruth: String(inactiveCount),
type: 'aggregation',
dataset: 'tabular',
},
)
// Filtering: count by department with salary filter (multi-condition)
for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringMultiConditionDepartments)) {
const count = tabular.filter(e => e.department === dept && e.salary > QUESTION_THRESHOLDS.tabular.departmentSalaryThreshold).length
questions.push({
id: `q${idCounter++}`,
prompt: `How many employees in ${dept} have a salary greater than ${QUESTION_THRESHOLDS.tabular.departmentSalaryThreshold}?`,
groundTruth: String(count),
type: 'filtering',
dataset: 'tabular',
})
}
// Filtering: active employees by experience (multi-condition)
for (const exp of QUESTION_THRESHOLDS.tabular.experienceYears.slice(0, QUESTION_LIMITS.tabular.filteringExperience)) {
const count = tabular.filter(e => e.yearsExperience > exp && e.active).length
questions.push({
id: `q${idCounter++}`,
prompt: `How many active employees have more than ${exp} years of experience?`,
groundTruth: String(count),
type: 'filtering',
dataset: 'tabular',
})
}
// Filtering: department by experience (multi-condition)
for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringDepartmentExp)) {
const count = tabular.filter(e => e.department === dept && e.yearsExperience > QUESTION_THRESHOLDS.tabular.departmentExperienceThreshold).length
questions.push({
id: `q${idCounter++}`,
prompt: `How many employees in ${dept} have more than ${QUESTION_THRESHOLDS.tabular.departmentExperienceThreshold} years of experience?`,
groundTruth: String(count),
type: 'filtering',
dataset: 'tabular',
})
}
// Filtering: department by active status (multi-condition)
for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringDepartmentActive)) {
const count = tabular.filter(e => e.department === dept && e.active).length
questions.push({
id: `q${idCounter++}`,
prompt: `How many active employees work in ${dept}?`,
groundTruth: String(count),
type: 'filtering',
dataset: 'tabular',
})
}
}
if (nested.length > 0) {
// Field retrieval: order totals and statuses
for (let i = 0; i < Math.min(QUESTION_LIMITS.nested.fieldRetrievalOrders, nested.length); i++) {
const order = nested[i * 2] || nested[i]
if (!order)
continue
if (i % 2 === 0) {
questions.push({
id: `q${idCounter++}`,
prompt: `What is the total for order ${order.orderId}?`,
groundTruth: String(order.total),
type: 'field-retrieval',
dataset: 'nested',
})
}
else {
questions.push({
id: `q${idCounter++}`,
prompt: `What is the status of order ${order.orderId}?`,
groundTruth: order.status,
type: 'field-retrieval',
dataset: 'nested',
})
}
}
// Field retrieval: customer info and order dates (expanded)
for (let i = 0; i < Math.min(QUESTION_LIMITS.nested.fieldRetrievalCustomers, nested.length); i++) {
const order = nested[i * 2 + 1] || nested[i]
if (!order)
continue
if (i % 4 === 0) {
questions.push({
id: `q${idCounter++}`,
prompt: `What is the customer name for order ${order.orderId}?`,
groundTruth: order.customer.name,
type: 'field-retrieval',
dataset: 'nested',
})
}
else if (i % 4 === 1) {
questions.push({
id: `q${idCounter++}`,
prompt: `What is the customer email for order ${order.orderId}?`,
groundTruth: order.customer.email,
type: 'field-retrieval',
dataset: 'nested',
})
}
else if (i % 4 === 2) {
questions.push({
id: `q${idCounter++}`,
prompt: `What is the order date for order ${order.orderId}?`,
groundTruth: order.orderDate || '',
type: 'field-retrieval',
dataset: 'nested',
})
}
else {
questions.push({
id: `q${idCounter++}`,
prompt: `How many items are in order ${order.orderId}?`,
groundTruth: String(order.items.length),
type: 'field-retrieval',
dataset: 'nested',
})
}
}
// Aggregation: totals and averages
const totalRevenue = nested.reduce((sum, o) => sum + o.total, 0)
const avgOrderValue = totalRevenue / nested.length
const totalOrders = nested.length
const maxOrderValue = Math.max(...nested.map(o => o.total))
// Count by status
const statuses = [...new Set(nested.map(o => o.status))]
for (const status of statuses.slice(0, QUESTION_LIMITS.nested.aggregationStatuses)) {
const count = nested.filter(o => o.status === status).length
questions.push({
id: `q${idCounter++}`,
prompt: `How many orders have status "${status}"?`,
groundTruth: String(count),
type: 'aggregation',
dataset: 'nested',
})
}
questions.push(
{
id: `q${idCounter++}`,
prompt: 'What is the total revenue across all orders?',
groundTruth: String(totalRevenue.toFixed(2)),
type: 'aggregation',
dataset: 'nested',
},
{
id: `q${idCounter++}`,
prompt: 'What is the average order value?',
groundTruth: String(avgOrderValue.toFixed(2)),
type: 'aggregation',
dataset: 'nested',
},
{
id: `q${idCounter++}`,
prompt: 'How many orders are in the dataset?',
groundTruth: String(totalOrders),
type: 'aggregation',
dataset: 'nested',
},
{
id: `q${idCounter++}`,
prompt: 'What is the highest order total?',
groundTruth: String(maxOrderValue.toFixed(2)),
type: 'aggregation',
dataset: 'nested',
},
)
// Aggregation: high-value orders (single-condition filter)
for (const threshold of QUESTION_THRESHOLDS.nested.highValueOrders) {
const count = nested.filter(o => o.total > threshold).length
questions.push({
id: `q${idCounter++}`,
prompt: `How many orders have a total greater than ${threshold}?`,
groundTruth: String(count),
type: 'aggregation',
dataset: 'nested',
})
}
// Filtering: multi-condition queries (status AND value)
const orderStatuses = [...new Set(nested.map(o => o.status))]
for (const status of orderStatuses.slice(0, QUESTION_LIMITS.nested.filteringStatusAndValue)) {
const count = nested.filter(o => o.status === status && o.total > QUESTION_THRESHOLDS.nested.statusValueThreshold).length
questions.push({
id: `q${idCounter++}`,
prompt: `How many orders have status "${status}" and total greater than ${QUESTION_THRESHOLDS.nested.statusValueThreshold}?`,
groundTruth: String(count),
type: 'filtering',
dataset: 'nested',
})
}
// Filtering: status AND items count (multi-condition)
for (const status of orderStatuses.slice(0, QUESTION_LIMITS.nested.filteringStatusAndItems)) {
const count = nested.filter(o => o.status === status && o.items.length >= QUESTION_THRESHOLDS.nested.itemCountThreshold).length
questions.push({
id: `q${idCounter++}`,
prompt: `How many orders have status "${status}" and at least ${QUESTION_THRESHOLDS.nested.itemCountThreshold} items?`,
groundTruth: String(count),
type: 'filtering',
dataset: 'nested',
})
}
// Filtering: total AND items count (multi-condition)
for (const threshold of QUESTION_THRESHOLDS.nested.totalThresholdsForItems) {
const count = nested.filter(o => o.total > threshold && o.items.length >= QUESTION_THRESHOLDS.nested.itemCountThreshold).length
questions.push({
id: `q${idCounter++}`,
prompt: `How many orders have a total greater than ${threshold} and at least ${QUESTION_THRESHOLDS.nested.itemCountThreshold} items?`,
groundTruth: String(count),
type: 'filtering',
dataset: 'nested',
})
}
}
if (analytics.length > 0) {
// Field retrieval: specific dates (expanded with all metrics)
for (let i = 0; i < Math.min(QUESTION_LIMITS.analytics.fieldRetrievalDates, analytics.length); i++) {
const metric = analytics[i * 3] || analytics[i]
if (!metric)
continue
if (i % 5 === 0) {
questions.push({
id: `q${idCounter++}`,
prompt: `How many views were recorded on ${metric.date}?`,
groundTruth: String(metric.views),
type: 'field-retrieval',
dataset: 'analytics',
})
}
else if (i % 5 === 1) {
questions.push({
id: `q${idCounter++}`,
prompt: `What was the revenue on ${metric.date}?`,
groundTruth: String(metric.revenue),
type: 'field-retrieval',
dataset: 'analytics',
})
}
else if (i % 5 === 2) {
questions.push({
id: `q${idCounter++}`,
prompt: `What was the conversion count on ${metric.date}?`,
groundTruth: String(metric.conversions),
type: 'field-retrieval',
dataset: 'analytics',
})
}
else if (i % 5 === 3) {
questions.push({
id: `q${idCounter++}`,
prompt: `How many clicks were recorded on ${metric.date}?`,
groundTruth: String(metric.clicks),
type: 'field-retrieval',
dataset: 'analytics',
})
}
else {
questions.push({
id: `q${idCounter++}`,
prompt: `What was the bounce rate on ${metric.date}?`,
groundTruth: String(metric.bounceRate),
type: 'field-retrieval',
dataset: 'analytics',
})
}
}
// Aggregation: totals and averages
const totalViews = analytics.reduce((sum, m) => sum + m.views, 0)
const totalRevenue = analytics.reduce((sum, m) => sum + m.revenue, 0)
const totalConversions = analytics.reduce((sum, m) => sum + m.conversions, 0)
const avgViews = Math.round(totalViews / analytics.length)
const avgRevenue = totalRevenue / analytics.length
const avgConversions = Math.round(totalConversions / analytics.length)
questions.push(
{
id: `q${idCounter++}`,
prompt: 'What is the total number of views across all dates?',
groundTruth: String(totalViews),
type: 'aggregation',
dataset: 'analytics',
},
{
id: `q${idCounter++}`,
prompt: 'What is the total revenue across all dates?',
groundTruth: String(totalRevenue.toFixed(2)),
type: 'aggregation',
dataset: 'analytics',
},
{
id: `q${idCounter++}`,
prompt: 'What is the total number of conversions across all dates?',
groundTruth: String(totalConversions),
type: 'aggregation',
dataset: 'analytics',
},
{
id: `q${idCounter++}`,
prompt: 'What is the average number of views per day?',
groundTruth: String(avgViews),
type: 'aggregation',
dataset: 'analytics',
},
{
id: `q${idCounter++}`,
prompt: 'What is the average revenue per day?',
groundTruth: String(avgRevenue.toFixed(2)),
type: 'aggregation',
dataset: 'analytics',
},
{
id: `q${idCounter++}`,
prompt: 'What is the average number of conversions per day?',
groundTruth: String(avgConversions),
type: 'aggregation',
dataset: 'analytics',
},
{
id: `q${idCounter++}`,
prompt: 'How many days are included in the analytics data?',
groundTruth: String(analytics.length),
type: 'aggregation',
dataset: 'analytics',
},
{
id: `q${idCounter++}`,
prompt: 'What is the highest number of views recorded in a single day?',
groundTruth: String(Math.max(...analytics.map(m => m.views))),
type: 'aggregation',
dataset: 'analytics',
},
)
// Aggregation: high-performing days (single-condition filters)
for (const threshold of QUESTION_THRESHOLDS.analytics.views) {
const count = analytics.filter(m => m.views > threshold).length
questions.push({
id: `q${idCounter++}`,
prompt: `How many days had more than ${threshold} views?`,
groundTruth: String(count),
type: 'aggregation',
dataset: 'analytics',
})
}
// Filtering: multi-condition queries (views AND conversions)
for (const viewThreshold of QUESTION_THRESHOLDS.analytics.viewsForFiltering) {
const count = analytics.filter(m => m.views > viewThreshold && m.conversions > QUESTION_THRESHOLDS.analytics.conversionsForFiltering).length
questions.push({
id: `q${idCounter++}`,
prompt: `How many days had more than ${viewThreshold} views and more than ${QUESTION_THRESHOLDS.analytics.conversionsForFiltering} conversions?`,
groundTruth: String(count),
type: 'filtering',
dataset: 'analytics',
})
}
// Filtering: views AND revenue (expanded)
for (const revenueThreshold of QUESTION_THRESHOLDS.analytics.revenueThresholds.slice(0, 5)) {
const count = analytics.filter(m => m.views > QUESTION_THRESHOLDS.analytics.viewsThresholdForRevenue && m.revenue > revenueThreshold).length
questions.push({
id: `q${idCounter++}`,
prompt: `How many days had more than ${QUESTION_THRESHOLDS.analytics.viewsThresholdForRevenue} views and revenue greater than ${revenueThreshold}?`,
groundTruth: String(count),
type: 'filtering',
dataset: 'analytics',
})
}
// Filtering: clicks AND conversions (multi-condition)
for (const clickThreshold of QUESTION_THRESHOLDS.analytics.clicksForFiltering) {
const count = analytics.filter(m => m.clicks > clickThreshold && m.conversions > QUESTION_THRESHOLDS.analytics.conversionsForClickFiltering).length
questions.push({
id: `q${idCounter++}`,
prompt: `How many days had more than ${clickThreshold} clicks and more than ${QUESTION_THRESHOLDS.analytics.conversionsForClickFiltering} conversions?`,
groundTruth: String(count),
type: 'filtering',
dataset: 'analytics',
})
}
// Filtering: revenue AND bounce rate (multi-condition)
for (const revenueThreshold of QUESTION_THRESHOLDS.analytics.revenueForBounceRate) {
const count = analytics.filter(m => m.revenue > revenueThreshold && m.bounceRate < QUESTION_THRESHOLDS.analytics.bounceRateThreshold).length
questions.push({
id: `q${idCounter++}`,
prompt: `How many days had revenue greater than ${revenueThreshold} and bounce rate less than ${QUESTION_THRESHOLDS.analytics.bounceRateThreshold}?`,
groundTruth: String(count),
type: 'filtering',
dataset: 'analytics',
})
}
}
if (github.length > 0) {
// Helper to extract owner from repo field
const getOwner = (repoFullName: string) => repoFullName.split('/')[0]!
// Field retrieval: specific repos (diverse fields)
for (let i = 0; i < Math.min(QUESTION_LIMITS.github.fieldRetrievalRepos, github.length); i++) {
const repo = github[i * 7]
if (!repo)
continue
if (i % 5 === 0) {
questions.push({
id: `q${idCounter++}`,
prompt: `How many stars does ${repo.repo} have?`,
groundTruth: String(repo.stars),
type: 'field-retrieval',
dataset: 'github',
})
}
else if (i % 5 === 1) {
questions.push({
id: `q${idCounter++}`,
prompt: `How many forks does ${repo.repo} have?`,
groundTruth: String(repo.forks),
type: 'field-retrieval',
dataset: 'github',
})
}
else if (i % 5 === 2) {
questions.push({
id: `q${idCounter++}`,
prompt: `Who is the owner of ${repo.repo}?`,
groundTruth: getOwner(repo.repo),
type: 'field-retrieval',
dataset: 'github',
})
}
else if (i % 5 === 3) {
questions.push({
id: `q${idCounter++}`,
prompt: `What is the default branch of ${repo.repo}?`,
groundTruth: repo.defaultBranch,
type: 'field-retrieval',
dataset: 'github',
})
}
else {
questions.push({
id: `q${idCounter++}`,
prompt: `How many watchers does ${repo.repo} have?`,
groundTruth: String(repo.watchers),
type: 'field-retrieval',
dataset: 'github',
})
}
}
// Aggregation: popular repositories
const totalStars = github.reduce((sum, r) => sum + r.stars, 0)
const totalRepos = github.length
const avgStars = Math.round(totalStars / totalRepos)
questions.push(
{
id: `q${idCounter++}`,
prompt: 'What is the total number of stars across all repositories?',
groundTruth: String(totalStars),
type: 'aggregation',
dataset: 'github',
},
{
id: `q${idCounter++}`,
prompt: 'How many repositories are in the dataset?',
groundTruth: String(totalRepos),
type: 'aggregation',
dataset: 'github',
},
{
id: `q${idCounter++}`,
prompt: 'What is the average number of stars per repository?',
groundTruth: String(avgStars),
type: 'aggregation',
dataset: 'github',
},
)
// Aggregation: star thresholds (single-condition filters)
for (const threshold of QUESTION_THRESHOLDS.github.stars) {
const count = github.filter(r => r.stars > threshold).length
questions.push({
id: `q${idCounter++}`,
prompt: `How many repositories have more than ${threshold} stars?`,
groundTruth: String(count),
type: 'aggregation',
dataset: 'github',
})
}
// Aggregation: fork thresholds (single-condition filters)
for (const threshold of QUESTION_THRESHOLDS.github.forks) {
const count = github.filter(r => r.forks > threshold).length
questions.push({
id: `q${idCounter++}`,
prompt: `How many repositories have more than ${threshold} forks?`,
groundTruth: String(count),
type: 'aggregation',
dataset: 'github',
})
}
// Aggregation: watcher thresholds (single-condition filters)
for (const threshold of QUESTION_THRESHOLDS.github.watchers) {
const count = github.filter(r => r.watchers > threshold).length
questions.push({
id: `q${idCounter++}`,
prompt: `How many repositories have more than ${threshold} watchers?`,
groundTruth: String(count),
type: 'aggregation',
dataset: 'github',
})
}
// Aggregation: default branch counts
const branches = [...new Set(github.map(r => r.defaultBranch))]
for (const branch of branches.slice(0, QUESTION_LIMITS.github.aggregationBranches)) {
const count = github.filter(r => r.defaultBranch === branch).length
questions.push({
id: `q${idCounter++}`,
prompt: `How many repositories use "${branch}" as their default branch?`,
groundTruth: String(count),
type: 'aggregation',
dataset: 'github',
})
}
// Filtering: multi-condition queries (stars AND forks)
for (const combo of QUESTION_THRESHOLDS.github.starForkCombinations.slice(0, QUESTION_LIMITS.github.filteringStarsAndForks)) {
const count = github.filter(r => r.stars > combo.stars && r.forks > combo.forks).length
questions.push({
id: `q${idCounter++}`,
prompt: `How many repositories have more than ${combo.stars} stars and more than ${combo.forks} forks?`,
groundTruth: String(count),
type: 'filtering',
dataset: 'github',
})
}
// Filtering: stars AND watchers (multi-condition)
for (const combo of QUESTION_THRESHOLDS.github.starWatcherCombinations) {
const count = github.filter(r => r.stars > combo.stars && r.watchers > combo.watchers).length
questions.push({
id: `q${idCounter++}`,
prompt: `How many repositories have more than ${combo.stars} stars and more than ${combo.watchers} watchers?`,
groundTruth: String(count),
type: 'filtering',
dataset: 'github',
})
}
}
return questions
}

View File

@@ -0,0 +1,196 @@
import type { AnalyticsMetric } from '../datasets'
import type { Question } from '../types'
import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from '../constants'
import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils'
/**
* Generate analytics (website metrics) questions
*/
export function generateAnalyticsQuestions(metrics: AnalyticsMetric[], getId: () => string): Question[] {
const questions: Question[] = []
if (metrics.length === 0)
return questions
// Field retrieval: date-based metrics
const metricFieldGenerators: Array<(metric: AnalyticsMetric, getId: () => string) => Question> = [
(metric, getId) => new QuestionBuilder()
.id(getId())
.prompt(`What are the views for ${metric.date}?`)
.groundTruth(String(metric.views))
.type('field-retrieval')
.dataset('analytics')
.build(),
(metric, getId) => new QuestionBuilder()
.id(getId())
.prompt(`What is the revenue for ${metric.date}?`)
.groundTruth(String(metric.revenue))
.type('field-retrieval')
.dataset('analytics')
.build(),
(metric, getId) => new QuestionBuilder()
.id(getId())
.prompt(`What is the bounce rate for ${metric.date}?`)
.groundTruth(String(metric.bounceRate))
.type('field-retrieval')
.dataset('analytics')
.build(),
(metric, getId) => new QuestionBuilder()
.id(getId())
.prompt(`How many conversions were there on ${metric.date}?`)
.groundTruth(String(metric.conversions))
.type('field-retrieval')
.dataset('analytics')
.build(),
]
questions.push(...rotateQuestions(
metrics,
metricFieldGenerators,
QUESTION_LIMITS.analytics.fieldRetrievalDates,
SAMPLE_STRIDES.ANALYTICS_FIELD,
getId,
))
// Aggregation: basic statistics
const totalDays = metrics.length
const totalViews = metrics.reduce((sum, m) => sum + m.views, 0)
const totalConversions = metrics.reduce((sum, m) => sum + m.conversions, 0)
const totalRevenue = metrics.reduce((sum, m) => sum + m.revenue, 0)
const avgBounceRate = metrics.reduce((sum, m) => sum + m.bounceRate, 0) / metrics.length
questions.push(
new QuestionBuilder()
.id(getId())
.prompt('How many days of data are in the dataset?')
.groundTruth(String(totalDays))
.type('aggregation')
.dataset('analytics')
.build(),
new QuestionBuilder()
.id(getId())
.prompt('What is the total number of views across all dates?')
.groundTruth(String(totalViews))
.type('aggregation')
.dataset('analytics')
.build(),
new QuestionBuilder()
.id(getId())
.prompt('What is the total number of conversions across all dates?')
.groundTruth(String(totalConversions))
.type('aggregation')
.dataset('analytics')
.build(),
new QuestionBuilder()
.id(getId())
.prompt('What is the total revenue across all dates?')
.groundTruth(String(totalRevenue.toFixed(2)))
.type('aggregation')
.dataset('analytics')
.build(),
new QuestionBuilder()
.id(getId())
.prompt('What is the average bounce rate?')
.groundTruth(String(avgBounceRate.toFixed(2)))
.type('aggregation')
.dataset('analytics')
.build(),
)
// Aggregation: high views/conversions
for (const threshold of QUESTION_THRESHOLDS.analytics.views) {
const count = countByPredicate(metrics, m => m.views > threshold)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many days had more than ${threshold} views?`)
.groundTruth(String(count))
.type('aggregation')
.dataset('analytics')
.build(),
)
}
for (const threshold of QUESTION_THRESHOLDS.analytics.conversions) {
const count = countByPredicate(metrics, m => m.conversions > threshold)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many days had more than ${threshold} conversions?`)
.groundTruth(String(count))
.type('aggregation')
.dataset('analytics')
.build(),
)
}
// Filtering: multi-condition (views AND revenue)
for (const threshold of QUESTION_THRESHOLDS.analytics.viewsForFiltering) {
const count = countByPredicate(
metrics,
m => m.views > threshold && m.conversions > QUESTION_THRESHOLDS.analytics.conversionsForFiltering,
)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many days had more than ${threshold} views and more than ${QUESTION_THRESHOLDS.analytics.conversionsForFiltering} conversions?`)
.groundTruth(String(count))
.type('filtering')
.dataset('analytics')
.build(),
)
}
// Filtering: revenue thresholds
for (const threshold of QUESTION_THRESHOLDS.analytics.revenueThresholds) {
const count = countByPredicate(
metrics,
m => m.revenue > threshold && m.views > QUESTION_THRESHOLDS.analytics.viewsThresholdForRevenue,
)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many days had revenue greater than ${threshold} with views above ${QUESTION_THRESHOLDS.analytics.viewsThresholdForRevenue}?`)
.groundTruth(String(count))
.type('filtering')
.dataset('analytics')
.build(),
)
}
// Filtering: clicks and conversions
for (const threshold of QUESTION_THRESHOLDS.analytics.clicksForFiltering) {
const count = countByPredicate(
metrics,
m => m.clicks > threshold && m.conversions > QUESTION_THRESHOLDS.analytics.conversionsForClickFiltering,
)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many days had more than ${threshold} clicks and more than ${QUESTION_THRESHOLDS.analytics.conversionsForClickFiltering} conversions?`)
.groundTruth(String(count))
.type('filtering')
.dataset('analytics')
.build(),
)
}
// Filtering: revenue and bounce rate
for (const threshold of QUESTION_THRESHOLDS.analytics.revenueForBounceRate) {
const count = countByPredicate(
metrics,
m => m.revenue > threshold && m.bounceRate < QUESTION_THRESHOLDS.analytics.bounceRateThreshold,
)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many days had revenue greater than ${threshold} with bounce rate below ${QUESTION_THRESHOLDS.analytics.bounceRateThreshold}?`)
.groundTruth(String(count))
.type('filtering')
.dataset('analytics')
.build(),
)
}
return questions
}

View File

@@ -0,0 +1,162 @@
import type { EventLog } from '../datasets'
import type { Question } from '../types'
import { QUESTION_LIMITS } from '../constants'
import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils'
/**
* Generate event log questions
*/
export function generateEventLogsQuestions(logs: EventLog[], getId: () => string): Question[] {
const questions: Question[] = []
if (logs.length === 0)
return questions
// Field retrieval: log metadata
const logFieldGenerators: Array<(log: EventLog, getId: () => string) => Question> = [
(log, getId) => new QuestionBuilder()
.id(getId())
.prompt(`What is the level of the log at ${log.timestamp}?`)
.groundTruth(log.level)
.type('field-retrieval')
.dataset('event-logs')
.build(),
(log, getId) => new QuestionBuilder()
.id(getId())
.prompt(`What is the endpoint for the log at ${log.timestamp}?`)
.groundTruth(log.endpoint)
.type('field-retrieval')
.dataset('event-logs')
.build(),
(log, getId) => new QuestionBuilder()
.id(getId())
.prompt(`What is the status code for the log at ${log.timestamp}?`)
.groundTruth(String(log.statusCode))
.type('field-retrieval')
.dataset('event-logs')
.build(),
(log, getId) => new QuestionBuilder()
.id(getId())
.prompt(`What is the response time for the log at ${log.timestamp}?`)
.groundTruth(String(log.responseTime))
.type('field-retrieval')
.dataset('event-logs')
.build(),
]
questions.push(...rotateQuestions(
logs,
logFieldGenerators,
QUESTION_LIMITS.eventLogs.fieldRetrieval,
SAMPLE_STRIDES.EVENT_LOG_FIELD,
getId,
))
// Aggregation: basic statistics
const totalLogs = logs.length
const avgResponseTime = logs.reduce((sum, l) => sum + l.responseTime, 0) / logs.length
questions.push(
new QuestionBuilder()
.id(getId())
.prompt('How many log entries are in the dataset?')
.groundTruth(String(totalLogs))
.type('aggregation')
.dataset('event-logs')
.build(),
new QuestionBuilder()
.id(getId())
.prompt('What is the average response time across all logs?')
.groundTruth(String(avgResponseTime.toFixed(2)))
.type('aggregation')
.dataset('event-logs')
.build(),
)
// Aggregation: by level
const levels = [...new Set(logs.map(l => l.level))]
for (const level of levels) {
const count = countByPredicate(logs, l => l.level === level)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many log entries have level "${level}"?`)
.groundTruth(String(count))
.type('aggregation')
.dataset('event-logs')
.build(),
)
}
// Aggregation: by endpoint
const endpoints = [...new Set(logs.map(l => l.endpoint))]
for (const endpoint of endpoints.slice(0, QUESTION_LIMITS.eventLogs.aggregationEndpoints)) {
const count = countByPredicate(logs, l => l.endpoint === endpoint)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many log entries are for endpoint "${endpoint}"?`)
.groundTruth(String(count))
.type('aggregation')
.dataset('event-logs')
.build(),
)
}
// Aggregation: by status code range
const errorCount = countByPredicate(logs, l => l.statusCode >= 400)
const successCount = countByPredicate(logs, l => l.statusCode >= 200 && l.statusCode < 300)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt('How many log entries have a status code indicating an error (>= 400)?')
.groundTruth(String(errorCount))
.type('aggregation')
.dataset('event-logs')
.build(),
new QuestionBuilder()
.id(getId())
.prompt('How many log entries have a successful status code (200-299)?')
.groundTruth(String(successCount))
.type('aggregation')
.dataset('event-logs')
.build(),
)
// Filtering: multi-condition (level AND status)
for (const level of levels.slice(0, QUESTION_LIMITS.eventLogs.filteringLevelAndStatus)) {
const count = countByPredicate(
logs,
l => l.level === level && l.statusCode >= 400,
)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many log entries have level "${level}" and status code >= 400?`)
.groundTruth(String(count))
.type('filtering')
.dataset('event-logs')
.build(),
)
}
// Filtering: endpoint AND status
for (const endpoint of endpoints.slice(0, QUESTION_LIMITS.eventLogs.filteringEndpointAndStatus)) {
const count = countByPredicate(
logs,
l => l.endpoint === endpoint && l.statusCode >= 500,
)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many log entries are for endpoint "${endpoint}" with status code >= 500?`)
.groundTruth(String(count))
.type('filtering')
.dataset('event-logs')
.build(),
)
}
return questions
}

View File

@@ -0,0 +1,184 @@
import type { Repository } from '../datasets'
import type { Question } from '../types'
import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from '../constants'
import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils'
/**
* Generate GitHub repository questions
*/
export function generateGithubQuestions(repos: Repository[], getId: () => string): Question[] {
const questions: Question[] = []
if (repos.length === 0)
return questions
// Field retrieval: repository metadata
const repoFieldGenerators: Array<(repo: Repository, getId: () => string) => Question> = [
(repo, getId) => new QuestionBuilder()
.id(getId())
.prompt(`How many stars does ${repo.owner}/${repo.name} have?`)
.groundTruth(String(repo.stars))
.type('field-retrieval')
.dataset('github')
.build(),
(repo, getId) => new QuestionBuilder()
.id(getId())
.prompt(`How many forks does ${repo.owner}/${repo.name} have?`)
.groundTruth(String(repo.forks))
.type('field-retrieval')
.dataset('github')
.build(),
(repo, getId) => new QuestionBuilder()
.id(getId())
.prompt(`How many watchers does ${repo.owner}/${repo.name} have?`)
.groundTruth(String(repo.watchers))
.type('field-retrieval')
.dataset('github')
.build(),
(repo, getId) => new QuestionBuilder()
.id(getId())
.prompt(`What is the main branch of ${repo.owner}/${repo.name}?`)
.groundTruth(repo.defaultBranch)
.type('field-retrieval')
.dataset('github')
.build(),
]
questions.push(...rotateQuestions(
repos,
repoFieldGenerators,
QUESTION_LIMITS.github.fieldRetrievalRepos,
SAMPLE_STRIDES.REPO_FIELD,
getId,
))
// Aggregation: basic statistics
const totalRepos = repos.length
const totalStars = repos.reduce((sum, r) => sum + r.stars, 0)
const totalForks = repos.reduce((sum, r) => sum + r.forks, 0)
const avgStars = totalStars / totalRepos
questions.push(
new QuestionBuilder()
.id(getId())
.prompt('How many repositories are in the dataset?')
.groundTruth(String(totalRepos))
.type('aggregation')
.dataset('github')
.build(),
new QuestionBuilder()
.id(getId())
.prompt('What is the total number of stars across all repositories?')
.groundTruth(String(totalStars))
.type('aggregation')
.dataset('github')
.build(),
new QuestionBuilder()
.id(getId())
.prompt('What is the total number of forks across all repositories?')
.groundTruth(String(totalForks))
.type('aggregation')
.dataset('github')
.build(),
new QuestionBuilder()
.id(getId())
.prompt('What is the average number of stars per repository?')
.groundTruth(String(Math.round(avgStars)))
.type('aggregation')
.dataset('github')
.build(),
)
// Aggregation: by default branch
const branches = [...new Set(repos.map(r => r.defaultBranch))]
for (const branch of branches.slice(0, QUESTION_LIMITS.github.aggregationBranches)) {
const count = countByPredicate(repos, r => r.defaultBranch === branch)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many repositories use "${branch}" as their default branch?`)
.groundTruth(String(count))
.type('aggregation')
.dataset('github')
.build(),
)
}
// Aggregation: high star counts
for (const threshold of QUESTION_THRESHOLDS.github.stars) {
const count = countByPredicate(repos, r => r.stars > threshold)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many repositories have more than ${threshold} stars?`)
.groundTruth(String(count))
.type('aggregation')
.dataset('github')
.build(),
)
}
// Aggregation: high fork counts
for (const threshold of QUESTION_THRESHOLDS.github.forks) {
const count = countByPredicate(repos, r => r.forks > threshold)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many repositories have more than ${threshold} forks?`)
.groundTruth(String(count))
.type('aggregation')
.dataset('github')
.build(),
)
}
// Aggregation: high watcher counts
for (const threshold of QUESTION_THRESHOLDS.github.watchers) {
const count = countByPredicate(repos, r => r.watchers > threshold)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many repositories have more than ${threshold} watchers?`)
.groundTruth(String(count))
.type('aggregation')
.dataset('github')
.build(),
)
}
// Filtering: multi-condition (stars AND forks)
for (const combo of QUESTION_THRESHOLDS.github.starForkCombinations.slice(0, QUESTION_LIMITS.github.filteringStarsAndForks)) {
const count = countByPredicate(
repos,
r => r.stars > combo.stars && r.forks > combo.forks,
)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many repositories have more than ${combo.stars} stars and more than ${combo.forks} forks?`)
.groundTruth(String(count))
.type('filtering')
.dataset('github')
.build(),
)
}
// Filtering: stars AND watchers
for (const combo of QUESTION_THRESHOLDS.github.starWatcherCombinations) {
const count = countByPredicate(
repos,
r => r.stars > combo.stars && r.watchers > combo.watchers,
)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many repositories have more than ${combo.stars} stars and more than ${combo.watchers} watchers?`)
.groundTruth(String(count))
.type('filtering')
.dataset('github')
.build(),
)
}
return questions
}

View File

@@ -0,0 +1,46 @@
import type { AnalyticsMetric, Employee, EventLog, NestedConfig, Order, Repository } from '../datasets'
import type { Question } from '../types'
import { ACCURACY_DATASETS } from '../datasets'
import { generateAnalyticsQuestions } from './analytics'
import { generateEventLogsQuestions } from './event-logs'
import { generateGithubQuestions } from './github'
import { generateNestedQuestions } from './nested'
import { generateNestedConfigQuestions } from './nested-config'
import { generateTabularQuestions } from './tabular'
import { createIdGenerator } from './utils'
/**
* Generate all questions from datasets
*
* @remarks
* Generates ~150-160 questions across different question types and datasets:
* - Field Retrieval: Direct field access with no computation
* Examples: "What is X's salary?", "What is the status of order Y?"
* - Aggregation: Counts, sums, averages, min/max operations (including single-condition filters)
* Examples: "How many X?", "What is the total/average?", "How many X > threshold?"
* - Filtering: Multi-condition queries requiring complex logical operations
* Examples: "How many X WHERE condition1 AND condition2?"
*/
export function generateQuestions(): Question[] {
const questions: Question[] = []
const idGen = createIdGenerator()
const getId = () => idGen.next().value
// Get datasets with proper typing
const tabular = (ACCURACY_DATASETS.find(d => d.name === 'tabular')?.data.employees as Employee[]) ?? []
const nested = (ACCURACY_DATASETS.find(d => d.name === 'nested')?.data.orders as Order[]) ?? []
const analytics = (ACCURACY_DATASETS.find(d => d.name === 'analytics')?.data.metrics as AnalyticsMetric[]) ?? []
const github = (ACCURACY_DATASETS.find(d => d.name === 'github')?.data.repositories as Repository[]) ?? []
const eventLogs = (ACCURACY_DATASETS.find(d => d.name === 'event-logs')?.data.logs as EventLog[]) ?? []
const nestedConfig = ACCURACY_DATASETS.find(d => d.name === 'nested-config')?.data as NestedConfig | undefined
// Generate questions for each dataset
questions.push(...generateTabularQuestions(tabular, getId))
questions.push(...generateNestedQuestions(nested, getId))
questions.push(...generateAnalyticsQuestions(analytics, getId))
questions.push(...generateGithubQuestions(github, getId))
questions.push(...generateEventLogsQuestions(eventLogs, getId))
questions.push(...generateNestedConfigQuestions(nestedConfig, getId))
return questions
}

View File

@@ -0,0 +1,147 @@
import type { NestedConfig } from '../datasets'
import type { Question } from '../types'
import { QUESTION_LIMITS } from '../constants'
import { QuestionBuilder } from './utils'
/**
* Generate nested configuration questions
*/
export function generateNestedConfigQuestions(config: NestedConfig | undefined, getId: () => string): Question[] {
const questions: Question[] = []
if (!config)
return questions
// Field retrieval: top-level config values
const fieldRetrievalQuestions = [
{
prompt: 'What is the environment in the configuration?',
groundTruth: config.environment,
},
{
prompt: 'What is the database host?',
groundTruth: config.database.host,
},
{
prompt: 'What is the database port?',
groundTruth: String(config.database.port),
},
{
prompt: 'What is the maximum connection pool size?',
groundTruth: String(config.database.pool.max),
},
{
prompt: 'What is the session duration?',
groundTruth: String(config.authentication.session.duration),
},
]
for (const q of fieldRetrievalQuestions.slice(0, QUESTION_LIMITS.nestedConfig.fieldRetrieval)) {
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(q.prompt)
.groundTruth(q.groundTruth)
.type('field-retrieval')
.dataset('nested-config')
.build(),
)
}
// Aggregation: counts of nested structures
const roleCount = Object.keys(config.permissions.roles).length
const groupCount = Object.keys(config.permissions.groups).length
const providerCount = config.authentication.providers.length
const featureCount = Object.keys(config.features).length
const replicaCount = config.database.replicas.length
questions.push(
new QuestionBuilder()
.id(getId())
.prompt('How many roles are defined in permissions?')
.groundTruth(String(roleCount))
.type('aggregation')
.dataset('nested-config')
.build(),
new QuestionBuilder()
.id(getId())
.prompt('How many groups are defined in permissions?')
.groundTruth(String(groupCount))
.type('aggregation')
.dataset('nested-config')
.build(),
new QuestionBuilder()
.id(getId())
.prompt('How many authentication providers are configured?')
.groundTruth(String(providerCount))
.type('aggregation')
.dataset('nested-config')
.build(),
new QuestionBuilder()
.id(getId())
.prompt('How many feature flags are defined?')
.groundTruth(String(featureCount))
.type('aggregation')
.dataset('nested-config')
.build(),
new QuestionBuilder()
.id(getId())
.prompt('How many database replicas are configured?')
.groundTruth(String(replicaCount))
.type('aggregation')
.dataset('nested-config')
.build(),
)
// Aggregation: feature flag details
const enabledFeatures = Object.entries(config.features).filter(([_, f]) => f.enabled).length
questions.push(
new QuestionBuilder()
.id(getId())
.prompt('How many feature flags are enabled?')
.groundTruth(String(enabledFeatures))
.type('aggregation')
.dataset('nested-config')
.build(),
)
// Aggregation: role permissions
const adminPermissions = config.permissions.roles.admin?.permissions.length ?? 0
questions.push(
new QuestionBuilder()
.id(getId())
.prompt('How many permissions does the admin role have?')
.groundTruth(String(adminPermissions))
.type('aggregation')
.dataset('nested-config')
.build(),
)
// Filtering: complex multi-condition queries
const filteringQuestions = [
{
prompt: 'How many feature flags are enabled with rollout greater than 50%?',
groundTruth: String(Object.entries(config.features)
.filter(([_, f]) => f.enabled && f.rollout > 50).length),
},
{
prompt: 'How many groups have the admin role?',
groundTruth: String(Object.entries(config.permissions.groups)
.filter(([_, g]) => g.roles.includes('admin')).length),
},
]
for (const q of filteringQuestions.slice(0, QUESTION_LIMITS.nestedConfig.filteringComplex)) {
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(q.prompt)
.groundTruth(q.groundTruth)
.type('filtering')
.dataset('nested-config')
.build(),
)
}
return questions
}

View File

@@ -0,0 +1,202 @@
import type { Order } from '../datasets'
import type { Question } from '../types'
import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from '../constants'
import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils'
/**
* Generate nested (orders) questions
*/
export function generateNestedQuestions(orders: Order[], getId: () => string): Question[] {
const questions: Question[] = []
if (orders.length === 0)
return questions
// Field retrieval: order totals and statuses
const orderFieldGenerators: Array<(order: Order, getId: () => string) => Question> = [
(order, getId) => new QuestionBuilder()
.id(getId())
.prompt(`What is the total for order ${order.orderId}?`)
.groundTruth(String(order.total))
.type('field-retrieval')
.dataset('nested')
.build(),
(order, getId) => new QuestionBuilder()
.id(getId())
.prompt(`What is the status of order ${order.orderId}?`)
.groundTruth(order.status)
.type('field-retrieval')
.dataset('nested')
.build(),
]
questions.push(...rotateQuestions(
orders,
orderFieldGenerators,
QUESTION_LIMITS.nested.fieldRetrievalOrders,
SAMPLE_STRIDES.ORDER_FIELD,
getId,
))
// Field retrieval: customer info and order dates
const customerFieldGenerators: Array<(order: Order, getId: () => string) => Question> = [
(order, getId) => new QuestionBuilder()
.id(getId())
.prompt(`What is the customer name for order ${order.orderId}?`)
.groundTruth(order.customer.name)
.type('field-retrieval')
.dataset('nested')
.build(),
(order, getId) => new QuestionBuilder()
.id(getId())
.prompt(`What is the customer email for order ${order.orderId}?`)
.groundTruth(order.customer.email)
.type('field-retrieval')
.dataset('nested')
.build(),
(order, getId) => new QuestionBuilder()
.id(getId())
.prompt(`What is the order date for order ${order.orderId}?`)
.groundTruth(order.orderDate || '')
.type('field-retrieval')
.dataset('nested')
.build(),
(order, getId) => new QuestionBuilder()
.id(getId())
.prompt(`How many items are in order ${order.orderId}?`)
.groundTruth(String(order.items.length))
.type('field-retrieval')
.dataset('nested')
.build(),
]
// Use stride + 1 for customer fields to offset from order fields
const customerOrders = orders.map((_, i) => orders[i * SAMPLE_STRIDES.CUSTOMER_FIELD + 1] || orders[i]).filter(Boolean) as Order[]
questions.push(...rotateQuestions(
customerOrders,
customerFieldGenerators,
QUESTION_LIMITS.nested.fieldRetrievalCustomers,
1,
getId,
))
// Aggregation: totals and averages
const totalRevenue = orders.reduce((sum, o) => sum + o.total, 0)
const avgOrderValue = totalRevenue / orders.length
const totalOrders = orders.length
const maxOrderValue = Math.max(...orders.map(o => o.total))
// Count by status
const statuses = [...new Set(orders.map(o => o.status))]
for (const status of statuses.slice(0, QUESTION_LIMITS.nested.aggregationStatuses)) {
const count = countByPredicate(orders, o => o.status === status)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many orders have status "${status}"?`)
.groundTruth(String(count))
.type('aggregation')
.dataset('nested')
.build(),
)
}
questions.push(
new QuestionBuilder()
.id(getId())
.prompt('What is the total revenue across all orders?')
.groundTruth(String(totalRevenue.toFixed(2)))
.type('aggregation')
.dataset('nested')
.build(),
new QuestionBuilder()
.id(getId())
.prompt('What is the average order value?')
.groundTruth(String(avgOrderValue.toFixed(2)))
.type('aggregation')
.dataset('nested')
.build(),
new QuestionBuilder()
.id(getId())
.prompt('How many orders are in the dataset?')
.groundTruth(String(totalOrders))
.type('aggregation')
.dataset('nested')
.build(),
new QuestionBuilder()
.id(getId())
.prompt('What is the highest order total?')
.groundTruth(String(maxOrderValue.toFixed(2)))
.type('aggregation')
.dataset('nested')
.build(),
)
// Aggregation: high-value orders (single-condition filter)
for (const threshold of QUESTION_THRESHOLDS.nested.highValueOrders) {
const count = countByPredicate(orders, o => o.total > threshold)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many orders have a total greater than ${threshold}?`)
.groundTruth(String(count))
.type('aggregation')
.dataset('nested')
.build(),
)
}
// Filtering: multi-condition queries (status AND value)
const orderStatuses = [...new Set(orders.map(o => o.status))]
for (const status of orderStatuses.slice(0, QUESTION_LIMITS.nested.filteringStatusAndValue)) {
const count = countByPredicate(
orders,
o => o.status === status && o.total > QUESTION_THRESHOLDS.nested.statusValueThreshold,
)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many orders have status "${status}" and total greater than ${QUESTION_THRESHOLDS.nested.statusValueThreshold}?`)
.groundTruth(String(count))
.type('filtering')
.dataset('nested')
.build(),
)
}
// Filtering: status AND items count (multi-condition)
for (const status of orderStatuses.slice(0, QUESTION_LIMITS.nested.filteringStatusAndItems)) {
const count = countByPredicate(
orders,
o => o.status === status && o.items.length >= QUESTION_THRESHOLDS.nested.itemCountThreshold,
)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many orders have status "${status}" and at least ${QUESTION_THRESHOLDS.nested.itemCountThreshold} items?`)
.groundTruth(String(count))
.type('filtering')
.dataset('nested')
.build(),
)
}
// Filtering: total AND items count (multi-condition)
for (const threshold of QUESTION_THRESHOLDS.nested.totalThresholdsForItems) {
const count = countByPredicate(
orders,
o => o.total > threshold && o.items.length >= QUESTION_THRESHOLDS.nested.itemCountThreshold,
)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many orders have a total greater than ${threshold} and at least ${QUESTION_THRESHOLDS.nested.itemCountThreshold} items?`)
.groundTruth(String(count))
.type('filtering')
.dataset('nested')
.build(),
)
}
return questions
}

View File

@@ -0,0 +1,191 @@
import type { Employee } from '../datasets'
import type { Question } from '../types'
import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from '../constants'
import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils'
/**
* Generate tabular (employee) questions
*/
export function generateTabularQuestions(employees: Employee[], getId: () => string): Question[] {
const questions: Question[] = []
if (employees.length === 0)
return questions
// Field retrieval: specific employees
const fieldGenerators: Array<(emp: Employee, getId: () => string) => Question> = [
(emp, getId) => new QuestionBuilder()
.id(getId())
.prompt(`What is the salary of ${emp.name}?`)
.groundTruth(String(emp.salary))
.type('field-retrieval')
.dataset('tabular')
.build(),
(emp, getId) => new QuestionBuilder()
.id(getId())
.prompt(`What department does ${emp.name} work in?`)
.groundTruth(emp.department)
.type('field-retrieval')
.dataset('tabular')
.build(),
(emp, getId) => new QuestionBuilder()
.id(getId())
.prompt(`What is the email address of ${emp.name}?`)
.groundTruth(emp.email)
.type('field-retrieval')
.dataset('tabular')
.build(),
(emp, getId) => new QuestionBuilder()
.id(getId())
.prompt(`How many years of experience does ${emp.name} have?`)
.groundTruth(String(emp.yearsExperience))
.type('field-retrieval')
.dataset('tabular')
.build(),
(emp, getId) => new QuestionBuilder()
.id(getId())
.prompt(`Is ${emp.name} an active employee?`)
.groundTruth(emp.active ? 'yes' : 'no')
.type('field-retrieval')
.dataset('tabular')
.build(),
]
questions.push(...rotateQuestions(
employees,
fieldGenerators,
QUESTION_LIMITS.tabular.fieldRetrieval,
SAMPLE_STRIDES.EMPLOYEE_FIELD,
getId,
))
// Aggregation: count by department
const departments = [...new Set(employees.map(e => e.department))]
for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.aggregationDepartments)) {
const count = countByPredicate(employees, e => e.department === dept)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many employees work in ${dept}?`)
.groundTruth(String(count))
.type('aggregation')
.dataset('tabular')
.build(),
)
}
// Aggregation: salary ranges (single-condition filters)
for (const threshold of QUESTION_THRESHOLDS.tabular.salaryRanges) {
const count = countByPredicate(employees, e => e.salary > threshold)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many employees have a salary greater than ${threshold}?`)
.groundTruth(String(count))
.type('aggregation')
.dataset('tabular')
.build(),
)
}
// Aggregation: totals and averages
const totalEmployees = employees.length
const avgSalary = Math.round(employees.reduce((sum, e) => sum + e.salary, 0) / totalEmployees)
const activeCount = countByPredicate(employees, e => e.active)
const inactiveCount = countByPredicate(employees, e => !e.active)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt('How many employees are in the dataset?')
.groundTruth(String(totalEmployees))
.type('aggregation')
.dataset('tabular')
.build(),
new QuestionBuilder()
.id(getId())
.prompt('What is the average salary across all employees?')
.groundTruth(String(avgSalary))
.type('aggregation')
.dataset('tabular')
.build(),
new QuestionBuilder()
.id(getId())
.prompt('How many employees are active?')
.groundTruth(String(activeCount))
.type('aggregation')
.dataset('tabular')
.build(),
new QuestionBuilder()
.id(getId())
.prompt('How many employees are inactive?')
.groundTruth(String(inactiveCount))
.type('aggregation')
.dataset('tabular')
.build(),
)
// Filtering: count by department with salary filter (multi-condition)
for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringMultiConditionDepartments)) {
const count = countByPredicate(
employees,
e => e.department === dept && e.salary > QUESTION_THRESHOLDS.tabular.departmentSalaryThreshold,
)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many employees in ${dept} have a salary greater than ${QUESTION_THRESHOLDS.tabular.departmentSalaryThreshold}?`)
.groundTruth(String(count))
.type('filtering')
.dataset('tabular')
.build(),
)
}
// Filtering: active employees by experience (multi-condition)
for (const exp of QUESTION_THRESHOLDS.tabular.experienceYears.slice(0, QUESTION_LIMITS.tabular.filteringExperience)) {
const count = countByPredicate(employees, e => e.yearsExperience > exp && e.active)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many active employees have more than ${exp} years of experience?`)
.groundTruth(String(count))
.type('filtering')
.dataset('tabular')
.build(),
)
}
// Filtering: department by experience (multi-condition)
for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringDepartmentExp)) {
const count = countByPredicate(
employees,
e => e.department === dept && e.yearsExperience > QUESTION_THRESHOLDS.tabular.departmentExperienceThreshold,
)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many employees in ${dept} have more than ${QUESTION_THRESHOLDS.tabular.departmentExperienceThreshold} years of experience?`)
.groundTruth(String(count))
.type('filtering')
.dataset('tabular')
.build(),
)
}
// Filtering: department by active status (multi-condition)
for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringDepartmentActive)) {
const count = countByPredicate(employees, e => e.department === dept && e.active)
questions.push(
new QuestionBuilder()
.id(getId())
.prompt(`How many active employees work in ${dept}?`)
.groundTruth(String(count))
.type('filtering')
.dataset('tabular')
.build(),
)
}
return questions
}

View File

@@ -0,0 +1,95 @@
import type { Question } from '../types'
// Constants for sampling strides
export const SAMPLE_STRIDES = {
EMPLOYEE_FIELD: 2,
ORDER_FIELD: 2,
CUSTOMER_FIELD: 2,
ANALYTICS_FIELD: 3,
METRIC_FIELD: 3,
REPO_FIELD: 7,
EVENT_LOG_FIELD: 5,
} as const
/**
* ID Generator
*/
export function* createIdGenerator(): Generator<string, never, never> {
let id = 1
while (true) {
yield `q${id++}`
}
}
/**
* Question Builder class for fluent question creation
*/
export class QuestionBuilder {
private question: Partial<Question> = {}
id(id: string): this {
this.question.id = id
return this
}
prompt(prompt: string): this {
this.question.prompt = prompt
return this
}
groundTruth(groundTruth: string): this {
this.question.groundTruth = groundTruth
return this
}
type(type: Question['type']): this {
this.question.type = type
return this
}
dataset(dataset: Question['dataset']): this {
this.question.dataset = dataset
return this
}
build(): Question {
if (!this.question.id || !this.question.prompt || !this.question.groundTruth || !this.question.type || !this.question.dataset) {
throw new Error('Incomplete question')
}
return this.question as Question
}
}
/**
* Helper: Count items matching a predicate
*/
export function countByPredicate<T>(items: T[], predicate: (item: T) => boolean): number {
return items.filter(predicate).length
}
/**
* Helper: Rotate through question generators
*/
export function rotateQuestions<T>(
items: T[],
generators: Array<(item: T, getId: () => string) => Question>,
limit: number,
stride: number,
getId: () => string,
): Question[] {
const questions: Question[] = []
for (let i = 0; i < Math.min(limit, items.length); i++) {
const item = items[i * stride] || items[i]
if (!item)
continue
const generatorIndex = i % generators.length
const generator = generators[generatorIndex]
if (generator) {
questions.push(generator(item, getId))
}
}
return questions
}

View File

@@ -1,7 +1,8 @@
import type { EfficiencyRanking, EvaluationResult, FormatResult, Question } from './types' import type { Dataset, EfficiencyRanking, EvaluationResult, FormatResult, Question } from './types'
import { FORMATTER_DISPLAY_NAMES } from './constants' import { FORMATTER_DISPLAY_NAMES } from './constants'
import { datasets } from './datasets' import { ACCURACY_DATASETS } from './datasets'
import { models } from './evaluate' import { models } from './evaluate'
import { supportsCSV } from './formatters'
import { generateQuestions } from './questions' import { generateQuestions } from './questions'
import { createProgressBar, tokenize } from './utils' import { createProgressBar, tokenize } from './utils'
@@ -16,7 +17,11 @@ export function calculateTokenCounts(
const tokenCounts: Record<string, number> = {} const tokenCounts: Record<string, number> = {}
for (const [formatName, formatter] of Object.entries(formatters)) { for (const [formatName, formatter] of Object.entries(formatters)) {
for (const dataset of datasets) { for (const dataset of ACCURACY_DATASETS) {
// Skip CSV for datasets that don't support it
if (formatName === 'csv' && !supportsCSV(dataset))
continue
const formatted = formatter(dataset.data) const formatted = formatter(dataset.data)
const key = `${formatName}-${dataset.name}` const key = `${formatName}-${dataset.name}`
tokenCounts[key] = tokenize(formatted) tokenCounts[key] = tokenize(formatted)
@@ -42,9 +47,9 @@ export function calculateFormatResults(
const accuracy = correctCount / totalCount const accuracy = correctCount / totalCount
// Calculate average tokens across all datasets for this format // Calculate average tokens across all datasets for this format
const avgTokens = Object.entries(tokenCounts) const formatTokenEntries = Object.entries(tokenCounts)
.filter(([key]) => key.startsWith(`${formatName}-`)) .filter(([key]) => key.startsWith(`${formatName}-`))
.reduce((sum, [, tokens]) => sum + tokens, 0) / datasets.length const avgTokens = formatTokenEntries.reduce((sum, [, tokens]) => sum + tokens, 0) / formatTokenEntries.length
const averageLatency = formatResults.reduce((sum, r) => sum + r.latencyMs, 0) / totalCount const averageLatency = formatResults.reduce((sum, r) => sum + r.latencyMs, 0) / totalCount
@@ -75,6 +80,8 @@ export function generateAccuracyReport(
return ` return `
Benchmarks test LLM comprehension across different input formats using ${totalQuestions} data retrieval questions on ${modelNames.length} ${modelNames.length === 1 ? 'model' : 'models'}. Benchmarks test LLM comprehension across different input formats using ${totalQuestions} data retrieval questions on ${modelNames.length} ${modelNames.length === 1 ? 'model' : 'models'}.
${generateDatasetCatalog(ACCURACY_DATASETS)}
#### Efficiency Ranking (Accuracy per 1K Tokens) #### Efficiency Ranking (Accuracy per 1K Tokens)
${generateEfficiencyRankingReport(formatResults)} ${generateEfficiencyRankingReport(formatResults)}
@@ -85,6 +92,38 @@ ${generateDetailedAccuracyReport(formatResults, results, questions, tokenCounts)
`.trimStart() `.trimStart()
} }
/**
* Generate dataset catalog section
*/
function generateDatasetCatalog(datasets: Dataset[]): string {
const rows = datasets.map((dataset) => {
const csvSupport = supportsCSV(dataset) ? '✓' : '✗'
const rowCount = Object.values(dataset.data)[0]?.length ?? 1
const structure = dataset.metadata.structureClass
const eligibility = `${dataset.metadata.tabularEligibility}%`
return `| ${dataset.description} | ${rowCount} | ${structure} | ${csvSupport} | ${eligibility} |`
}).join('\n')
return `
#### Dataset Catalog
| Dataset | Rows | Structure | CSV Support | Eligibility |
| ------- | ---- | --------- | ----------- | ----------- |
${rows}
**Structure classes:**
- **uniform**: All objects have identical fields with primitive values
- **semi-uniform**: Mix of uniform and non-uniform structures
- **nested**: Objects with nested structures (nested objects or arrays)
- **deep**: Highly nested with minimal tabular eligibility
**CSV Support:** ✓ (supported), ✗ (not supported - would require lossy flattening)
**Eligibility:** Percentage of arrays that qualify for TOON's tabular format (uniform objects with primitive values)
`.trim()
}
/** /**
* Generate efficiency ranking report * Generate efficiency ranking report
*/ */
@@ -168,10 +207,12 @@ function generateDetailedAccuracyReport(
const filteringPercent = ((filteringCount / totalQuestions) * 100).toFixed(0) const filteringPercent = ((filteringCount / totalQuestions) * 100).toFixed(0)
// Calculate dataset sizes // Calculate dataset sizes
const tabularSize = datasets.find(d => d.name === 'tabular')?.data.employees?.length || 0 const tabularSize = ACCURACY_DATASETS.find(d => d.name === 'tabular')?.data.employees?.length || 0
const nestedSize = datasets.find(d => d.name === 'nested')?.data.orders?.length || 0 const nestedSize = ACCURACY_DATASETS.find(d => d.name === 'nested')?.data.orders?.length || 0
const analyticsSize = datasets.find(d => d.name === 'analytics')?.data.metrics?.length || 0 const analyticsSize = ACCURACY_DATASETS.find(d => d.name === 'analytics')?.data.metrics?.length || 0
const githubSize = datasets.find(d => d.name === 'github')?.data.repositories?.length || 0 const githubSize = ACCURACY_DATASETS.find(d => d.name === 'github')?.data.repositories?.length || 0
const eventLogsSize = ACCURACY_DATASETS.find(d => d.name === 'event-logs')?.data.logs?.length || 0
const nestedConfigSize = 1 // Single config object
// Calculate number of formats and evaluations // Calculate number of formats and evaluations
const formatCount = formatResults.length const formatCount = formatResults.length
@@ -208,12 +249,14 @@ This benchmark tests **LLM comprehension and data retrieval accuracy** across di
#### Datasets Tested #### Datasets Tested
Four datasets designed to test different structural patterns (all contain arrays of uniform objects, TOON's optimal format): Six datasets designed to test different structural patterns:
1. **Tabular** (${tabularSize} employee records): Uniform objects with identical fields optimal for TOON's tabular format. 1. **Tabular** (${tabularSize} employee records): Uniform objects with identical fields optimal for TOON's tabular format.
2. **Nested** (${nestedSize} e-commerce orders): Complex structures with nested customer objects and item arrays. 2. **Nested** (${nestedSize} e-commerce orders): Complex structures with nested customer objects and item arrays.
3. **Analytics** (${analyticsSize} days of metrics): Time-series data with dates and numeric values. 3. **Analytics** (${analyticsSize} days of metrics): Time-series data with dates and numeric values.
4. **GitHub** (${githubSize} repositories): Real-world data from top GitHub repos by stars. 4. **GitHub** (${githubSize} repositories): Real-world data from top GitHub repos by stars.
5. **Event Logs** (${eventLogsSize} logs): Semi-uniform data with ~50% flat logs and ~50% with nested error objects.
6. **Nested Config** (${nestedConfigSize} configuration): Deeply nested configuration with minimal tabular eligibility.
#### Question Types #### Question Types
@@ -314,7 +357,7 @@ function generateDatasetBreakdown(
questions: Question[], questions: Question[],
tokenCounts: Record<string, number>, tokenCounts: Record<string, number>,
): string { ): string {
return datasets.map((dataset) => { return ACCURACY_DATASETS.map((dataset) => {
const datasetResults = formatResults.map((fr) => { const datasetResults = formatResults.map((fr) => {
const datasetFormatResults = results.filter(r => r.questionId.includes(dataset.name) || questions.find(q => q.id === r.questionId)?.dataset === dataset.name) const datasetFormatResults = results.filter(r => r.questionId.includes(dataset.name) || questions.find(q => q.id === r.questionId)?.dataset === dataset.name)
if (datasetFormatResults.length === 0) if (datasetFormatResults.length === 0)

View File

@@ -1,7 +1,14 @@
export interface DatasetMetadata {
supportsCSV: boolean
structureClass: 'uniform' | 'semi-uniform' | 'nested' | 'deep'
tabularEligibility: number
}
export interface Dataset { export interface Dataset {
name: string name: string
description: string description: string
data: Record<string, any> data: Record<string, any>
metadata: DatasetMetadata
} }
export interface Question { export interface Question {