test(benchmark): overhaul generation

2026-01-29 23:34:10 +08:00 · 2025-11-06 14:45:44 +01:00
parent 9863875706
commit bc711ccecf
19 changed files with 2254 additions and 997 deletions
--- a/README.md
+++ b/README.md
@@ -60,6 +60,19 @@ For small payloads, JSON/CSV/YAML work fine. TOON's value emerges at scale: when
 </details>
 <details>
 <summary><strong>When NOT to use TOON</strong></summary>
 TOON excels with uniform arrays of objects, but there are cases where other formats are better:
 - **Deeply nested or non-uniform structures** (tabular eligibility ≈ 0%): JSON-compact often uses fewer tokens. Example: complex configuration objects with many nested levels.
 - **Semi-uniform arrays** (~40–60% tabular eligibility): Token savings diminish. Prefer JSON if your pipelines already rely on it.
 - **Flat CSV use-cases**: CSV is smaller than TOON for pure tabular data. TOON adds minimal overhead (~5-10%) to provide structure (length markers, field headers, delimiter scoping) that improves LLM reliability.
 See [benchmarks](#benchmarks) for concrete comparisons across different data structures.
 </details>
 ## Key Features
 - 💸 **Token-efficient:** typically 30–60% fewer tokens than JSON[^1]
@@ -75,14 +88,16 @@ For small payloads, JSON/CSV/YAML work fine. TOON's value emerges at scale: when
 > [!TIP]
 > Try the interactive [Format Tokenization Playground](https://www.curiouslychase.com/playground/format-tokenization-exploration) to compare token usage across CSV, JSON, YAML, and TOON with your own data.
 Benchmarks are organized into two tracks to ensure fair comparisons:
 - **Mixed-Structure Track**: Datasets with nested or semi-uniform structures (TOON vs JSON, YAML, XML). CSV excluded as it cannot properly represent these structures.
 - **Flat-Only Track**: Datasets with flat tabular structures where CSV is applicable (CSV vs TOON vs JSON, YAML, XML).
 ### Token Efficiency
 Token counts are measured using the GPT-5 `o200k_base` tokenizer via [`gpt-tokenizer`](https://github.com/niieani/gpt-tokenizer). Savings are calculated against formatted JSON (2-space indentation) as the primary baseline, with additional comparisons to compact JSON (minified), YAML, and XML. Actual savings vary by model and tokenizer.
-The benchmarks use datasets optimized for TOON's strengths (uniform tabular data). Real-world performance depends on your data structure.
+The benchmarks test datasets across different structural patterns (uniform, semi-uniform, nested, deeply nested) to show where TOON excels and where other formats may be better.
 > [!NOTE]
 > CSV/TSV doesn't support nested structures, so it's not included in this comparison. For flat datasets where CSV applies, see token counts and accuracy metrics in the [Retrieval Accuracy](#retrieval-accuracy) section.
 <!-- automd:file src="./benchmarks/results/token-efficiency.md" -->
--- a/benchmarks/README.md
+++ b/benchmarks/README.md
@@ -34,8 +34,8 @@ Results are saved to `results/token-efficiency.md`.
 Tests how well LLMs can answer questions about data in different formats (TOON, JSON, JSON compact, XML, YAML, CSV):
-1. Generate ~150-160 questions across 4 datasets
+1. Generate ~150-160 questions across 6 datasets (CSV only included for datasets with flat/tabular structure)
-2. Convert each dataset to all 6 formats
+2. Convert each dataset to all supported formats
 3. Query each LLM with formatted data + question
 4. Validate answers using `gpt-5-nano` as judge
 5. Aggregate metrics and generate report
--- a/benchmarks/results/token-efficiency.md
+++ b/benchmarks/results/token-efficiency.md
@@ -1,36 +1,149 @@
 ## Mixed-Structure Track
 Datasets with nested or semi-uniform structures. CSV excluded as it cannot properly represent these structures.
 ```
-⭐ GitHub Repositories       ██████████████░░░░░░░░░░░    8,745 tokens
+🛒 E-commerce orders with nested structures [eligibility: 33%] 
 toon                  ▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░    58,528 tokens
  vs JSON (−37.9%)               94,207
  vs JSON compact (+0.9%)        57,979
  vs YAML (−17.8%)               71,223
  vs XML (−45.2%)               106,720
 🧾 Semi-uniform event logs [eligibility: 50%]                  
 toon                  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░   154,419 tokens
  vs JSON (−15.0%)              181,592
  vs JSON compact (+19.9%)      128,836
  vs YAML (−0.9%)               155,749
  vs XML (−25.1%)               206,271
 🧩 Deeply nested configuration [eligibility: 0%]               
 toon                  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░       630 tokens
  vs JSON (−31.4%)                  918
  vs JSON compact (+11.9%)          563
  vs YAML (−6.4%)                   673
  vs XML (−37.4%)                 1,007
 ─────────────────────────────────────────────────────────────────────────────────
 Total                                                               
 toon                  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░   213,577 tokens
  vs JSON (−22.8%)              276,717
  vs JSON compact (+14.0%)      187,378
  vs YAML (−6.2%)               227,645
  vs XML (−32.0%)               313,998
 ```
 ## Flat-Only Track
 Datasets with flat tabular structures where CSV is applicable.
 ```
 👥 Uniform employee records (TOON optimal format) [eligibility: 100%]
 csv                   ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░    46,968 tokens
 toon                  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓    49,841 tokens   (+5.8% vs CSV)
  vs JSON (−60.7%)              126,886
  vs JSON compact (−36.8%)       78,882
  vs YAML (−50.0%)               99,743
  vs XML (−66.0%)               146,465
 📈 Time-series analytics data [eligibility: 100%]              
 csv                   ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░     8,382 tokens
 toon                  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓     9,114 tokens   (+8.0% vs CSV)
  vs JSON (−59.0%)               22,244
  vs JSON compact (−35.9%)       14,210
  vs YAML (−49.0%)               17,857
  vs XML (−65.8%)                26,615
 ⭐ Top 100 GitHub repositories [eligibility: 100%]             
 csv                   ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░     8,513 tokens
 toon                  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓     8,745 tokens   (+2.7% vs CSV)
  vs JSON (−42.3%)               15,145
  vs JSON compact (−23.7%)       11,455
  vs YAML (−33.4%)               13,129
  vs XML (−48.8%)                17,095
-📈 Daily Analytics           ██████████░░░░░░░░░░░░░░░    4,507 tokens
+─────────────────────────────────────────────────────────────────────────────────
-                             vs JSON (−58.9%)           10,977
+Total                                                               
-                             vs JSON compact (−35.7%)    7,013
+csv                   ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░    63,863 tokens
-                             vs YAML (−48.8%)            8,810
+toon                  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓    67,700 tokens   (+5.7% vs CSV)
-                             vs XML (−65.7%)            13,128
+  vs JSON (−58.8%)              164,275
-
+  vs JSON compact (−35.2%)      104,547
-🛒 E-Commerce Order          ████████████████░░░░░░░░░      166 tokens
+  vs YAML (−48.2%)              130,729
-                             vs JSON (−35.4%)              257
+  vs XML (−64.4%)               190,175
                             vs JSON compact (−2.9%)       171
                             vs YAML (−15.7%)              197
                             vs XML (−38.7%)               271
 ─────────────────────────────────────────────────────────────────────
 Total                        ██████████████░░░░░░░░░░░   13,418 tokens
                             vs JSON (−49.1%)           26,379
                             vs JSON compact (−28.0%)   18,639
                             vs YAML (−39.4%)           22,136
                             vs XML (−56.0%)            30,494
 ```
 <details>
 <summary><strong>View detailed examples</strong></summary>
-#### ⭐ GitHub Repositories
+#### 📈 Time-series analytics data
-**Configuration:** Top 100 GitHub repositories with stars, forks, and metadata
+**Savings:** 13,130 tokens (59.0% reduction vs JSON)
 **JSON** (22,244 tokens):
 ```json
 {
  "metrics": [
    {
      "date": "2025-01-01",
      "views": 4324,
      "clicks": 146,
      "conversions": 21,
      "revenue": 3834.57,
      "bounceRate": 0.4
    },
    {
      "date": "2025-01-02",
      "views": 6248,
      "clicks": 407,
      "conversions": 22,
      "revenue": 2936.12,
      "bounceRate": 0.62
    },
    {
      "date": "2025-01-03",
      "views": 7382,
      "clicks": 270,
      "conversions": 24,
      "revenue": 6825.19,
      "bounceRate": 0.7
    },
    {
      "date": "2025-01-04",
      "views": 4586,
      "clicks": 267,
      "conversions": 24,
      "revenue": 2391.11,
      "bounceRate": 0.64
    },
    {
      "date": "2025-01-05",
      "views": 6171,
      "clicks": 227,
      "conversions": 12,
      "revenue": 3430.1,
      "bounceRate": 0.39
    }
  ]
 }
 ```
 **TOON** (9,114 tokens):
 ```
 metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
  2025-01-01,4324,146,21,3834.57,0.4
  2025-01-02,6248,407,22,2936.12,0.62
  2025-01-03,7382,270,24,6825.19,0.7
  2025-01-04,4586,267,24,2391.11,0.64
  2025-01-05,6171,227,12,3430.1,0.39
 ```
 ---
 #### ⭐ Top 100 GitHub repositories
 **Savings:** 6,400 tokens (42.3% reduction vs JSON)
@@ -91,72 +204,4 @@ repositories[3]{id,name,repo,description,createdAt,updatedAt,pushedAt,stars,watc
  21737465,awesome,sindresorhus/awesome,😎 Awesome lists about all kinds of interesting topics,"2014-07-11T13:42:37Z","2025-10-28T12:40:21Z","2025-10-27T17:57:31Z",410052,8017,32029,main
 ```
 ---
 #### 📈 Daily Analytics
 **Configuration:** 180 days of web metrics (views, clicks, conversions, revenue)
 **Savings:** 6,470 tokens (58.9% reduction vs JSON)
 **JSON** (10,977 tokens):
 ```json
 {
  "metrics": [
    {
      "date": "2025-01-01",
      "views": 6890,
      "clicks": 401,
      "conversions": 23,
      "revenue": 6015.59,
      "bounceRate": 0.63
    },
    {
      "date": "2025-01-02",
      "views": 6940,
      "clicks": 323,
      "conversions": 37,
      "revenue": 9086.44,
      "bounceRate": 0.36
    },
    {
      "date": "2025-01-03",
      "views": 4390,
      "clicks": 346,
      "conversions": 26,
      "revenue": 6360.75,
      "bounceRate": 0.48
    },
    {
      "date": "2025-01-04",
      "views": 3429,
      "clicks": 231,
      "conversions": 13,
      "revenue": 2360.96,
      "bounceRate": 0.65
    },
    {
      "date": "2025-01-05",
      "views": 5804,
      "clicks": 186,
      "conversions": 22,
      "revenue": 2535.96,
      "bounceRate": 0.37
    }
  ]
 }
 ```
 **TOON** (4,507 tokens):
 ```
 metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
  2025-01-01,6890,401,23,6015.59,0.63
  2025-01-02,6940,323,37,9086.44,0.36
  2025-01-03,4390,346,26,6360.75,0.48
  2025-01-04,3429,231,13,2360.96,0.65
  2025-01-05,5804,186,22,2535.96,0.37
 ```
 </details>
--- a/benchmarks/scripts/accuracy-benchmark.ts
+++ b/benchmarks/scripts/accuracy-benchmark.ts
@@ -5,16 +5,83 @@ import process from 'node:process'
 import * as prompts from '@clack/prompts'
 import PQueue from 'p-queue'
 import { BENCHMARKS_DIR, DEFAULT_CONCURRENCY, DRY_RUN, DRY_RUN_LIMITS, MODEL_RPM_LIMITS, ROOT_DIR } from '../src/constants'
-import { datasets } from '../src/datasets'
+import { ACCURACY_DATASETS } from '../src/datasets'
 import { evaluateQuestion, models } from '../src/evaluate'
-import { formatters } from '../src/formatters'
+import { formatters, supportsCSV } from '../src/formatters'
 import { generateQuestions } from '../src/questions'
 import { calculateFormatResults, calculateTokenCounts, generateAccuracyReport } from '../src/report'
 import { getAllModelResults, hasModelResults, saveModelResults } from '../src/storage'
 import { ensureDir } from '../src/utils'
 // Constants
 const PROGRESS_UPDATE_INTERVAL = 10
 const RATE_LIMIT_INTERVAL_MS = 60_000
 prompts.intro('Retrieval Accuracy Benchmark')
 /**
 * Generate evaluation tasks for a model
 */
 function generateEvaluationTasks(questions: Question[]): { question: Question, formatName: string }[] {
  const tasks: { question: Question, formatName: string }[] = []
  for (const question of questions) {
    for (const [formatName] of Object.entries(formatters)) {
      // Skip CSV for datasets that don't support it
      const dataset = ACCURACY_DATASETS.find(d => d.name === question.dataset)
      if (formatName === 'csv' && dataset && !supportsCSV(dataset))
        continue
      tasks.push({ question, formatName })
    }
  }
  return tasks
 }
 /**
 * Check which models already have saved results
 */
 async function checkExistingResults(activeModels: typeof models) {
  const existingModelResults: Record<string, boolean> = {}
  for (const model of activeModels) {
    const existingResult = await hasModelResults(model.modelId)
    if (existingResult)
      existingModelResults[model.modelId] = existingResult
  }
  return existingModelResults
 }
 /**
 * Create a progress updater function
 */
 function createProgressUpdater(spinner: ReturnType<typeof prompts.spinner>, total: number) {
  let completed = 0
  return () => {
    completed++
    if (completed % PROGRESS_UPDATE_INTERVAL === 0 || completed === total) {
      const percent = ((completed / total) * 100).toFixed(1)
      spinner.message(`Progress: ${completed}/${total} (${percent}%)`)
    }
  }
 }
 /**
 * Create a rate-limited queue for model evaluation
 */
 function createEvaluationQueue(modelId: string) {
  const rpmLimit = MODEL_RPM_LIMITS[modelId]
  return new PQueue({
    concurrency: DEFAULT_CONCURRENCY,
    intervalCap: rpmLimit ?? Infinity,
    interval: rpmLimit ? RATE_LIMIT_INTERVAL_MS : 0,
  })
 }
 // Prompt user to select which models to benchmark
 const modelChoices = models.map(({ modelId }) => ({
  value: modelId,
@@ -37,15 +104,10 @@ const activeModels = models.filter(m => selectedModels.includes(m.modelId))
 prompts.log.info(`Selected ${activeModels.length} model(s): ${activeModels.map(m => m.modelId).join(', ')}`)
 // Check which models already have results
-const existingModelResults: Record<string, boolean> = {}
+const existingModelResults = await checkExistingResults(activeModels)
 for (const model of activeModels) {
  const existingResult = await hasModelResults(model.modelId)
  if (existingResult)
    existingModelResults[model.modelId] = existingResult
 }
 if (Object.keys(existingModelResults).length > 0) {
-  prompts.log.info(`Found existing results for ${Object.values(existingModelResults).length} model(s)`)
+  prompts.log.info(`Found existing results for ${Object.keys(existingModelResults).length} model(s)`)
 }
 if (DRY_RUN) {
@@ -75,31 +137,22 @@ for (const model of activeModels) {
  prompts.log.step(`Running benchmark for ${modelId}`)
  // Generate evaluation tasks for this model
-  const tasks: { question: Question, formatName: string }[] = []
+  const tasks = generateEvaluationTasks(questions)
  for (const question of questions) {
    for (const [formatName] of Object.entries(formatters)) {
      tasks.push({ question, formatName })
    }
  }
  const total = tasks.length
  const rpmLimit = MODEL_RPM_LIMITS[modelId]
-  const queue = new PQueue({
+  const queue = createEvaluationQueue(modelId)
    concurrency: DEFAULT_CONCURRENCY,
    intervalCap: rpmLimit ?? Infinity,
    interval: rpmLimit ? 60_000 : 0,
  })
  const evalSpinner = prompts.spinner()
  evalSpinner.start(`Running ${total} evaluations (concurrency: ${DEFAULT_CONCURRENCY}, RPM limit: ${rpmLimit ?? 'unlimited'})`)
-  let completed = 0
+  const updateProgress = createProgressUpdater(evalSpinner, total)
  // Queue all tasks
  const modelResultPromises = tasks.map(task =>
    queue.add(async () => {
      // Format data on-demand
-      const dataset = datasets.find(d => d.name === task.question.dataset)!
+      const dataset = ACCURACY_DATASETS.find(d => d.name === task.question.dataset)!
      const formatter = formatters[task.formatName]!
      const formattedData = formatter(dataset.data)
@@ -111,11 +164,7 @@ for (const model of activeModels) {
      })
      // Progress update after task completes
-      completed++
+      updateProgress()
      if (completed % 10 === 0 || completed === total) {
        const percent = ((completed / total) * 100).toFixed(1)
        evalSpinner.message(`Progress: ${completed}/${total} (${percent}%)`)
      }
      return result
    }),
@@ -154,5 +203,5 @@ await ensureDir(resultsDir)
 const outputFilePath = path.join(resultsDir, 'retrieval-accuracy.md')
 await fsp.writeFile(outputFilePath, accuracyReport)
 prompts.log.info(`Report saved to: \`${path.relative(ROOT_DIR, outputFilePath)}\``)
 reportSpinner.stop('Report generation complete!')
 prompts.log.info(`Report saved to: \`${path.relative(ROOT_DIR, outputFilePath)}\``)
--- a/benchmarks/scripts/token-efficiency-benchmark.ts
+++ b/benchmarks/scripts/token-efficiency-benchmark.ts
@@ -1,11 +1,11 @@
 import type { Dataset } from '../src/types'
 import * as fsp from 'node:fs/promises'
 import * as path from 'node:path'
 import * as prompts from '@clack/prompts'
 import { encode } from '../../packages/toon/src'
 import githubRepos from '../data/github-repos.json' with { type: 'json' }
 import { BENCHMARKS_DIR, FORMATTER_DISPLAY_NAMES, ROOT_DIR } from '../src/constants'
-import { generateAnalyticsData, generateOrderData } from '../src/datasets'
+import { TOKEN_EFFICIENCY_DATASETS } from '../src/datasets'
-import { formatters } from '../src/formatters'
+import { formatters, supportsCSV } from '../src/formatters'
 import { createProgressBar, ensureDir, tokenize } from '../src/utils'
 interface FormatMetrics {
@@ -16,55 +16,160 @@ interface FormatMetrics {
 }
 interface BenchmarkResult {
-  name: string
+  dataset: Dataset
  emoji: string
  description: string
  data: Record<string, any>
  formats: FormatMetrics[]
  showDetailed: boolean
 }
-const BENCHMARK_EXAMPLES = [
+// Constants
-  {
+const DATASET_ICONS: Record<string, string> = {
-    name: 'GitHub Repositories',
+  'tabular': '👥',
-    emoji: '⭐',
+  'nested': '🛒',
-    description: 'Top 100 GitHub repositories with stars, forks, and metadata',
+  'analytics': '📈',
-    getData: () => ({ repositories: githubRepos }),
+  'github': '⭐',
-    showDetailed: true,
+  'event-logs': '🧾',
-  },
+  'nested-config': '🧩',
-  {
+}
-    name: 'Daily Analytics',
+
-    emoji: '📈',
+const COMPARISON_FORMAT_ORDER = ['json-pretty', 'json-compact', 'yaml', 'xml'] as const
-    description: '180 days of web metrics (views, clicks, conversions, revenue)',
+
-    getData: () => generateAnalyticsData(180),
+const PROGRESS_BAR_CONFIG = { filled: '▓', empty: '░' } as const
-    showDetailed: true,
+const PROGRESS_BAR_WIDTH = 20
-  },
+const TOKEN_PADDING = 7
-  {
+const LABEL_PADDING = 60
-    name: 'E-Commerce Order',
+const COMPARISON_LABEL_PADDING = 30
-    emoji: '🛒',
+
-    description: 'Single nested order with customer and items',
+const SEPARATOR = '─────────────────────────────────────────────────────────────────────────────────'
-    getData: generateOrderData,
+const DEFAULT_DATASET_ICON = '📊'
-    showDetailed: false,
+
-  },
+const DETAILED_EXAMPLE_DATASETS = ['github', 'analytics'] as const
-] as const
+const GITHUB_REPO_LIMIT = 3
 const GITHUB_DESC_LIMIT = 80
 const ANALYTICS_METRICS_LIMIT = 5
 prompts.intro('Token Efficiency Benchmark')
 /**
 * Format a comparison line showing savings vs TOON
 */
 function formatComparisonLine(format: FormatMetrics): string {
  const label = FORMATTER_DISPLAY_NAMES[format.name] || format.name.toUpperCase()
  const signedPercent = format.savingsPercent >= 0
    ? `−${format.savingsPercent.toFixed(1)}%`
    : `+${Math.abs(format.savingsPercent).toFixed(1)}%`
  const labelWithSavings = `vs ${label} (${signedPercent})`.padEnd(COMPARISON_LABEL_PADDING)
  const tokenStr = format.tokens.toLocaleString('en-US').padStart(TOKEN_PADDING)
  return `  ${labelWithSavings}${tokenStr}`
 }
 /**
 * Calculate total tokens and savings for a set of datasets
 */
 function calculateTotalMetrics(datasets: BenchmarkResult[], formatNames: readonly string[]) {
  const totalToonTokens = datasets.reduce((sum, r) => {
    const toon = r.formats.find(f => f.name === 'toon')!
    return sum + toon.tokens
  }, 0)
  const totals = formatNames.map((formatName) => {
    const totalTokens = datasets.reduce((sum, r) => {
      const format = r.formats.find(f => f.name === formatName)
      return sum + (format?.tokens || 0)
    }, 0)
    const savings = totalTokens - totalToonTokens
    const savingsPercent = (savings / totalTokens) * 100
    return { name: formatName, tokens: totalTokens, savingsPercent }
  })
  return { totalToonTokens, totals }
 }
 /**
 * Generate total lines for a track
 */
 function generateTotalLines(
  totalToonTokens: number,
  totals: { name: string, tokens: number, savingsPercent: number }[],
  baselineFormat?: { name: string, tokens: number },
 ) {
  const lines: string[] = ['Total                                                               ']
  if (baselineFormat) {
    // Flat-only track with CSV baseline
    const csvPercentage = Math.min(100, (baselineFormat.tokens / totalToonTokens) * 100)
    const csvBar = createProgressBar(csvPercentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
    const csvStr = baselineFormat.tokens.toLocaleString('en-US').padStart(TOKEN_PADDING)
    lines.push(`csv                   ${csvBar}   ${csvStr} tokens`)
    const overheadPercent = ((totalToonTokens - baselineFormat.tokens) / totalToonTokens) * 100
    const toonBar = createProgressBar(100, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
    const toonStr = totalToonTokens.toLocaleString('en-US').padStart(TOKEN_PADDING)
    lines.push(`toon                  ${toonBar}   ${toonStr} tokens   (+${overheadPercent.toFixed(1)}% vs CSV)`)
  }
  else {
    // Mixed-structure track
    const totalPercentage = Math.min(100, (totalToonTokens / totals[0]!.tokens) * 100)
    const totalBar = createProgressBar(totalPercentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
    const toonStr = totalToonTokens.toLocaleString('en-US').padStart(TOKEN_PADDING)
    lines.push(`toon                  ${totalBar}   ${toonStr} tokens`)
  }
  // Add comparison lines
  for (const format of totals) {
    lines.push(formatComparisonLine({
      name: format.name,
      tokens: format.tokens,
      savings: 0, // Not used in this context
      savingsPercent: format.savingsPercent,
    }))
  }
  return lines.join('\n')
 }
 /**
 * Generate bar chart for a dataset
 */
 function generateDatasetChart(result: BenchmarkResult): string {
  const { dataset, formats } = result
  const toon = formats.find(f => f.name === 'toon')!
  const jsonPretty = formats.find(f => f.name === 'json-pretty')!
  const emoji = DATASET_ICONS[dataset.name] || DEFAULT_DATASET_ICON
  const eligibility = dataset.metadata.tabularEligibility
  const name = `${dataset.description} [eligibility: ${eligibility}%]`
  const percentage = Math.min(100, 100 - jsonPretty.savingsPercent)
  const bar = createProgressBar(percentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
  const toonStr = toon.tokens.toLocaleString('en-US')
  const line1 = `${emoji} ${name.padEnd(LABEL_PADDING)}\ntoon                  ${bar}   ${toonStr.padStart(TOKEN_PADDING)} tokens`
  const comparisonLines = COMPARISON_FORMAT_ORDER.map((formatName) => {
    const format = formats.find(f => f.name === formatName)
    if (!format)
      return null
    return formatComparisonLine(format)
  }).filter(Boolean)
  return [line1, ...comparisonLines].join('\n')
 }
 const results: BenchmarkResult[] = []
 const totalTokensByFormat: Record<string, number> = {}
-for (const example of BENCHMARK_EXAMPLES) {
+// Calculate token counts for all datasets
-  const data = example.getData()
+for (const dataset of TOKEN_EFFICIENCY_DATASETS) {
  // Calculate tokens for each format
  const formatMetrics: FormatMetrics[] = []
  const tokensByFormat: Record<string, number> = {}
  // Calculate tokens for each format
  for (const [formatName, formatter] of Object.entries(formatters)) {
-    const formattedString = formatter(data)
+    // Skip CSV for datasets that don't support it
    if (formatName === 'csv' && !supportsCSV(dataset))
      continue
    const formattedString = formatter(dataset.data)
    const tokens = tokenize(formattedString)
    tokensByFormat[formatName] = tokens
    totalTokensByFormat[formatName] = (totalTokensByFormat[formatName] || 0) + tokens
  }
  // Calculate savings vs TOON
@@ -80,105 +185,126 @@ for (const example of BENCHMARK_EXAMPLES) {
  }
  results.push({
-    name: example.name,
+    dataset,
    emoji: example.emoji,
    description: example.description,
    data,
    formats: formatMetrics,
    showDetailed: example.showDetailed,
  })
 }
-// Calculate total savings percentages
+// Separate datasets by CSV support
-const totalToonTokens = totalTokensByFormat.toon!
+const mixedStructureDatasets = results.filter(r => !supportsCSV(r.dataset))
-const totalSavingsPercent: Record<string, number> = {}
+const flatOnlyDatasets = results.filter(r => supportsCSV(r.dataset))
 for (const [formatName, totalTokens] of Object.entries(totalTokensByFormat)) {
  if (formatName === 'toon') {
    totalSavingsPercent[formatName] = 0
  }
  else {
    const savings = totalTokens - totalToonTokens
    totalSavingsPercent[formatName] = (savings / totalTokens) * 100
  }
 }
-// Generate ASCII bar chart visualization (stacked compact format)
+// Mixed-Structure Track (no CSV)
-const formatOrder = ['json-pretty', 'json-compact', 'yaml', 'xml']
+const mixedCharts = mixedStructureDatasets
-const datasetRows = results
+  .map(result => generateDatasetChart(result))
  .join('\n\n')
 // Flat-Only Track (with CSV)
 const flatCharts = flatOnlyDatasets
  .map((result) => {
    const csv = result.formats.find(f => f.name === 'csv')
    const toon = result.formats.find(f => f.name === 'toon')!
-    const percentage = result.formats.find(f => f.name === 'json-pretty')!.savingsPercent
+
-    const bar = createProgressBar(100 - percentage, 100) // Invert to show TOON tokens
+    if (!csv)
      return generateDatasetChart(result)
    // Special handling to show CSV first with TOON overhead
    const { dataset } = result
    const emoji = DATASET_ICONS[dataset.name] || DEFAULT_DATASET_ICON
    const eligibility = dataset.metadata.tabularEligibility
    const name = `${dataset.description} [eligibility: ${eligibility}%]`
    // CSV line
    const csvPercentage = Math.min(100, (csv.tokens / toon.tokens) * 100)
    const csvBar = createProgressBar(csvPercentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
    const csvStr = csv.tokens.toLocaleString('en-US')
    const line1 = `${emoji} ${name.padEnd(LABEL_PADDING)}\ncsv                   ${csvBar}   ${csvStr.padStart(TOKEN_PADDING)} tokens`
    // TOON line with overhead vs CSV
    const toonOverhead = toon.tokens - csv.tokens
    const toonOverheadPercent = (toonOverhead / toon.tokens) * 100
    const toonBar = createProgressBar(100, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
    const toonStr = toon.tokens.toLocaleString('en-US')
    const toonVsCSV = toonOverheadPercent >= 0
      ? `(+${toonOverheadPercent.toFixed(1)}% vs CSV)`
      : `(${toonOverheadPercent.toFixed(1)}% vs CSV)`
    const toonLine = `toon                  ${toonBar}   ${toonStr.padStart(TOKEN_PADDING)} tokens   ${toonVsCSV}`
-    const line1 = `${result.emoji} ${result.name.padEnd(25)} ${bar}   ${toonStr.padStart(6)} tokens`
+    // Other format comparisons (vs TOON)
    const comparisonLines = COMPARISON_FORMAT_ORDER.map((formatName) => {
      const format = result.formats.find(f => f.name === formatName)
      if (!format)
        return null
-    const comparisonLines = formatOrder.map((formatName) => {
+      return formatComparisonLine(format)
-      const format = result.formats.find(f => f.name === formatName)!
+    }).filter(Boolean)
      const label = FORMATTER_DISPLAY_NAMES[formatName] || formatName.toUpperCase()
      const signedPercent = format.savingsPercent >= 0
        ? `−${format.savingsPercent.toFixed(1)}%`
        : `+${Math.abs(format.savingsPercent).toFixed(1)}%`
      const labelWithSavings = `vs ${label} (${signedPercent})`.padEnd(27)
      const tokenStr = format.tokens.toLocaleString('en-US').padStart(6)
      return `                             ${labelWithSavings}${tokenStr}`
    })
-    return [line1, ...comparisonLines].join('\n')
+    return [line1, toonLine, ...comparisonLines].join('\n')
  })
  .join('\n\n')
-// Add separator and totals row
+// Calculate totals for mixed structure
-const separator = '─────────────────────────────────────────────────────────────────────'
+const { totalToonTokens: totalToonTokensMixed, totals: mixedTotals } = calculateTotalMetrics(mixedStructureDatasets, COMPARISON_FORMAT_ORDER)
 const mixedTotalLines = generateTotalLines(totalToonTokensMixed, mixedTotals)
-// Calculate bar for totals (TOON vs average of comparison formats)
+// Calculate totals for flat-only
-const comparisonTokens = formatOrder.map(name => totalTokensByFormat[name]!)
+const { totalToonTokens: totalToonTokensFlat, totals: flatTotals } = calculateTotalMetrics(flatOnlyDatasets, COMPARISON_FORMAT_ORDER)
-const averageComparisonTokens = comparisonTokens.reduce((a, b) => a + b, 0) / comparisonTokens.length
+const totalCSVTokensFlat = flatOnlyDatasets.reduce((sum, r) => {
-const totalPercentage = (totalToonTokens / averageComparisonTokens) * 100
+  const csv = r.formats.find(f => f.name === 'csv')
-const totalBar = createProgressBar(totalPercentage, 100)
+  return sum + (csv?.tokens || 0)
 }, 0)
 const flatTotalLines = generateTotalLines(totalToonTokensFlat, flatTotals, { name: 'csv', tokens: totalCSVTokensFlat })
-const totalLine1 = `Total                        ${totalBar}   ${totalToonTokens.toLocaleString('en-US').padStart(6)} tokens`
+const barChartSection = `
 ## Mixed-Structure Track
-const totalComparisonLines = formatOrder.map((formatName) => {
+Datasets with nested or semi-uniform structures. CSV excluded as it cannot properly represent these structures.
  const label = FORMATTER_DISPLAY_NAMES[formatName] || formatName.toUpperCase()
  const tokens = totalTokensByFormat[formatName]!
  const percent = totalSavingsPercent[formatName]!
  const signedPercent = percent >= 0 ? `−${percent.toFixed(1)}%` : `+${Math.abs(percent).toFixed(1)}%`
  const labelWithSavings = `vs ${label} (${signedPercent})`.padEnd(27)
  const tokenStr = tokens.toLocaleString('en-US').padStart(6)
  return `                             ${labelWithSavings}${tokenStr}`
 })
-const barChartSection = `${datasetRows}\n\n${separator}\n${totalLine1}\n${totalComparisonLines.join('\n')}`
+\`\`\`
 ${mixedCharts}
-// Generate detailed examples (only for selected examples)
+${SEPARATOR}
-// Note: Large datasets are truncated for display readability in the report.
+${mixedTotalLines}
-// Token counts are calculated from the full datasets, not the truncated versions.
+\`\`\`
 ## Flat-Only Track
 Datasets with flat tabular structures where CSV is applicable.
 \`\`\`
 ${flatCharts}
 ${SEPARATOR}
 ${flatTotalLines}
 \`\`\`
 `.trim()
 // Generate detailed examples (optional: show a few examples)
 const detailedExamples = results
-  .filter(result => result.showDetailed)
+  .filter(r => DETAILED_EXAMPLE_DATASETS.includes(r.dataset.name as any))
  .map((result, i, filtered) => {
-    // Truncate large datasets for display
+    let displayData = result.dataset.data
-    let displayData = result.data
+
-    if (result.name === 'GitHub Repositories') {
+    // Truncate for display
    if (result.dataset.name === 'github') {
      displayData = {
-        repositories: result.data.repositories.slice(0, 3).map((repo: Record<string, any>) => ({
+        repositories: displayData.repositories.slice(0, GITHUB_REPO_LIMIT).map((repo: Record<string, any>) => ({
          ...repo,
-          description: repo.description?.slice(0, 80) + (repo.description?.length > 80 ? '…' : ''),
+          description: repo.description?.slice(0, GITHUB_DESC_LIMIT) + (repo.description?.length > GITHUB_DESC_LIMIT ? '…' : ''),
        })),
      }
    }
-    else if (result.name === 'Daily Analytics') {
+    else if (result.dataset.name === 'analytics') {
-      displayData = { metrics: result.data.metrics.slice(0, 5) }
+      displayData = { metrics: displayData.metrics.slice(0, ANALYTICS_METRICS_LIMIT) }
    }
    const separator = i < filtered.length - 1 ? '\n\n---' : ''
-
+    const emoji = DATASET_ICONS[result.dataset.name] || DEFAULT_DATASET_ICON
    const json = result.formats.find(f => f.name === 'json-pretty')!
    const toon = result.formats.find(f => f.name === 'toon')!
-    return `#### ${result.emoji} ${result.name}
+    return `#### ${emoji} ${result.dataset.description}
 **Configuration:** ${result.description}
 **Savings:** ${json.savings.toLocaleString('en-US')} tokens (${json.savingsPercent.toFixed(1)}% reduction vs JSON)
@@ -197,9 +323,7 @@ ${encode(displayData)}
  .join('\n\n')
 const markdown = `
 \`\`\`
 ${barChartSection}
 \`\`\`
 <details>
 <summary><strong>View detailed examples</strong></summary>
@@ -209,7 +333,7 @@ ${detailedExamples}
 </details>
 `.trimStart()
-prompts.log.message(`${barChartSection}\n`)
+prompts.log.message(barChartSection)
 const resultsDir = path.join(BENCHMARKS_DIR, 'results')
 await ensureDir(resultsDir)
--- a/benchmarks/src/constants.ts
+++ b/benchmarks/src/constants.ts
@@ -8,7 +8,7 @@ export const BENCHMARKS_DIR: string = url.fileURLToPath(new URL('../', import.me
 * Model-specific RPM (requests per minute) limits to handle API quotas
 *
 * @remarks
- * Set `undefined` for models without specific limits
+ * Set `undefined` for models without specific limits.
 */
 /// keep-sorted
 export const MODEL_RPM_LIMITS: Record<string, number | undefined> = {
@@ -39,7 +39,7 @@ export const FORMATTER_DISPLAY_NAMES: Record<string, string> = {
 * Enable dry run mode for quick testing with limited AI requests
 *
 * @remarks
- * Set via environment variable: `DRY_RUN=true`
+ * Set via environment variable: `DRY_RUN=true`.
 */
 export const DRY_RUN: boolean = process.env.DRY_RUN === 'true'
@@ -123,4 +123,14 @@ export const QUESTION_LIMITS = {
    aggregationBranches: 2,
    filteringStarsAndForks: 8,
  },
  eventLogs: {
    fieldRetrieval: 10,
    aggregationEndpoints: 3,
    filteringLevelAndStatus: 2,
    filteringEndpointAndStatus: 2,
  },
  nestedConfig: {
    fieldRetrieval: 5,
    filteringComplex: 2,
  },
 } as const
--- a/benchmarks/src/datasets.ts
+++ b/benchmarks/src/datasets.ts
@@ -5,6 +5,67 @@ import githubRepos from '../data/github-repos.json' with { type: 'json' }
 // Seed for reproducibility
 faker.seed(12345)
 /**
 * Calculate the tabular eligibility percentage of a data structure
 *
 * @remarks
 * Recursively analyzes data to determine what percentage of arrays qualify
 * for TOON's tabular format (uniform objects with primitive values only).
 */
 export function calculateTabularEligibility(data: unknown): number {
  let totalArrays = 0
  let tabularArrays = 0
  function isTabularArray(arr: unknown[]): boolean {
    if (arr.length === 0)
      return false
    // Check if all elements are objects
    if (!arr.every(item => typeof item === 'object' && item !== null && !Array.isArray(item)))
      return false
    // Get keys from first object
    const firstKeys = Object.keys(arr[0] as Record<string, unknown>)
    if (firstKeys.length === 0)
      return false
    // Check if all objects have the same keys and only primitive values
    return arr.every((item) => {
      const itemObj = item as Record<string, unknown>
      const itemKeys = Object.keys(itemObj)
      if (itemKeys.length !== firstKeys.length)
        return false
      if (!firstKeys.every(key => itemKeys.includes(key)))
        return false
      // Check if all values are primitives (no nested objects or arrays)
      return firstKeys.every((key) => {
        const value = itemObj[key]
        return value === null || ['string', 'number', 'boolean'].includes(typeof value)
      })
    })
  }
  function traverse(obj: unknown): void {
    if (Array.isArray(obj)) {
      totalArrays++
      if (isTabularArray(obj))
        tabularArrays++
      // Continue traversing array elements
      obj.forEach(item => traverse(item))
    }
    else if (typeof obj === 'object' && obj !== null) {
      // Traverse object properties
      Object.values(obj).forEach(value => traverse(value))
    }
  }
  traverse(data)
  return totalArrays === 0 ? 0 : Math.round((tabularArrays / totalArrays) * 100)
 }
 /**
 * Employee record structure for tabular dataset
 */
@@ -73,6 +134,78 @@ export interface Repository {
  pushedAt: string
 }
 /**
 * Event log structure for semi-uniform dataset
 */
 export interface EventLog {
  timestamp: string
  level: 'info' | 'warn' | 'error'
  endpoint: string
  statusCode: number
  responseTime: number
  userId: number
  error?: {
    message: string
    stack: string
    retryable: boolean
  }
 }
 /**
 * Nested configuration structure for deeply nested dataset
 */
 export interface NestedConfig {
  environment: string
  version: string
  database: {
    host: string
    port: number
    name: string
    pool: {
      min: number
      max: number
      idleTimeout: number
    }
    replicas: {
      host: string
      port: number
      priority: number
    }[]
  }
  features: Record<string, {
    enabled: boolean
    rollout: number
    variants: {
      name: string
      weight: number
      config: Record<string, any>
    }[]
  }>
  authentication: {
    providers: {
      name: string
      clientId: string
      scopes: string[]
      config: Record<string, any>
    }[]
    session: {
      secret: string
      duration: number
      refreshThreshold: number
    }
  }
  permissions: {
    roles: Record<string, {
      permissions: string[]
      inherits: string[]
    }>
    groups: Record<string, {
      members: string[]
      roles: string[]
    }>
  }
 }
 /**
 * Generate analytics time-series data
 */
@@ -108,17 +241,13 @@ export function generateAnalyticsData(days: number, startDate = '2025-01-01'): {
 }
 /**
- * Tabular dataset: 100 uniform employee records
+ * Generate employee data (uniform tabular structure)
 *
 * @remarks
 * Tests TOON's tabular array format
 */
 const departments: readonly string[] = ['Engineering', 'Sales', 'Marketing', 'HR', 'Operations', 'Finance'] as const
-const tabularDataset: Dataset = {
+
-  name: 'tabular',
+function generateEmployees(count: number): { employees: Employee[] } {
-  description: 'Uniform employee records (TOON optimal format)',
+  return {
-  data: {
+    employees: Array.from({ length: count }, (_, i): Employee => {
    employees: Array.from({ length: 100 }, (_, i): Employee => {
      const yearsExp = faker.number.int({ min: 1, max: 25 })
      return {
        id: i + 1,
@@ -130,72 +259,132 @@ const tabularDataset: Dataset = {
        active: faker.datatype.boolean(0.8), // 80% active
      }
    }),
  }
 }
 /**
 * Tabular dataset: Uniform employee records
 *
 * @remarks
 * Tests TOON's tabular array format.
 */
 const tabularDataset: Dataset = {
  name: 'tabular',
  description: 'Uniform employee records (TOON optimal format)',
  data: generateEmployees(100),
  metadata: {
    supportsCSV: true,
    structureClass: 'uniform',
    tabularEligibility: 100,
  },
 }
 /**
- * Nested dataset: 50 e-commerce orders with nested structures
+ * Generate e-commerce orders (nested structure)
 *
 * @remarks
 * Tests TOON's handling of complex nested objects
 */
-const productNames: readonly string[] = ['Wireless Mouse', 'USB Cable', 'Laptop Stand', 'Keyboard', 'Webcam', 'Headphones', 'Monitor', 'Desk Lamp'] as const
+const PRODUCT_NAMES = ['Wireless Mouse', 'USB Cable', 'Laptop Stand', 'Keyboard', 'Webcam', 'Headphones', 'Monitor', 'Desk Lamp'] as const
-const statuses: readonly string[] = ['pending', 'processing', 'shipped', 'delivered', 'cancelled'] as const
+const ORDER_STATUSES = ['pending', 'processing', 'shipped', 'delivered', 'cancelled'] as const
-const nestedDataset: Dataset = {
+const ORDER_CONSTANTS = {
-  name: 'nested',
+  CUSTOMER_ID_MOD: 20,
-  description: 'E-commerce orders with nested structures',
+  MIN_ITEMS: 1,
-  data: {
+  MAX_ITEMS: 4,
-    orders: Array.from({ length: 50 }, (_, i) => {
+  MIN_ITEM_PRICE: 9.99,
-      const customerId = (i % 20) + 1
+  MAX_ITEM_PRICE: 199.99,
-      const itemCount = faker.number.int({ min: 1, max: 4 })
+  MIN_ITEM_QUANTITY: 1,
  MAX_ITEM_QUANTITY: 5,
  SKU_LENGTH: 6,
  ORDER_ID_PADDING: 4,
  RECENT_DAYS: 90,
  TAX_RATE: 0.08,
 } as const
 function generateOrders(count: number): { orders: Order[] } {
  return {
    orders: Array.from({ length: count }, (_, i) => {
      const customerId = (i % ORDER_CONSTANTS.CUSTOMER_ID_MOD) + 1
      const itemCount = faker.number.int({ min: ORDER_CONSTANTS.MIN_ITEMS, max: ORDER_CONSTANTS.MAX_ITEMS })
      const items = Array.from({ length: itemCount }, (_, j) => {
-        const price = faker.number.float({ min: 9.99, max: 199.99, fractionDigits: 2 })
+        const price = faker.number.float({
-        const quantity = faker.number.int({ min: 1, max: 5 })
+          min: ORDER_CONSTANTS.MIN_ITEM_PRICE,
          max: ORDER_CONSTANTS.MAX_ITEM_PRICE,
          fractionDigits: 2,
        })
        const quantity = faker.number.int({
          min: ORDER_CONSTANTS.MIN_ITEM_QUANTITY,
          max: ORDER_CONSTANTS.MAX_ITEM_QUANTITY,
        })
        return {
-          sku: `SKU-${faker.string.alphanumeric({ length: 6 }).toUpperCase()}`,
+          sku: `SKU-${faker.string.alphanumeric({ length: ORDER_CONSTANTS.SKU_LENGTH }).toUpperCase()}`,
-          name: productNames[j % productNames.length]!,
+          name: PRODUCT_NAMES[j % PRODUCT_NAMES.length]!,
          quantity,
          price,
        }
      })
-      const total = Number(items.reduce((sum, item) => sum + (item.price * item.quantity), 0).toFixed(2))
+      const subtotal = Number(items.reduce((sum, item) => sum + (item.price * item.quantity), 0).toFixed(2))
      const tax = Number((subtotal * ORDER_CONSTANTS.TAX_RATE).toFixed(2))
      const total = Number((subtotal + tax).toFixed(2))
      return {
-        orderId: `ORD-${String(i + 1).padStart(4, '0')}`,
+        orderId: `ORD-${String(i + 1).padStart(ORDER_CONSTANTS.ORDER_ID_PADDING, '0')}`,
        customer: {
          id: customerId,
          name: faker.person.fullName(),
          email: faker.internet.email().toLowerCase(),
          phone: faker.phone.number(),
        },
        items,
        subtotal,
        tax,
        total,
-        status: statuses[i % statuses.length]!,
+        status: ORDER_STATUSES[i % ORDER_STATUSES.length]!,
-        orderDate: faker.date.recent({ days: 90 }).toISOString().split('T')[0],
+        orderDate: faker.date.recent({ days: ORDER_CONSTANTS.RECENT_DAYS }).toISOString().split('T')[0],
      }
    }),
  }
 }
 /**
 * Nested dataset: E-commerce orders with nested structures
 *
 * @remarks
 * Tests TOON's handling of complex nested objects.
 */
 const nestedDataset: Dataset = {
  name: 'nested',
  description: 'E-commerce orders with nested structures',
  data: generateOrders(50),
  metadata: {
    supportsCSV: false,
    structureClass: 'nested',
    tabularEligibility: 33, // orders array is not tabular, but items arrays within are
  },
 }
 /**
- * Analytics dataset: 60 days of time-series metrics
+ * Analytics dataset: Time-series metrics
 *
 * @remarks
- * Tests TOON's handling of numeric data and date fields
+ * Tests TOON's handling of numeric data and date fields.
 */
 const analyticsDataset: Dataset = {
  name: 'analytics',
  description: 'Time-series analytics data',
  data: generateAnalyticsData(60),
  metadata: {
    supportsCSV: true,
    structureClass: 'uniform',
    tabularEligibility: 100,
  },
 }
 /**
 * Real-world dataset: Top 100 starred GitHub repositories
 *
 * @remarks
- * Tests TOON's tabular format
+ * Tests TOON's tabular format with real data.
 */
 const githubDataset: Dataset = {
  name: 'github',
@@ -203,13 +392,18 @@ const githubDataset: Dataset = {
  data: {
    repositories: githubRepos,
  },
  metadata: {
    supportsCSV: true,
    structureClass: 'uniform',
    tabularEligibility: 100,
  },
 }
 /**
 * Generate a single e-commerce order with nested structure
 *
 * @remarks
- * Used for token efficiency benchmarks
+ * Used for token efficiency benchmarks.
 */
 export function generateOrderData(): Order {
  return {
@@ -235,11 +429,257 @@ export function generateOrderData(): Order {
 }
 /**
- * All datasets used in the benchmark
+ * Generate event logs (semi-uniform structure)
 *
 * @remarks
 * Approximately 50% of logs include nested error objects, 50% are flat.
 * This creates ~45% tabular eligibility.
 */
-export const datasets: Dataset[] = [
+export function generateEventLogs(count: number): { logs: EventLog[] } {
-  tabularDataset,
+  const endpoints = ['/api/users', '/api/orders', '/api/products', '/api/auth', '/api/payments']
-  nestedDataset,
+  const levels = ['info', 'warn', 'error'] as const
-  analyticsDataset,
+
-  githubDataset,
+  return {
    logs: Array.from({ length: count }, () => {
      const level = faker.helpers.arrayElement(levels)
      const hasError = level === 'error' || (level === 'warn' && faker.datatype.boolean(0.3))
      const log: EventLog = {
        timestamp: faker.date.recent({ days: 7 }).toISOString(),
        level,
        endpoint: faker.helpers.arrayElement(endpoints),
        statusCode: hasError
          ? faker.number.int({ min: 400, max: 599 })
          : faker.number.int({ min: 200, max: 299 }),
        responseTime: faker.number.int({ min: 10, max: 5000 }),
        userId: faker.number.int({ min: 1000, max: 9999 }),
      }
      if (hasError) {
        log.error = {
          message: faker.helpers.arrayElement([
            'Database connection timeout',
            'Invalid authentication token',
            'Resource not found',
            'Internal server error',
            'Rate limit exceeded',
          ]),
          stack: `Error: ${faker.lorem.sentence()}\n  at ${faker.lorem.word()}\n  at ${faker.lorem.word()}`,
          retryable: faker.datatype.boolean(0.6),
        }
      }
      return log
    }),
  }
 }
 /**
 * Generate deeply nested configuration
 *
 * @remarks
 * Creates a complex nested structure with minimal tabular eligibility (~0%).
 */
 export function generateNestedConfig(): NestedConfig {
  return {
    environment: faker.helpers.arrayElement(['production', 'staging', 'development']),
    version: faker.system.semver(),
    database: {
      host: faker.internet.domainName(),
      port: 5432,
      name: faker.database.type(),
      pool: {
        min: 2,
        max: faker.number.int({ min: 10, max: 50 }),
        idleTimeout: 30000,
      },
      replicas: Array.from({ length: 3 }, (_, i) => ({
        host: `replica-${i + 1}.${faker.internet.domainName()}`,
        port: 5432,
        priority: i + 1,
      })),
    },
    features: {
      darkMode: {
        enabled: faker.datatype.boolean(),
        rollout: faker.number.int({ min: 0, max: 100 }),
        variants: [
          {
            name: 'default',
            weight: 70,
            config: { theme: 'dark', animations: true },
          },
          {
            name: 'minimal',
            weight: 30,
            config: { theme: 'dark', animations: false },
          },
        ],
      },
      analytics: {
        enabled: faker.datatype.boolean(),
        rollout: faker.number.int({ min: 0, max: 100 }),
        variants: [
          {
            name: 'full',
            weight: 100,
            config: { tracking: 'all', sampling: 1.0 },
          },
        ],
      },
    },
    authentication: {
      providers: [
        {
          name: 'oauth2',
          clientId: faker.string.uuid(),
          scopes: ['read', 'write', 'admin'],
          config: {
            authUrl: faker.internet.url(),
            tokenUrl: faker.internet.url(),
          },
        },
        {
          name: 'saml',
          clientId: faker.string.uuid(),
          scopes: ['read'],
          config: {
            entryPoint: faker.internet.url(),
            cert: faker.string.alphanumeric({ length: 64 }),
          },
        },
      ],
      session: {
        secret: faker.string.alphanumeric({ length: 32 }),
        duration: 86400,
        refreshThreshold: 3600,
      },
    },
    permissions: {
      roles: {
        admin: {
          permissions: ['read', 'write', 'delete', 'manage_users', 'manage_roles'],
          inherits: [],
        },
        editor: {
          permissions: ['read', 'write'],
          inherits: ['viewer'],
        },
        viewer: {
          permissions: ['read'],
          inherits: [],
        },
      },
      groups: {
        engineering: {
          members: Array.from({ length: 5 }, () => faker.internet.email()),
          roles: ['admin', 'editor'],
        },
        support: {
          members: Array.from({ length: 3 }, () => faker.internet.email()),
          roles: ['viewer'],
        },
      },
    },
  }
 }
 /**
 * Event logs dataset: Semi-uniform structure
 *
 * @remarks
 * Tests TOON with semi-uniform data (~50% flat, ~50% with nested errors).
 */
 const eventLogsDataset: Dataset = {
  name: 'event-logs',
  description: 'Semi-uniform event logs',
  data: generateEventLogs(75),
  metadata: {
    supportsCSV: false,
    structureClass: 'semi-uniform',
    tabularEligibility: 50, // ~50% of logs have nested error objects
  },
 }
 /**
 * Nested config dataset: Deeply nested structure
 *
 * @remarks
 * Tests TOON's worst-case scenario with deeply nested configuration.
 */
 const nestedConfigDataset: Dataset = {
  name: 'nested-config',
  description: 'Deeply nested configuration',
  data: generateNestedConfig(),
  metadata: {
    supportsCSV: false,
    structureClass: 'deep',
    tabularEligibility: 0, // Highly nested, minimal tabular arrays
  },
 }
 /**
 * Datasets for accuracy benchmarks (smaller sizes for faster evaluation)
 */
 export const ACCURACY_DATASETS: Dataset[] = [
  tabularDataset, // 100 employees
  nestedDataset, // 50 orders
  analyticsDataset, // 60 days
  githubDataset, // 100 repos
  eventLogsDataset, // 75 logs
  nestedConfigDataset, // 1 config
 ]
 /**
 * Datasets for token efficiency benchmarks (larger sizes to amplify token differences)
 */
 export const TOKEN_EFFICIENCY_DATASETS: Dataset[] = [
  // Tabular: 2000 employees
  {
    name: 'tabular',
    description: 'Uniform employee records (TOON optimal format)',
    data: generateEmployees(2000),
    metadata: {
      supportsCSV: true,
      structureClass: 'uniform',
      tabularEligibility: 100,
    },
  },
  // Nested: 500 orders
  {
    name: 'nested',
    description: 'E-commerce orders with nested structures',
    data: generateOrders(500),
    metadata: {
      supportsCSV: false,
      structureClass: 'nested',
      tabularEligibility: 33,
    },
  },
  // Analytics: 365 days
  {
    name: 'analytics',
    description: 'Time-series analytics data',
    data: generateAnalyticsData(365),
    metadata: {
      supportsCSV: true,
      structureClass: 'uniform',
      tabularEligibility: 100,
    },
  },
  // GitHub: 100 repos (same as accuracy)
  githubDataset,
  // Event logs: 2000 logs
  {
    name: 'event-logs',
    description: 'Semi-uniform event logs',
    data: generateEventLogs(2000),
    metadata: {
      supportsCSV: false,
      structureClass: 'semi-uniform',
      tabularEligibility: 50,
    },
  },
  // Nested config: 1 config (same as accuracy)
  nestedConfigDataset,
 ]
--- a/benchmarks/src/formatters.ts
+++ b/benchmarks/src/formatters.ts
@@ -1,3 +1,4 @@
 import type { Dataset } from './types'
 import { stringify as stringifyCSV } from 'csv-stringify/sync'
 import { XMLBuilder } from 'fast-xml-parser'
 import { stringify as stringifyYAML } from 'yaml'
@@ -75,3 +76,14 @@ function toXML(data: unknown): string {
  return builder.build(data)
 }
 /**
 * Check if a dataset supports CSV format
 *
 * @remarks
 * CSV is only suitable for flat tabular data. Datasets with nested structures
 * should not be compared using CSV as it cannot properly represent the data.
 */
 export function supportsCSV(dataset: Dataset): boolean {
  return dataset.metadata.supportsCSV
 }
--- a/benchmarks/src/questions.ts
+++ b/benchmarks/src/questions.ts
@@ -1,711 +0,0 @@
 /**
 * Question generation for TOON benchmarks
 *
 * Generates ~150-160 questions across different question types and datasets:
 * - Field Retrieval: Direct field access with no computation
 *   Examples: "What is X's salary?", "What is the status of order Y?"
 * - Aggregation: Counts, sums, averages, min/max operations (including single-condition filters)
 *   Examples: "How many X?", "What is the total/average?", "How many X > threshold?"
 * - Filtering: Multi-condition queries requiring complex logical operations
 *   Examples: "How many X WHERE condition1 AND condition2?"
 */
 import type { AnalyticsMetric, Employee, Order, Repository } from './datasets'
 import type { Question } from './types'
 import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from './constants'
 import { datasets } from './datasets'
 /**
 * Generate all questions from datasets
 */
 export function generateQuestions(): Question[] {
  const questions: Question[] = []
  let idCounter = 1
  // Get datasets with proper typing
  const tabular = (datasets.find(d => d.name === 'tabular')?.data.employees as Employee[]) ?? []
  const nested = (datasets.find(d => d.name === 'nested')?.data.orders as Order[]) ?? []
  const analytics = (datasets.find(d => d.name === 'analytics')?.data.metrics as AnalyticsMetric[]) ?? []
  const github = (datasets.find(d => d.name === 'github')?.data.repositories as Repository[]) ?? []
  if (tabular.length > 0) {
    // Field retrieval: specific employees
    for (let i = 0; i < Math.min(QUESTION_LIMITS.tabular.fieldRetrieval, tabular.length); i++) {
      const emp = tabular[i * 2] || tabular[i]
      if (!emp)
        continue
      // Rotate through all field types
      if (i % 5 === 0) {
        questions.push({
          id: `q${idCounter++}`,
          prompt: `What is the salary of ${emp.name}?`,
          groundTruth: String(emp.salary),
          type: 'field-retrieval',
          dataset: 'tabular',
        })
      }
      else if (i % 5 === 1) {
        questions.push({
          id: `q${idCounter++}`,
          prompt: `What department does ${emp.name} work in?`,
          groundTruth: emp.department,
          type: 'field-retrieval',
          dataset: 'tabular',
        })
      }
      else if (i % 5 === 2) {
        questions.push({
          id: `q${idCounter++}`,
          prompt: `What is the email address of ${emp.name}?`,
          groundTruth: emp.email,
          type: 'field-retrieval',
          dataset: 'tabular',
        })
      }
      else if (i % 5 === 3) {
        questions.push({
          id: `q${idCounter++}`,
          prompt: `How many years of experience does ${emp.name} have?`,
          groundTruth: String(emp.yearsExperience),
          type: 'field-retrieval',
          dataset: 'tabular',
        })
      }
      else {
        questions.push({
          id: `q${idCounter++}`,
          prompt: `Is ${emp.name} an active employee?`,
          groundTruth: emp.active ? 'yes' : 'no',
          type: 'field-retrieval',
          dataset: 'tabular',
        })
      }
    }
    // Aggregation: count by department
    const departments = [...new Set(tabular.map(e => e.department))]
    for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.aggregationDepartments)) {
      const count = tabular.filter(e => e.department === dept).length
      questions.push({
        id: `q${idCounter++}`,
        prompt: `How many employees work in ${dept}?`,
        groundTruth: String(count),
        type: 'aggregation',
        dataset: 'tabular',
      })
    }
    // Aggregation: salary ranges (single-condition filters)
    for (const threshold of QUESTION_THRESHOLDS.tabular.salaryRanges) {
      const count = tabular.filter(e => e.salary > threshold).length
      questions.push({
        id: `q${idCounter++}`,
        prompt: `How many employees have a salary greater than ${threshold}?`,
        groundTruth: String(count),
        type: 'aggregation',
        dataset: 'tabular',
      })
    }
    // Aggregation: totals and averages
    const totalEmployees = tabular.length
    const avgSalary = Math.round(tabular.reduce((sum, e) => sum + e.salary, 0) / totalEmployees)
    const activeCount = tabular.filter(e => e.active).length
    const inactiveCount = tabular.filter(e => !e.active).length
    questions.push(
      {
        id: `q${idCounter++}`,
        prompt: 'How many employees are in the dataset?',
        groundTruth: String(totalEmployees),
        type: 'aggregation',
        dataset: 'tabular',
      },
      {
        id: `q${idCounter++}`,
        prompt: 'What is the average salary across all employees?',
        groundTruth: String(avgSalary),
        type: 'aggregation',
        dataset: 'tabular',
      },
      {
        id: `q${idCounter++}`,
        prompt: 'How many employees are active?',
        groundTruth: String(activeCount),
        type: 'aggregation',
        dataset: 'tabular',
      },
      {
        id: `q${idCounter++}`,
        prompt: 'How many employees are inactive?',
        groundTruth: String(inactiveCount),
        type: 'aggregation',
        dataset: 'tabular',
      },
    )
    // Filtering: count by department with salary filter (multi-condition)
    for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringMultiConditionDepartments)) {
      const count = tabular.filter(e => e.department === dept && e.salary > QUESTION_THRESHOLDS.tabular.departmentSalaryThreshold).length
      questions.push({
        id: `q${idCounter++}`,
        prompt: `How many employees in ${dept} have a salary greater than ${QUESTION_THRESHOLDS.tabular.departmentSalaryThreshold}?`,
        groundTruth: String(count),
        type: 'filtering',
        dataset: 'tabular',
      })
    }
    // Filtering: active employees by experience (multi-condition)
    for (const exp of QUESTION_THRESHOLDS.tabular.experienceYears.slice(0, QUESTION_LIMITS.tabular.filteringExperience)) {
      const count = tabular.filter(e => e.yearsExperience > exp && e.active).length
      questions.push({
        id: `q${idCounter++}`,
        prompt: `How many active employees have more than ${exp} years of experience?`,
        groundTruth: String(count),
        type: 'filtering',
        dataset: 'tabular',
      })
    }
    // Filtering: department by experience (multi-condition)
    for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringDepartmentExp)) {
      const count = tabular.filter(e => e.department === dept && e.yearsExperience > QUESTION_THRESHOLDS.tabular.departmentExperienceThreshold).length
      questions.push({
        id: `q${idCounter++}`,
        prompt: `How many employees in ${dept} have more than ${QUESTION_THRESHOLDS.tabular.departmentExperienceThreshold} years of experience?`,
        groundTruth: String(count),
        type: 'filtering',
        dataset: 'tabular',
      })
    }
    // Filtering: department by active status (multi-condition)
    for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringDepartmentActive)) {
      const count = tabular.filter(e => e.department === dept && e.active).length
      questions.push({
        id: `q${idCounter++}`,
        prompt: `How many active employees work in ${dept}?`,
        groundTruth: String(count),
        type: 'filtering',
        dataset: 'tabular',
      })
    }
  }
  if (nested.length > 0) {
    // Field retrieval: order totals and statuses
    for (let i = 0; i < Math.min(QUESTION_LIMITS.nested.fieldRetrievalOrders, nested.length); i++) {
      const order = nested[i * 2] || nested[i]
      if (!order)
        continue
      if (i % 2 === 0) {
        questions.push({
          id: `q${idCounter++}`,
          prompt: `What is the total for order ${order.orderId}?`,
          groundTruth: String(order.total),
          type: 'field-retrieval',
          dataset: 'nested',
        })
      }
      else {
        questions.push({
          id: `q${idCounter++}`,
          prompt: `What is the status of order ${order.orderId}?`,
          groundTruth: order.status,
          type: 'field-retrieval',
          dataset: 'nested',
        })
      }
    }
    // Field retrieval: customer info and order dates (expanded)
    for (let i = 0; i < Math.min(QUESTION_LIMITS.nested.fieldRetrievalCustomers, nested.length); i++) {
      const order = nested[i * 2 + 1] || nested[i]
      if (!order)
        continue
      if (i % 4 === 0) {
        questions.push({
          id: `q${idCounter++}`,
          prompt: `What is the customer name for order ${order.orderId}?`,
          groundTruth: order.customer.name,
          type: 'field-retrieval',
          dataset: 'nested',
        })
      }
      else if (i % 4 === 1) {
        questions.push({
          id: `q${idCounter++}`,
          prompt: `What is the customer email for order ${order.orderId}?`,
          groundTruth: order.customer.email,
          type: 'field-retrieval',
          dataset: 'nested',
        })
      }
      else if (i % 4 === 2) {
        questions.push({
          id: `q${idCounter++}`,
          prompt: `What is the order date for order ${order.orderId}?`,
          groundTruth: order.orderDate || '',
          type: 'field-retrieval',
          dataset: 'nested',
        })
      }
      else {
        questions.push({
          id: `q${idCounter++}`,
          prompt: `How many items are in order ${order.orderId}?`,
          groundTruth: String(order.items.length),
          type: 'field-retrieval',
          dataset: 'nested',
        })
      }
    }
    // Aggregation: totals and averages
    const totalRevenue = nested.reduce((sum, o) => sum + o.total, 0)
    const avgOrderValue = totalRevenue / nested.length
    const totalOrders = nested.length
    const maxOrderValue = Math.max(...nested.map(o => o.total))
    // Count by status
    const statuses = [...new Set(nested.map(o => o.status))]
    for (const status of statuses.slice(0, QUESTION_LIMITS.nested.aggregationStatuses)) {
      const count = nested.filter(o => o.status === status).length
      questions.push({
        id: `q${idCounter++}`,
        prompt: `How many orders have status "${status}"?`,
        groundTruth: String(count),
        type: 'aggregation',
        dataset: 'nested',
      })
    }
    questions.push(
      {
        id: `q${idCounter++}`,
        prompt: 'What is the total revenue across all orders?',
        groundTruth: String(totalRevenue.toFixed(2)),
        type: 'aggregation',
        dataset: 'nested',
      },
      {
        id: `q${idCounter++}`,
        prompt: 'What is the average order value?',
        groundTruth: String(avgOrderValue.toFixed(2)),
        type: 'aggregation',
        dataset: 'nested',
      },
      {
        id: `q${idCounter++}`,
        prompt: 'How many orders are in the dataset?',
        groundTruth: String(totalOrders),
        type: 'aggregation',
        dataset: 'nested',
      },
      {
        id: `q${idCounter++}`,
        prompt: 'What is the highest order total?',
        groundTruth: String(maxOrderValue.toFixed(2)),
        type: 'aggregation',
        dataset: 'nested',
      },
    )
    // Aggregation: high-value orders (single-condition filter)
    for (const threshold of QUESTION_THRESHOLDS.nested.highValueOrders) {
      const count = nested.filter(o => o.total > threshold).length
      questions.push({
        id: `q${idCounter++}`,
        prompt: `How many orders have a total greater than ${threshold}?`,
        groundTruth: String(count),
        type: 'aggregation',
        dataset: 'nested',
      })
    }
    // Filtering: multi-condition queries (status AND value)
    const orderStatuses = [...new Set(nested.map(o => o.status))]
    for (const status of orderStatuses.slice(0, QUESTION_LIMITS.nested.filteringStatusAndValue)) {
      const count = nested.filter(o => o.status === status && o.total > QUESTION_THRESHOLDS.nested.statusValueThreshold).length
      questions.push({
        id: `q${idCounter++}`,
        prompt: `How many orders have status "${status}" and total greater than ${QUESTION_THRESHOLDS.nested.statusValueThreshold}?`,
        groundTruth: String(count),
        type: 'filtering',
        dataset: 'nested',
      })
    }
    // Filtering: status AND items count (multi-condition)
    for (const status of orderStatuses.slice(0, QUESTION_LIMITS.nested.filteringStatusAndItems)) {
      const count = nested.filter(o => o.status === status && o.items.length >= QUESTION_THRESHOLDS.nested.itemCountThreshold).length
      questions.push({
        id: `q${idCounter++}`,
        prompt: `How many orders have status "${status}" and at least ${QUESTION_THRESHOLDS.nested.itemCountThreshold} items?`,
        groundTruth: String(count),
        type: 'filtering',
        dataset: 'nested',
      })
    }
    // Filtering: total AND items count (multi-condition)
    for (const threshold of QUESTION_THRESHOLDS.nested.totalThresholdsForItems) {
      const count = nested.filter(o => o.total > threshold && o.items.length >= QUESTION_THRESHOLDS.nested.itemCountThreshold).length
      questions.push({
        id: `q${idCounter++}`,
        prompt: `How many orders have a total greater than ${threshold} and at least ${QUESTION_THRESHOLDS.nested.itemCountThreshold} items?`,
        groundTruth: String(count),
        type: 'filtering',
        dataset: 'nested',
      })
    }
  }
  if (analytics.length > 0) {
    // Field retrieval: specific dates (expanded with all metrics)
    for (let i = 0; i < Math.min(QUESTION_LIMITS.analytics.fieldRetrievalDates, analytics.length); i++) {
      const metric = analytics[i * 3] || analytics[i]
      if (!metric)
        continue
      if (i % 5 === 0) {
        questions.push({
          id: `q${idCounter++}`,
          prompt: `How many views were recorded on ${metric.date}?`,
          groundTruth: String(metric.views),
          type: 'field-retrieval',
          dataset: 'analytics',
        })
      }
      else if (i % 5 === 1) {
        questions.push({
          id: `q${idCounter++}`,
          prompt: `What was the revenue on ${metric.date}?`,
          groundTruth: String(metric.revenue),
          type: 'field-retrieval',
          dataset: 'analytics',
        })
      }
      else if (i % 5 === 2) {
        questions.push({
          id: `q${idCounter++}`,
          prompt: `What was the conversion count on ${metric.date}?`,
          groundTruth: String(metric.conversions),
          type: 'field-retrieval',
          dataset: 'analytics',
        })
      }
      else if (i % 5 === 3) {
        questions.push({
          id: `q${idCounter++}`,
          prompt: `How many clicks were recorded on ${metric.date}?`,
          groundTruth: String(metric.clicks),
          type: 'field-retrieval',
          dataset: 'analytics',
        })
      }
      else {
        questions.push({
          id: `q${idCounter++}`,
          prompt: `What was the bounce rate on ${metric.date}?`,
          groundTruth: String(metric.bounceRate),
          type: 'field-retrieval',
          dataset: 'analytics',
        })
      }
    }
    // Aggregation: totals and averages
    const totalViews = analytics.reduce((sum, m) => sum + m.views, 0)
    const totalRevenue = analytics.reduce((sum, m) => sum + m.revenue, 0)
    const totalConversions = analytics.reduce((sum, m) => sum + m.conversions, 0)
    const avgViews = Math.round(totalViews / analytics.length)
    const avgRevenue = totalRevenue / analytics.length
    const avgConversions = Math.round(totalConversions / analytics.length)
    questions.push(
      {
        id: `q${idCounter++}`,
        prompt: 'What is the total number of views across all dates?',
        groundTruth: String(totalViews),
        type: 'aggregation',
        dataset: 'analytics',
      },
      {
        id: `q${idCounter++}`,
        prompt: 'What is the total revenue across all dates?',
        groundTruth: String(totalRevenue.toFixed(2)),
        type: 'aggregation',
        dataset: 'analytics',
      },
      {
        id: `q${idCounter++}`,
        prompt: 'What is the total number of conversions across all dates?',
        groundTruth: String(totalConversions),
        type: 'aggregation',
        dataset: 'analytics',
      },
      {
        id: `q${idCounter++}`,
        prompt: 'What is the average number of views per day?',
        groundTruth: String(avgViews),
        type: 'aggregation',
        dataset: 'analytics',
      },
      {
        id: `q${idCounter++}`,
        prompt: 'What is the average revenue per day?',
        groundTruth: String(avgRevenue.toFixed(2)),
        type: 'aggregation',
        dataset: 'analytics',
      },
      {
        id: `q${idCounter++}`,
        prompt: 'What is the average number of conversions per day?',
        groundTruth: String(avgConversions),
        type: 'aggregation',
        dataset: 'analytics',
      },
      {
        id: `q${idCounter++}`,
        prompt: 'How many days are included in the analytics data?',
        groundTruth: String(analytics.length),
        type: 'aggregation',
        dataset: 'analytics',
      },
      {
        id: `q${idCounter++}`,
        prompt: 'What is the highest number of views recorded in a single day?',
        groundTruth: String(Math.max(...analytics.map(m => m.views))),
        type: 'aggregation',
        dataset: 'analytics',
      },
    )
    // Aggregation: high-performing days (single-condition filters)
    for (const threshold of QUESTION_THRESHOLDS.analytics.views) {
      const count = analytics.filter(m => m.views > threshold).length
      questions.push({
        id: `q${idCounter++}`,
        prompt: `How many days had more than ${threshold} views?`,
        groundTruth: String(count),
        type: 'aggregation',
        dataset: 'analytics',
      })
    }
    // Filtering: multi-condition queries (views AND conversions)
    for (const viewThreshold of QUESTION_THRESHOLDS.analytics.viewsForFiltering) {
      const count = analytics.filter(m => m.views > viewThreshold && m.conversions > QUESTION_THRESHOLDS.analytics.conversionsForFiltering).length
      questions.push({
        id: `q${idCounter++}`,
        prompt: `How many days had more than ${viewThreshold} views and more than ${QUESTION_THRESHOLDS.analytics.conversionsForFiltering} conversions?`,
        groundTruth: String(count),
        type: 'filtering',
        dataset: 'analytics',
      })
    }
    // Filtering: views AND revenue (expanded)
    for (const revenueThreshold of QUESTION_THRESHOLDS.analytics.revenueThresholds.slice(0, 5)) {
      const count = analytics.filter(m => m.views > QUESTION_THRESHOLDS.analytics.viewsThresholdForRevenue && m.revenue > revenueThreshold).length
      questions.push({
        id: `q${idCounter++}`,
        prompt: `How many days had more than ${QUESTION_THRESHOLDS.analytics.viewsThresholdForRevenue} views and revenue greater than ${revenueThreshold}?`,
        groundTruth: String(count),
        type: 'filtering',
        dataset: 'analytics',
      })
    }
    // Filtering: clicks AND conversions (multi-condition)
    for (const clickThreshold of QUESTION_THRESHOLDS.analytics.clicksForFiltering) {
      const count = analytics.filter(m => m.clicks > clickThreshold && m.conversions > QUESTION_THRESHOLDS.analytics.conversionsForClickFiltering).length
      questions.push({
        id: `q${idCounter++}`,
        prompt: `How many days had more than ${clickThreshold} clicks and more than ${QUESTION_THRESHOLDS.analytics.conversionsForClickFiltering} conversions?`,
        groundTruth: String(count),
        type: 'filtering',
        dataset: 'analytics',
      })
    }
    // Filtering: revenue AND bounce rate (multi-condition)
    for (const revenueThreshold of QUESTION_THRESHOLDS.analytics.revenueForBounceRate) {
      const count = analytics.filter(m => m.revenue > revenueThreshold && m.bounceRate < QUESTION_THRESHOLDS.analytics.bounceRateThreshold).length
      questions.push({
        id: `q${idCounter++}`,
        prompt: `How many days had revenue greater than ${revenueThreshold} and bounce rate less than ${QUESTION_THRESHOLDS.analytics.bounceRateThreshold}?`,
        groundTruth: String(count),
        type: 'filtering',
        dataset: 'analytics',
      })
    }
  }
  if (github.length > 0) {
    // Helper to extract owner from repo field
    const getOwner = (repoFullName: string) => repoFullName.split('/')[0]!
    // Field retrieval: specific repos (diverse fields)
    for (let i = 0; i < Math.min(QUESTION_LIMITS.github.fieldRetrievalRepos, github.length); i++) {
      const repo = github[i * 7]
      if (!repo)
        continue
      if (i % 5 === 0) {
        questions.push({
          id: `q${idCounter++}`,
          prompt: `How many stars does ${repo.repo} have?`,
          groundTruth: String(repo.stars),
          type: 'field-retrieval',
          dataset: 'github',
        })
      }
      else if (i % 5 === 1) {
        questions.push({
          id: `q${idCounter++}`,
          prompt: `How many forks does ${repo.repo} have?`,
          groundTruth: String(repo.forks),
          type: 'field-retrieval',
          dataset: 'github',
        })
      }
      else if (i % 5 === 2) {
        questions.push({
          id: `q${idCounter++}`,
          prompt: `Who is the owner of ${repo.repo}?`,
          groundTruth: getOwner(repo.repo),
          type: 'field-retrieval',
          dataset: 'github',
        })
      }
      else if (i % 5 === 3) {
        questions.push({
          id: `q${idCounter++}`,
          prompt: `What is the default branch of ${repo.repo}?`,
          groundTruth: repo.defaultBranch,
          type: 'field-retrieval',
          dataset: 'github',
        })
      }
      else {
        questions.push({
          id: `q${idCounter++}`,
          prompt: `How many watchers does ${repo.repo} have?`,
          groundTruth: String(repo.watchers),
          type: 'field-retrieval',
          dataset: 'github',
        })
      }
    }
    // Aggregation: popular repositories
    const totalStars = github.reduce((sum, r) => sum + r.stars, 0)
    const totalRepos = github.length
    const avgStars = Math.round(totalStars / totalRepos)
    questions.push(
      {
        id: `q${idCounter++}`,
        prompt: 'What is the total number of stars across all repositories?',
        groundTruth: String(totalStars),
        type: 'aggregation',
        dataset: 'github',
      },
      {
        id: `q${idCounter++}`,
        prompt: 'How many repositories are in the dataset?',
        groundTruth: String(totalRepos),
        type: 'aggregation',
        dataset: 'github',
      },
      {
        id: `q${idCounter++}`,
        prompt: 'What is the average number of stars per repository?',
        groundTruth: String(avgStars),
        type: 'aggregation',
        dataset: 'github',
      },
    )
    // Aggregation: star thresholds (single-condition filters)
    for (const threshold of QUESTION_THRESHOLDS.github.stars) {
      const count = github.filter(r => r.stars > threshold).length
      questions.push({
        id: `q${idCounter++}`,
        prompt: `How many repositories have more than ${threshold} stars?`,
        groundTruth: String(count),
        type: 'aggregation',
        dataset: 'github',
      })
    }
    // Aggregation: fork thresholds (single-condition filters)
    for (const threshold of QUESTION_THRESHOLDS.github.forks) {
      const count = github.filter(r => r.forks > threshold).length
      questions.push({
        id: `q${idCounter++}`,
        prompt: `How many repositories have more than ${threshold} forks?`,
        groundTruth: String(count),
        type: 'aggregation',
        dataset: 'github',
      })
    }
    // Aggregation: watcher thresholds (single-condition filters)
    for (const threshold of QUESTION_THRESHOLDS.github.watchers) {
      const count = github.filter(r => r.watchers > threshold).length
      questions.push({
        id: `q${idCounter++}`,
        prompt: `How many repositories have more than ${threshold} watchers?`,
        groundTruth: String(count),
        type: 'aggregation',
        dataset: 'github',
      })
    }
    // Aggregation: default branch counts
    const branches = [...new Set(github.map(r => r.defaultBranch))]
    for (const branch of branches.slice(0, QUESTION_LIMITS.github.aggregationBranches)) {
      const count = github.filter(r => r.defaultBranch === branch).length
      questions.push({
        id: `q${idCounter++}`,
        prompt: `How many repositories use "${branch}" as their default branch?`,
        groundTruth: String(count),
        type: 'aggregation',
        dataset: 'github',
      })
    }
    // Filtering: multi-condition queries (stars AND forks)
    for (const combo of QUESTION_THRESHOLDS.github.starForkCombinations.slice(0, QUESTION_LIMITS.github.filteringStarsAndForks)) {
      const count = github.filter(r => r.stars > combo.stars && r.forks > combo.forks).length
      questions.push({
        id: `q${idCounter++}`,
        prompt: `How many repositories have more than ${combo.stars} stars and more than ${combo.forks} forks?`,
        groundTruth: String(count),
        type: 'filtering',
        dataset: 'github',
      })
    }
    // Filtering: stars AND watchers (multi-condition)
    for (const combo of QUESTION_THRESHOLDS.github.starWatcherCombinations) {
      const count = github.filter(r => r.stars > combo.stars && r.watchers > combo.watchers).length
      questions.push({
        id: `q${idCounter++}`,
        prompt: `How many repositories have more than ${combo.stars} stars and more than ${combo.watchers} watchers?`,
        groundTruth: String(count),
        type: 'filtering',
        dataset: 'github',
      })
    }
  }
  return questions
 }
--- a/benchmarks/src/questions/analytics.ts
+++ b/benchmarks/src/questions/analytics.ts
@@ -0,0 +1,196 @@
 import type { AnalyticsMetric } from '../datasets'
 import type { Question } from '../types'
 import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from '../constants'
 import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils'
 /**
 * Generate analytics (website metrics) questions
 */
 export function generateAnalyticsQuestions(metrics: AnalyticsMetric[], getId: () => string): Question[] {
  const questions: Question[] = []
  if (metrics.length === 0)
    return questions
  // Field retrieval: date-based metrics
  const metricFieldGenerators: Array<(metric: AnalyticsMetric, getId: () => string) => Question> = [
    (metric, getId) => new QuestionBuilder()
      .id(getId())
      .prompt(`What are the views for ${metric.date}?`)
      .groundTruth(String(metric.views))
      .type('field-retrieval')
      .dataset('analytics')
      .build(),
    (metric, getId) => new QuestionBuilder()
      .id(getId())
      .prompt(`What is the revenue for ${metric.date}?`)
      .groundTruth(String(metric.revenue))
      .type('field-retrieval')
      .dataset('analytics')
      .build(),
    (metric, getId) => new QuestionBuilder()
      .id(getId())
      .prompt(`What is the bounce rate for ${metric.date}?`)
      .groundTruth(String(metric.bounceRate))
      .type('field-retrieval')
      .dataset('analytics')
      .build(),
    (metric, getId) => new QuestionBuilder()
      .id(getId())
      .prompt(`How many conversions were there on ${metric.date}?`)
      .groundTruth(String(metric.conversions))
      .type('field-retrieval')
      .dataset('analytics')
      .build(),
  ]
  questions.push(...rotateQuestions(
    metrics,
    metricFieldGenerators,
    QUESTION_LIMITS.analytics.fieldRetrievalDates,
    SAMPLE_STRIDES.ANALYTICS_FIELD,
    getId,
  ))
  // Aggregation: basic statistics
  const totalDays = metrics.length
  const totalViews = metrics.reduce((sum, m) => sum + m.views, 0)
  const totalConversions = metrics.reduce((sum, m) => sum + m.conversions, 0)
  const totalRevenue = metrics.reduce((sum, m) => sum + m.revenue, 0)
  const avgBounceRate = metrics.reduce((sum, m) => sum + m.bounceRate, 0) / metrics.length
  questions.push(
    new QuestionBuilder()
      .id(getId())
      .prompt('How many days of data are in the dataset?')
      .groundTruth(String(totalDays))
      .type('aggregation')
      .dataset('analytics')
      .build(),
    new QuestionBuilder()
      .id(getId())
      .prompt('What is the total number of views across all dates?')
      .groundTruth(String(totalViews))
      .type('aggregation')
      .dataset('analytics')
      .build(),
    new QuestionBuilder()
      .id(getId())
      .prompt('What is the total number of conversions across all dates?')
      .groundTruth(String(totalConversions))
      .type('aggregation')
      .dataset('analytics')
      .build(),
    new QuestionBuilder()
      .id(getId())
      .prompt('What is the total revenue across all dates?')
      .groundTruth(String(totalRevenue.toFixed(2)))
      .type('aggregation')
      .dataset('analytics')
      .build(),
    new QuestionBuilder()
      .id(getId())
      .prompt('What is the average bounce rate?')
      .groundTruth(String(avgBounceRate.toFixed(2)))
      .type('aggregation')
      .dataset('analytics')
      .build(),
  )
  // Aggregation: high views/conversions
  for (const threshold of QUESTION_THRESHOLDS.analytics.views) {
    const count = countByPredicate(metrics, m => m.views > threshold)
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many days had more than ${threshold} views?`)
        .groundTruth(String(count))
        .type('aggregation')
        .dataset('analytics')
        .build(),
    )
  }
  for (const threshold of QUESTION_THRESHOLDS.analytics.conversions) {
    const count = countByPredicate(metrics, m => m.conversions > threshold)
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many days had more than ${threshold} conversions?`)
        .groundTruth(String(count))
        .type('aggregation')
        .dataset('analytics')
        .build(),
    )
  }
  // Filtering: multi-condition (views AND revenue)
  for (const threshold of QUESTION_THRESHOLDS.analytics.viewsForFiltering) {
    const count = countByPredicate(
      metrics,
      m => m.views > threshold && m.conversions > QUESTION_THRESHOLDS.analytics.conversionsForFiltering,
    )
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many days had more than ${threshold} views and more than ${QUESTION_THRESHOLDS.analytics.conversionsForFiltering} conversions?`)
        .groundTruth(String(count))
        .type('filtering')
        .dataset('analytics')
        .build(),
    )
  }
  // Filtering: revenue thresholds
  for (const threshold of QUESTION_THRESHOLDS.analytics.revenueThresholds) {
    const count = countByPredicate(
      metrics,
      m => m.revenue > threshold && m.views > QUESTION_THRESHOLDS.analytics.viewsThresholdForRevenue,
    )
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many days had revenue greater than ${threshold} with views above ${QUESTION_THRESHOLDS.analytics.viewsThresholdForRevenue}?`)
        .groundTruth(String(count))
        .type('filtering')
        .dataset('analytics')
        .build(),
    )
  }
  // Filtering: clicks and conversions
  for (const threshold of QUESTION_THRESHOLDS.analytics.clicksForFiltering) {
    const count = countByPredicate(
      metrics,
      m => m.clicks > threshold && m.conversions > QUESTION_THRESHOLDS.analytics.conversionsForClickFiltering,
    )
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many days had more than ${threshold} clicks and more than ${QUESTION_THRESHOLDS.analytics.conversionsForClickFiltering} conversions?`)
        .groundTruth(String(count))
        .type('filtering')
        .dataset('analytics')
        .build(),
    )
  }
  // Filtering: revenue and bounce rate
  for (const threshold of QUESTION_THRESHOLDS.analytics.revenueForBounceRate) {
    const count = countByPredicate(
      metrics,
      m => m.revenue > threshold && m.bounceRate < QUESTION_THRESHOLDS.analytics.bounceRateThreshold,
    )
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many days had revenue greater than ${threshold} with bounce rate below ${QUESTION_THRESHOLDS.analytics.bounceRateThreshold}?`)
        .groundTruth(String(count))
        .type('filtering')
        .dataset('analytics')
        .build(),
    )
  }
  return questions
 }
--- a/benchmarks/src/questions/event-logs.ts
+++ b/benchmarks/src/questions/event-logs.ts
@@ -0,0 +1,162 @@
 import type { EventLog } from '../datasets'
 import type { Question } from '../types'
 import { QUESTION_LIMITS } from '../constants'
 import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils'
 /**
 * Generate event log questions
 */
 export function generateEventLogsQuestions(logs: EventLog[], getId: () => string): Question[] {
  const questions: Question[] = []
  if (logs.length === 0)
    return questions
  // Field retrieval: log metadata
  const logFieldGenerators: Array<(log: EventLog, getId: () => string) => Question> = [
    (log, getId) => new QuestionBuilder()
      .id(getId())
      .prompt(`What is the level of the log at ${log.timestamp}?`)
      .groundTruth(log.level)
      .type('field-retrieval')
      .dataset('event-logs')
      .build(),
    (log, getId) => new QuestionBuilder()
      .id(getId())
      .prompt(`What is the endpoint for the log at ${log.timestamp}?`)
      .groundTruth(log.endpoint)
      .type('field-retrieval')
      .dataset('event-logs')
      .build(),
    (log, getId) => new QuestionBuilder()
      .id(getId())
      .prompt(`What is the status code for the log at ${log.timestamp}?`)
      .groundTruth(String(log.statusCode))
      .type('field-retrieval')
      .dataset('event-logs')
      .build(),
    (log, getId) => new QuestionBuilder()
      .id(getId())
      .prompt(`What is the response time for the log at ${log.timestamp}?`)
      .groundTruth(String(log.responseTime))
      .type('field-retrieval')
      .dataset('event-logs')
      .build(),
  ]
  questions.push(...rotateQuestions(
    logs,
    logFieldGenerators,
    QUESTION_LIMITS.eventLogs.fieldRetrieval,
    SAMPLE_STRIDES.EVENT_LOG_FIELD,
    getId,
  ))
  // Aggregation: basic statistics
  const totalLogs = logs.length
  const avgResponseTime = logs.reduce((sum, l) => sum + l.responseTime, 0) / logs.length
  questions.push(
    new QuestionBuilder()
      .id(getId())
      .prompt('How many log entries are in the dataset?')
      .groundTruth(String(totalLogs))
      .type('aggregation')
      .dataset('event-logs')
      .build(),
    new QuestionBuilder()
      .id(getId())
      .prompt('What is the average response time across all logs?')
      .groundTruth(String(avgResponseTime.toFixed(2)))
      .type('aggregation')
      .dataset('event-logs')
      .build(),
  )
  // Aggregation: by level
  const levels = [...new Set(logs.map(l => l.level))]
  for (const level of levels) {
    const count = countByPredicate(logs, l => l.level === level)
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many log entries have level "${level}"?`)
        .groundTruth(String(count))
        .type('aggregation')
        .dataset('event-logs')
        .build(),
    )
  }
  // Aggregation: by endpoint
  const endpoints = [...new Set(logs.map(l => l.endpoint))]
  for (const endpoint of endpoints.slice(0, QUESTION_LIMITS.eventLogs.aggregationEndpoints)) {
    const count = countByPredicate(logs, l => l.endpoint === endpoint)
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many log entries are for endpoint "${endpoint}"?`)
        .groundTruth(String(count))
        .type('aggregation')
        .dataset('event-logs')
        .build(),
    )
  }
  // Aggregation: by status code range
  const errorCount = countByPredicate(logs, l => l.statusCode >= 400)
  const successCount = countByPredicate(logs, l => l.statusCode >= 200 && l.statusCode < 300)
  questions.push(
    new QuestionBuilder()
      .id(getId())
      .prompt('How many log entries have a status code indicating an error (>= 400)?')
      .groundTruth(String(errorCount))
      .type('aggregation')
      .dataset('event-logs')
      .build(),
    new QuestionBuilder()
      .id(getId())
      .prompt('How many log entries have a successful status code (200-299)?')
      .groundTruth(String(successCount))
      .type('aggregation')
      .dataset('event-logs')
      .build(),
  )
  // Filtering: multi-condition (level AND status)
  for (const level of levels.slice(0, QUESTION_LIMITS.eventLogs.filteringLevelAndStatus)) {
    const count = countByPredicate(
      logs,
      l => l.level === level && l.statusCode >= 400,
    )
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many log entries have level "${level}" and status code >= 400?`)
        .groundTruth(String(count))
        .type('filtering')
        .dataset('event-logs')
        .build(),
    )
  }
  // Filtering: endpoint AND status
  for (const endpoint of endpoints.slice(0, QUESTION_LIMITS.eventLogs.filteringEndpointAndStatus)) {
    const count = countByPredicate(
      logs,
      l => l.endpoint === endpoint && l.statusCode >= 500,
    )
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many log entries are for endpoint "${endpoint}" with status code >= 500?`)
        .groundTruth(String(count))
        .type('filtering')
        .dataset('event-logs')
        .build(),
    )
  }
  return questions
 }
--- a/benchmarks/src/questions/github.ts
+++ b/benchmarks/src/questions/github.ts
@@ -0,0 +1,184 @@
 import type { Repository } from '../datasets'
 import type { Question } from '../types'
 import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from '../constants'
 import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils'
 /**
 * Generate GitHub repository questions
 */
 export function generateGithubQuestions(repos: Repository[], getId: () => string): Question[] {
  const questions: Question[] = []
  if (repos.length === 0)
    return questions
  // Field retrieval: repository metadata
  const repoFieldGenerators: Array<(repo: Repository, getId: () => string) => Question> = [
    (repo, getId) => new QuestionBuilder()
      .id(getId())
      .prompt(`How many stars does ${repo.owner}/${repo.name} have?`)
      .groundTruth(String(repo.stars))
      .type('field-retrieval')
      .dataset('github')
      .build(),
    (repo, getId) => new QuestionBuilder()
      .id(getId())
      .prompt(`How many forks does ${repo.owner}/${repo.name} have?`)
      .groundTruth(String(repo.forks))
      .type('field-retrieval')
      .dataset('github')
      .build(),
    (repo, getId) => new QuestionBuilder()
      .id(getId())
      .prompt(`How many watchers does ${repo.owner}/${repo.name} have?`)
      .groundTruth(String(repo.watchers))
      .type('field-retrieval')
      .dataset('github')
      .build(),
    (repo, getId) => new QuestionBuilder()
      .id(getId())
      .prompt(`What is the main branch of ${repo.owner}/${repo.name}?`)
      .groundTruth(repo.defaultBranch)
      .type('field-retrieval')
      .dataset('github')
      .build(),
  ]
  questions.push(...rotateQuestions(
    repos,
    repoFieldGenerators,
    QUESTION_LIMITS.github.fieldRetrievalRepos,
    SAMPLE_STRIDES.REPO_FIELD,
    getId,
  ))
  // Aggregation: basic statistics
  const totalRepos = repos.length
  const totalStars = repos.reduce((sum, r) => sum + r.stars, 0)
  const totalForks = repos.reduce((sum, r) => sum + r.forks, 0)
  const avgStars = totalStars / totalRepos
  questions.push(
    new QuestionBuilder()
      .id(getId())
      .prompt('How many repositories are in the dataset?')
      .groundTruth(String(totalRepos))
      .type('aggregation')
      .dataset('github')
      .build(),
    new QuestionBuilder()
      .id(getId())
      .prompt('What is the total number of stars across all repositories?')
      .groundTruth(String(totalStars))
      .type('aggregation')
      .dataset('github')
      .build(),
    new QuestionBuilder()
      .id(getId())
      .prompt('What is the total number of forks across all repositories?')
      .groundTruth(String(totalForks))
      .type('aggregation')
      .dataset('github')
      .build(),
    new QuestionBuilder()
      .id(getId())
      .prompt('What is the average number of stars per repository?')
      .groundTruth(String(Math.round(avgStars)))
      .type('aggregation')
      .dataset('github')
      .build(),
  )
  // Aggregation: by default branch
  const branches = [...new Set(repos.map(r => r.defaultBranch))]
  for (const branch of branches.slice(0, QUESTION_LIMITS.github.aggregationBranches)) {
    const count = countByPredicate(repos, r => r.defaultBranch === branch)
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many repositories use "${branch}" as their default branch?`)
        .groundTruth(String(count))
        .type('aggregation')
        .dataset('github')
        .build(),
    )
  }
  // Aggregation: high star counts
  for (const threshold of QUESTION_THRESHOLDS.github.stars) {
    const count = countByPredicate(repos, r => r.stars > threshold)
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many repositories have more than ${threshold} stars?`)
        .groundTruth(String(count))
        .type('aggregation')
        .dataset('github')
        .build(),
    )
  }
  // Aggregation: high fork counts
  for (const threshold of QUESTION_THRESHOLDS.github.forks) {
    const count = countByPredicate(repos, r => r.forks > threshold)
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many repositories have more than ${threshold} forks?`)
        .groundTruth(String(count))
        .type('aggregation')
        .dataset('github')
        .build(),
    )
  }
  // Aggregation: high watcher counts
  for (const threshold of QUESTION_THRESHOLDS.github.watchers) {
    const count = countByPredicate(repos, r => r.watchers > threshold)
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many repositories have more than ${threshold} watchers?`)
        .groundTruth(String(count))
        .type('aggregation')
        .dataset('github')
        .build(),
    )
  }
  // Filtering: multi-condition (stars AND forks)
  for (const combo of QUESTION_THRESHOLDS.github.starForkCombinations.slice(0, QUESTION_LIMITS.github.filteringStarsAndForks)) {
    const count = countByPredicate(
      repos,
      r => r.stars > combo.stars && r.forks > combo.forks,
    )
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many repositories have more than ${combo.stars} stars and more than ${combo.forks} forks?`)
        .groundTruth(String(count))
        .type('filtering')
        .dataset('github')
        .build(),
    )
  }
  // Filtering: stars AND watchers
  for (const combo of QUESTION_THRESHOLDS.github.starWatcherCombinations) {
    const count = countByPredicate(
      repos,
      r => r.stars > combo.stars && r.watchers > combo.watchers,
    )
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many repositories have more than ${combo.stars} stars and more than ${combo.watchers} watchers?`)
        .groundTruth(String(count))
        .type('filtering')
        .dataset('github')
        .build(),
    )
  }
  return questions
 }
--- a/benchmarks/src/questions/index.ts
+++ b/benchmarks/src/questions/index.ts
@@ -0,0 +1,46 @@
 import type { AnalyticsMetric, Employee, EventLog, NestedConfig, Order, Repository } from '../datasets'
 import type { Question } from '../types'
 import { ACCURACY_DATASETS } from '../datasets'
 import { generateAnalyticsQuestions } from './analytics'
 import { generateEventLogsQuestions } from './event-logs'
 import { generateGithubQuestions } from './github'
 import { generateNestedQuestions } from './nested'
 import { generateNestedConfigQuestions } from './nested-config'
 import { generateTabularQuestions } from './tabular'
 import { createIdGenerator } from './utils'
 /**
 * Generate all questions from datasets
 *
 * @remarks
 * Generates ~150-160 questions across different question types and datasets:
 * - Field Retrieval: Direct field access with no computation
 *   Examples: "What is X's salary?", "What is the status of order Y?"
 * - Aggregation: Counts, sums, averages, min/max operations (including single-condition filters)
 *   Examples: "How many X?", "What is the total/average?", "How many X > threshold?"
 * - Filtering: Multi-condition queries requiring complex logical operations
 *   Examples: "How many X WHERE condition1 AND condition2?"
 */
 export function generateQuestions(): Question[] {
  const questions: Question[] = []
  const idGen = createIdGenerator()
  const getId = () => idGen.next().value
  // Get datasets with proper typing
  const tabular = (ACCURACY_DATASETS.find(d => d.name === 'tabular')?.data.employees as Employee[]) ?? []
  const nested = (ACCURACY_DATASETS.find(d => d.name === 'nested')?.data.orders as Order[]) ?? []
  const analytics = (ACCURACY_DATASETS.find(d => d.name === 'analytics')?.data.metrics as AnalyticsMetric[]) ?? []
  const github = (ACCURACY_DATASETS.find(d => d.name === 'github')?.data.repositories as Repository[]) ?? []
  const eventLogs = (ACCURACY_DATASETS.find(d => d.name === 'event-logs')?.data.logs as EventLog[]) ?? []
  const nestedConfig = ACCURACY_DATASETS.find(d => d.name === 'nested-config')?.data as NestedConfig | undefined
  // Generate questions for each dataset
  questions.push(...generateTabularQuestions(tabular, getId))
  questions.push(...generateNestedQuestions(nested, getId))
  questions.push(...generateAnalyticsQuestions(analytics, getId))
  questions.push(...generateGithubQuestions(github, getId))
  questions.push(...generateEventLogsQuestions(eventLogs, getId))
  questions.push(...generateNestedConfigQuestions(nestedConfig, getId))
  return questions
 }
--- a/benchmarks/src/questions/nested-config.ts
+++ b/benchmarks/src/questions/nested-config.ts
@@ -0,0 +1,147 @@
 import type { NestedConfig } from '../datasets'
 import type { Question } from '../types'
 import { QUESTION_LIMITS } from '../constants'
 import { QuestionBuilder } from './utils'
 /**
 * Generate nested configuration questions
 */
 export function generateNestedConfigQuestions(config: NestedConfig | undefined, getId: () => string): Question[] {
  const questions: Question[] = []
  if (!config)
    return questions
  // Field retrieval: top-level config values
  const fieldRetrievalQuestions = [
    {
      prompt: 'What is the environment in the configuration?',
      groundTruth: config.environment,
    },
    {
      prompt: 'What is the database host?',
      groundTruth: config.database.host,
    },
    {
      prompt: 'What is the database port?',
      groundTruth: String(config.database.port),
    },
    {
      prompt: 'What is the maximum connection pool size?',
      groundTruth: String(config.database.pool.max),
    },
    {
      prompt: 'What is the session duration?',
      groundTruth: String(config.authentication.session.duration),
    },
  ]
  for (const q of fieldRetrievalQuestions.slice(0, QUESTION_LIMITS.nestedConfig.fieldRetrieval)) {
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(q.prompt)
        .groundTruth(q.groundTruth)
        .type('field-retrieval')
        .dataset('nested-config')
        .build(),
    )
  }
  // Aggregation: counts of nested structures
  const roleCount = Object.keys(config.permissions.roles).length
  const groupCount = Object.keys(config.permissions.groups).length
  const providerCount = config.authentication.providers.length
  const featureCount = Object.keys(config.features).length
  const replicaCount = config.database.replicas.length
  questions.push(
    new QuestionBuilder()
      .id(getId())
      .prompt('How many roles are defined in permissions?')
      .groundTruth(String(roleCount))
      .type('aggregation')
      .dataset('nested-config')
      .build(),
    new QuestionBuilder()
      .id(getId())
      .prompt('How many groups are defined in permissions?')
      .groundTruth(String(groupCount))
      .type('aggregation')
      .dataset('nested-config')
      .build(),
    new QuestionBuilder()
      .id(getId())
      .prompt('How many authentication providers are configured?')
      .groundTruth(String(providerCount))
      .type('aggregation')
      .dataset('nested-config')
      .build(),
    new QuestionBuilder()
      .id(getId())
      .prompt('How many feature flags are defined?')
      .groundTruth(String(featureCount))
      .type('aggregation')
      .dataset('nested-config')
      .build(),
    new QuestionBuilder()
      .id(getId())
      .prompt('How many database replicas are configured?')
      .groundTruth(String(replicaCount))
      .type('aggregation')
      .dataset('nested-config')
      .build(),
  )
  // Aggregation: feature flag details
  const enabledFeatures = Object.entries(config.features).filter(([_, f]) => f.enabled).length
  questions.push(
    new QuestionBuilder()
      .id(getId())
      .prompt('How many feature flags are enabled?')
      .groundTruth(String(enabledFeatures))
      .type('aggregation')
      .dataset('nested-config')
      .build(),
  )
  // Aggregation: role permissions
  const adminPermissions = config.permissions.roles.admin?.permissions.length ?? 0
  questions.push(
    new QuestionBuilder()
      .id(getId())
      .prompt('How many permissions does the admin role have?')
      .groundTruth(String(adminPermissions))
      .type('aggregation')
      .dataset('nested-config')
      .build(),
  )
  // Filtering: complex multi-condition queries
  const filteringQuestions = [
    {
      prompt: 'How many feature flags are enabled with rollout greater than 50%?',
      groundTruth: String(Object.entries(config.features)
        .filter(([_, f]) => f.enabled && f.rollout > 50).length),
    },
    {
      prompt: 'How many groups have the admin role?',
      groundTruth: String(Object.entries(config.permissions.groups)
        .filter(([_, g]) => g.roles.includes('admin')).length),
    },
  ]
  for (const q of filteringQuestions.slice(0, QUESTION_LIMITS.nestedConfig.filteringComplex)) {
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(q.prompt)
        .groundTruth(q.groundTruth)
        .type('filtering')
        .dataset('nested-config')
        .build(),
    )
  }
  return questions
 }
--- a/benchmarks/src/questions/nested.ts
+++ b/benchmarks/src/questions/nested.ts
@@ -0,0 +1,202 @@
 import type { Order } from '../datasets'
 import type { Question } from '../types'
 import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from '../constants'
 import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils'
 /**
 * Generate nested (orders) questions
 */
 export function generateNestedQuestions(orders: Order[], getId: () => string): Question[] {
  const questions: Question[] = []
  if (orders.length === 0)
    return questions
  // Field retrieval: order totals and statuses
  const orderFieldGenerators: Array<(order: Order, getId: () => string) => Question> = [
    (order, getId) => new QuestionBuilder()
      .id(getId())
      .prompt(`What is the total for order ${order.orderId}?`)
      .groundTruth(String(order.total))
      .type('field-retrieval')
      .dataset('nested')
      .build(),
    (order, getId) => new QuestionBuilder()
      .id(getId())
      .prompt(`What is the status of order ${order.orderId}?`)
      .groundTruth(order.status)
      .type('field-retrieval')
      .dataset('nested')
      .build(),
  ]
  questions.push(...rotateQuestions(
    orders,
    orderFieldGenerators,
    QUESTION_LIMITS.nested.fieldRetrievalOrders,
    SAMPLE_STRIDES.ORDER_FIELD,
    getId,
  ))
  // Field retrieval: customer info and order dates
  const customerFieldGenerators: Array<(order: Order, getId: () => string) => Question> = [
    (order, getId) => new QuestionBuilder()
      .id(getId())
      .prompt(`What is the customer name for order ${order.orderId}?`)
      .groundTruth(order.customer.name)
      .type('field-retrieval')
      .dataset('nested')
      .build(),
    (order, getId) => new QuestionBuilder()
      .id(getId())
      .prompt(`What is the customer email for order ${order.orderId}?`)
      .groundTruth(order.customer.email)
      .type('field-retrieval')
      .dataset('nested')
      .build(),
    (order, getId) => new QuestionBuilder()
      .id(getId())
      .prompt(`What is the order date for order ${order.orderId}?`)
      .groundTruth(order.orderDate || '')
      .type('field-retrieval')
      .dataset('nested')
      .build(),
    (order, getId) => new QuestionBuilder()
      .id(getId())
      .prompt(`How many items are in order ${order.orderId}?`)
      .groundTruth(String(order.items.length))
      .type('field-retrieval')
      .dataset('nested')
      .build(),
  ]
  // Use stride + 1 for customer fields to offset from order fields
  const customerOrders = orders.map((_, i) => orders[i * SAMPLE_STRIDES.CUSTOMER_FIELD + 1] || orders[i]).filter(Boolean) as Order[]
  questions.push(...rotateQuestions(
    customerOrders,
    customerFieldGenerators,
    QUESTION_LIMITS.nested.fieldRetrievalCustomers,
    1,
    getId,
  ))
  // Aggregation: totals and averages
  const totalRevenue = orders.reduce((sum, o) => sum + o.total, 0)
  const avgOrderValue = totalRevenue / orders.length
  const totalOrders = orders.length
  const maxOrderValue = Math.max(...orders.map(o => o.total))
  // Count by status
  const statuses = [...new Set(orders.map(o => o.status))]
  for (const status of statuses.slice(0, QUESTION_LIMITS.nested.aggregationStatuses)) {
    const count = countByPredicate(orders, o => o.status === status)
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many orders have status "${status}"?`)
        .groundTruth(String(count))
        .type('aggregation')
        .dataset('nested')
        .build(),
    )
  }
  questions.push(
    new QuestionBuilder()
      .id(getId())
      .prompt('What is the total revenue across all orders?')
      .groundTruth(String(totalRevenue.toFixed(2)))
      .type('aggregation')
      .dataset('nested')
      .build(),
    new QuestionBuilder()
      .id(getId())
      .prompt('What is the average order value?')
      .groundTruth(String(avgOrderValue.toFixed(2)))
      .type('aggregation')
      .dataset('nested')
      .build(),
    new QuestionBuilder()
      .id(getId())
      .prompt('How many orders are in the dataset?')
      .groundTruth(String(totalOrders))
      .type('aggregation')
      .dataset('nested')
      .build(),
    new QuestionBuilder()
      .id(getId())
      .prompt('What is the highest order total?')
      .groundTruth(String(maxOrderValue.toFixed(2)))
      .type('aggregation')
      .dataset('nested')
      .build(),
  )
  // Aggregation: high-value orders (single-condition filter)
  for (const threshold of QUESTION_THRESHOLDS.nested.highValueOrders) {
    const count = countByPredicate(orders, o => o.total > threshold)
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many orders have a total greater than ${threshold}?`)
        .groundTruth(String(count))
        .type('aggregation')
        .dataset('nested')
        .build(),
    )
  }
  // Filtering: multi-condition queries (status AND value)
  const orderStatuses = [...new Set(orders.map(o => o.status))]
  for (const status of orderStatuses.slice(0, QUESTION_LIMITS.nested.filteringStatusAndValue)) {
    const count = countByPredicate(
      orders,
      o => o.status === status && o.total > QUESTION_THRESHOLDS.nested.statusValueThreshold,
    )
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many orders have status "${status}" and total greater than ${QUESTION_THRESHOLDS.nested.statusValueThreshold}?`)
        .groundTruth(String(count))
        .type('filtering')
        .dataset('nested')
        .build(),
    )
  }
  // Filtering: status AND items count (multi-condition)
  for (const status of orderStatuses.slice(0, QUESTION_LIMITS.nested.filteringStatusAndItems)) {
    const count = countByPredicate(
      orders,
      o => o.status === status && o.items.length >= QUESTION_THRESHOLDS.nested.itemCountThreshold,
    )
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many orders have status "${status}" and at least ${QUESTION_THRESHOLDS.nested.itemCountThreshold} items?`)
        .groundTruth(String(count))
        .type('filtering')
        .dataset('nested')
        .build(),
    )
  }
  // Filtering: total AND items count (multi-condition)
  for (const threshold of QUESTION_THRESHOLDS.nested.totalThresholdsForItems) {
    const count = countByPredicate(
      orders,
      o => o.total > threshold && o.items.length >= QUESTION_THRESHOLDS.nested.itemCountThreshold,
    )
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many orders have a total greater than ${threshold} and at least ${QUESTION_THRESHOLDS.nested.itemCountThreshold} items?`)
        .groundTruth(String(count))
        .type('filtering')
        .dataset('nested')
        .build(),
    )
  }
  return questions
 }
--- a/benchmarks/src/questions/tabular.ts
+++ b/benchmarks/src/questions/tabular.ts
@@ -0,0 +1,191 @@
 import type { Employee } from '../datasets'
 import type { Question } from '../types'
 import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from '../constants'
 import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils'
 /**
 * Generate tabular (employee) questions
 */
 export function generateTabularQuestions(employees: Employee[], getId: () => string): Question[] {
  const questions: Question[] = []
  if (employees.length === 0)
    return questions
  // Field retrieval: specific employees
  const fieldGenerators: Array<(emp: Employee, getId: () => string) => Question> = [
    (emp, getId) => new QuestionBuilder()
      .id(getId())
      .prompt(`What is the salary of ${emp.name}?`)
      .groundTruth(String(emp.salary))
      .type('field-retrieval')
      .dataset('tabular')
      .build(),
    (emp, getId) => new QuestionBuilder()
      .id(getId())
      .prompt(`What department does ${emp.name} work in?`)
      .groundTruth(emp.department)
      .type('field-retrieval')
      .dataset('tabular')
      .build(),
    (emp, getId) => new QuestionBuilder()
      .id(getId())
      .prompt(`What is the email address of ${emp.name}?`)
      .groundTruth(emp.email)
      .type('field-retrieval')
      .dataset('tabular')
      .build(),
    (emp, getId) => new QuestionBuilder()
      .id(getId())
      .prompt(`How many years of experience does ${emp.name} have?`)
      .groundTruth(String(emp.yearsExperience))
      .type('field-retrieval')
      .dataset('tabular')
      .build(),
    (emp, getId) => new QuestionBuilder()
      .id(getId())
      .prompt(`Is ${emp.name} an active employee?`)
      .groundTruth(emp.active ? 'yes' : 'no')
      .type('field-retrieval')
      .dataset('tabular')
      .build(),
  ]
  questions.push(...rotateQuestions(
    employees,
    fieldGenerators,
    QUESTION_LIMITS.tabular.fieldRetrieval,
    SAMPLE_STRIDES.EMPLOYEE_FIELD,
    getId,
  ))
  // Aggregation: count by department
  const departments = [...new Set(employees.map(e => e.department))]
  for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.aggregationDepartments)) {
    const count = countByPredicate(employees, e => e.department === dept)
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many employees work in ${dept}?`)
        .groundTruth(String(count))
        .type('aggregation')
        .dataset('tabular')
        .build(),
    )
  }
  // Aggregation: salary ranges (single-condition filters)
  for (const threshold of QUESTION_THRESHOLDS.tabular.salaryRanges) {
    const count = countByPredicate(employees, e => e.salary > threshold)
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many employees have a salary greater than ${threshold}?`)
        .groundTruth(String(count))
        .type('aggregation')
        .dataset('tabular')
        .build(),
    )
  }
  // Aggregation: totals and averages
  const totalEmployees = employees.length
  const avgSalary = Math.round(employees.reduce((sum, e) => sum + e.salary, 0) / totalEmployees)
  const activeCount = countByPredicate(employees, e => e.active)
  const inactiveCount = countByPredicate(employees, e => !e.active)
  questions.push(
    new QuestionBuilder()
      .id(getId())
      .prompt('How many employees are in the dataset?')
      .groundTruth(String(totalEmployees))
      .type('aggregation')
      .dataset('tabular')
      .build(),
    new QuestionBuilder()
      .id(getId())
      .prompt('What is the average salary across all employees?')
      .groundTruth(String(avgSalary))
      .type('aggregation')
      .dataset('tabular')
      .build(),
    new QuestionBuilder()
      .id(getId())
      .prompt('How many employees are active?')
      .groundTruth(String(activeCount))
      .type('aggregation')
      .dataset('tabular')
      .build(),
    new QuestionBuilder()
      .id(getId())
      .prompt('How many employees are inactive?')
      .groundTruth(String(inactiveCount))
      .type('aggregation')
      .dataset('tabular')
      .build(),
  )
  // Filtering: count by department with salary filter (multi-condition)
  for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringMultiConditionDepartments)) {
    const count = countByPredicate(
      employees,
      e => e.department === dept && e.salary > QUESTION_THRESHOLDS.tabular.departmentSalaryThreshold,
    )
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many employees in ${dept} have a salary greater than ${QUESTION_THRESHOLDS.tabular.departmentSalaryThreshold}?`)
        .groundTruth(String(count))
        .type('filtering')
        .dataset('tabular')
        .build(),
    )
  }
  // Filtering: active employees by experience (multi-condition)
  for (const exp of QUESTION_THRESHOLDS.tabular.experienceYears.slice(0, QUESTION_LIMITS.tabular.filteringExperience)) {
    const count = countByPredicate(employees, e => e.yearsExperience > exp && e.active)
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many active employees have more than ${exp} years of experience?`)
        .groundTruth(String(count))
        .type('filtering')
        .dataset('tabular')
        .build(),
    )
  }
  // Filtering: department by experience (multi-condition)
  for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringDepartmentExp)) {
    const count = countByPredicate(
      employees,
      e => e.department === dept && e.yearsExperience > QUESTION_THRESHOLDS.tabular.departmentExperienceThreshold,
    )
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many employees in ${dept} have more than ${QUESTION_THRESHOLDS.tabular.departmentExperienceThreshold} years of experience?`)
        .groundTruth(String(count))
        .type('filtering')
        .dataset('tabular')
        .build(),
    )
  }
  // Filtering: department by active status (multi-condition)
  for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringDepartmentActive)) {
    const count = countByPredicate(employees, e => e.department === dept && e.active)
    questions.push(
      new QuestionBuilder()
        .id(getId())
        .prompt(`How many active employees work in ${dept}?`)
        .groundTruth(String(count))
        .type('filtering')
        .dataset('tabular')
        .build(),
    )
  }
  return questions
 }
--- a/benchmarks/src/questions/utils.ts
+++ b/benchmarks/src/questions/utils.ts
@@ -0,0 +1,95 @@
 import type { Question } from '../types'
 // Constants for sampling strides
 export const SAMPLE_STRIDES = {
  EMPLOYEE_FIELD: 2,
  ORDER_FIELD: 2,
  CUSTOMER_FIELD: 2,
  ANALYTICS_FIELD: 3,
  METRIC_FIELD: 3,
  REPO_FIELD: 7,
  EVENT_LOG_FIELD: 5,
 } as const
 /**
 * ID Generator
 */
 export function* createIdGenerator(): Generator<string, never, never> {
  let id = 1
  while (true) {
    yield `q${id++}`
  }
 }
 /**
 * Question Builder class for fluent question creation
 */
 export class QuestionBuilder {
  private question: Partial<Question> = {}
  id(id: string): this {
    this.question.id = id
    return this
  }
  prompt(prompt: string): this {
    this.question.prompt = prompt
    return this
  }
  groundTruth(groundTruth: string): this {
    this.question.groundTruth = groundTruth
    return this
  }
  type(type: Question['type']): this {
    this.question.type = type
    return this
  }
  dataset(dataset: Question['dataset']): this {
    this.question.dataset = dataset
    return this
  }
  build(): Question {
    if (!this.question.id || !this.question.prompt || !this.question.groundTruth || !this.question.type || !this.question.dataset) {
      throw new Error('Incomplete question')
    }
    return this.question as Question
  }
 }
 /**
 * Helper: Count items matching a predicate
 */
 export function countByPredicate<T>(items: T[], predicate: (item: T) => boolean): number {
  return items.filter(predicate).length
 }
 /**
 * Helper: Rotate through question generators
 */
 export function rotateQuestions<T>(
  items: T[],
  generators: Array<(item: T, getId: () => string) => Question>,
  limit: number,
  stride: number,
  getId: () => string,
 ): Question[] {
  const questions: Question[] = []
  for (let i = 0; i < Math.min(limit, items.length); i++) {
    const item = items[i * stride] || items[i]
    if (!item)
      continue
    const generatorIndex = i % generators.length
    const generator = generators[generatorIndex]
    if (generator) {
      questions.push(generator(item, getId))
    }
  }
  return questions
 }
--- a/benchmarks/src/report.ts
+++ b/benchmarks/src/report.ts
@@ -1,7 +1,8 @@
-import type { EfficiencyRanking, EvaluationResult, FormatResult, Question } from './types'
+import type { Dataset, EfficiencyRanking, EvaluationResult, FormatResult, Question } from './types'
 import { FORMATTER_DISPLAY_NAMES } from './constants'
-import { datasets } from './datasets'
+import { ACCURACY_DATASETS } from './datasets'
 import { models } from './evaluate'
 import { supportsCSV } from './formatters'
 import { generateQuestions } from './questions'
 import { createProgressBar, tokenize } from './utils'
@@ -16,7 +17,11 @@ export function calculateTokenCounts(
  const tokenCounts: Record<string, number> = {}
  for (const [formatName, formatter] of Object.entries(formatters)) {
-    for (const dataset of datasets) {
+    for (const dataset of ACCURACY_DATASETS) {
      // Skip CSV for datasets that don't support it
      if (formatName === 'csv' && !supportsCSV(dataset))
        continue
      const formatted = formatter(dataset.data)
      const key = `${formatName}-${dataset.name}`
      tokenCounts[key] = tokenize(formatted)
@@ -42,9 +47,9 @@ export function calculateFormatResults(
    const accuracy = correctCount / totalCount
    // Calculate average tokens across all datasets for this format
-    const avgTokens = Object.entries(tokenCounts)
+    const formatTokenEntries = Object.entries(tokenCounts)
      .filter(([key]) => key.startsWith(`${formatName}-`))
-      .reduce((sum, [, tokens]) => sum + tokens, 0) / datasets.length
+    const avgTokens = formatTokenEntries.reduce((sum, [, tokens]) => sum + tokens, 0) / formatTokenEntries.length
    const averageLatency = formatResults.reduce((sum, r) => sum + r.latencyMs, 0) / totalCount
@@ -75,6 +80,8 @@ export function generateAccuracyReport(
  return `
 Benchmarks test LLM comprehension across different input formats using ${totalQuestions} data retrieval questions on ${modelNames.length} ${modelNames.length === 1 ? 'model' : 'models'}.
 ${generateDatasetCatalog(ACCURACY_DATASETS)}
 #### Efficiency Ranking (Accuracy per 1K Tokens)
 ${generateEfficiencyRankingReport(formatResults)}
@@ -85,6 +92,38 @@ ${generateDetailedAccuracyReport(formatResults, results, questions, tokenCounts)
 `.trimStart()
 }
 /**
 * Generate dataset catalog section
 */
 function generateDatasetCatalog(datasets: Dataset[]): string {
  const rows = datasets.map((dataset) => {
    const csvSupport = supportsCSV(dataset) ? '✓' : '✗'
    const rowCount = Object.values(dataset.data)[0]?.length ?? 1
    const structure = dataset.metadata.structureClass
    const eligibility = `${dataset.metadata.tabularEligibility}%`
    return `| ${dataset.description} | ${rowCount} | ${structure} | ${csvSupport} | ${eligibility} |`
  }).join('\n')
  return `
 #### Dataset Catalog
 | Dataset | Rows | Structure | CSV Support | Eligibility |
 | ------- | ---- | --------- | ----------- | ----------- |
 ${rows}
 **Structure classes:**
 - **uniform**: All objects have identical fields with primitive values
 - **semi-uniform**: Mix of uniform and non-uniform structures
 - **nested**: Objects with nested structures (nested objects or arrays)
 - **deep**: Highly nested with minimal tabular eligibility
 **CSV Support:** ✓ (supported), ✗ (not supported - would require lossy flattening)
 **Eligibility:** Percentage of arrays that qualify for TOON's tabular format (uniform objects with primitive values)
 `.trim()
 }
 /**
 * Generate efficiency ranking report
 */
@@ -168,10 +207,12 @@ function generateDetailedAccuracyReport(
  const filteringPercent = ((filteringCount / totalQuestions) * 100).toFixed(0)
  // Calculate dataset sizes
-  const tabularSize = datasets.find(d => d.name === 'tabular')?.data.employees?.length || 0
+  const tabularSize = ACCURACY_DATASETS.find(d => d.name === 'tabular')?.data.employees?.length || 0
-  const nestedSize = datasets.find(d => d.name === 'nested')?.data.orders?.length || 0
+  const nestedSize = ACCURACY_DATASETS.find(d => d.name === 'nested')?.data.orders?.length || 0
-  const analyticsSize = datasets.find(d => d.name === 'analytics')?.data.metrics?.length || 0
+  const analyticsSize = ACCURACY_DATASETS.find(d => d.name === 'analytics')?.data.metrics?.length || 0
-  const githubSize = datasets.find(d => d.name === 'github')?.data.repositories?.length || 0
+  const githubSize = ACCURACY_DATASETS.find(d => d.name === 'github')?.data.repositories?.length || 0
  const eventLogsSize = ACCURACY_DATASETS.find(d => d.name === 'event-logs')?.data.logs?.length || 0
  const nestedConfigSize = 1 // Single config object
  // Calculate number of formats and evaluations
  const formatCount = formatResults.length
@@ -208,12 +249,14 @@ This benchmark tests **LLM comprehension and data retrieval accuracy** across di
 #### Datasets Tested
-Four datasets designed to test different structural patterns (all contain arrays of uniform objects, TOON's optimal format):
+Six datasets designed to test different structural patterns:
 1. **Tabular** (${tabularSize} employee records): Uniform objects with identical fields – optimal for TOON's tabular format.
 2. **Nested** (${nestedSize} e-commerce orders): Complex structures with nested customer objects and item arrays.
 3. **Analytics** (${analyticsSize} days of metrics): Time-series data with dates and numeric values.
 4. **GitHub** (${githubSize} repositories): Real-world data from top GitHub repos by stars.
 5. **Event Logs** (${eventLogsSize} logs): Semi-uniform data with ~50% flat logs and ~50% with nested error objects.
 6. **Nested Config** (${nestedConfigSize} configuration): Deeply nested configuration with minimal tabular eligibility.
 #### Question Types
@@ -314,7 +357,7 @@ function generateDatasetBreakdown(
  questions: Question[],
  tokenCounts: Record<string, number>,
 ): string {
-  return datasets.map((dataset) => {
+  return ACCURACY_DATASETS.map((dataset) => {
    const datasetResults = formatResults.map((fr) => {
      const datasetFormatResults = results.filter(r => r.questionId.includes(dataset.name) || questions.find(q => q.id === r.questionId)?.dataset === dataset.name)
      if (datasetFormatResults.length === 0)
--- a/benchmarks/src/types.ts
+++ b/benchmarks/src/types.ts
@@ -1,7 +1,14 @@
 export interface DatasetMetadata {
  supportsCSV: boolean
  structureClass: 'uniform' | 'semi-uniform' | 'nested' | 'deep'
  tabularEligibility: number
 }
 export interface Dataset {
  name: string
  description: string
  data: Record<string, any>
  metadata: DatasetMetadata
 }
 export interface Question {