test(benchmark): overhaul generation

2026-01-29 15:24:10 +08:00 · 2025-11-06 14:45:44 +01:00
parent 9863875706
commit bc711ccecf
19 changed files with 2254 additions and 997 deletions
--- a/README.md
+++ b/README.md
@@ -60,6 +60,19 @@ For small payloads, JSON/CSV/YAML work fine. TOON's value emerges at scale: when

 </details>

+<details>
+<summary><strong>When NOT to use TOON</strong></summary>
+
+TOON excels with uniform arrays of objects, but there are cases where other formats are better:
+
+- **Deeply nested or non-uniform structures** (tabular eligibility ≈ 0%): JSON-compact often uses fewer tokens. Example: complex configuration objects with many nested levels.
+- **Semi-uniform arrays** (~40–60% tabular eligibility): Token savings diminish. Prefer JSON if your pipelines already rely on it.
+- **Flat CSV use-cases**: CSV is smaller than TOON for pure tabular data. TOON adds minimal overhead (~5-10%) to provide structure (length markers, field headers, delimiter scoping) that improves LLM reliability.
+
+See [benchmarks](#benchmarks) for concrete comparisons across different data structures.
+
+</details>
+
 ## Key Features

 - 💸 **Token-efficient:** typically 30–60% fewer tokens than JSON[^1]
@@ -75,14 +88,16 @@ For small payloads, JSON/CSV/YAML work fine. TOON's value emerges at scale: when
 > [!TIP]
 > Try the interactive [Format Tokenization Playground](https://www.curiouslychase.com/playground/format-tokenization-exploration) to compare token usage across CSV, JSON, YAML, and TOON with your own data.

+Benchmarks are organized into two tracks to ensure fair comparisons:
+
+- **Mixed-Structure Track**: Datasets with nested or semi-uniform structures (TOON vs JSON, YAML, XML). CSV excluded as it cannot properly represent these structures.
+- **Flat-Only Track**: Datasets with flat tabular structures where CSV is applicable (CSV vs TOON vs JSON, YAML, XML).
+
 ### Token Efficiency

 Token counts are measured using the GPT-5 `o200k_base` tokenizer via [`gpt-tokenizer`](https://github.com/niieani/gpt-tokenizer). Savings are calculated against formatted JSON (2-space indentation) as the primary baseline, with additional comparisons to compact JSON (minified), YAML, and XML. Actual savings vary by model and tokenizer.

-The benchmarks use datasets optimized for TOON's strengths (uniform tabular data). Real-world performance depends on your data structure.
-
-> [!NOTE]
-> CSV/TSV doesn't support nested structures, so it's not included in this comparison. For flat datasets where CSV applies, see token counts and accuracy metrics in the [Retrieval Accuracy](#retrieval-accuracy) section.
+The benchmarks test datasets across different structural patterns (uniform, semi-uniform, nested, deeply nested) to show where TOON excels and where other formats may be better.

 <!-- automd:file src="./benchmarks/results/token-efficiency.md" -->

--- a/benchmarks/README.md
+++ b/benchmarks/README.md
@@ -34,8 +34,8 @@ Results are saved to `results/token-efficiency.md`.

 Tests how well LLMs can answer questions about data in different formats (TOON, JSON, JSON compact, XML, YAML, CSV):

-1. Generate ~150-160 questions across 4 datasets
-2. Convert each dataset to all 6 formats
+1. Generate ~150-160 questions across 6 datasets (CSV only included for datasets with flat/tabular structure)
+2. Convert each dataset to all supported formats
 3. Query each LLM with formatted data + question
 4. Validate answers using `gpt-5-nano` as judge
 5. Aggregate metrics and generate report
--- a/benchmarks/results/token-efficiency.md
+++ b/benchmarks/results/token-efficiency.md
@@ -1,36 +1,149 @@
+
+## Mixed-Structure Track
+
+Datasets with nested or semi-uniform structures. CSV excluded as it cannot properly represent these structures.
+
 ```
-⭐ GitHub Repositories       ██████████████░░░░░░░░░░░    8,745 tokens
-                             vs JSON (−42.3%)           15,145
-                             vs JSON compact (−23.7%)   11,455
-                             vs YAML (−33.4%)           13,129
-                             vs XML (−48.8%)            17,095
+🛒 E-commerce orders with nested structures [eligibility: 33%] 
+toon                  ▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░    58,528 tokens
+  vs JSON (−37.9%)               94,207
+  vs JSON compact (+0.9%)        57,979
+  vs YAML (−17.8%)               71,223
+  vs XML (−45.2%)               106,720

-📈 Daily Analytics           ██████████░░░░░░░░░░░░░░░    4,507 tokens
-                             vs JSON (−58.9%)           10,977
-                             vs JSON compact (−35.7%)    7,013
-                             vs YAML (−48.8%)            8,810
-                             vs XML (−65.7%)            13,128
+🧾 Semi-uniform event logs [eligibility: 50%]                  
+toon                  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░   154,419 tokens
+  vs JSON (−15.0%)              181,592
+  vs JSON compact (+19.9%)      128,836
+  vs YAML (−0.9%)               155,749
+  vs XML (−25.1%)               206,271

-🛒 E-Commerce Order          ████████████████░░░░░░░░░      166 tokens
-                             vs JSON (−35.4%)              257
-                             vs JSON compact (−2.9%)       171
-                             vs YAML (−15.7%)              197
-                             vs XML (−38.7%)               271
+🧩 Deeply nested configuration [eligibility: 0%]               
+toon                  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░       630 tokens
+  vs JSON (−31.4%)                  918
+  vs JSON compact (+11.9%)          563
+  vs YAML (−6.4%)                   673
+  vs XML (−37.4%)                 1,007

-─────────────────────────────────────────────────────────────────────
-Total                        ██████████████░░░░░░░░░░░   13,418 tokens
-                             vs JSON (−49.1%)           26,379
-                             vs JSON compact (−28.0%)   18,639
-                             vs YAML (−39.4%)           22,136
-                             vs XML (−56.0%)            30,494
+─────────────────────────────────────────────────────────────────────────────────
+Total                                                               
+toon                  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░   213,577 tokens
+  vs JSON (−22.8%)              276,717
+  vs JSON compact (+14.0%)      187,378
+  vs YAML (−6.2%)               227,645
+  vs XML (−32.0%)               313,998
 ```

+## Flat-Only Track
+
+Datasets with flat tabular structures where CSV is applicable.
+
+```
+👥 Uniform employee records (TOON optimal format) [eligibility: 100%]
+csv                   ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░    46,968 tokens
+toon                  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓    49,841 tokens   (+5.8% vs CSV)
+  vs JSON (−60.7%)              126,886
+  vs JSON compact (−36.8%)       78,882
+  vs YAML (−50.0%)               99,743
+  vs XML (−66.0%)               146,465
+
+📈 Time-series analytics data [eligibility: 100%]              
+csv                   ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░     8,382 tokens
+toon                  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓     9,114 tokens   (+8.0% vs CSV)
+  vs JSON (−59.0%)               22,244
+  vs JSON compact (−35.9%)       14,210
+  vs YAML (−49.0%)               17,857
+  vs XML (−65.8%)                26,615
+
+⭐ Top 100 GitHub repositories [eligibility: 100%]             
+csv                   ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░     8,513 tokens
+toon                  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓     8,745 tokens   (+2.7% vs CSV)
+  vs JSON (−42.3%)               15,145
+  vs JSON compact (−23.7%)       11,455
+  vs YAML (−33.4%)               13,129
+  vs XML (−48.8%)                17,095
+
+─────────────────────────────────────────────────────────────────────────────────
+Total                                                               
+csv                   ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░    63,863 tokens
+toon                  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓    67,700 tokens   (+5.7% vs CSV)
+  vs JSON (−58.8%)              164,275
+  vs JSON compact (−35.2%)      104,547
+  vs YAML (−48.2%)              130,729
+  vs XML (−64.4%)               190,175
+```
+
+
 <details>
 <summary><strong>View detailed examples</strong></summary>

-#### ⭐ GitHub Repositories
+#### 📈 Time-series analytics data

-**Configuration:** Top 100 GitHub repositories with stars, forks, and metadata
+**Savings:** 13,130 tokens (59.0% reduction vs JSON)
+
+**JSON** (22,244 tokens):
+
+```json
+{
+  "metrics": [
+    {
+      "date": "2025-01-01",
+      "views": 4324,
+      "clicks": 146,
+      "conversions": 21,
+      "revenue": 3834.57,
+      "bounceRate": 0.4
+    },
+    {
+      "date": "2025-01-02",
+      "views": 6248,
+      "clicks": 407,
+      "conversions": 22,
+      "revenue": 2936.12,
+      "bounceRate": 0.62
+    },
+    {
+      "date": "2025-01-03",
+      "views": 7382,
+      "clicks": 270,
+      "conversions": 24,
+      "revenue": 6825.19,
+      "bounceRate": 0.7
+    },
+    {
+      "date": "2025-01-04",
+      "views": 4586,
+      "clicks": 267,
+      "conversions": 24,
+      "revenue": 2391.11,
+      "bounceRate": 0.64
+    },
+    {
+      "date": "2025-01-05",
+      "views": 6171,
+      "clicks": 227,
+      "conversions": 12,
+      "revenue": 3430.1,
+      "bounceRate": 0.39
+    }
+  ]
+}
+```
+
+**TOON** (9,114 tokens):
+
+```
+metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
+  2025-01-01,4324,146,21,3834.57,0.4
+  2025-01-02,6248,407,22,2936.12,0.62
+  2025-01-03,7382,270,24,6825.19,0.7
+  2025-01-04,4586,267,24,2391.11,0.64
+  2025-01-05,6171,227,12,3430.1,0.39
+```
+
+---
+
+#### ⭐ Top 100 GitHub repositories

 **Savings:** 6,400 tokens (42.3% reduction vs JSON)

@@ -91,72 +204,4 @@ repositories[3]{id,name,repo,description,createdAt,updatedAt,pushedAt,stars,watc
  21737465,awesome,sindresorhus/awesome,😎 Awesome lists about all kinds of interesting topics,"2014-07-11T13:42:37Z","2025-10-28T12:40:21Z","2025-10-27T17:57:31Z",410052,8017,32029,main
 ```

---
-
-#### 📈 Daily Analytics
-
-**Configuration:** 180 days of web metrics (views, clicks, conversions, revenue)
-
-**Savings:** 6,470 tokens (58.9% reduction vs JSON)
-
-**JSON** (10,977 tokens):
-
-```json
-{
-  "metrics": [
-    {
-      "date": "2025-01-01",
-      "views": 6890,
-      "clicks": 401,
-      "conversions": 23,
-      "revenue": 6015.59,
-      "bounceRate": 0.63
-    },
-    {
-      "date": "2025-01-02",
-      "views": 6940,
-      "clicks": 323,
-      "conversions": 37,
-      "revenue": 9086.44,
-      "bounceRate": 0.36
-    },
-    {
-      "date": "2025-01-03",
-      "views": 4390,
-      "clicks": 346,
-      "conversions": 26,
-      "revenue": 6360.75,
-      "bounceRate": 0.48
-    },
-    {
-      "date": "2025-01-04",
-      "views": 3429,
-      "clicks": 231,
-      "conversions": 13,
-      "revenue": 2360.96,
-      "bounceRate": 0.65
-    },
-    {
-      "date": "2025-01-05",
-      "views": 5804,
-      "clicks": 186,
-      "conversions": 22,
-      "revenue": 2535.96,
-      "bounceRate": 0.37
-    }
-  ]
-}
-```
-
-**TOON** (4,507 tokens):
-
-```
-metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
-  2025-01-01,6890,401,23,6015.59,0.63
-  2025-01-02,6940,323,37,9086.44,0.36
-  2025-01-03,4390,346,26,6360.75,0.48
-  2025-01-04,3429,231,13,2360.96,0.65
-  2025-01-05,5804,186,22,2535.96,0.37
-```
-
 </details>
--- a/benchmarks/scripts/accuracy-benchmark.ts
+++ b/benchmarks/scripts/accuracy-benchmark.ts
@@ -5,16 +5,83 @@ import process from 'node:process'
 import * as prompts from '@clack/prompts'
 import PQueue from 'p-queue'
 import { BENCHMARKS_DIR, DEFAULT_CONCURRENCY, DRY_RUN, DRY_RUN_LIMITS, MODEL_RPM_LIMITS, ROOT_DIR } from '../src/constants'
-import { datasets } from '../src/datasets'
+import { ACCURACY_DATASETS } from '../src/datasets'
 import { evaluateQuestion, models } from '../src/evaluate'
-import { formatters } from '../src/formatters'
+import { formatters, supportsCSV } from '../src/formatters'
 import { generateQuestions } from '../src/questions'
 import { calculateFormatResults, calculateTokenCounts, generateAccuracyReport } from '../src/report'
 import { getAllModelResults, hasModelResults, saveModelResults } from '../src/storage'
 import { ensureDir } from '../src/utils'

+// Constants
+const PROGRESS_UPDATE_INTERVAL = 10
+const RATE_LIMIT_INTERVAL_MS = 60_000
+
 prompts.intro('Retrieval Accuracy Benchmark')

+/**
+ * Generate evaluation tasks for a model
+ */
+function generateEvaluationTasks(questions: Question[]): { question: Question, formatName: string }[] {
+  const tasks: { question: Question, formatName: string }[] = []
+
+  for (const question of questions) {
+    for (const [formatName] of Object.entries(formatters)) {
+      // Skip CSV for datasets that don't support it
+      const dataset = ACCURACY_DATASETS.find(d => d.name === question.dataset)
+      if (formatName === 'csv' && dataset && !supportsCSV(dataset))
+        continue
+
+      tasks.push({ question, formatName })
+    }
+  }
+
+  return tasks
+}
+
+/**
+ * Check which models already have saved results
+ */
+async function checkExistingResults(activeModels: typeof models) {
+  const existingModelResults: Record<string, boolean> = {}
+
+  for (const model of activeModels) {
+    const existingResult = await hasModelResults(model.modelId)
+    if (existingResult)
+      existingModelResults[model.modelId] = existingResult
+  }
+
+  return existingModelResults
+}
+
+/**
+ * Create a progress updater function
+ */
+function createProgressUpdater(spinner: ReturnType<typeof prompts.spinner>, total: number) {
+  let completed = 0
+
+  return () => {
+    completed++
+    if (completed % PROGRESS_UPDATE_INTERVAL === 0 || completed === total) {
+      const percent = ((completed / total) * 100).toFixed(1)
+      spinner.message(`Progress: ${completed}/${total} (${percent}%)`)
+    }
+  }
+}
+
+/**
+ * Create a rate-limited queue for model evaluation
+ */
+function createEvaluationQueue(modelId: string) {
+  const rpmLimit = MODEL_RPM_LIMITS[modelId]
+
+  return new PQueue({
+    concurrency: DEFAULT_CONCURRENCY,
+    intervalCap: rpmLimit ?? Infinity,
+    interval: rpmLimit ? RATE_LIMIT_INTERVAL_MS : 0,
+  })
+}
+
 // Prompt user to select which models to benchmark
 const modelChoices = models.map(({ modelId }) => ({
  value: modelId,
@@ -37,15 +104,10 @@ const activeModels = models.filter(m => selectedModels.includes(m.modelId))
 prompts.log.info(`Selected ${activeModels.length} model(s): ${activeModels.map(m => m.modelId).join(', ')}`)

 // Check which models already have results
-const existingModelResults: Record<string, boolean> = {}
-for (const model of activeModels) {
-  const existingResult = await hasModelResults(model.modelId)
-  if (existingResult)
-    existingModelResults[model.modelId] = existingResult
-}
+const existingModelResults = await checkExistingResults(activeModels)

 if (Object.keys(existingModelResults).length > 0) {
-  prompts.log.info(`Found existing results for ${Object.values(existingModelResults).length} model(s)`)
+  prompts.log.info(`Found existing results for ${Object.keys(existingModelResults).length} model(s)`)
 }

 if (DRY_RUN) {
@@ -75,31 +137,22 @@ for (const model of activeModels) {
  prompts.log.step(`Running benchmark for ${modelId}`)

  // Generate evaluation tasks for this model
-  const tasks: { question: Question, formatName: string }[] = []
-  for (const question of questions) {
-    for (const [formatName] of Object.entries(formatters)) {
-      tasks.push({ question, formatName })
-    }
-  }
+  const tasks = generateEvaluationTasks(questions)

  const total = tasks.length
  const rpmLimit = MODEL_RPM_LIMITS[modelId]
-  const queue = new PQueue({
-    concurrency: DEFAULT_CONCURRENCY,
-    intervalCap: rpmLimit ?? Infinity,
-    interval: rpmLimit ? 60_000 : 0,
-  })
+  const queue = createEvaluationQueue(modelId)

  const evalSpinner = prompts.spinner()
  evalSpinner.start(`Running ${total} evaluations (concurrency: ${DEFAULT_CONCURRENCY}, RPM limit: ${rpmLimit ?? 'unlimited'})`)

-  let completed = 0
+  const updateProgress = createProgressUpdater(evalSpinner, total)

  // Queue all tasks
  const modelResultPromises = tasks.map(task =>
    queue.add(async () => {
      // Format data on-demand
-      const dataset = datasets.find(d => d.name === task.question.dataset)!
+      const dataset = ACCURACY_DATASETS.find(d => d.name === task.question.dataset)!
      const formatter = formatters[task.formatName]!
      const formattedData = formatter(dataset.data)

@@ -111,11 +164,7 @@ for (const model of activeModels) {
      })

      // Progress update after task completes
-      completed++
-      if (completed % 10 === 0 || completed === total) {
-        const percent = ((completed / total) * 100).toFixed(1)
-        evalSpinner.message(`Progress: ${completed}/${total} (${percent}%)`)
-      }
+      updateProgress()

      return result
    }),
@@ -154,5 +203,5 @@ await ensureDir(resultsDir)
 const outputFilePath = path.join(resultsDir, 'retrieval-accuracy.md')
 await fsp.writeFile(outputFilePath, accuracyReport)

-prompts.log.info(`Report saved to: \`${path.relative(ROOT_DIR, outputFilePath)}\``)
 reportSpinner.stop('Report generation complete!')
+prompts.log.info(`Report saved to: \`${path.relative(ROOT_DIR, outputFilePath)}\``)
--- a/benchmarks/scripts/token-efficiency-benchmark.ts
+++ b/benchmarks/scripts/token-efficiency-benchmark.ts
@@ -1,11 +1,11 @@
+import type { Dataset } from '../src/types'
 import * as fsp from 'node:fs/promises'
 import * as path from 'node:path'
 import * as prompts from '@clack/prompts'
 import { encode } from '../../packages/toon/src'
-import githubRepos from '../data/github-repos.json' with { type: 'json' }
 import { BENCHMARKS_DIR, FORMATTER_DISPLAY_NAMES, ROOT_DIR } from '../src/constants'
-import { generateAnalyticsData, generateOrderData } from '../src/datasets'
-import { formatters } from '../src/formatters'
+import { TOKEN_EFFICIENCY_DATASETS } from '../src/datasets'
+import { formatters, supportsCSV } from '../src/formatters'
 import { createProgressBar, ensureDir, tokenize } from '../src/utils'

 interface FormatMetrics {
@@ -16,55 +16,160 @@ interface FormatMetrics {
 }

 interface BenchmarkResult {
-  name: string
-  emoji: string
-  description: string
-  data: Record<string, any>
+  dataset: Dataset
  formats: FormatMetrics[]
-  showDetailed: boolean
 }

-const BENCHMARK_EXAMPLES = [
-  {
-    name: 'GitHub Repositories',
-    emoji: '⭐',
-    description: 'Top 100 GitHub repositories with stars, forks, and metadata',
-    getData: () => ({ repositories: githubRepos }),
-    showDetailed: true,
-  },
-  {
-    name: 'Daily Analytics',
-    emoji: '📈',
-    description: '180 days of web metrics (views, clicks, conversions, revenue)',
-    getData: () => generateAnalyticsData(180),
-    showDetailed: true,
-  },
-  {
-    name: 'E-Commerce Order',
-    emoji: '🛒',
-    description: 'Single nested order with customer and items',
-    getData: generateOrderData,
-    showDetailed: false,
-  },
-] as const
+// Constants
+const DATASET_ICONS: Record<string, string> = {
+  'tabular': '👥',
+  'nested': '🛒',
+  'analytics': '📈',
+  'github': '⭐',
+  'event-logs': '🧾',
+  'nested-config': '🧩',
+}
+
+const COMPARISON_FORMAT_ORDER = ['json-pretty', 'json-compact', 'yaml', 'xml'] as const
+
+const PROGRESS_BAR_CONFIG = { filled: '▓', empty: '░' } as const
+const PROGRESS_BAR_WIDTH = 20
+const TOKEN_PADDING = 7
+const LABEL_PADDING = 60
+const COMPARISON_LABEL_PADDING = 30
+
+const SEPARATOR = '─────────────────────────────────────────────────────────────────────────────────'
+const DEFAULT_DATASET_ICON = '📊'
+
+const DETAILED_EXAMPLE_DATASETS = ['github', 'analytics'] as const
+const GITHUB_REPO_LIMIT = 3
+const GITHUB_DESC_LIMIT = 80
+const ANALYTICS_METRICS_LIMIT = 5

 prompts.intro('Token Efficiency Benchmark')

+/**
+ * Format a comparison line showing savings vs TOON
+ */
+function formatComparisonLine(format: FormatMetrics): string {
+  const label = FORMATTER_DISPLAY_NAMES[format.name] || format.name.toUpperCase()
+  const signedPercent = format.savingsPercent >= 0
+    ? `−${format.savingsPercent.toFixed(1)}%`
+    : `+${Math.abs(format.savingsPercent).toFixed(1)}%`
+  const labelWithSavings = `vs ${label} (${signedPercent})`.padEnd(COMPARISON_LABEL_PADDING)
+  const tokenStr = format.tokens.toLocaleString('en-US').padStart(TOKEN_PADDING)
+  return `  ${labelWithSavings}${tokenStr}`
+}
+
+/**
+ * Calculate total tokens and savings for a set of datasets
+ */
+function calculateTotalMetrics(datasets: BenchmarkResult[], formatNames: readonly string[]) {
+  const totalToonTokens = datasets.reduce((sum, r) => {
+    const toon = r.formats.find(f => f.name === 'toon')!
+    return sum + toon.tokens
+  }, 0)
+
+  const totals = formatNames.map((formatName) => {
+    const totalTokens = datasets.reduce((sum, r) => {
+      const format = r.formats.find(f => f.name === formatName)
+      return sum + (format?.tokens || 0)
+    }, 0)
+    const savings = totalTokens - totalToonTokens
+    const savingsPercent = (savings / totalTokens) * 100
+    return { name: formatName, tokens: totalTokens, savingsPercent }
+  })
+
+  return { totalToonTokens, totals }
+}
+
+/**
+ * Generate total lines for a track
+ */
+function generateTotalLines(
+  totalToonTokens: number,
+  totals: { name: string, tokens: number, savingsPercent: number }[],
+  baselineFormat?: { name: string, tokens: number },
+) {
+  const lines: string[] = ['Total                                                               ']
+
+  if (baselineFormat) {
+    // Flat-only track with CSV baseline
+    const csvPercentage = Math.min(100, (baselineFormat.tokens / totalToonTokens) * 100)
+    const csvBar = createProgressBar(csvPercentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
+    const csvStr = baselineFormat.tokens.toLocaleString('en-US').padStart(TOKEN_PADDING)
+    lines.push(`csv                   ${csvBar}   ${csvStr} tokens`)
+
+    const overheadPercent = ((totalToonTokens - baselineFormat.tokens) / totalToonTokens) * 100
+    const toonBar = createProgressBar(100, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
+    const toonStr = totalToonTokens.toLocaleString('en-US').padStart(TOKEN_PADDING)
+    lines.push(`toon                  ${toonBar}   ${toonStr} tokens   (+${overheadPercent.toFixed(1)}% vs CSV)`)
+  }
+  else {
+    // Mixed-structure track
+    const totalPercentage = Math.min(100, (totalToonTokens / totals[0]!.tokens) * 100)
+    const totalBar = createProgressBar(totalPercentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
+    const toonStr = totalToonTokens.toLocaleString('en-US').padStart(TOKEN_PADDING)
+    lines.push(`toon                  ${totalBar}   ${toonStr} tokens`)
+  }
+
+  // Add comparison lines
+  for (const format of totals) {
+    lines.push(formatComparisonLine({
+      name: format.name,
+      tokens: format.tokens,
+      savings: 0, // Not used in this context
+      savingsPercent: format.savingsPercent,
+    }))
+  }
+
+  return lines.join('\n')
+}
+
+/**
+ * Generate bar chart for a dataset
+ */
+function generateDatasetChart(result: BenchmarkResult): string {
+  const { dataset, formats } = result
+  const toon = formats.find(f => f.name === 'toon')!
+  const jsonPretty = formats.find(f => f.name === 'json-pretty')!
+
+  const emoji = DATASET_ICONS[dataset.name] || DEFAULT_DATASET_ICON
+  const eligibility = dataset.metadata.tabularEligibility
+  const name = `${dataset.description} [eligibility: ${eligibility}%]`
+  const percentage = Math.min(100, 100 - jsonPretty.savingsPercent)
+  const bar = createProgressBar(percentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
+  const toonStr = toon.tokens.toLocaleString('en-US')
+
+  const line1 = `${emoji} ${name.padEnd(LABEL_PADDING)}\ntoon                  ${bar}   ${toonStr.padStart(TOKEN_PADDING)} tokens`
+
+  const comparisonLines = COMPARISON_FORMAT_ORDER.map((formatName) => {
+    const format = formats.find(f => f.name === formatName)
+    if (!format)
+      return null
+
+    return formatComparisonLine(format)
+  }).filter(Boolean)
+
+  return [line1, ...comparisonLines].join('\n')
+}
+
 const results: BenchmarkResult[] = []
-const totalTokensByFormat: Record<string, number> = {}

-for (const example of BENCHMARK_EXAMPLES) {
-  const data = example.getData()
-
-  // Calculate tokens for each format
+// Calculate token counts for all datasets
+for (const dataset of TOKEN_EFFICIENCY_DATASETS) {
  const formatMetrics: FormatMetrics[] = []
  const tokensByFormat: Record<string, number> = {}

+  // Calculate tokens for each format
  for (const [formatName, formatter] of Object.entries(formatters)) {
-    const formattedString = formatter(data)
+    // Skip CSV for datasets that don't support it
+    if (formatName === 'csv' && !supportsCSV(dataset))
+      continue
+
+    const formattedString = formatter(dataset.data)
    const tokens = tokenize(formattedString)
    tokensByFormat[formatName] = tokens
-    totalTokensByFormat[formatName] = (totalTokensByFormat[formatName] || 0) + tokens
  }

  // Calculate savings vs TOON
@@ -80,105 +185,126 @@ for (const example of BENCHMARK_EXAMPLES) {
  }

  results.push({
-    name: example.name,
-    emoji: example.emoji,
-    description: example.description,
-    data,
+    dataset,
    formats: formatMetrics,
-    showDetailed: example.showDetailed,
  })
 }

-// Calculate total savings percentages
-const totalToonTokens = totalTokensByFormat.toon!
-const totalSavingsPercent: Record<string, number> = {}
-for (const [formatName, totalTokens] of Object.entries(totalTokensByFormat)) {
-  if (formatName === 'toon') {
-    totalSavingsPercent[formatName] = 0
-  }
-  else {
-    const savings = totalTokens - totalToonTokens
-    totalSavingsPercent[formatName] = (savings / totalTokens) * 100
-  }
-}
+// Separate datasets by CSV support
+const mixedStructureDatasets = results.filter(r => !supportsCSV(r.dataset))
+const flatOnlyDatasets = results.filter(r => supportsCSV(r.dataset))

-// Generate ASCII bar chart visualization (stacked compact format)
-const formatOrder = ['json-pretty', 'json-compact', 'yaml', 'xml']
-const datasetRows = results
+// Mixed-Structure Track (no CSV)
+const mixedCharts = mixedStructureDatasets
+  .map(result => generateDatasetChart(result))
+  .join('\n\n')
+
+// Flat-Only Track (with CSV)
+const flatCharts = flatOnlyDatasets
  .map((result) => {
+    const csv = result.formats.find(f => f.name === 'csv')
    const toon = result.formats.find(f => f.name === 'toon')!
-    const percentage = result.formats.find(f => f.name === 'json-pretty')!.savingsPercent
-    const bar = createProgressBar(100 - percentage, 100) // Invert to show TOON tokens
+
+    if (!csv)
+      return generateDatasetChart(result)
+
+    // Special handling to show CSV first with TOON overhead
+    const { dataset } = result
+    const emoji = DATASET_ICONS[dataset.name] || DEFAULT_DATASET_ICON
+    const eligibility = dataset.metadata.tabularEligibility
+    const name = `${dataset.description} [eligibility: ${eligibility}%]`
+
+    // CSV line
+    const csvPercentage = Math.min(100, (csv.tokens / toon.tokens) * 100)
+    const csvBar = createProgressBar(csvPercentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
+    const csvStr = csv.tokens.toLocaleString('en-US')
+
+    const line1 = `${emoji} ${name.padEnd(LABEL_PADDING)}\ncsv                   ${csvBar}   ${csvStr.padStart(TOKEN_PADDING)} tokens`
+
+    // TOON line with overhead vs CSV
+    const toonOverhead = toon.tokens - csv.tokens
+    const toonOverheadPercent = (toonOverhead / toon.tokens) * 100
+    const toonBar = createProgressBar(100, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
    const toonStr = toon.tokens.toLocaleString('en-US')
+    const toonVsCSV = toonOverheadPercent >= 0
+      ? `(+${toonOverheadPercent.toFixed(1)}% vs CSV)`
+      : `(${toonOverheadPercent.toFixed(1)}% vs CSV)`
+    const toonLine = `toon                  ${toonBar}   ${toonStr.padStart(TOKEN_PADDING)} tokens   ${toonVsCSV}`

-    const line1 = `${result.emoji} ${result.name.padEnd(25)} ${bar}   ${toonStr.padStart(6)} tokens`
+    // Other format comparisons (vs TOON)
+    const comparisonLines = COMPARISON_FORMAT_ORDER.map((formatName) => {
+      const format = result.formats.find(f => f.name === formatName)
+      if (!format)
+        return null

-    const comparisonLines = formatOrder.map((formatName) => {
-      const format = result.formats.find(f => f.name === formatName)!
-      const label = FORMATTER_DISPLAY_NAMES[formatName] || formatName.toUpperCase()
-      const signedPercent = format.savingsPercent >= 0
-        ? `−${format.savingsPercent.toFixed(1)}%`
-        : `+${Math.abs(format.savingsPercent).toFixed(1)}%`
-      const labelWithSavings = `vs ${label} (${signedPercent})`.padEnd(27)
-      const tokenStr = format.tokens.toLocaleString('en-US').padStart(6)
-      return `                             ${labelWithSavings}${tokenStr}`
-    })
+      return formatComparisonLine(format)
+    }).filter(Boolean)

-    return [line1, ...comparisonLines].join('\n')
+    return [line1, toonLine, ...comparisonLines].join('\n')
  })
  .join('\n\n')

-// Add separator and totals row
-const separator = '─────────────────────────────────────────────────────────────────────'
+// Calculate totals for mixed structure
+const { totalToonTokens: totalToonTokensMixed, totals: mixedTotals } = calculateTotalMetrics(mixedStructureDatasets, COMPARISON_FORMAT_ORDER)
+const mixedTotalLines = generateTotalLines(totalToonTokensMixed, mixedTotals)

-// Calculate bar for totals (TOON vs average of comparison formats)
-const comparisonTokens = formatOrder.map(name => totalTokensByFormat[name]!)
-const averageComparisonTokens = comparisonTokens.reduce((a, b) => a + b, 0) / comparisonTokens.length
-const totalPercentage = (totalToonTokens / averageComparisonTokens) * 100
-const totalBar = createProgressBar(totalPercentage, 100)
+// Calculate totals for flat-only
+const { totalToonTokens: totalToonTokensFlat, totals: flatTotals } = calculateTotalMetrics(flatOnlyDatasets, COMPARISON_FORMAT_ORDER)
+const totalCSVTokensFlat = flatOnlyDatasets.reduce((sum, r) => {
+  const csv = r.formats.find(f => f.name === 'csv')
+  return sum + (csv?.tokens || 0)
+}, 0)
+const flatTotalLines = generateTotalLines(totalToonTokensFlat, flatTotals, { name: 'csv', tokens: totalCSVTokensFlat })

-const totalLine1 = `Total                        ${totalBar}   ${totalToonTokens.toLocaleString('en-US').padStart(6)} tokens`
+const barChartSection = `
+## Mixed-Structure Track

-const totalComparisonLines = formatOrder.map((formatName) => {
-  const label = FORMATTER_DISPLAY_NAMES[formatName] || formatName.toUpperCase()
-  const tokens = totalTokensByFormat[formatName]!
-  const percent = totalSavingsPercent[formatName]!
-  const signedPercent = percent >= 0 ? `−${percent.toFixed(1)}%` : `+${Math.abs(percent).toFixed(1)}%`
-  const labelWithSavings = `vs ${label} (${signedPercent})`.padEnd(27)
-  const tokenStr = tokens.toLocaleString('en-US').padStart(6)
-  return `                             ${labelWithSavings}${tokenStr}`
-})
+Datasets with nested or semi-uniform structures. CSV excluded as it cannot properly represent these structures.

-const barChartSection = `${datasetRows}\n\n${separator}\n${totalLine1}\n${totalComparisonLines.join('\n')}`
+\`\`\`
+${mixedCharts}

-// Generate detailed examples (only for selected examples)
-// Note: Large datasets are truncated for display readability in the report.
-// Token counts are calculated from the full datasets, not the truncated versions.
+${SEPARATOR}
+${mixedTotalLines}
+\`\`\`
+
+## Flat-Only Track
+
+Datasets with flat tabular structures where CSV is applicable.
+
+\`\`\`
+${flatCharts}
+
+${SEPARATOR}
+${flatTotalLines}
+\`\`\`
+`.trim()
+
+// Generate detailed examples (optional: show a few examples)
 const detailedExamples = results
-  .filter(result => result.showDetailed)
+  .filter(r => DETAILED_EXAMPLE_DATASETS.includes(r.dataset.name as any))
  .map((result, i, filtered) => {
-    // Truncate large datasets for display
-    let displayData = result.data
-    if (result.name === 'GitHub Repositories') {
+    let displayData = result.dataset.data
+
+    // Truncate for display
+    if (result.dataset.name === 'github') {
      displayData = {
-        repositories: result.data.repositories.slice(0, 3).map((repo: Record<string, any>) => ({
+        repositories: displayData.repositories.slice(0, GITHUB_REPO_LIMIT).map((repo: Record<string, any>) => ({
          ...repo,
-          description: repo.description?.slice(0, 80) + (repo.description?.length > 80 ? '…' : ''),
+          description: repo.description?.slice(0, GITHUB_DESC_LIMIT) + (repo.description?.length > GITHUB_DESC_LIMIT ? '…' : ''),
        })),
      }
    }
-    else if (result.name === 'Daily Analytics') {
-      displayData = { metrics: result.data.metrics.slice(0, 5) }
+    else if (result.dataset.name === 'analytics') {
+      displayData = { metrics: displayData.metrics.slice(0, ANALYTICS_METRICS_LIMIT) }
    }

    const separator = i < filtered.length - 1 ? '\n\n---' : ''
-
+    const emoji = DATASET_ICONS[result.dataset.name] || DEFAULT_DATASET_ICON
    const json = result.formats.find(f => f.name === 'json-pretty')!
    const toon = result.formats.find(f => f.name === 'toon')!

-    return `#### ${result.emoji} ${result.name}
-
-**Configuration:** ${result.description}
+    return `#### ${emoji} ${result.dataset.description}

 **Savings:** ${json.savings.toLocaleString('en-US')} tokens (${json.savingsPercent.toFixed(1)}% reduction vs JSON)

@@ -197,9 +323,7 @@ ${encode(displayData)}
  .join('\n\n')

 const markdown = `
-\`\`\`
 ${barChartSection}
-\`\`\`

 <details>
 <summary><strong>View detailed examples</strong></summary>
@@ -209,7 +333,7 @@ ${detailedExamples}
 </details>
 `.trimStart()

-prompts.log.message(`${barChartSection}\n`)
+prompts.log.message(barChartSection)

 const resultsDir = path.join(BENCHMARKS_DIR, 'results')
 await ensureDir(resultsDir)
--- a/benchmarks/src/constants.ts
+++ b/benchmarks/src/constants.ts
@@ -8,7 +8,7 @@ export const BENCHMARKS_DIR: string = url.fileURLToPath(new URL('../', import.me
 * Model-specific RPM (requests per minute) limits to handle API quotas
 *
 * @remarks
- * Set `undefined` for models without specific limits
+ * Set `undefined` for models without specific limits.
 */
 /// keep-sorted
 export const MODEL_RPM_LIMITS: Record<string, number | undefined> = {
@@ -39,7 +39,7 @@ export const FORMATTER_DISPLAY_NAMES: Record<string, string> = {
 * Enable dry run mode for quick testing with limited AI requests
 *
 * @remarks
- * Set via environment variable: `DRY_RUN=true`
+ * Set via environment variable: `DRY_RUN=true`.
 */
 export const DRY_RUN: boolean = process.env.DRY_RUN === 'true'

@@ -123,4 +123,14 @@ export const QUESTION_LIMITS = {
    aggregationBranches: 2,
    filteringStarsAndForks: 8,
  },
+  eventLogs: {
+    fieldRetrieval: 10,
+    aggregationEndpoints: 3,
+    filteringLevelAndStatus: 2,
+    filteringEndpointAndStatus: 2,
+  },
+  nestedConfig: {
+    fieldRetrieval: 5,
+    filteringComplex: 2,
+  },
 } as const
--- a/benchmarks/src/datasets.ts
+++ b/benchmarks/src/datasets.ts
@@ -5,6 +5,67 @@ import githubRepos from '../data/github-repos.json' with { type: 'json' }
 // Seed for reproducibility
 faker.seed(12345)

+/**
+ * Calculate the tabular eligibility percentage of a data structure
+ *
+ * @remarks
+ * Recursively analyzes data to determine what percentage of arrays qualify
+ * for TOON's tabular format (uniform objects with primitive values only).
+ */
+export function calculateTabularEligibility(data: unknown): number {
+  let totalArrays = 0
+  let tabularArrays = 0
+
+  function isTabularArray(arr: unknown[]): boolean {
+    if (arr.length === 0)
+      return false
+
+    // Check if all elements are objects
+    if (!arr.every(item => typeof item === 'object' && item !== null && !Array.isArray(item)))
+      return false
+
+    // Get keys from first object
+    const firstKeys = Object.keys(arr[0] as Record<string, unknown>)
+    if (firstKeys.length === 0)
+      return false
+
+    // Check if all objects have the same keys and only primitive values
+    return arr.every((item) => {
+      const itemObj = item as Record<string, unknown>
+      const itemKeys = Object.keys(itemObj)
+      if (itemKeys.length !== firstKeys.length)
+        return false
+      if (!firstKeys.every(key => itemKeys.includes(key)))
+        return false
+
+      // Check if all values are primitives (no nested objects or arrays)
+      return firstKeys.every((key) => {
+        const value = itemObj[key]
+        return value === null || ['string', 'number', 'boolean'].includes(typeof value)
+      })
+    })
+  }
+
+  function traverse(obj: unknown): void {
+    if (Array.isArray(obj)) {
+      totalArrays++
+      if (isTabularArray(obj))
+        tabularArrays++
+
+      // Continue traversing array elements
+      obj.forEach(item => traverse(item))
+    }
+    else if (typeof obj === 'object' && obj !== null) {
+      // Traverse object properties
+      Object.values(obj).forEach(value => traverse(value))
+    }
+  }
+
+  traverse(data)
+
+  return totalArrays === 0 ? 0 : Math.round((tabularArrays / totalArrays) * 100)
+}
+
 /**
 * Employee record structure for tabular dataset
 */
@@ -73,6 +134,78 @@ export interface Repository {
  pushedAt: string
 }

+/**
+ * Event log structure for semi-uniform dataset
+ */
+export interface EventLog {
+  timestamp: string
+  level: 'info' | 'warn' | 'error'
+  endpoint: string
+  statusCode: number
+  responseTime: number
+  userId: number
+  error?: {
+    message: string
+    stack: string
+    retryable: boolean
+  }
+}
+
+/**
+ * Nested configuration structure for deeply nested dataset
+ */
+export interface NestedConfig {
+  environment: string
+  version: string
+  database: {
+    host: string
+    port: number
+    name: string
+    pool: {
+      min: number
+      max: number
+      idleTimeout: number
+    }
+    replicas: {
+      host: string
+      port: number
+      priority: number
+    }[]
+  }
+  features: Record<string, {
+    enabled: boolean
+    rollout: number
+    variants: {
+      name: string
+      weight: number
+      config: Record<string, any>
+    }[]
+  }>
+  authentication: {
+    providers: {
+      name: string
+      clientId: string
+      scopes: string[]
+      config: Record<string, any>
+    }[]
+    session: {
+      secret: string
+      duration: number
+      refreshThreshold: number
+    }
+  }
+  permissions: {
+    roles: Record<string, {
+      permissions: string[]
+      inherits: string[]
+    }>
+    groups: Record<string, {
+      members: string[]
+      roles: string[]
+    }>
+  }
+}
+
 /**
 * Generate analytics time-series data
 */
@@ -108,17 +241,13 @@ export function generateAnalyticsData(days: number, startDate = '2025-01-01'): {
 }

 /**
- * Tabular dataset: 100 uniform employee records
- *
- * @remarks
- * Tests TOON's tabular array format
+ * Generate employee data (uniform tabular structure)
 */
 const departments: readonly string[] = ['Engineering', 'Sales', 'Marketing', 'HR', 'Operations', 'Finance'] as const
-const tabularDataset: Dataset = {
-  name: 'tabular',
-  description: 'Uniform employee records (TOON optimal format)',
-  data: {
-    employees: Array.from({ length: 100 }, (_, i): Employee => {
+
+function generateEmployees(count: number): { employees: Employee[] } {
+  return {
+    employees: Array.from({ length: count }, (_, i): Employee => {
      const yearsExp = faker.number.int({ min: 1, max: 25 })
      return {
        id: i + 1,
@@ -130,72 +259,132 @@ const tabularDataset: Dataset = {
        active: faker.datatype.boolean(0.8), // 80% active
      }
    }),
+  }
+}
+
+/**
+ * Tabular dataset: Uniform employee records
+ *
+ * @remarks
+ * Tests TOON's tabular array format.
+ */
+const tabularDataset: Dataset = {
+  name: 'tabular',
+  description: 'Uniform employee records (TOON optimal format)',
+  data: generateEmployees(100),
+  metadata: {
+    supportsCSV: true,
+    structureClass: 'uniform',
+    tabularEligibility: 100,
  },
 }

 /**
- * Nested dataset: 50 e-commerce orders with nested structures
- *
- * @remarks
- * Tests TOON's handling of complex nested objects
+ * Generate e-commerce orders (nested structure)
 */
-const productNames: readonly string[] = ['Wireless Mouse', 'USB Cable', 'Laptop Stand', 'Keyboard', 'Webcam', 'Headphones', 'Monitor', 'Desk Lamp'] as const
-const statuses: readonly string[] = ['pending', 'processing', 'shipped', 'delivered', 'cancelled'] as const
+const PRODUCT_NAMES = ['Wireless Mouse', 'USB Cable', 'Laptop Stand', 'Keyboard', 'Webcam', 'Headphones', 'Monitor', 'Desk Lamp'] as const
+const ORDER_STATUSES = ['pending', 'processing', 'shipped', 'delivered', 'cancelled'] as const

-const nestedDataset: Dataset = {
-  name: 'nested',
-  description: 'E-commerce orders with nested structures',
-  data: {
-    orders: Array.from({ length: 50 }, (_, i) => {
-      const customerId = (i % 20) + 1
-      const itemCount = faker.number.int({ min: 1, max: 4 })
+const ORDER_CONSTANTS = {
+  CUSTOMER_ID_MOD: 20,
+  MIN_ITEMS: 1,
+  MAX_ITEMS: 4,
+  MIN_ITEM_PRICE: 9.99,
+  MAX_ITEM_PRICE: 199.99,
+  MIN_ITEM_QUANTITY: 1,
+  MAX_ITEM_QUANTITY: 5,
+  SKU_LENGTH: 6,
+  ORDER_ID_PADDING: 4,
+  RECENT_DAYS: 90,
+  TAX_RATE: 0.08,
+} as const
+
+function generateOrders(count: number): { orders: Order[] } {
+  return {
+    orders: Array.from({ length: count }, (_, i) => {
+      const customerId = (i % ORDER_CONSTANTS.CUSTOMER_ID_MOD) + 1
+      const itemCount = faker.number.int({ min: ORDER_CONSTANTS.MIN_ITEMS, max: ORDER_CONSTANTS.MAX_ITEMS })

      const items = Array.from({ length: itemCount }, (_, j) => {
-        const price = faker.number.float({ min: 9.99, max: 199.99, fractionDigits: 2 })
-        const quantity = faker.number.int({ min: 1, max: 5 })
+        const price = faker.number.float({
+          min: ORDER_CONSTANTS.MIN_ITEM_PRICE,
+          max: ORDER_CONSTANTS.MAX_ITEM_PRICE,
+          fractionDigits: 2,
+        })
+        const quantity = faker.number.int({
+          min: ORDER_CONSTANTS.MIN_ITEM_QUANTITY,
+          max: ORDER_CONSTANTS.MAX_ITEM_QUANTITY,
+        })
        return {
-          sku: `SKU-${faker.string.alphanumeric({ length: 6 }).toUpperCase()}`,
-          name: productNames[j % productNames.length]!,
+          sku: `SKU-${faker.string.alphanumeric({ length: ORDER_CONSTANTS.SKU_LENGTH }).toUpperCase()}`,
+          name: PRODUCT_NAMES[j % PRODUCT_NAMES.length]!,
          quantity,
          price,
        }
      })

-      const total = Number(items.reduce((sum, item) => sum + (item.price * item.quantity), 0).toFixed(2))
+      const subtotal = Number(items.reduce((sum, item) => sum + (item.price * item.quantity), 0).toFixed(2))
+      const tax = Number((subtotal * ORDER_CONSTANTS.TAX_RATE).toFixed(2))
+      const total = Number((subtotal + tax).toFixed(2))

      return {
-        orderId: `ORD-${String(i + 1).padStart(4, '0')}`,
+        orderId: `ORD-${String(i + 1).padStart(ORDER_CONSTANTS.ORDER_ID_PADDING, '0')}`,
        customer: {
          id: customerId,
          name: faker.person.fullName(),
          email: faker.internet.email().toLowerCase(),
+          phone: faker.phone.number(),
        },
        items,
+        subtotal,
+        tax,
        total,
-        status: statuses[i % statuses.length]!,
-        orderDate: faker.date.recent({ days: 90 }).toISOString().split('T')[0],
+        status: ORDER_STATUSES[i % ORDER_STATUSES.length]!,
+        orderDate: faker.date.recent({ days: ORDER_CONSTANTS.RECENT_DAYS }).toISOString().split('T')[0],
      }
    }),
+  }
+}
+
+/**
+ * Nested dataset: E-commerce orders with nested structures
+ *
+ * @remarks
+ * Tests TOON's handling of complex nested objects.
+ */
+const nestedDataset: Dataset = {
+  name: 'nested',
+  description: 'E-commerce orders with nested structures',
+  data: generateOrders(50),
+  metadata: {
+    supportsCSV: false,
+    structureClass: 'nested',
+    tabularEligibility: 33, // orders array is not tabular, but items arrays within are
  },
 }

 /**
- * Analytics dataset: 60 days of time-series metrics
+ * Analytics dataset: Time-series metrics
 *
 * @remarks
- * Tests TOON's handling of numeric data and date fields
+ * Tests TOON's handling of numeric data and date fields.
 */
 const analyticsDataset: Dataset = {
  name: 'analytics',
  description: 'Time-series analytics data',
  data: generateAnalyticsData(60),
+  metadata: {
+    supportsCSV: true,
+    structureClass: 'uniform',
+    tabularEligibility: 100,
+  },
 }

 /**
 * Real-world dataset: Top 100 starred GitHub repositories
 *
 * @remarks
- * Tests TOON's tabular format
+ * Tests TOON's tabular format with real data.
 */
 const githubDataset: Dataset = {
  name: 'github',
@@ -203,13 +392,18 @@ const githubDataset: Dataset = {
  data: {
    repositories: githubRepos,
  },
+  metadata: {
+    supportsCSV: true,
+    structureClass: 'uniform',
+    tabularEligibility: 100,
+  },
 }

 /**
 * Generate a single e-commerce order with nested structure
 *
 * @remarks
- * Used for token efficiency benchmarks
+ * Used for token efficiency benchmarks.
 */
 export function generateOrderData(): Order {
  return {
@@ -235,11 +429,257 @@ export function generateOrderData(): Order {
 }

 /**
- * All datasets used in the benchmark
+ * Generate event logs (semi-uniform structure)
+ *
+ * @remarks
+ * Approximately 50% of logs include nested error objects, 50% are flat.
+ * This creates ~45% tabular eligibility.
 */
-export const datasets: Dataset[] = [
-  tabularDataset,
-  nestedDataset,
-  analyticsDataset,
-  githubDataset,
+export function generateEventLogs(count: number): { logs: EventLog[] } {
+  const endpoints = ['/api/users', '/api/orders', '/api/products', '/api/auth', '/api/payments']
+  const levels = ['info', 'warn', 'error'] as const
+
+  return {
+    logs: Array.from({ length: count }, () => {
+      const level = faker.helpers.arrayElement(levels)
+      const hasError = level === 'error' || (level === 'warn' && faker.datatype.boolean(0.3))
+
+      const log: EventLog = {
+        timestamp: faker.date.recent({ days: 7 }).toISOString(),
+        level,
+        endpoint: faker.helpers.arrayElement(endpoints),
+        statusCode: hasError
+          ? faker.number.int({ min: 400, max: 599 })
+          : faker.number.int({ min: 200, max: 299 }),
+        responseTime: faker.number.int({ min: 10, max: 5000 }),
+        userId: faker.number.int({ min: 1000, max: 9999 }),
+      }
+
+      if (hasError) {
+        log.error = {
+          message: faker.helpers.arrayElement([
+            'Database connection timeout',
+            'Invalid authentication token',
+            'Resource not found',
+            'Internal server error',
+            'Rate limit exceeded',
+          ]),
+          stack: `Error: ${faker.lorem.sentence()}\n  at ${faker.lorem.word()}\n  at ${faker.lorem.word()}`,
+          retryable: faker.datatype.boolean(0.6),
+        }
+      }
+
+      return log
+    }),
+  }
+}
+
+/**
+ * Generate deeply nested configuration
+ *
+ * @remarks
+ * Creates a complex nested structure with minimal tabular eligibility (~0%).
+ */
+export function generateNestedConfig(): NestedConfig {
+  return {
+    environment: faker.helpers.arrayElement(['production', 'staging', 'development']),
+    version: faker.system.semver(),
+    database: {
+      host: faker.internet.domainName(),
+      port: 5432,
+      name: faker.database.type(),
+      pool: {
+        min: 2,
+        max: faker.number.int({ min: 10, max: 50 }),
+        idleTimeout: 30000,
+      },
+      replicas: Array.from({ length: 3 }, (_, i) => ({
+        host: `replica-${i + 1}.${faker.internet.domainName()}`,
+        port: 5432,
+        priority: i + 1,
+      })),
+    },
+    features: {
+      darkMode: {
+        enabled: faker.datatype.boolean(),
+        rollout: faker.number.int({ min: 0, max: 100 }),
+        variants: [
+          {
+            name: 'default',
+            weight: 70,
+            config: { theme: 'dark', animations: true },
+          },
+          {
+            name: 'minimal',
+            weight: 30,
+            config: { theme: 'dark', animations: false },
+          },
+        ],
+      },
+      analytics: {
+        enabled: faker.datatype.boolean(),
+        rollout: faker.number.int({ min: 0, max: 100 }),
+        variants: [
+          {
+            name: 'full',
+            weight: 100,
+            config: { tracking: 'all', sampling: 1.0 },
+          },
+        ],
+      },
+    },
+    authentication: {
+      providers: [
+        {
+          name: 'oauth2',
+          clientId: faker.string.uuid(),
+          scopes: ['read', 'write', 'admin'],
+          config: {
+            authUrl: faker.internet.url(),
+            tokenUrl: faker.internet.url(),
+          },
+        },
+        {
+          name: 'saml',
+          clientId: faker.string.uuid(),
+          scopes: ['read'],
+          config: {
+            entryPoint: faker.internet.url(),
+            cert: faker.string.alphanumeric({ length: 64 }),
+          },
+        },
+      ],
+      session: {
+        secret: faker.string.alphanumeric({ length: 32 }),
+        duration: 86400,
+        refreshThreshold: 3600,
+      },
+    },
+    permissions: {
+      roles: {
+        admin: {
+          permissions: ['read', 'write', 'delete', 'manage_users', 'manage_roles'],
+          inherits: [],
+        },
+        editor: {
+          permissions: ['read', 'write'],
+          inherits: ['viewer'],
+        },
+        viewer: {
+          permissions: ['read'],
+          inherits: [],
+        },
+      },
+      groups: {
+        engineering: {
+          members: Array.from({ length: 5 }, () => faker.internet.email()),
+          roles: ['admin', 'editor'],
+        },
+        support: {
+          members: Array.from({ length: 3 }, () => faker.internet.email()),
+          roles: ['viewer'],
+        },
+      },
+    },
+  }
+}
+
+/**
+ * Event logs dataset: Semi-uniform structure
+ *
+ * @remarks
+ * Tests TOON with semi-uniform data (~50% flat, ~50% with nested errors).
+ */
+const eventLogsDataset: Dataset = {
+  name: 'event-logs',
+  description: 'Semi-uniform event logs',
+  data: generateEventLogs(75),
+  metadata: {
+    supportsCSV: false,
+    structureClass: 'semi-uniform',
+    tabularEligibility: 50, // ~50% of logs have nested error objects
+  },
+}
+
+/**
+ * Nested config dataset: Deeply nested structure
+ *
+ * @remarks
+ * Tests TOON's worst-case scenario with deeply nested configuration.
+ */
+const nestedConfigDataset: Dataset = {
+  name: 'nested-config',
+  description: 'Deeply nested configuration',
+  data: generateNestedConfig(),
+  metadata: {
+    supportsCSV: false,
+    structureClass: 'deep',
+    tabularEligibility: 0, // Highly nested, minimal tabular arrays
+  },
+}
+
+/**
+ * Datasets for accuracy benchmarks (smaller sizes for faster evaluation)
+ */
+export const ACCURACY_DATASETS: Dataset[] = [
+  tabularDataset, // 100 employees
+  nestedDataset, // 50 orders
+  analyticsDataset, // 60 days
+  githubDataset, // 100 repos
+  eventLogsDataset, // 75 logs
+  nestedConfigDataset, // 1 config
+]
+
+/**
+ * Datasets for token efficiency benchmarks (larger sizes to amplify token differences)
+ */
+export const TOKEN_EFFICIENCY_DATASETS: Dataset[] = [
+  // Tabular: 2000 employees
+  {
+    name: 'tabular',
+    description: 'Uniform employee records (TOON optimal format)',
+    data: generateEmployees(2000),
+    metadata: {
+      supportsCSV: true,
+      structureClass: 'uniform',
+      tabularEligibility: 100,
+    },
+  },
+  // Nested: 500 orders
+  {
+    name: 'nested',
+    description: 'E-commerce orders with nested structures',
+    data: generateOrders(500),
+    metadata: {
+      supportsCSV: false,
+      structureClass: 'nested',
+      tabularEligibility: 33,
+    },
+  },
+  // Analytics: 365 days
+  {
+    name: 'analytics',
+    description: 'Time-series analytics data',
+    data: generateAnalyticsData(365),
+    metadata: {
+      supportsCSV: true,
+      structureClass: 'uniform',
+      tabularEligibility: 100,
+    },
+  },
+  // GitHub: 100 repos (same as accuracy)
+  githubDataset,
+  // Event logs: 2000 logs
+  {
+    name: 'event-logs',
+    description: 'Semi-uniform event logs',
+    data: generateEventLogs(2000),
+    metadata: {
+      supportsCSV: false,
+      structureClass: 'semi-uniform',
+      tabularEligibility: 50,
+    },
+  },
+  // Nested config: 1 config (same as accuracy)
+  nestedConfigDataset,
 ]
--- a/benchmarks/src/formatters.ts
+++ b/benchmarks/src/formatters.ts
@@ -1,3 +1,4 @@
+import type { Dataset } from './types'
 import { stringify as stringifyCSV } from 'csv-stringify/sync'
 import { XMLBuilder } from 'fast-xml-parser'
 import { stringify as stringifyYAML } from 'yaml'
@@ -75,3 +76,14 @@ function toXML(data: unknown): string {

  return builder.build(data)
 }
+
+/**
+ * Check if a dataset supports CSV format
+ *
+ * @remarks
+ * CSV is only suitable for flat tabular data. Datasets with nested structures
+ * should not be compared using CSV as it cannot properly represent the data.
+ */
+export function supportsCSV(dataset: Dataset): boolean {
+  return dataset.metadata.supportsCSV
+}
--- a/benchmarks/src/questions.ts
+++ b/benchmarks/src/questions.ts
@@ -1,711 +0,0 @@
-/**
- * Question generation for TOON benchmarks
- *
- * Generates ~150-160 questions across different question types and datasets:
- * - Field Retrieval: Direct field access with no computation
- *   Examples: "What is X's salary?", "What is the status of order Y?"
- * - Aggregation: Counts, sums, averages, min/max operations (including single-condition filters)
- *   Examples: "How many X?", "What is the total/average?", "How many X > threshold?"
- * - Filtering: Multi-condition queries requiring complex logical operations
- *   Examples: "How many X WHERE condition1 AND condition2?"
- */
-
-import type { AnalyticsMetric, Employee, Order, Repository } from './datasets'
-import type { Question } from './types'
-import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from './constants'
-import { datasets } from './datasets'
-
-/**
- * Generate all questions from datasets
- */
-export function generateQuestions(): Question[] {
-  const questions: Question[] = []
-  let idCounter = 1
-
-  // Get datasets with proper typing
-  const tabular = (datasets.find(d => d.name === 'tabular')?.data.employees as Employee[]) ?? []
-  const nested = (datasets.find(d => d.name === 'nested')?.data.orders as Order[]) ?? []
-  const analytics = (datasets.find(d => d.name === 'analytics')?.data.metrics as AnalyticsMetric[]) ?? []
-  const github = (datasets.find(d => d.name === 'github')?.data.repositories as Repository[]) ?? []
-
-  if (tabular.length > 0) {
-    // Field retrieval: specific employees
-    for (let i = 0; i < Math.min(QUESTION_LIMITS.tabular.fieldRetrieval, tabular.length); i++) {
-      const emp = tabular[i * 2] || tabular[i]
-      if (!emp)
-        continue
-
-      // Rotate through all field types
-      if (i % 5 === 0) {
-        questions.push({
-          id: `q${idCounter++}`,
-          prompt: `What is the salary of ${emp.name}?`,
-          groundTruth: String(emp.salary),
-          type: 'field-retrieval',
-          dataset: 'tabular',
-        })
-      }
-      else if (i % 5 === 1) {
-        questions.push({
-          id: `q${idCounter++}`,
-          prompt: `What department does ${emp.name} work in?`,
-          groundTruth: emp.department,
-          type: 'field-retrieval',
-          dataset: 'tabular',
-        })
-      }
-      else if (i % 5 === 2) {
-        questions.push({
-          id: `q${idCounter++}`,
-          prompt: `What is the email address of ${emp.name}?`,
-          groundTruth: emp.email,
-          type: 'field-retrieval',
-          dataset: 'tabular',
-        })
-      }
-      else if (i % 5 === 3) {
-        questions.push({
-          id: `q${idCounter++}`,
-          prompt: `How many years of experience does ${emp.name} have?`,
-          groundTruth: String(emp.yearsExperience),
-          type: 'field-retrieval',
-          dataset: 'tabular',
-        })
-      }
-      else {
-        questions.push({
-          id: `q${idCounter++}`,
-          prompt: `Is ${emp.name} an active employee?`,
-          groundTruth: emp.active ? 'yes' : 'no',
-          type: 'field-retrieval',
-          dataset: 'tabular',
-        })
-      }
-    }
-
-    // Aggregation: count by department
-    const departments = [...new Set(tabular.map(e => e.department))]
-    for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.aggregationDepartments)) {
-      const count = tabular.filter(e => e.department === dept).length
-      questions.push({
-        id: `q${idCounter++}`,
-        prompt: `How many employees work in ${dept}?`,
-        groundTruth: String(count),
-        type: 'aggregation',
-        dataset: 'tabular',
-      })
-    }
-
-    // Aggregation: salary ranges (single-condition filters)
-    for (const threshold of QUESTION_THRESHOLDS.tabular.salaryRanges) {
-      const count = tabular.filter(e => e.salary > threshold).length
-      questions.push({
-        id: `q${idCounter++}`,
-        prompt: `How many employees have a salary greater than ${threshold}?`,
-        groundTruth: String(count),
-        type: 'aggregation',
-        dataset: 'tabular',
-      })
-    }
-
-    // Aggregation: totals and averages
-    const totalEmployees = tabular.length
-    const avgSalary = Math.round(tabular.reduce((sum, e) => sum + e.salary, 0) / totalEmployees)
-    const activeCount = tabular.filter(e => e.active).length
-    const inactiveCount = tabular.filter(e => !e.active).length
-
-    questions.push(
-      {
-        id: `q${idCounter++}`,
-        prompt: 'How many employees are in the dataset?',
-        groundTruth: String(totalEmployees),
-        type: 'aggregation',
-        dataset: 'tabular',
-      },
-      {
-        id: `q${idCounter++}`,
-        prompt: 'What is the average salary across all employees?',
-        groundTruth: String(avgSalary),
-        type: 'aggregation',
-        dataset: 'tabular',
-      },
-      {
-        id: `q${idCounter++}`,
-        prompt: 'How many employees are active?',
-        groundTruth: String(activeCount),
-        type: 'aggregation',
-        dataset: 'tabular',
-      },
-      {
-        id: `q${idCounter++}`,
-        prompt: 'How many employees are inactive?',
-        groundTruth: String(inactiveCount),
-        type: 'aggregation',
-        dataset: 'tabular',
-      },
-    )
-
-    // Filtering: count by department with salary filter (multi-condition)
-    for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringMultiConditionDepartments)) {
-      const count = tabular.filter(e => e.department === dept && e.salary > QUESTION_THRESHOLDS.tabular.departmentSalaryThreshold).length
-      questions.push({
-        id: `q${idCounter++}`,
-        prompt: `How many employees in ${dept} have a salary greater than ${QUESTION_THRESHOLDS.tabular.departmentSalaryThreshold}?`,
-        groundTruth: String(count),
-        type: 'filtering',
-        dataset: 'tabular',
-      })
-    }
-
-    // Filtering: active employees by experience (multi-condition)
-    for (const exp of QUESTION_THRESHOLDS.tabular.experienceYears.slice(0, QUESTION_LIMITS.tabular.filteringExperience)) {
-      const count = tabular.filter(e => e.yearsExperience > exp && e.active).length
-      questions.push({
-        id: `q${idCounter++}`,
-        prompt: `How many active employees have more than ${exp} years of experience?`,
-        groundTruth: String(count),
-        type: 'filtering',
-        dataset: 'tabular',
-      })
-    }
-
-    // Filtering: department by experience (multi-condition)
-    for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringDepartmentExp)) {
-      const count = tabular.filter(e => e.department === dept && e.yearsExperience > QUESTION_THRESHOLDS.tabular.departmentExperienceThreshold).length
-      questions.push({
-        id: `q${idCounter++}`,
-        prompt: `How many employees in ${dept} have more than ${QUESTION_THRESHOLDS.tabular.departmentExperienceThreshold} years of experience?`,
-        groundTruth: String(count),
-        type: 'filtering',
-        dataset: 'tabular',
-      })
-    }
-
-    // Filtering: department by active status (multi-condition)
-    for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringDepartmentActive)) {
-      const count = tabular.filter(e => e.department === dept && e.active).length
-      questions.push({
-        id: `q${idCounter++}`,
-        prompt: `How many active employees work in ${dept}?`,
-        groundTruth: String(count),
-        type: 'filtering',
-        dataset: 'tabular',
-      })
-    }
-  }
-
-  if (nested.length > 0) {
-    // Field retrieval: order totals and statuses
-    for (let i = 0; i < Math.min(QUESTION_LIMITS.nested.fieldRetrievalOrders, nested.length); i++) {
-      const order = nested[i * 2] || nested[i]
-      if (!order)
-        continue
-
-      if (i % 2 === 0) {
-        questions.push({
-          id: `q${idCounter++}`,
-          prompt: `What is the total for order ${order.orderId}?`,
-          groundTruth: String(order.total),
-          type: 'field-retrieval',
-          dataset: 'nested',
-        })
-      }
-      else {
-        questions.push({
-          id: `q${idCounter++}`,
-          prompt: `What is the status of order ${order.orderId}?`,
-          groundTruth: order.status,
-          type: 'field-retrieval',
-          dataset: 'nested',
-        })
-      }
-    }
-
-    // Field retrieval: customer info and order dates (expanded)
-    for (let i = 0; i < Math.min(QUESTION_LIMITS.nested.fieldRetrievalCustomers, nested.length); i++) {
-      const order = nested[i * 2 + 1] || nested[i]
-      if (!order)
-        continue
-
-      if (i % 4 === 0) {
-        questions.push({
-          id: `q${idCounter++}`,
-          prompt: `What is the customer name for order ${order.orderId}?`,
-          groundTruth: order.customer.name,
-          type: 'field-retrieval',
-          dataset: 'nested',
-        })
-      }
-      else if (i % 4 === 1) {
-        questions.push({
-          id: `q${idCounter++}`,
-          prompt: `What is the customer email for order ${order.orderId}?`,
-          groundTruth: order.customer.email,
-          type: 'field-retrieval',
-          dataset: 'nested',
-        })
-      }
-      else if (i % 4 === 2) {
-        questions.push({
-          id: `q${idCounter++}`,
-          prompt: `What is the order date for order ${order.orderId}?`,
-          groundTruth: order.orderDate || '',
-          type: 'field-retrieval',
-          dataset: 'nested',
-        })
-      }
-      else {
-        questions.push({
-          id: `q${idCounter++}`,
-          prompt: `How many items are in order ${order.orderId}?`,
-          groundTruth: String(order.items.length),
-          type: 'field-retrieval',
-          dataset: 'nested',
-        })
-      }
-    }
-
-    // Aggregation: totals and averages
-    const totalRevenue = nested.reduce((sum, o) => sum + o.total, 0)
-    const avgOrderValue = totalRevenue / nested.length
-    const totalOrders = nested.length
-    const maxOrderValue = Math.max(...nested.map(o => o.total))
-
-    // Count by status
-    const statuses = [...new Set(nested.map(o => o.status))]
-    for (const status of statuses.slice(0, QUESTION_LIMITS.nested.aggregationStatuses)) {
-      const count = nested.filter(o => o.status === status).length
-      questions.push({
-        id: `q${idCounter++}`,
-        prompt: `How many orders have status "${status}"?`,
-        groundTruth: String(count),
-        type: 'aggregation',
-        dataset: 'nested',
-      })
-    }
-
-    questions.push(
-      {
-        id: `q${idCounter++}`,
-        prompt: 'What is the total revenue across all orders?',
-        groundTruth: String(totalRevenue.toFixed(2)),
-        type: 'aggregation',
-        dataset: 'nested',
-      },
-      {
-        id: `q${idCounter++}`,
-        prompt: 'What is the average order value?',
-        groundTruth: String(avgOrderValue.toFixed(2)),
-        type: 'aggregation',
-        dataset: 'nested',
-      },
-      {
-        id: `q${idCounter++}`,
-        prompt: 'How many orders are in the dataset?',
-        groundTruth: String(totalOrders),
-        type: 'aggregation',
-        dataset: 'nested',
-      },
-      {
-        id: `q${idCounter++}`,
-        prompt: 'What is the highest order total?',
-        groundTruth: String(maxOrderValue.toFixed(2)),
-        type: 'aggregation',
-        dataset: 'nested',
-      },
-    )
-
-    // Aggregation: high-value orders (single-condition filter)
-    for (const threshold of QUESTION_THRESHOLDS.nested.highValueOrders) {
-      const count = nested.filter(o => o.total > threshold).length
-      questions.push({
-        id: `q${idCounter++}`,
-        prompt: `How many orders have a total greater than ${threshold}?`,
-        groundTruth: String(count),
-        type: 'aggregation',
-        dataset: 'nested',
-      })
-    }
-
-    // Filtering: multi-condition queries (status AND value)
-    const orderStatuses = [...new Set(nested.map(o => o.status))]
-    for (const status of orderStatuses.slice(0, QUESTION_LIMITS.nested.filteringStatusAndValue)) {
-      const count = nested.filter(o => o.status === status && o.total > QUESTION_THRESHOLDS.nested.statusValueThreshold).length
-      questions.push({
-        id: `q${idCounter++}`,
-        prompt: `How many orders have status "${status}" and total greater than ${QUESTION_THRESHOLDS.nested.statusValueThreshold}?`,
-        groundTruth: String(count),
-        type: 'filtering',
-        dataset: 'nested',
-      })
-    }
-
-    // Filtering: status AND items count (multi-condition)
-    for (const status of orderStatuses.slice(0, QUESTION_LIMITS.nested.filteringStatusAndItems)) {
-      const count = nested.filter(o => o.status === status && o.items.length >= QUESTION_THRESHOLDS.nested.itemCountThreshold).length
-      questions.push({
-        id: `q${idCounter++}`,
-        prompt: `How many orders have status "${status}" and at least ${QUESTION_THRESHOLDS.nested.itemCountThreshold} items?`,
-        groundTruth: String(count),
-        type: 'filtering',
-        dataset: 'nested',
-      })
-    }
-
-    // Filtering: total AND items count (multi-condition)
-    for (const threshold of QUESTION_THRESHOLDS.nested.totalThresholdsForItems) {
-      const count = nested.filter(o => o.total > threshold && o.items.length >= QUESTION_THRESHOLDS.nested.itemCountThreshold).length
-      questions.push({
-        id: `q${idCounter++}`,
-        prompt: `How many orders have a total greater than ${threshold} and at least ${QUESTION_THRESHOLDS.nested.itemCountThreshold} items?`,
-        groundTruth: String(count),
-        type: 'filtering',
-        dataset: 'nested',
-      })
-    }
-  }
-
-  if (analytics.length > 0) {
-    // Field retrieval: specific dates (expanded with all metrics)
-    for (let i = 0; i < Math.min(QUESTION_LIMITS.analytics.fieldRetrievalDates, analytics.length); i++) {
-      const metric = analytics[i * 3] || analytics[i]
-      if (!metric)
-        continue
-
-      if (i % 5 === 0) {
-        questions.push({
-          id: `q${idCounter++}`,
-          prompt: `How many views were recorded on ${metric.date}?`,
-          groundTruth: String(metric.views),
-          type: 'field-retrieval',
-          dataset: 'analytics',
-        })
-      }
-      else if (i % 5 === 1) {
-        questions.push({
-          id: `q${idCounter++}`,
-          prompt: `What was the revenue on ${metric.date}?`,
-          groundTruth: String(metric.revenue),
-          type: 'field-retrieval',
-          dataset: 'analytics',
-        })
-      }
-      else if (i % 5 === 2) {
-        questions.push({
-          id: `q${idCounter++}`,
-          prompt: `What was the conversion count on ${metric.date}?`,
-          groundTruth: String(metric.conversions),
-          type: 'field-retrieval',
-          dataset: 'analytics',
-        })
-      }
-      else if (i % 5 === 3) {
-        questions.push({
-          id: `q${idCounter++}`,
-          prompt: `How many clicks were recorded on ${metric.date}?`,
-          groundTruth: String(metric.clicks),
-          type: 'field-retrieval',
-          dataset: 'analytics',
-        })
-      }
-      else {
-        questions.push({
-          id: `q${idCounter++}`,
-          prompt: `What was the bounce rate on ${metric.date}?`,
-          groundTruth: String(metric.bounceRate),
-          type: 'field-retrieval',
-          dataset: 'analytics',
-        })
-      }
-    }
-
-    // Aggregation: totals and averages
-    const totalViews = analytics.reduce((sum, m) => sum + m.views, 0)
-    const totalRevenue = analytics.reduce((sum, m) => sum + m.revenue, 0)
-    const totalConversions = analytics.reduce((sum, m) => sum + m.conversions, 0)
-    const avgViews = Math.round(totalViews / analytics.length)
-    const avgRevenue = totalRevenue / analytics.length
-    const avgConversions = Math.round(totalConversions / analytics.length)
-
-    questions.push(
-      {
-        id: `q${idCounter++}`,
-        prompt: 'What is the total number of views across all dates?',
-        groundTruth: String(totalViews),
-        type: 'aggregation',
-        dataset: 'analytics',
-      },
-      {
-        id: `q${idCounter++}`,
-        prompt: 'What is the total revenue across all dates?',
-        groundTruth: String(totalRevenue.toFixed(2)),
-        type: 'aggregation',
-        dataset: 'analytics',
-      },
-      {
-        id: `q${idCounter++}`,
-        prompt: 'What is the total number of conversions across all dates?',
-        groundTruth: String(totalConversions),
-        type: 'aggregation',
-        dataset: 'analytics',
-      },
-      {
-        id: `q${idCounter++}`,
-        prompt: 'What is the average number of views per day?',
-        groundTruth: String(avgViews),
-        type: 'aggregation',
-        dataset: 'analytics',
-      },
-      {
-        id: `q${idCounter++}`,
-        prompt: 'What is the average revenue per day?',
-        groundTruth: String(avgRevenue.toFixed(2)),
-        type: 'aggregation',
-        dataset: 'analytics',
-      },
-      {
-        id: `q${idCounter++}`,
-        prompt: 'What is the average number of conversions per day?',
-        groundTruth: String(avgConversions),
-        type: 'aggregation',
-        dataset: 'analytics',
-      },
-      {
-        id: `q${idCounter++}`,
-        prompt: 'How many days are included in the analytics data?',
-        groundTruth: String(analytics.length),
-        type: 'aggregation',
-        dataset: 'analytics',
-      },
-      {
-        id: `q${idCounter++}`,
-        prompt: 'What is the highest number of views recorded in a single day?',
-        groundTruth: String(Math.max(...analytics.map(m => m.views))),
-        type: 'aggregation',
-        dataset: 'analytics',
-      },
-    )
-
-    // Aggregation: high-performing days (single-condition filters)
-    for (const threshold of QUESTION_THRESHOLDS.analytics.views) {
-      const count = analytics.filter(m => m.views > threshold).length
-      questions.push({
-        id: `q${idCounter++}`,
-        prompt: `How many days had more than ${threshold} views?`,
-        groundTruth: String(count),
-        type: 'aggregation',
-        dataset: 'analytics',
-      })
-    }
-
-    // Filtering: multi-condition queries (views AND conversions)
-    for (const viewThreshold of QUESTION_THRESHOLDS.analytics.viewsForFiltering) {
-      const count = analytics.filter(m => m.views > viewThreshold && m.conversions > QUESTION_THRESHOLDS.analytics.conversionsForFiltering).length
-      questions.push({
-        id: `q${idCounter++}`,
-        prompt: `How many days had more than ${viewThreshold} views and more than ${QUESTION_THRESHOLDS.analytics.conversionsForFiltering} conversions?`,
-        groundTruth: String(count),
-        type: 'filtering',
-        dataset: 'analytics',
-      })
-    }
-
-    // Filtering: views AND revenue (expanded)
-    for (const revenueThreshold of QUESTION_THRESHOLDS.analytics.revenueThresholds.slice(0, 5)) {
-      const count = analytics.filter(m => m.views > QUESTION_THRESHOLDS.analytics.viewsThresholdForRevenue && m.revenue > revenueThreshold).length
-      questions.push({
-        id: `q${idCounter++}`,
-        prompt: `How many days had more than ${QUESTION_THRESHOLDS.analytics.viewsThresholdForRevenue} views and revenue greater than ${revenueThreshold}?`,
-        groundTruth: String(count),
-        type: 'filtering',
-        dataset: 'analytics',
-      })
-    }
-
-    // Filtering: clicks AND conversions (multi-condition)
-    for (const clickThreshold of QUESTION_THRESHOLDS.analytics.clicksForFiltering) {
-      const count = analytics.filter(m => m.clicks > clickThreshold && m.conversions > QUESTION_THRESHOLDS.analytics.conversionsForClickFiltering).length
-      questions.push({
-        id: `q${idCounter++}`,
-        prompt: `How many days had more than ${clickThreshold} clicks and more than ${QUESTION_THRESHOLDS.analytics.conversionsForClickFiltering} conversions?`,
-        groundTruth: String(count),
-        type: 'filtering',
-        dataset: 'analytics',
-      })
-    }
-
-    // Filtering: revenue AND bounce rate (multi-condition)
-    for (const revenueThreshold of QUESTION_THRESHOLDS.analytics.revenueForBounceRate) {
-      const count = analytics.filter(m => m.revenue > revenueThreshold && m.bounceRate < QUESTION_THRESHOLDS.analytics.bounceRateThreshold).length
-      questions.push({
-        id: `q${idCounter++}`,
-        prompt: `How many days had revenue greater than ${revenueThreshold} and bounce rate less than ${QUESTION_THRESHOLDS.analytics.bounceRateThreshold}?`,
-        groundTruth: String(count),
-        type: 'filtering',
-        dataset: 'analytics',
-      })
-    }
-  }
-
-  if (github.length > 0) {
-    // Helper to extract owner from repo field
-    const getOwner = (repoFullName: string) => repoFullName.split('/')[0]!
-
-    // Field retrieval: specific repos (diverse fields)
-    for (let i = 0; i < Math.min(QUESTION_LIMITS.github.fieldRetrievalRepos, github.length); i++) {
-      const repo = github[i * 7]
-      if (!repo)
-        continue
-
-      if (i % 5 === 0) {
-        questions.push({
-          id: `q${idCounter++}`,
-          prompt: `How many stars does ${repo.repo} have?`,
-          groundTruth: String(repo.stars),
-          type: 'field-retrieval',
-          dataset: 'github',
-        })
-      }
-      else if (i % 5 === 1) {
-        questions.push({
-          id: `q${idCounter++}`,
-          prompt: `How many forks does ${repo.repo} have?`,
-          groundTruth: String(repo.forks),
-          type: 'field-retrieval',
-          dataset: 'github',
-        })
-      }
-      else if (i % 5 === 2) {
-        questions.push({
-          id: `q${idCounter++}`,
-          prompt: `Who is the owner of ${repo.repo}?`,
-          groundTruth: getOwner(repo.repo),
-          type: 'field-retrieval',
-          dataset: 'github',
-        })
-      }
-      else if (i % 5 === 3) {
-        questions.push({
-          id: `q${idCounter++}`,
-          prompt: `What is the default branch of ${repo.repo}?`,
-          groundTruth: repo.defaultBranch,
-          type: 'field-retrieval',
-          dataset: 'github',
-        })
-      }
-      else {
-        questions.push({
-          id: `q${idCounter++}`,
-          prompt: `How many watchers does ${repo.repo} have?`,
-          groundTruth: String(repo.watchers),
-          type: 'field-retrieval',
-          dataset: 'github',
-        })
-      }
-    }
-
-    // Aggregation: popular repositories
-    const totalStars = github.reduce((sum, r) => sum + r.stars, 0)
-    const totalRepos = github.length
-    const avgStars = Math.round(totalStars / totalRepos)
-
-    questions.push(
-      {
-        id: `q${idCounter++}`,
-        prompt: 'What is the total number of stars across all repositories?',
-        groundTruth: String(totalStars),
-        type: 'aggregation',
-        dataset: 'github',
-      },
-      {
-        id: `q${idCounter++}`,
-        prompt: 'How many repositories are in the dataset?',
-        groundTruth: String(totalRepos),
-        type: 'aggregation',
-        dataset: 'github',
-      },
-      {
-        id: `q${idCounter++}`,
-        prompt: 'What is the average number of stars per repository?',
-        groundTruth: String(avgStars),
-        type: 'aggregation',
-        dataset: 'github',
-      },
-    )
-
-    // Aggregation: star thresholds (single-condition filters)
-    for (const threshold of QUESTION_THRESHOLDS.github.stars) {
-      const count = github.filter(r => r.stars > threshold).length
-      questions.push({
-        id: `q${idCounter++}`,
-        prompt: `How many repositories have more than ${threshold} stars?`,
-        groundTruth: String(count),
-        type: 'aggregation',
-        dataset: 'github',
-      })
-    }
-
-    // Aggregation: fork thresholds (single-condition filters)
-    for (const threshold of QUESTION_THRESHOLDS.github.forks) {
-      const count = github.filter(r => r.forks > threshold).length
-      questions.push({
-        id: `q${idCounter++}`,
-        prompt: `How many repositories have more than ${threshold} forks?`,
-        groundTruth: String(count),
-        type: 'aggregation',
-        dataset: 'github',
-      })
-    }
-
-    // Aggregation: watcher thresholds (single-condition filters)
-    for (const threshold of QUESTION_THRESHOLDS.github.watchers) {
-      const count = github.filter(r => r.watchers > threshold).length
-      questions.push({
-        id: `q${idCounter++}`,
-        prompt: `How many repositories have more than ${threshold} watchers?`,
-        groundTruth: String(count),
-        type: 'aggregation',
-        dataset: 'github',
-      })
-    }
-
-    // Aggregation: default branch counts
-    const branches = [...new Set(github.map(r => r.defaultBranch))]
-    for (const branch of branches.slice(0, QUESTION_LIMITS.github.aggregationBranches)) {
-      const count = github.filter(r => r.defaultBranch === branch).length
-      questions.push({
-        id: `q${idCounter++}`,
-        prompt: `How many repositories use "${branch}" as their default branch?`,
-        groundTruth: String(count),
-        type: 'aggregation',
-        dataset: 'github',
-      })
-    }
-
-    // Filtering: multi-condition queries (stars AND forks)
-    for (const combo of QUESTION_THRESHOLDS.github.starForkCombinations.slice(0, QUESTION_LIMITS.github.filteringStarsAndForks)) {
-      const count = github.filter(r => r.stars > combo.stars && r.forks > combo.forks).length
-      questions.push({
-        id: `q${idCounter++}`,
-        prompt: `How many repositories have more than ${combo.stars} stars and more than ${combo.forks} forks?`,
-        groundTruth: String(count),
-        type: 'filtering',
-        dataset: 'github',
-      })
-    }
-
-    // Filtering: stars AND watchers (multi-condition)
-    for (const combo of QUESTION_THRESHOLDS.github.starWatcherCombinations) {
-      const count = github.filter(r => r.stars > combo.stars && r.watchers > combo.watchers).length
-      questions.push({
-        id: `q${idCounter++}`,
-        prompt: `How many repositories have more than ${combo.stars} stars and more than ${combo.watchers} watchers?`,
-        groundTruth: String(count),
-        type: 'filtering',
-        dataset: 'github',
-      })
-    }
-  }
-
-  return questions
-}
--- a/benchmarks/src/questions/analytics.ts
+++ b/benchmarks/src/questions/analytics.ts
@@ -0,0 +1,196 @@
+import type { AnalyticsMetric } from '../datasets'
+import type { Question } from '../types'
+import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from '../constants'
+import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils'
+
+/**
+ * Generate analytics (website metrics) questions
+ */
+export function generateAnalyticsQuestions(metrics: AnalyticsMetric[], getId: () => string): Question[] {
+  const questions: Question[] = []
+
+  if (metrics.length === 0)
+    return questions
+
+  // Field retrieval: date-based metrics
+  const metricFieldGenerators: Array<(metric: AnalyticsMetric, getId: () => string) => Question> = [
+    (metric, getId) => new QuestionBuilder()
+      .id(getId())
+      .prompt(`What are the views for ${metric.date}?`)
+      .groundTruth(String(metric.views))
+      .type('field-retrieval')
+      .dataset('analytics')
+      .build(),
+    (metric, getId) => new QuestionBuilder()
+      .id(getId())
+      .prompt(`What is the revenue for ${metric.date}?`)
+      .groundTruth(String(metric.revenue))
+      .type('field-retrieval')
+      .dataset('analytics')
+      .build(),
+    (metric, getId) => new QuestionBuilder()
+      .id(getId())
+      .prompt(`What is the bounce rate for ${metric.date}?`)
+      .groundTruth(String(metric.bounceRate))
+      .type('field-retrieval')
+      .dataset('analytics')
+      .build(),
+    (metric, getId) => new QuestionBuilder()
+      .id(getId())
+      .prompt(`How many conversions were there on ${metric.date}?`)
+      .groundTruth(String(metric.conversions))
+      .type('field-retrieval')
+      .dataset('analytics')
+      .build(),
+  ]
+
+  questions.push(...rotateQuestions(
+    metrics,
+    metricFieldGenerators,
+    QUESTION_LIMITS.analytics.fieldRetrievalDates,
+    SAMPLE_STRIDES.ANALYTICS_FIELD,
+    getId,
+  ))
+
+  // Aggregation: basic statistics
+  const totalDays = metrics.length
+  const totalViews = metrics.reduce((sum, m) => sum + m.views, 0)
+  const totalConversions = metrics.reduce((sum, m) => sum + m.conversions, 0)
+  const totalRevenue = metrics.reduce((sum, m) => sum + m.revenue, 0)
+  const avgBounceRate = metrics.reduce((sum, m) => sum + m.bounceRate, 0) / metrics.length
+
+  questions.push(
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('How many days of data are in the dataset?')
+      .groundTruth(String(totalDays))
+      .type('aggregation')
+      .dataset('analytics')
+      .build(),
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('What is the total number of views across all dates?')
+      .groundTruth(String(totalViews))
+      .type('aggregation')
+      .dataset('analytics')
+      .build(),
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('What is the total number of conversions across all dates?')
+      .groundTruth(String(totalConversions))
+      .type('aggregation')
+      .dataset('analytics')
+      .build(),
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('What is the total revenue across all dates?')
+      .groundTruth(String(totalRevenue.toFixed(2)))
+      .type('aggregation')
+      .dataset('analytics')
+      .build(),
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('What is the average bounce rate?')
+      .groundTruth(String(avgBounceRate.toFixed(2)))
+      .type('aggregation')
+      .dataset('analytics')
+      .build(),
+  )
+
+  // Aggregation: high views/conversions
+  for (const threshold of QUESTION_THRESHOLDS.analytics.views) {
+    const count = countByPredicate(metrics, m => m.views > threshold)
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many days had more than ${threshold} views?`)
+        .groundTruth(String(count))
+        .type('aggregation')
+        .dataset('analytics')
+        .build(),
+    )
+  }
+
+  for (const threshold of QUESTION_THRESHOLDS.analytics.conversions) {
+    const count = countByPredicate(metrics, m => m.conversions > threshold)
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many days had more than ${threshold} conversions?`)
+        .groundTruth(String(count))
+        .type('aggregation')
+        .dataset('analytics')
+        .build(),
+    )
+  }
+
+  // Filtering: multi-condition (views AND revenue)
+  for (const threshold of QUESTION_THRESHOLDS.analytics.viewsForFiltering) {
+    const count = countByPredicate(
+      metrics,
+      m => m.views > threshold && m.conversions > QUESTION_THRESHOLDS.analytics.conversionsForFiltering,
+    )
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many days had more than ${threshold} views and more than ${QUESTION_THRESHOLDS.analytics.conversionsForFiltering} conversions?`)
+        .groundTruth(String(count))
+        .type('filtering')
+        .dataset('analytics')
+        .build(),
+    )
+  }
+
+  // Filtering: revenue thresholds
+  for (const threshold of QUESTION_THRESHOLDS.analytics.revenueThresholds) {
+    const count = countByPredicate(
+      metrics,
+      m => m.revenue > threshold && m.views > QUESTION_THRESHOLDS.analytics.viewsThresholdForRevenue,
+    )
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many days had revenue greater than ${threshold} with views above ${QUESTION_THRESHOLDS.analytics.viewsThresholdForRevenue}?`)
+        .groundTruth(String(count))
+        .type('filtering')
+        .dataset('analytics')
+        .build(),
+    )
+  }
+
+  // Filtering: clicks and conversions
+  for (const threshold of QUESTION_THRESHOLDS.analytics.clicksForFiltering) {
+    const count = countByPredicate(
+      metrics,
+      m => m.clicks > threshold && m.conversions > QUESTION_THRESHOLDS.analytics.conversionsForClickFiltering,
+    )
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many days had more than ${threshold} clicks and more than ${QUESTION_THRESHOLDS.analytics.conversionsForClickFiltering} conversions?`)
+        .groundTruth(String(count))
+        .type('filtering')
+        .dataset('analytics')
+        .build(),
+    )
+  }
+
+  // Filtering: revenue and bounce rate
+  for (const threshold of QUESTION_THRESHOLDS.analytics.revenueForBounceRate) {
+    const count = countByPredicate(
+      metrics,
+      m => m.revenue > threshold && m.bounceRate < QUESTION_THRESHOLDS.analytics.bounceRateThreshold,
+    )
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many days had revenue greater than ${threshold} with bounce rate below ${QUESTION_THRESHOLDS.analytics.bounceRateThreshold}?`)
+        .groundTruth(String(count))
+        .type('filtering')
+        .dataset('analytics')
+        .build(),
+    )
+  }
+
+  return questions
+}
--- a/benchmarks/src/questions/event-logs.ts
+++ b/benchmarks/src/questions/event-logs.ts
@@ -0,0 +1,162 @@
+import type { EventLog } from '../datasets'
+import type { Question } from '../types'
+import { QUESTION_LIMITS } from '../constants'
+import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils'
+
+/**
+ * Generate event log questions
+ */
+export function generateEventLogsQuestions(logs: EventLog[], getId: () => string): Question[] {
+  const questions: Question[] = []
+
+  if (logs.length === 0)
+    return questions
+
+  // Field retrieval: log metadata
+  const logFieldGenerators: Array<(log: EventLog, getId: () => string) => Question> = [
+    (log, getId) => new QuestionBuilder()
+      .id(getId())
+      .prompt(`What is the level of the log at ${log.timestamp}?`)
+      .groundTruth(log.level)
+      .type('field-retrieval')
+      .dataset('event-logs')
+      .build(),
+    (log, getId) => new QuestionBuilder()
+      .id(getId())
+      .prompt(`What is the endpoint for the log at ${log.timestamp}?`)
+      .groundTruth(log.endpoint)
+      .type('field-retrieval')
+      .dataset('event-logs')
+      .build(),
+    (log, getId) => new QuestionBuilder()
+      .id(getId())
+      .prompt(`What is the status code for the log at ${log.timestamp}?`)
+      .groundTruth(String(log.statusCode))
+      .type('field-retrieval')
+      .dataset('event-logs')
+      .build(),
+    (log, getId) => new QuestionBuilder()
+      .id(getId())
+      .prompt(`What is the response time for the log at ${log.timestamp}?`)
+      .groundTruth(String(log.responseTime))
+      .type('field-retrieval')
+      .dataset('event-logs')
+      .build(),
+  ]
+
+  questions.push(...rotateQuestions(
+    logs,
+    logFieldGenerators,
+    QUESTION_LIMITS.eventLogs.fieldRetrieval,
+    SAMPLE_STRIDES.EVENT_LOG_FIELD,
+    getId,
+  ))
+
+  // Aggregation: basic statistics
+  const totalLogs = logs.length
+  const avgResponseTime = logs.reduce((sum, l) => sum + l.responseTime, 0) / logs.length
+
+  questions.push(
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('How many log entries are in the dataset?')
+      .groundTruth(String(totalLogs))
+      .type('aggregation')
+      .dataset('event-logs')
+      .build(),
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('What is the average response time across all logs?')
+      .groundTruth(String(avgResponseTime.toFixed(2)))
+      .type('aggregation')
+      .dataset('event-logs')
+      .build(),
+  )
+
+  // Aggregation: by level
+  const levels = [...new Set(logs.map(l => l.level))]
+  for (const level of levels) {
+    const count = countByPredicate(logs, l => l.level === level)
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many log entries have level "${level}"?`)
+        .groundTruth(String(count))
+        .type('aggregation')
+        .dataset('event-logs')
+        .build(),
+    )
+  }
+
+  // Aggregation: by endpoint
+  const endpoints = [...new Set(logs.map(l => l.endpoint))]
+  for (const endpoint of endpoints.slice(0, QUESTION_LIMITS.eventLogs.aggregationEndpoints)) {
+    const count = countByPredicate(logs, l => l.endpoint === endpoint)
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many log entries are for endpoint "${endpoint}"?`)
+        .groundTruth(String(count))
+        .type('aggregation')
+        .dataset('event-logs')
+        .build(),
+    )
+  }
+
+  // Aggregation: by status code range
+  const errorCount = countByPredicate(logs, l => l.statusCode >= 400)
+  const successCount = countByPredicate(logs, l => l.statusCode >= 200 && l.statusCode < 300)
+
+  questions.push(
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('How many log entries have a status code indicating an error (>= 400)?')
+      .groundTruth(String(errorCount))
+      .type('aggregation')
+      .dataset('event-logs')
+      .build(),
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('How many log entries have a successful status code (200-299)?')
+      .groundTruth(String(successCount))
+      .type('aggregation')
+      .dataset('event-logs')
+      .build(),
+  )
+
+  // Filtering: multi-condition (level AND status)
+  for (const level of levels.slice(0, QUESTION_LIMITS.eventLogs.filteringLevelAndStatus)) {
+    const count = countByPredicate(
+      logs,
+      l => l.level === level && l.statusCode >= 400,
+    )
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many log entries have level "${level}" and status code >= 400?`)
+        .groundTruth(String(count))
+        .type('filtering')
+        .dataset('event-logs')
+        .build(),
+    )
+  }
+
+  // Filtering: endpoint AND status
+  for (const endpoint of endpoints.slice(0, QUESTION_LIMITS.eventLogs.filteringEndpointAndStatus)) {
+    const count = countByPredicate(
+      logs,
+      l => l.endpoint === endpoint && l.statusCode >= 500,
+    )
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many log entries are for endpoint "${endpoint}" with status code >= 500?`)
+        .groundTruth(String(count))
+        .type('filtering')
+        .dataset('event-logs')
+        .build(),
+    )
+  }
+
+  return questions
+}
--- a/benchmarks/src/questions/github.ts
+++ b/benchmarks/src/questions/github.ts
@@ -0,0 +1,184 @@
+import type { Repository } from '../datasets'
+import type { Question } from '../types'
+import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from '../constants'
+import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils'
+
+/**
+ * Generate GitHub repository questions
+ */
+export function generateGithubQuestions(repos: Repository[], getId: () => string): Question[] {
+  const questions: Question[] = []
+
+  if (repos.length === 0)
+    return questions
+
+  // Field retrieval: repository metadata
+  const repoFieldGenerators: Array<(repo: Repository, getId: () => string) => Question> = [
+    (repo, getId) => new QuestionBuilder()
+      .id(getId())
+      .prompt(`How many stars does ${repo.owner}/${repo.name} have?`)
+      .groundTruth(String(repo.stars))
+      .type('field-retrieval')
+      .dataset('github')
+      .build(),
+    (repo, getId) => new QuestionBuilder()
+      .id(getId())
+      .prompt(`How many forks does ${repo.owner}/${repo.name} have?`)
+      .groundTruth(String(repo.forks))
+      .type('field-retrieval')
+      .dataset('github')
+      .build(),
+    (repo, getId) => new QuestionBuilder()
+      .id(getId())
+      .prompt(`How many watchers does ${repo.owner}/${repo.name} have?`)
+      .groundTruth(String(repo.watchers))
+      .type('field-retrieval')
+      .dataset('github')
+      .build(),
+    (repo, getId) => new QuestionBuilder()
+      .id(getId())
+      .prompt(`What is the main branch of ${repo.owner}/${repo.name}?`)
+      .groundTruth(repo.defaultBranch)
+      .type('field-retrieval')
+      .dataset('github')
+      .build(),
+  ]
+
+  questions.push(...rotateQuestions(
+    repos,
+    repoFieldGenerators,
+    QUESTION_LIMITS.github.fieldRetrievalRepos,
+    SAMPLE_STRIDES.REPO_FIELD,
+    getId,
+  ))
+
+  // Aggregation: basic statistics
+  const totalRepos = repos.length
+  const totalStars = repos.reduce((sum, r) => sum + r.stars, 0)
+  const totalForks = repos.reduce((sum, r) => sum + r.forks, 0)
+  const avgStars = totalStars / totalRepos
+
+  questions.push(
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('How many repositories are in the dataset?')
+      .groundTruth(String(totalRepos))
+      .type('aggregation')
+      .dataset('github')
+      .build(),
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('What is the total number of stars across all repositories?')
+      .groundTruth(String(totalStars))
+      .type('aggregation')
+      .dataset('github')
+      .build(),
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('What is the total number of forks across all repositories?')
+      .groundTruth(String(totalForks))
+      .type('aggregation')
+      .dataset('github')
+      .build(),
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('What is the average number of stars per repository?')
+      .groundTruth(String(Math.round(avgStars)))
+      .type('aggregation')
+      .dataset('github')
+      .build(),
+  )
+
+  // Aggregation: by default branch
+  const branches = [...new Set(repos.map(r => r.defaultBranch))]
+  for (const branch of branches.slice(0, QUESTION_LIMITS.github.aggregationBranches)) {
+    const count = countByPredicate(repos, r => r.defaultBranch === branch)
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many repositories use "${branch}" as their default branch?`)
+        .groundTruth(String(count))
+        .type('aggregation')
+        .dataset('github')
+        .build(),
+    )
+  }
+
+  // Aggregation: high star counts
+  for (const threshold of QUESTION_THRESHOLDS.github.stars) {
+    const count = countByPredicate(repos, r => r.stars > threshold)
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many repositories have more than ${threshold} stars?`)
+        .groundTruth(String(count))
+        .type('aggregation')
+        .dataset('github')
+        .build(),
+    )
+  }
+
+  // Aggregation: high fork counts
+  for (const threshold of QUESTION_THRESHOLDS.github.forks) {
+    const count = countByPredicate(repos, r => r.forks > threshold)
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many repositories have more than ${threshold} forks?`)
+        .groundTruth(String(count))
+        .type('aggregation')
+        .dataset('github')
+        .build(),
+    )
+  }
+
+  // Aggregation: high watcher counts
+  for (const threshold of QUESTION_THRESHOLDS.github.watchers) {
+    const count = countByPredicate(repos, r => r.watchers > threshold)
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many repositories have more than ${threshold} watchers?`)
+        .groundTruth(String(count))
+        .type('aggregation')
+        .dataset('github')
+        .build(),
+    )
+  }
+
+  // Filtering: multi-condition (stars AND forks)
+  for (const combo of QUESTION_THRESHOLDS.github.starForkCombinations.slice(0, QUESTION_LIMITS.github.filteringStarsAndForks)) {
+    const count = countByPredicate(
+      repos,
+      r => r.stars > combo.stars && r.forks > combo.forks,
+    )
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many repositories have more than ${combo.stars} stars and more than ${combo.forks} forks?`)
+        .groundTruth(String(count))
+        .type('filtering')
+        .dataset('github')
+        .build(),
+    )
+  }
+
+  // Filtering: stars AND watchers
+  for (const combo of QUESTION_THRESHOLDS.github.starWatcherCombinations) {
+    const count = countByPredicate(
+      repos,
+      r => r.stars > combo.stars && r.watchers > combo.watchers,
+    )
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many repositories have more than ${combo.stars} stars and more than ${combo.watchers} watchers?`)
+        .groundTruth(String(count))
+        .type('filtering')
+        .dataset('github')
+        .build(),
+    )
+  }
+
+  return questions
+}
--- a/benchmarks/src/questions/index.ts
+++ b/benchmarks/src/questions/index.ts
@@ -0,0 +1,46 @@
+import type { AnalyticsMetric, Employee, EventLog, NestedConfig, Order, Repository } from '../datasets'
+import type { Question } from '../types'
+import { ACCURACY_DATASETS } from '../datasets'
+import { generateAnalyticsQuestions } from './analytics'
+import { generateEventLogsQuestions } from './event-logs'
+import { generateGithubQuestions } from './github'
+import { generateNestedQuestions } from './nested'
+import { generateNestedConfigQuestions } from './nested-config'
+import { generateTabularQuestions } from './tabular'
+import { createIdGenerator } from './utils'
+
+/**
+ * Generate all questions from datasets
+ *
+ * @remarks
+ * Generates ~150-160 questions across different question types and datasets:
+ * - Field Retrieval: Direct field access with no computation
+ *   Examples: "What is X's salary?", "What is the status of order Y?"
+ * - Aggregation: Counts, sums, averages, min/max operations (including single-condition filters)
+ *   Examples: "How many X?", "What is the total/average?", "How many X > threshold?"
+ * - Filtering: Multi-condition queries requiring complex logical operations
+ *   Examples: "How many X WHERE condition1 AND condition2?"
+ */
+export function generateQuestions(): Question[] {
+  const questions: Question[] = []
+  const idGen = createIdGenerator()
+  const getId = () => idGen.next().value
+
+  // Get datasets with proper typing
+  const tabular = (ACCURACY_DATASETS.find(d => d.name === 'tabular')?.data.employees as Employee[]) ?? []
+  const nested = (ACCURACY_DATASETS.find(d => d.name === 'nested')?.data.orders as Order[]) ?? []
+  const analytics = (ACCURACY_DATASETS.find(d => d.name === 'analytics')?.data.metrics as AnalyticsMetric[]) ?? []
+  const github = (ACCURACY_DATASETS.find(d => d.name === 'github')?.data.repositories as Repository[]) ?? []
+  const eventLogs = (ACCURACY_DATASETS.find(d => d.name === 'event-logs')?.data.logs as EventLog[]) ?? []
+  const nestedConfig = ACCURACY_DATASETS.find(d => d.name === 'nested-config')?.data as NestedConfig | undefined
+
+  // Generate questions for each dataset
+  questions.push(...generateTabularQuestions(tabular, getId))
+  questions.push(...generateNestedQuestions(nested, getId))
+  questions.push(...generateAnalyticsQuestions(analytics, getId))
+  questions.push(...generateGithubQuestions(github, getId))
+  questions.push(...generateEventLogsQuestions(eventLogs, getId))
+  questions.push(...generateNestedConfigQuestions(nestedConfig, getId))
+
+  return questions
+}
--- a/benchmarks/src/questions/nested-config.ts
+++ b/benchmarks/src/questions/nested-config.ts
@@ -0,0 +1,147 @@
+import type { NestedConfig } from '../datasets'
+import type { Question } from '../types'
+import { QUESTION_LIMITS } from '../constants'
+import { QuestionBuilder } from './utils'
+
+/**
+ * Generate nested configuration questions
+ */
+export function generateNestedConfigQuestions(config: NestedConfig | undefined, getId: () => string): Question[] {
+  const questions: Question[] = []
+
+  if (!config)
+    return questions
+
+  // Field retrieval: top-level config values
+  const fieldRetrievalQuestions = [
+    {
+      prompt: 'What is the environment in the configuration?',
+      groundTruth: config.environment,
+    },
+    {
+      prompt: 'What is the database host?',
+      groundTruth: config.database.host,
+    },
+    {
+      prompt: 'What is the database port?',
+      groundTruth: String(config.database.port),
+    },
+    {
+      prompt: 'What is the maximum connection pool size?',
+      groundTruth: String(config.database.pool.max),
+    },
+    {
+      prompt: 'What is the session duration?',
+      groundTruth: String(config.authentication.session.duration),
+    },
+  ]
+
+  for (const q of fieldRetrievalQuestions.slice(0, QUESTION_LIMITS.nestedConfig.fieldRetrieval)) {
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(q.prompt)
+        .groundTruth(q.groundTruth)
+        .type('field-retrieval')
+        .dataset('nested-config')
+        .build(),
+    )
+  }
+
+  // Aggregation: counts of nested structures
+  const roleCount = Object.keys(config.permissions.roles).length
+  const groupCount = Object.keys(config.permissions.groups).length
+  const providerCount = config.authentication.providers.length
+  const featureCount = Object.keys(config.features).length
+  const replicaCount = config.database.replicas.length
+
+  questions.push(
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('How many roles are defined in permissions?')
+      .groundTruth(String(roleCount))
+      .type('aggregation')
+      .dataset('nested-config')
+      .build(),
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('How many groups are defined in permissions?')
+      .groundTruth(String(groupCount))
+      .type('aggregation')
+      .dataset('nested-config')
+      .build(),
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('How many authentication providers are configured?')
+      .groundTruth(String(providerCount))
+      .type('aggregation')
+      .dataset('nested-config')
+      .build(),
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('How many feature flags are defined?')
+      .groundTruth(String(featureCount))
+      .type('aggregation')
+      .dataset('nested-config')
+      .build(),
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('How many database replicas are configured?')
+      .groundTruth(String(replicaCount))
+      .type('aggregation')
+      .dataset('nested-config')
+      .build(),
+  )
+
+  // Aggregation: feature flag details
+  const enabledFeatures = Object.entries(config.features).filter(([_, f]) => f.enabled).length
+  questions.push(
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('How many feature flags are enabled?')
+      .groundTruth(String(enabledFeatures))
+      .type('aggregation')
+      .dataset('nested-config')
+      .build(),
+  )
+
+  // Aggregation: role permissions
+  const adminPermissions = config.permissions.roles.admin?.permissions.length ?? 0
+  questions.push(
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('How many permissions does the admin role have?')
+      .groundTruth(String(adminPermissions))
+      .type('aggregation')
+      .dataset('nested-config')
+      .build(),
+  )
+
+  // Filtering: complex multi-condition queries
+  const filteringQuestions = [
+    {
+      prompt: 'How many feature flags are enabled with rollout greater than 50%?',
+      groundTruth: String(Object.entries(config.features)
+        .filter(([_, f]) => f.enabled && f.rollout > 50).length),
+    },
+    {
+      prompt: 'How many groups have the admin role?',
+      groundTruth: String(Object.entries(config.permissions.groups)
+        .filter(([_, g]) => g.roles.includes('admin')).length),
+    },
+  ]
+
+  for (const q of filteringQuestions.slice(0, QUESTION_LIMITS.nestedConfig.filteringComplex)) {
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(q.prompt)
+        .groundTruth(q.groundTruth)
+        .type('filtering')
+        .dataset('nested-config')
+        .build(),
+    )
+  }
+
+  return questions
+}
--- a/benchmarks/src/questions/nested.ts
+++ b/benchmarks/src/questions/nested.ts
@@ -0,0 +1,202 @@
+import type { Order } from '../datasets'
+import type { Question } from '../types'
+import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from '../constants'
+import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils'
+
+/**
+ * Generate nested (orders) questions
+ */
+export function generateNestedQuestions(orders: Order[], getId: () => string): Question[] {
+  const questions: Question[] = []
+
+  if (orders.length === 0)
+    return questions
+
+  // Field retrieval: order totals and statuses
+  const orderFieldGenerators: Array<(order: Order, getId: () => string) => Question> = [
+    (order, getId) => new QuestionBuilder()
+      .id(getId())
+      .prompt(`What is the total for order ${order.orderId}?`)
+      .groundTruth(String(order.total))
+      .type('field-retrieval')
+      .dataset('nested')
+      .build(),
+    (order, getId) => new QuestionBuilder()
+      .id(getId())
+      .prompt(`What is the status of order ${order.orderId}?`)
+      .groundTruth(order.status)
+      .type('field-retrieval')
+      .dataset('nested')
+      .build(),
+  ]
+
+  questions.push(...rotateQuestions(
+    orders,
+    orderFieldGenerators,
+    QUESTION_LIMITS.nested.fieldRetrievalOrders,
+    SAMPLE_STRIDES.ORDER_FIELD,
+    getId,
+  ))
+
+  // Field retrieval: customer info and order dates
+  const customerFieldGenerators: Array<(order: Order, getId: () => string) => Question> = [
+    (order, getId) => new QuestionBuilder()
+      .id(getId())
+      .prompt(`What is the customer name for order ${order.orderId}?`)
+      .groundTruth(order.customer.name)
+      .type('field-retrieval')
+      .dataset('nested')
+      .build(),
+    (order, getId) => new QuestionBuilder()
+      .id(getId())
+      .prompt(`What is the customer email for order ${order.orderId}?`)
+      .groundTruth(order.customer.email)
+      .type('field-retrieval')
+      .dataset('nested')
+      .build(),
+    (order, getId) => new QuestionBuilder()
+      .id(getId())
+      .prompt(`What is the order date for order ${order.orderId}?`)
+      .groundTruth(order.orderDate || '')
+      .type('field-retrieval')
+      .dataset('nested')
+      .build(),
+    (order, getId) => new QuestionBuilder()
+      .id(getId())
+      .prompt(`How many items are in order ${order.orderId}?`)
+      .groundTruth(String(order.items.length))
+      .type('field-retrieval')
+      .dataset('nested')
+      .build(),
+  ]
+
+  // Use stride + 1 for customer fields to offset from order fields
+  const customerOrders = orders.map((_, i) => orders[i * SAMPLE_STRIDES.CUSTOMER_FIELD + 1] || orders[i]).filter(Boolean) as Order[]
+  questions.push(...rotateQuestions(
+    customerOrders,
+    customerFieldGenerators,
+    QUESTION_LIMITS.nested.fieldRetrievalCustomers,
+    1,
+    getId,
+  ))
+
+  // Aggregation: totals and averages
+  const totalRevenue = orders.reduce((sum, o) => sum + o.total, 0)
+  const avgOrderValue = totalRevenue / orders.length
+  const totalOrders = orders.length
+  const maxOrderValue = Math.max(...orders.map(o => o.total))
+
+  // Count by status
+  const statuses = [...new Set(orders.map(o => o.status))]
+  for (const status of statuses.slice(0, QUESTION_LIMITS.nested.aggregationStatuses)) {
+    const count = countByPredicate(orders, o => o.status === status)
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many orders have status "${status}"?`)
+        .groundTruth(String(count))
+        .type('aggregation')
+        .dataset('nested')
+        .build(),
+    )
+  }
+
+  questions.push(
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('What is the total revenue across all orders?')
+      .groundTruth(String(totalRevenue.toFixed(2)))
+      .type('aggregation')
+      .dataset('nested')
+      .build(),
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('What is the average order value?')
+      .groundTruth(String(avgOrderValue.toFixed(2)))
+      .type('aggregation')
+      .dataset('nested')
+      .build(),
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('How many orders are in the dataset?')
+      .groundTruth(String(totalOrders))
+      .type('aggregation')
+      .dataset('nested')
+      .build(),
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('What is the highest order total?')
+      .groundTruth(String(maxOrderValue.toFixed(2)))
+      .type('aggregation')
+      .dataset('nested')
+      .build(),
+  )
+
+  // Aggregation: high-value orders (single-condition filter)
+  for (const threshold of QUESTION_THRESHOLDS.nested.highValueOrders) {
+    const count = countByPredicate(orders, o => o.total > threshold)
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many orders have a total greater than ${threshold}?`)
+        .groundTruth(String(count))
+        .type('aggregation')
+        .dataset('nested')
+        .build(),
+    )
+  }
+
+  // Filtering: multi-condition queries (status AND value)
+  const orderStatuses = [...new Set(orders.map(o => o.status))]
+  for (const status of orderStatuses.slice(0, QUESTION_LIMITS.nested.filteringStatusAndValue)) {
+    const count = countByPredicate(
+      orders,
+      o => o.status === status && o.total > QUESTION_THRESHOLDS.nested.statusValueThreshold,
+    )
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many orders have status "${status}" and total greater than ${QUESTION_THRESHOLDS.nested.statusValueThreshold}?`)
+        .groundTruth(String(count))
+        .type('filtering')
+        .dataset('nested')
+        .build(),
+    )
+  }
+
+  // Filtering: status AND items count (multi-condition)
+  for (const status of orderStatuses.slice(0, QUESTION_LIMITS.nested.filteringStatusAndItems)) {
+    const count = countByPredicate(
+      orders,
+      o => o.status === status && o.items.length >= QUESTION_THRESHOLDS.nested.itemCountThreshold,
+    )
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many orders have status "${status}" and at least ${QUESTION_THRESHOLDS.nested.itemCountThreshold} items?`)
+        .groundTruth(String(count))
+        .type('filtering')
+        .dataset('nested')
+        .build(),
+    )
+  }
+
+  // Filtering: total AND items count (multi-condition)
+  for (const threshold of QUESTION_THRESHOLDS.nested.totalThresholdsForItems) {
+    const count = countByPredicate(
+      orders,
+      o => o.total > threshold && o.items.length >= QUESTION_THRESHOLDS.nested.itemCountThreshold,
+    )
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many orders have a total greater than ${threshold} and at least ${QUESTION_THRESHOLDS.nested.itemCountThreshold} items?`)
+        .groundTruth(String(count))
+        .type('filtering')
+        .dataset('nested')
+        .build(),
+    )
+  }
+
+  return questions
+}
--- a/benchmarks/src/questions/tabular.ts
+++ b/benchmarks/src/questions/tabular.ts
@@ -0,0 +1,191 @@
+import type { Employee } from '../datasets'
+import type { Question } from '../types'
+import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from '../constants'
+import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils'
+
+/**
+ * Generate tabular (employee) questions
+ */
+export function generateTabularQuestions(employees: Employee[], getId: () => string): Question[] {
+  const questions: Question[] = []
+
+  if (employees.length === 0)
+    return questions
+
+  // Field retrieval: specific employees
+  const fieldGenerators: Array<(emp: Employee, getId: () => string) => Question> = [
+    (emp, getId) => new QuestionBuilder()
+      .id(getId())
+      .prompt(`What is the salary of ${emp.name}?`)
+      .groundTruth(String(emp.salary))
+      .type('field-retrieval')
+      .dataset('tabular')
+      .build(),
+    (emp, getId) => new QuestionBuilder()
+      .id(getId())
+      .prompt(`What department does ${emp.name} work in?`)
+      .groundTruth(emp.department)
+      .type('field-retrieval')
+      .dataset('tabular')
+      .build(),
+    (emp, getId) => new QuestionBuilder()
+      .id(getId())
+      .prompt(`What is the email address of ${emp.name}?`)
+      .groundTruth(emp.email)
+      .type('field-retrieval')
+      .dataset('tabular')
+      .build(),
+    (emp, getId) => new QuestionBuilder()
+      .id(getId())
+      .prompt(`How many years of experience does ${emp.name} have?`)
+      .groundTruth(String(emp.yearsExperience))
+      .type('field-retrieval')
+      .dataset('tabular')
+      .build(),
+    (emp, getId) => new QuestionBuilder()
+      .id(getId())
+      .prompt(`Is ${emp.name} an active employee?`)
+      .groundTruth(emp.active ? 'yes' : 'no')
+      .type('field-retrieval')
+      .dataset('tabular')
+      .build(),
+  ]
+
+  questions.push(...rotateQuestions(
+    employees,
+    fieldGenerators,
+    QUESTION_LIMITS.tabular.fieldRetrieval,
+    SAMPLE_STRIDES.EMPLOYEE_FIELD,
+    getId,
+  ))
+
+  // Aggregation: count by department
+  const departments = [...new Set(employees.map(e => e.department))]
+  for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.aggregationDepartments)) {
+    const count = countByPredicate(employees, e => e.department === dept)
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many employees work in ${dept}?`)
+        .groundTruth(String(count))
+        .type('aggregation')
+        .dataset('tabular')
+        .build(),
+    )
+  }
+
+  // Aggregation: salary ranges (single-condition filters)
+  for (const threshold of QUESTION_THRESHOLDS.tabular.salaryRanges) {
+    const count = countByPredicate(employees, e => e.salary > threshold)
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many employees have a salary greater than ${threshold}?`)
+        .groundTruth(String(count))
+        .type('aggregation')
+        .dataset('tabular')
+        .build(),
+    )
+  }
+
+  // Aggregation: totals and averages
+  const totalEmployees = employees.length
+  const avgSalary = Math.round(employees.reduce((sum, e) => sum + e.salary, 0) / totalEmployees)
+  const activeCount = countByPredicate(employees, e => e.active)
+  const inactiveCount = countByPredicate(employees, e => !e.active)
+
+  questions.push(
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('How many employees are in the dataset?')
+      .groundTruth(String(totalEmployees))
+      .type('aggregation')
+      .dataset('tabular')
+      .build(),
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('What is the average salary across all employees?')
+      .groundTruth(String(avgSalary))
+      .type('aggregation')
+      .dataset('tabular')
+      .build(),
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('How many employees are active?')
+      .groundTruth(String(activeCount))
+      .type('aggregation')
+      .dataset('tabular')
+      .build(),
+    new QuestionBuilder()
+      .id(getId())
+      .prompt('How many employees are inactive?')
+      .groundTruth(String(inactiveCount))
+      .type('aggregation')
+      .dataset('tabular')
+      .build(),
+  )
+
+  // Filtering: count by department with salary filter (multi-condition)
+  for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringMultiConditionDepartments)) {
+    const count = countByPredicate(
+      employees,
+      e => e.department === dept && e.salary > QUESTION_THRESHOLDS.tabular.departmentSalaryThreshold,
+    )
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many employees in ${dept} have a salary greater than ${QUESTION_THRESHOLDS.tabular.departmentSalaryThreshold}?`)
+        .groundTruth(String(count))
+        .type('filtering')
+        .dataset('tabular')
+        .build(),
+    )
+  }
+
+  // Filtering: active employees by experience (multi-condition)
+  for (const exp of QUESTION_THRESHOLDS.tabular.experienceYears.slice(0, QUESTION_LIMITS.tabular.filteringExperience)) {
+    const count = countByPredicate(employees, e => e.yearsExperience > exp && e.active)
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many active employees have more than ${exp} years of experience?`)
+        .groundTruth(String(count))
+        .type('filtering')
+        .dataset('tabular')
+        .build(),
+    )
+  }
+
+  // Filtering: department by experience (multi-condition)
+  for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringDepartmentExp)) {
+    const count = countByPredicate(
+      employees,
+      e => e.department === dept && e.yearsExperience > QUESTION_THRESHOLDS.tabular.departmentExperienceThreshold,
+    )
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many employees in ${dept} have more than ${QUESTION_THRESHOLDS.tabular.departmentExperienceThreshold} years of experience?`)
+        .groundTruth(String(count))
+        .type('filtering')
+        .dataset('tabular')
+        .build(),
+    )
+  }
+
+  // Filtering: department by active status (multi-condition)
+  for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringDepartmentActive)) {
+    const count = countByPredicate(employees, e => e.department === dept && e.active)
+    questions.push(
+      new QuestionBuilder()
+        .id(getId())
+        .prompt(`How many active employees work in ${dept}?`)
+        .groundTruth(String(count))
+        .type('filtering')
+        .dataset('tabular')
+        .build(),
+    )
+  }
+
+  return questions
+}
--- a/benchmarks/src/questions/utils.ts
+++ b/benchmarks/src/questions/utils.ts
@@ -0,0 +1,95 @@
+import type { Question } from '../types'
+
+// Constants for sampling strides
+export const SAMPLE_STRIDES = {
+  EMPLOYEE_FIELD: 2,
+  ORDER_FIELD: 2,
+  CUSTOMER_FIELD: 2,
+  ANALYTICS_FIELD: 3,
+  METRIC_FIELD: 3,
+  REPO_FIELD: 7,
+  EVENT_LOG_FIELD: 5,
+} as const
+
+/**
+ * ID Generator
+ */
+export function* createIdGenerator(): Generator<string, never, never> {
+  let id = 1
+  while (true) {
+    yield `q${id++}`
+  }
+}
+
+/**
+ * Question Builder class for fluent question creation
+ */
+export class QuestionBuilder {
+  private question: Partial<Question> = {}
+
+  id(id: string): this {
+    this.question.id = id
+    return this
+  }
+
+  prompt(prompt: string): this {
+    this.question.prompt = prompt
+    return this
+  }
+
+  groundTruth(groundTruth: string): this {
+    this.question.groundTruth = groundTruth
+    return this
+  }
+
+  type(type: Question['type']): this {
+    this.question.type = type
+    return this
+  }
+
+  dataset(dataset: Question['dataset']): this {
+    this.question.dataset = dataset
+    return this
+  }
+
+  build(): Question {
+    if (!this.question.id || !this.question.prompt || !this.question.groundTruth || !this.question.type || !this.question.dataset) {
+      throw new Error('Incomplete question')
+    }
+    return this.question as Question
+  }
+}
+
+/**
+ * Helper: Count items matching a predicate
+ */
+export function countByPredicate<T>(items: T[], predicate: (item: T) => boolean): number {
+  return items.filter(predicate).length
+}
+
+/**
+ * Helper: Rotate through question generators
+ */
+export function rotateQuestions<T>(
+  items: T[],
+  generators: Array<(item: T, getId: () => string) => Question>,
+  limit: number,
+  stride: number,
+  getId: () => string,
+): Question[] {
+  const questions: Question[] = []
+
+  for (let i = 0; i < Math.min(limit, items.length); i++) {
+    const item = items[i * stride] || items[i]
+    if (!item)
+      continue
+
+    const generatorIndex = i % generators.length
+    const generator = generators[generatorIndex]
+    if (generator) {
+      questions.push(generator(item, getId))
+    }
+  }
+
+  return questions
+}
--- a/benchmarks/src/report.ts
+++ b/benchmarks/src/report.ts
@@ -1,7 +1,8 @@
-import type { EfficiencyRanking, EvaluationResult, FormatResult, Question } from './types'
+import type { Dataset, EfficiencyRanking, EvaluationResult, FormatResult, Question } from './types'
 import { FORMATTER_DISPLAY_NAMES } from './constants'
-import { datasets } from './datasets'
+import { ACCURACY_DATASETS } from './datasets'
 import { models } from './evaluate'
+import { supportsCSV } from './formatters'
 import { generateQuestions } from './questions'
 import { createProgressBar, tokenize } from './utils'

@@ -16,7 +17,11 @@ export function calculateTokenCounts(
  const tokenCounts: Record<string, number> = {}

  for (const [formatName, formatter] of Object.entries(formatters)) {
-    for (const dataset of datasets) {
+    for (const dataset of ACCURACY_DATASETS) {
+      // Skip CSV for datasets that don't support it
+      if (formatName === 'csv' && !supportsCSV(dataset))
+        continue
+
      const formatted = formatter(dataset.data)
      const key = `${formatName}-${dataset.name}`
      tokenCounts[key] = tokenize(formatted)
@@ -42,9 +47,9 @@ export function calculateFormatResults(
    const accuracy = correctCount / totalCount

    // Calculate average tokens across all datasets for this format
-    const avgTokens = Object.entries(tokenCounts)
+    const formatTokenEntries = Object.entries(tokenCounts)
      .filter(([key]) => key.startsWith(`${formatName}-`))
-      .reduce((sum, [, tokens]) => sum + tokens, 0) / datasets.length
+    const avgTokens = formatTokenEntries.reduce((sum, [, tokens]) => sum + tokens, 0) / formatTokenEntries.length

    const averageLatency = formatResults.reduce((sum, r) => sum + r.latencyMs, 0) / totalCount

@@ -75,6 +80,8 @@ export function generateAccuracyReport(
  return `
 Benchmarks test LLM comprehension across different input formats using ${totalQuestions} data retrieval questions on ${modelNames.length} ${modelNames.length === 1 ? 'model' : 'models'}.

+${generateDatasetCatalog(ACCURACY_DATASETS)}
+
 #### Efficiency Ranking (Accuracy per 1K Tokens)

 ${generateEfficiencyRankingReport(formatResults)}
@@ -85,6 +92,38 @@ ${generateDetailedAccuracyReport(formatResults, results, questions, tokenCounts)
 `.trimStart()
 }

+/**
+ * Generate dataset catalog section
+ */
+function generateDatasetCatalog(datasets: Dataset[]): string {
+  const rows = datasets.map((dataset) => {
+    const csvSupport = supportsCSV(dataset) ? '✓' : '✗'
+    const rowCount = Object.values(dataset.data)[0]?.length ?? 1
+    const structure = dataset.metadata.structureClass
+    const eligibility = `${dataset.metadata.tabularEligibility}%`
+
+    return `| ${dataset.description} | ${rowCount} | ${structure} | ${csvSupport} | ${eligibility} |`
+  }).join('\n')
+
+  return `
+#### Dataset Catalog
+
+| Dataset | Rows | Structure | CSV Support | Eligibility |
+| ------- | ---- | --------- | ----------- | ----------- |
+${rows}
+
+**Structure classes:**
+- **uniform**: All objects have identical fields with primitive values
+- **semi-uniform**: Mix of uniform and non-uniform structures
+- **nested**: Objects with nested structures (nested objects or arrays)
+- **deep**: Highly nested with minimal tabular eligibility
+
+**CSV Support:** ✓ (supported), ✗ (not supported - would require lossy flattening)
+
+**Eligibility:** Percentage of arrays that qualify for TOON's tabular format (uniform objects with primitive values)
+`.trim()
+}
+
 /**
 * Generate efficiency ranking report
 */
@@ -168,10 +207,12 @@ function generateDetailedAccuracyReport(
  const filteringPercent = ((filteringCount / totalQuestions) * 100).toFixed(0)

  // Calculate dataset sizes
-  const tabularSize = datasets.find(d => d.name === 'tabular')?.data.employees?.length || 0
-  const nestedSize = datasets.find(d => d.name === 'nested')?.data.orders?.length || 0
-  const analyticsSize = datasets.find(d => d.name === 'analytics')?.data.metrics?.length || 0
-  const githubSize = datasets.find(d => d.name === 'github')?.data.repositories?.length || 0
+  const tabularSize = ACCURACY_DATASETS.find(d => d.name === 'tabular')?.data.employees?.length || 0
+  const nestedSize = ACCURACY_DATASETS.find(d => d.name === 'nested')?.data.orders?.length || 0
+  const analyticsSize = ACCURACY_DATASETS.find(d => d.name === 'analytics')?.data.metrics?.length || 0
+  const githubSize = ACCURACY_DATASETS.find(d => d.name === 'github')?.data.repositories?.length || 0
+  const eventLogsSize = ACCURACY_DATASETS.find(d => d.name === 'event-logs')?.data.logs?.length || 0
+  const nestedConfigSize = 1 // Single config object

  // Calculate number of formats and evaluations
  const formatCount = formatResults.length
@@ -208,12 +249,14 @@ This benchmark tests **LLM comprehension and data retrieval accuracy** across di

 #### Datasets Tested

-Four datasets designed to test different structural patterns (all contain arrays of uniform objects, TOON's optimal format):
+Six datasets designed to test different structural patterns:

 1. **Tabular** (${tabularSize} employee records): Uniform objects with identical fields – optimal for TOON's tabular format.
 2. **Nested** (${nestedSize} e-commerce orders): Complex structures with nested customer objects and item arrays.
 3. **Analytics** (${analyticsSize} days of metrics): Time-series data with dates and numeric values.
 4. **GitHub** (${githubSize} repositories): Real-world data from top GitHub repos by stars.
+5. **Event Logs** (${eventLogsSize} logs): Semi-uniform data with ~50% flat logs and ~50% with nested error objects.
+6. **Nested Config** (${nestedConfigSize} configuration): Deeply nested configuration with minimal tabular eligibility.

 #### Question Types

@@ -314,7 +357,7 @@ function generateDatasetBreakdown(
  questions: Question[],
  tokenCounts: Record<string, number>,
 ): string {
-  return datasets.map((dataset) => {
+  return ACCURACY_DATASETS.map((dataset) => {
    const datasetResults = formatResults.map((fr) => {
      const datasetFormatResults = results.filter(r => r.questionId.includes(dataset.name) || questions.find(q => q.id === r.questionId)?.dataset === dataset.name)
      if (datasetFormatResults.length === 0)
--- a/benchmarks/src/types.ts
+++ b/benchmarks/src/types.ts
@@ -1,7 +1,14 @@
+export interface DatasetMetadata {
+  supportsCSV: boolean
+  structureClass: 'uniform' | 'semi-uniform' | 'nested' | 'deep'
+  tabularEligibility: number
+}
+
 export interface Dataset {
  name: string
  description: string
  data: Record<string, any>
+  metadata: DatasetMetadata
 }

 export interface Question {