mirror of
https://github.com/voson-wang/toon.git
synced 2026-01-29 23:34:10 +08:00
test(benchmark): overhaul generation
This commit is contained in:
23
README.md
23
README.md
@@ -60,6 +60,19 @@ For small payloads, JSON/CSV/YAML work fine. TOON's value emerges at scale: when
|
|||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary><strong>When NOT to use TOON</strong></summary>
|
||||||
|
|
||||||
|
TOON excels with uniform arrays of objects, but there are cases where other formats are better:
|
||||||
|
|
||||||
|
- **Deeply nested or non-uniform structures** (tabular eligibility ≈ 0%): JSON-compact often uses fewer tokens. Example: complex configuration objects with many nested levels.
|
||||||
|
- **Semi-uniform arrays** (~40–60% tabular eligibility): Token savings diminish. Prefer JSON if your pipelines already rely on it.
|
||||||
|
- **Flat CSV use-cases**: CSV is smaller than TOON for pure tabular data. TOON adds minimal overhead (~5-10%) to provide structure (length markers, field headers, delimiter scoping) that improves LLM reliability.
|
||||||
|
|
||||||
|
See [benchmarks](#benchmarks) for concrete comparisons across different data structures.
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
## Key Features
|
## Key Features
|
||||||
|
|
||||||
- 💸 **Token-efficient:** typically 30–60% fewer tokens than JSON[^1]
|
- 💸 **Token-efficient:** typically 30–60% fewer tokens than JSON[^1]
|
||||||
@@ -75,14 +88,16 @@ For small payloads, JSON/CSV/YAML work fine. TOON's value emerges at scale: when
|
|||||||
> [!TIP]
|
> [!TIP]
|
||||||
> Try the interactive [Format Tokenization Playground](https://www.curiouslychase.com/playground/format-tokenization-exploration) to compare token usage across CSV, JSON, YAML, and TOON with your own data.
|
> Try the interactive [Format Tokenization Playground](https://www.curiouslychase.com/playground/format-tokenization-exploration) to compare token usage across CSV, JSON, YAML, and TOON with your own data.
|
||||||
|
|
||||||
|
Benchmarks are organized into two tracks to ensure fair comparisons:
|
||||||
|
|
||||||
|
- **Mixed-Structure Track**: Datasets with nested or semi-uniform structures (TOON vs JSON, YAML, XML). CSV excluded as it cannot properly represent these structures.
|
||||||
|
- **Flat-Only Track**: Datasets with flat tabular structures where CSV is applicable (CSV vs TOON vs JSON, YAML, XML).
|
||||||
|
|
||||||
### Token Efficiency
|
### Token Efficiency
|
||||||
|
|
||||||
Token counts are measured using the GPT-5 `o200k_base` tokenizer via [`gpt-tokenizer`](https://github.com/niieani/gpt-tokenizer). Savings are calculated against formatted JSON (2-space indentation) as the primary baseline, with additional comparisons to compact JSON (minified), YAML, and XML. Actual savings vary by model and tokenizer.
|
Token counts are measured using the GPT-5 `o200k_base` tokenizer via [`gpt-tokenizer`](https://github.com/niieani/gpt-tokenizer). Savings are calculated against formatted JSON (2-space indentation) as the primary baseline, with additional comparisons to compact JSON (minified), YAML, and XML. Actual savings vary by model and tokenizer.
|
||||||
|
|
||||||
The benchmarks use datasets optimized for TOON's strengths (uniform tabular data). Real-world performance depends on your data structure.
|
The benchmarks test datasets across different structural patterns (uniform, semi-uniform, nested, deeply nested) to show where TOON excels and where other formats may be better.
|
||||||
|
|
||||||
> [!NOTE]
|
|
||||||
> CSV/TSV doesn't support nested structures, so it's not included in this comparison. For flat datasets where CSV applies, see token counts and accuracy metrics in the [Retrieval Accuracy](#retrieval-accuracy) section.
|
|
||||||
|
|
||||||
<!-- automd:file src="./benchmarks/results/token-efficiency.md" -->
|
<!-- automd:file src="./benchmarks/results/token-efficiency.md" -->
|
||||||
|
|
||||||
|
|||||||
@@ -34,8 +34,8 @@ Results are saved to `results/token-efficiency.md`.
|
|||||||
|
|
||||||
Tests how well LLMs can answer questions about data in different formats (TOON, JSON, JSON compact, XML, YAML, CSV):
|
Tests how well LLMs can answer questions about data in different formats (TOON, JSON, JSON compact, XML, YAML, CSV):
|
||||||
|
|
||||||
1. Generate ~150-160 questions across 4 datasets
|
1. Generate ~150-160 questions across 6 datasets (CSV only included for datasets with flat/tabular structure)
|
||||||
2. Convert each dataset to all 6 formats
|
2. Convert each dataset to all supported formats
|
||||||
3. Query each LLM with formatted data + question
|
3. Query each LLM with formatted data + question
|
||||||
4. Validate answers using `gpt-5-nano` as judge
|
4. Validate answers using `gpt-5-nano` as judge
|
||||||
5. Aggregate metrics and generate report
|
5. Aggregate metrics and generate report
|
||||||
|
|||||||
@@ -1,36 +1,149 @@
|
|||||||
|
|
||||||
|
## Mixed-Structure Track
|
||||||
|
|
||||||
|
Datasets with nested or semi-uniform structures. CSV excluded as it cannot properly represent these structures.
|
||||||
|
|
||||||
```
|
```
|
||||||
⭐ GitHub Repositories ██████████████░░░░░░░░░░░ 8,745 tokens
|
🛒 E-commerce orders with nested structures [eligibility: 33%]
|
||||||
|
toon ▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░ 58,528 tokens
|
||||||
|
vs JSON (−37.9%) 94,207
|
||||||
|
vs JSON compact (+0.9%) 57,979
|
||||||
|
vs YAML (−17.8%) 71,223
|
||||||
|
vs XML (−45.2%) 106,720
|
||||||
|
|
||||||
|
🧾 Semi-uniform event logs [eligibility: 50%]
|
||||||
|
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░ 154,419 tokens
|
||||||
|
vs JSON (−15.0%) 181,592
|
||||||
|
vs JSON compact (+19.9%) 128,836
|
||||||
|
vs YAML (−0.9%) 155,749
|
||||||
|
vs XML (−25.1%) 206,271
|
||||||
|
|
||||||
|
🧩 Deeply nested configuration [eligibility: 0%]
|
||||||
|
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░ 630 tokens
|
||||||
|
vs JSON (−31.4%) 918
|
||||||
|
vs JSON compact (+11.9%) 563
|
||||||
|
vs YAML (−6.4%) 673
|
||||||
|
vs XML (−37.4%) 1,007
|
||||||
|
|
||||||
|
─────────────────────────────────────────────────────────────────────────────────
|
||||||
|
Total
|
||||||
|
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░ 213,577 tokens
|
||||||
|
vs JSON (−22.8%) 276,717
|
||||||
|
vs JSON compact (+14.0%) 187,378
|
||||||
|
vs YAML (−6.2%) 227,645
|
||||||
|
vs XML (−32.0%) 313,998
|
||||||
|
```
|
||||||
|
|
||||||
|
## Flat-Only Track
|
||||||
|
|
||||||
|
Datasets with flat tabular structures where CSV is applicable.
|
||||||
|
|
||||||
|
```
|
||||||
|
👥 Uniform employee records (TOON optimal format) [eligibility: 100%]
|
||||||
|
csv ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░ 46,968 tokens
|
||||||
|
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 49,841 tokens (+5.8% vs CSV)
|
||||||
|
vs JSON (−60.7%) 126,886
|
||||||
|
vs JSON compact (−36.8%) 78,882
|
||||||
|
vs YAML (−50.0%) 99,743
|
||||||
|
vs XML (−66.0%) 146,465
|
||||||
|
|
||||||
|
📈 Time-series analytics data [eligibility: 100%]
|
||||||
|
csv ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░ 8,382 tokens
|
||||||
|
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 9,114 tokens (+8.0% vs CSV)
|
||||||
|
vs JSON (−59.0%) 22,244
|
||||||
|
vs JSON compact (−35.9%) 14,210
|
||||||
|
vs YAML (−49.0%) 17,857
|
||||||
|
vs XML (−65.8%) 26,615
|
||||||
|
|
||||||
|
⭐ Top 100 GitHub repositories [eligibility: 100%]
|
||||||
|
csv ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░ 8,513 tokens
|
||||||
|
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 8,745 tokens (+2.7% vs CSV)
|
||||||
vs JSON (−42.3%) 15,145
|
vs JSON (−42.3%) 15,145
|
||||||
vs JSON compact (−23.7%) 11,455
|
vs JSON compact (−23.7%) 11,455
|
||||||
vs YAML (−33.4%) 13,129
|
vs YAML (−33.4%) 13,129
|
||||||
vs XML (−48.8%) 17,095
|
vs XML (−48.8%) 17,095
|
||||||
|
|
||||||
📈 Daily Analytics ██████████░░░░░░░░░░░░░░░ 4,507 tokens
|
─────────────────────────────────────────────────────────────────────────────────
|
||||||
vs JSON (−58.9%) 10,977
|
Total
|
||||||
vs JSON compact (−35.7%) 7,013
|
csv ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░ 63,863 tokens
|
||||||
vs YAML (−48.8%) 8,810
|
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 67,700 tokens (+5.7% vs CSV)
|
||||||
vs XML (−65.7%) 13,128
|
vs JSON (−58.8%) 164,275
|
||||||
|
vs JSON compact (−35.2%) 104,547
|
||||||
🛒 E-Commerce Order ████████████████░░░░░░░░░ 166 tokens
|
vs YAML (−48.2%) 130,729
|
||||||
vs JSON (−35.4%) 257
|
vs XML (−64.4%) 190,175
|
||||||
vs JSON compact (−2.9%) 171
|
|
||||||
vs YAML (−15.7%) 197
|
|
||||||
vs XML (−38.7%) 271
|
|
||||||
|
|
||||||
─────────────────────────────────────────────────────────────────────
|
|
||||||
Total ██████████████░░░░░░░░░░░ 13,418 tokens
|
|
||||||
vs JSON (−49.1%) 26,379
|
|
||||||
vs JSON compact (−28.0%) 18,639
|
|
||||||
vs YAML (−39.4%) 22,136
|
|
||||||
vs XML (−56.0%) 30,494
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary><strong>View detailed examples</strong></summary>
|
<summary><strong>View detailed examples</strong></summary>
|
||||||
|
|
||||||
#### ⭐ GitHub Repositories
|
#### 📈 Time-series analytics data
|
||||||
|
|
||||||
**Configuration:** Top 100 GitHub repositories with stars, forks, and metadata
|
**Savings:** 13,130 tokens (59.0% reduction vs JSON)
|
||||||
|
|
||||||
|
**JSON** (22,244 tokens):
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"metrics": [
|
||||||
|
{
|
||||||
|
"date": "2025-01-01",
|
||||||
|
"views": 4324,
|
||||||
|
"clicks": 146,
|
||||||
|
"conversions": 21,
|
||||||
|
"revenue": 3834.57,
|
||||||
|
"bounceRate": 0.4
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"date": "2025-01-02",
|
||||||
|
"views": 6248,
|
||||||
|
"clicks": 407,
|
||||||
|
"conversions": 22,
|
||||||
|
"revenue": 2936.12,
|
||||||
|
"bounceRate": 0.62
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"date": "2025-01-03",
|
||||||
|
"views": 7382,
|
||||||
|
"clicks": 270,
|
||||||
|
"conversions": 24,
|
||||||
|
"revenue": 6825.19,
|
||||||
|
"bounceRate": 0.7
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"date": "2025-01-04",
|
||||||
|
"views": 4586,
|
||||||
|
"clicks": 267,
|
||||||
|
"conversions": 24,
|
||||||
|
"revenue": 2391.11,
|
||||||
|
"bounceRate": 0.64
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"date": "2025-01-05",
|
||||||
|
"views": 6171,
|
||||||
|
"clicks": 227,
|
||||||
|
"conversions": 12,
|
||||||
|
"revenue": 3430.1,
|
||||||
|
"bounceRate": 0.39
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**TOON** (9,114 tokens):
|
||||||
|
|
||||||
|
```
|
||||||
|
metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
|
||||||
|
2025-01-01,4324,146,21,3834.57,0.4
|
||||||
|
2025-01-02,6248,407,22,2936.12,0.62
|
||||||
|
2025-01-03,7382,270,24,6825.19,0.7
|
||||||
|
2025-01-04,4586,267,24,2391.11,0.64
|
||||||
|
2025-01-05,6171,227,12,3430.1,0.39
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### ⭐ Top 100 GitHub repositories
|
||||||
|
|
||||||
**Savings:** 6,400 tokens (42.3% reduction vs JSON)
|
**Savings:** 6,400 tokens (42.3% reduction vs JSON)
|
||||||
|
|
||||||
@@ -91,72 +204,4 @@ repositories[3]{id,name,repo,description,createdAt,updatedAt,pushedAt,stars,watc
|
|||||||
21737465,awesome,sindresorhus/awesome,😎 Awesome lists about all kinds of interesting topics,"2014-07-11T13:42:37Z","2025-10-28T12:40:21Z","2025-10-27T17:57:31Z",410052,8017,32029,main
|
21737465,awesome,sindresorhus/awesome,😎 Awesome lists about all kinds of interesting topics,"2014-07-11T13:42:37Z","2025-10-28T12:40:21Z","2025-10-27T17:57:31Z",410052,8017,32029,main
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
#### 📈 Daily Analytics
|
|
||||||
|
|
||||||
**Configuration:** 180 days of web metrics (views, clicks, conversions, revenue)
|
|
||||||
|
|
||||||
**Savings:** 6,470 tokens (58.9% reduction vs JSON)
|
|
||||||
|
|
||||||
**JSON** (10,977 tokens):
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"metrics": [
|
|
||||||
{
|
|
||||||
"date": "2025-01-01",
|
|
||||||
"views": 6890,
|
|
||||||
"clicks": 401,
|
|
||||||
"conversions": 23,
|
|
||||||
"revenue": 6015.59,
|
|
||||||
"bounceRate": 0.63
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"date": "2025-01-02",
|
|
||||||
"views": 6940,
|
|
||||||
"clicks": 323,
|
|
||||||
"conversions": 37,
|
|
||||||
"revenue": 9086.44,
|
|
||||||
"bounceRate": 0.36
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"date": "2025-01-03",
|
|
||||||
"views": 4390,
|
|
||||||
"clicks": 346,
|
|
||||||
"conversions": 26,
|
|
||||||
"revenue": 6360.75,
|
|
||||||
"bounceRate": 0.48
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"date": "2025-01-04",
|
|
||||||
"views": 3429,
|
|
||||||
"clicks": 231,
|
|
||||||
"conversions": 13,
|
|
||||||
"revenue": 2360.96,
|
|
||||||
"bounceRate": 0.65
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"date": "2025-01-05",
|
|
||||||
"views": 5804,
|
|
||||||
"clicks": 186,
|
|
||||||
"conversions": 22,
|
|
||||||
"revenue": 2535.96,
|
|
||||||
"bounceRate": 0.37
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**TOON** (4,507 tokens):
|
|
||||||
|
|
||||||
```
|
|
||||||
metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
|
|
||||||
2025-01-01,6890,401,23,6015.59,0.63
|
|
||||||
2025-01-02,6940,323,37,9086.44,0.36
|
|
||||||
2025-01-03,4390,346,26,6360.75,0.48
|
|
||||||
2025-01-04,3429,231,13,2360.96,0.65
|
|
||||||
2025-01-05,5804,186,22,2535.96,0.37
|
|
||||||
```
|
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|||||||
@@ -5,16 +5,83 @@ import process from 'node:process'
|
|||||||
import * as prompts from '@clack/prompts'
|
import * as prompts from '@clack/prompts'
|
||||||
import PQueue from 'p-queue'
|
import PQueue from 'p-queue'
|
||||||
import { BENCHMARKS_DIR, DEFAULT_CONCURRENCY, DRY_RUN, DRY_RUN_LIMITS, MODEL_RPM_LIMITS, ROOT_DIR } from '../src/constants'
|
import { BENCHMARKS_DIR, DEFAULT_CONCURRENCY, DRY_RUN, DRY_RUN_LIMITS, MODEL_RPM_LIMITS, ROOT_DIR } from '../src/constants'
|
||||||
import { datasets } from '../src/datasets'
|
import { ACCURACY_DATASETS } from '../src/datasets'
|
||||||
import { evaluateQuestion, models } from '../src/evaluate'
|
import { evaluateQuestion, models } from '../src/evaluate'
|
||||||
import { formatters } from '../src/formatters'
|
import { formatters, supportsCSV } from '../src/formatters'
|
||||||
import { generateQuestions } from '../src/questions'
|
import { generateQuestions } from '../src/questions'
|
||||||
import { calculateFormatResults, calculateTokenCounts, generateAccuracyReport } from '../src/report'
|
import { calculateFormatResults, calculateTokenCounts, generateAccuracyReport } from '../src/report'
|
||||||
import { getAllModelResults, hasModelResults, saveModelResults } from '../src/storage'
|
import { getAllModelResults, hasModelResults, saveModelResults } from '../src/storage'
|
||||||
import { ensureDir } from '../src/utils'
|
import { ensureDir } from '../src/utils'
|
||||||
|
|
||||||
|
// Constants
|
||||||
|
const PROGRESS_UPDATE_INTERVAL = 10
|
||||||
|
const RATE_LIMIT_INTERVAL_MS = 60_000
|
||||||
|
|
||||||
prompts.intro('Retrieval Accuracy Benchmark')
|
prompts.intro('Retrieval Accuracy Benchmark')
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Generate evaluation tasks for a model
|
||||||
|
*/
|
||||||
|
function generateEvaluationTasks(questions: Question[]): { question: Question, formatName: string }[] {
|
||||||
|
const tasks: { question: Question, formatName: string }[] = []
|
||||||
|
|
||||||
|
for (const question of questions) {
|
||||||
|
for (const [formatName] of Object.entries(formatters)) {
|
||||||
|
// Skip CSV for datasets that don't support it
|
||||||
|
const dataset = ACCURACY_DATASETS.find(d => d.name === question.dataset)
|
||||||
|
if (formatName === 'csv' && dataset && !supportsCSV(dataset))
|
||||||
|
continue
|
||||||
|
|
||||||
|
tasks.push({ question, formatName })
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return tasks
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Check which models already have saved results
|
||||||
|
*/
|
||||||
|
async function checkExistingResults(activeModels: typeof models) {
|
||||||
|
const existingModelResults: Record<string, boolean> = {}
|
||||||
|
|
||||||
|
for (const model of activeModels) {
|
||||||
|
const existingResult = await hasModelResults(model.modelId)
|
||||||
|
if (existingResult)
|
||||||
|
existingModelResults[model.modelId] = existingResult
|
||||||
|
}
|
||||||
|
|
||||||
|
return existingModelResults
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Create a progress updater function
|
||||||
|
*/
|
||||||
|
function createProgressUpdater(spinner: ReturnType<typeof prompts.spinner>, total: number) {
|
||||||
|
let completed = 0
|
||||||
|
|
||||||
|
return () => {
|
||||||
|
completed++
|
||||||
|
if (completed % PROGRESS_UPDATE_INTERVAL === 0 || completed === total) {
|
||||||
|
const percent = ((completed / total) * 100).toFixed(1)
|
||||||
|
spinner.message(`Progress: ${completed}/${total} (${percent}%)`)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Create a rate-limited queue for model evaluation
|
||||||
|
*/
|
||||||
|
function createEvaluationQueue(modelId: string) {
|
||||||
|
const rpmLimit = MODEL_RPM_LIMITS[modelId]
|
||||||
|
|
||||||
|
return new PQueue({
|
||||||
|
concurrency: DEFAULT_CONCURRENCY,
|
||||||
|
intervalCap: rpmLimit ?? Infinity,
|
||||||
|
interval: rpmLimit ? RATE_LIMIT_INTERVAL_MS : 0,
|
||||||
|
})
|
||||||
|
}
|
||||||
|
|
||||||
// Prompt user to select which models to benchmark
|
// Prompt user to select which models to benchmark
|
||||||
const modelChoices = models.map(({ modelId }) => ({
|
const modelChoices = models.map(({ modelId }) => ({
|
||||||
value: modelId,
|
value: modelId,
|
||||||
@@ -37,15 +104,10 @@ const activeModels = models.filter(m => selectedModels.includes(m.modelId))
|
|||||||
prompts.log.info(`Selected ${activeModels.length} model(s): ${activeModels.map(m => m.modelId).join(', ')}`)
|
prompts.log.info(`Selected ${activeModels.length} model(s): ${activeModels.map(m => m.modelId).join(', ')}`)
|
||||||
|
|
||||||
// Check which models already have results
|
// Check which models already have results
|
||||||
const existingModelResults: Record<string, boolean> = {}
|
const existingModelResults = await checkExistingResults(activeModels)
|
||||||
for (const model of activeModels) {
|
|
||||||
const existingResult = await hasModelResults(model.modelId)
|
|
||||||
if (existingResult)
|
|
||||||
existingModelResults[model.modelId] = existingResult
|
|
||||||
}
|
|
||||||
|
|
||||||
if (Object.keys(existingModelResults).length > 0) {
|
if (Object.keys(existingModelResults).length > 0) {
|
||||||
prompts.log.info(`Found existing results for ${Object.values(existingModelResults).length} model(s)`)
|
prompts.log.info(`Found existing results for ${Object.keys(existingModelResults).length} model(s)`)
|
||||||
}
|
}
|
||||||
|
|
||||||
if (DRY_RUN) {
|
if (DRY_RUN) {
|
||||||
@@ -75,31 +137,22 @@ for (const model of activeModels) {
|
|||||||
prompts.log.step(`Running benchmark for ${modelId}`)
|
prompts.log.step(`Running benchmark for ${modelId}`)
|
||||||
|
|
||||||
// Generate evaluation tasks for this model
|
// Generate evaluation tasks for this model
|
||||||
const tasks: { question: Question, formatName: string }[] = []
|
const tasks = generateEvaluationTasks(questions)
|
||||||
for (const question of questions) {
|
|
||||||
for (const [formatName] of Object.entries(formatters)) {
|
|
||||||
tasks.push({ question, formatName })
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
const total = tasks.length
|
const total = tasks.length
|
||||||
const rpmLimit = MODEL_RPM_LIMITS[modelId]
|
const rpmLimit = MODEL_RPM_LIMITS[modelId]
|
||||||
const queue = new PQueue({
|
const queue = createEvaluationQueue(modelId)
|
||||||
concurrency: DEFAULT_CONCURRENCY,
|
|
||||||
intervalCap: rpmLimit ?? Infinity,
|
|
||||||
interval: rpmLimit ? 60_000 : 0,
|
|
||||||
})
|
|
||||||
|
|
||||||
const evalSpinner = prompts.spinner()
|
const evalSpinner = prompts.spinner()
|
||||||
evalSpinner.start(`Running ${total} evaluations (concurrency: ${DEFAULT_CONCURRENCY}, RPM limit: ${rpmLimit ?? 'unlimited'})`)
|
evalSpinner.start(`Running ${total} evaluations (concurrency: ${DEFAULT_CONCURRENCY}, RPM limit: ${rpmLimit ?? 'unlimited'})`)
|
||||||
|
|
||||||
let completed = 0
|
const updateProgress = createProgressUpdater(evalSpinner, total)
|
||||||
|
|
||||||
// Queue all tasks
|
// Queue all tasks
|
||||||
const modelResultPromises = tasks.map(task =>
|
const modelResultPromises = tasks.map(task =>
|
||||||
queue.add(async () => {
|
queue.add(async () => {
|
||||||
// Format data on-demand
|
// Format data on-demand
|
||||||
const dataset = datasets.find(d => d.name === task.question.dataset)!
|
const dataset = ACCURACY_DATASETS.find(d => d.name === task.question.dataset)!
|
||||||
const formatter = formatters[task.formatName]!
|
const formatter = formatters[task.formatName]!
|
||||||
const formattedData = formatter(dataset.data)
|
const formattedData = formatter(dataset.data)
|
||||||
|
|
||||||
@@ -111,11 +164,7 @@ for (const model of activeModels) {
|
|||||||
})
|
})
|
||||||
|
|
||||||
// Progress update after task completes
|
// Progress update after task completes
|
||||||
completed++
|
updateProgress()
|
||||||
if (completed % 10 === 0 || completed === total) {
|
|
||||||
const percent = ((completed / total) * 100).toFixed(1)
|
|
||||||
evalSpinner.message(`Progress: ${completed}/${total} (${percent}%)`)
|
|
||||||
}
|
|
||||||
|
|
||||||
return result
|
return result
|
||||||
}),
|
}),
|
||||||
@@ -154,5 +203,5 @@ await ensureDir(resultsDir)
|
|||||||
const outputFilePath = path.join(resultsDir, 'retrieval-accuracy.md')
|
const outputFilePath = path.join(resultsDir, 'retrieval-accuracy.md')
|
||||||
await fsp.writeFile(outputFilePath, accuracyReport)
|
await fsp.writeFile(outputFilePath, accuracyReport)
|
||||||
|
|
||||||
prompts.log.info(`Report saved to: \`${path.relative(ROOT_DIR, outputFilePath)}\``)
|
|
||||||
reportSpinner.stop('Report generation complete!')
|
reportSpinner.stop('Report generation complete!')
|
||||||
|
prompts.log.info(`Report saved to: \`${path.relative(ROOT_DIR, outputFilePath)}\``)
|
||||||
|
|||||||
@@ -1,11 +1,11 @@
|
|||||||
|
import type { Dataset } from '../src/types'
|
||||||
import * as fsp from 'node:fs/promises'
|
import * as fsp from 'node:fs/promises'
|
||||||
import * as path from 'node:path'
|
import * as path from 'node:path'
|
||||||
import * as prompts from '@clack/prompts'
|
import * as prompts from '@clack/prompts'
|
||||||
import { encode } from '../../packages/toon/src'
|
import { encode } from '../../packages/toon/src'
|
||||||
import githubRepos from '../data/github-repos.json' with { type: 'json' }
|
|
||||||
import { BENCHMARKS_DIR, FORMATTER_DISPLAY_NAMES, ROOT_DIR } from '../src/constants'
|
import { BENCHMARKS_DIR, FORMATTER_DISPLAY_NAMES, ROOT_DIR } from '../src/constants'
|
||||||
import { generateAnalyticsData, generateOrderData } from '../src/datasets'
|
import { TOKEN_EFFICIENCY_DATASETS } from '../src/datasets'
|
||||||
import { formatters } from '../src/formatters'
|
import { formatters, supportsCSV } from '../src/formatters'
|
||||||
import { createProgressBar, ensureDir, tokenize } from '../src/utils'
|
import { createProgressBar, ensureDir, tokenize } from '../src/utils'
|
||||||
|
|
||||||
interface FormatMetrics {
|
interface FormatMetrics {
|
||||||
@@ -16,55 +16,160 @@ interface FormatMetrics {
|
|||||||
}
|
}
|
||||||
|
|
||||||
interface BenchmarkResult {
|
interface BenchmarkResult {
|
||||||
name: string
|
dataset: Dataset
|
||||||
emoji: string
|
|
||||||
description: string
|
|
||||||
data: Record<string, any>
|
|
||||||
formats: FormatMetrics[]
|
formats: FormatMetrics[]
|
||||||
showDetailed: boolean
|
|
||||||
}
|
}
|
||||||
|
|
||||||
const BENCHMARK_EXAMPLES = [
|
// Constants
|
||||||
{
|
const DATASET_ICONS: Record<string, string> = {
|
||||||
name: 'GitHub Repositories',
|
'tabular': '👥',
|
||||||
emoji: '⭐',
|
'nested': '🛒',
|
||||||
description: 'Top 100 GitHub repositories with stars, forks, and metadata',
|
'analytics': '📈',
|
||||||
getData: () => ({ repositories: githubRepos }),
|
'github': '⭐',
|
||||||
showDetailed: true,
|
'event-logs': '🧾',
|
||||||
},
|
'nested-config': '🧩',
|
||||||
{
|
}
|
||||||
name: 'Daily Analytics',
|
|
||||||
emoji: '📈',
|
const COMPARISON_FORMAT_ORDER = ['json-pretty', 'json-compact', 'yaml', 'xml'] as const
|
||||||
description: '180 days of web metrics (views, clicks, conversions, revenue)',
|
|
||||||
getData: () => generateAnalyticsData(180),
|
const PROGRESS_BAR_CONFIG = { filled: '▓', empty: '░' } as const
|
||||||
showDetailed: true,
|
const PROGRESS_BAR_WIDTH = 20
|
||||||
},
|
const TOKEN_PADDING = 7
|
||||||
{
|
const LABEL_PADDING = 60
|
||||||
name: 'E-Commerce Order',
|
const COMPARISON_LABEL_PADDING = 30
|
||||||
emoji: '🛒',
|
|
||||||
description: 'Single nested order with customer and items',
|
const SEPARATOR = '─────────────────────────────────────────────────────────────────────────────────'
|
||||||
getData: generateOrderData,
|
const DEFAULT_DATASET_ICON = '📊'
|
||||||
showDetailed: false,
|
|
||||||
},
|
const DETAILED_EXAMPLE_DATASETS = ['github', 'analytics'] as const
|
||||||
] as const
|
const GITHUB_REPO_LIMIT = 3
|
||||||
|
const GITHUB_DESC_LIMIT = 80
|
||||||
|
const ANALYTICS_METRICS_LIMIT = 5
|
||||||
|
|
||||||
prompts.intro('Token Efficiency Benchmark')
|
prompts.intro('Token Efficiency Benchmark')
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Format a comparison line showing savings vs TOON
|
||||||
|
*/
|
||||||
|
function formatComparisonLine(format: FormatMetrics): string {
|
||||||
|
const label = FORMATTER_DISPLAY_NAMES[format.name] || format.name.toUpperCase()
|
||||||
|
const signedPercent = format.savingsPercent >= 0
|
||||||
|
? `−${format.savingsPercent.toFixed(1)}%`
|
||||||
|
: `+${Math.abs(format.savingsPercent).toFixed(1)}%`
|
||||||
|
const labelWithSavings = `vs ${label} (${signedPercent})`.padEnd(COMPARISON_LABEL_PADDING)
|
||||||
|
const tokenStr = format.tokens.toLocaleString('en-US').padStart(TOKEN_PADDING)
|
||||||
|
return ` ${labelWithSavings}${tokenStr}`
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Calculate total tokens and savings for a set of datasets
|
||||||
|
*/
|
||||||
|
function calculateTotalMetrics(datasets: BenchmarkResult[], formatNames: readonly string[]) {
|
||||||
|
const totalToonTokens = datasets.reduce((sum, r) => {
|
||||||
|
const toon = r.formats.find(f => f.name === 'toon')!
|
||||||
|
return sum + toon.tokens
|
||||||
|
}, 0)
|
||||||
|
|
||||||
|
const totals = formatNames.map((formatName) => {
|
||||||
|
const totalTokens = datasets.reduce((sum, r) => {
|
||||||
|
const format = r.formats.find(f => f.name === formatName)
|
||||||
|
return sum + (format?.tokens || 0)
|
||||||
|
}, 0)
|
||||||
|
const savings = totalTokens - totalToonTokens
|
||||||
|
const savingsPercent = (savings / totalTokens) * 100
|
||||||
|
return { name: formatName, tokens: totalTokens, savingsPercent }
|
||||||
|
})
|
||||||
|
|
||||||
|
return { totalToonTokens, totals }
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Generate total lines for a track
|
||||||
|
*/
|
||||||
|
function generateTotalLines(
|
||||||
|
totalToonTokens: number,
|
||||||
|
totals: { name: string, tokens: number, savingsPercent: number }[],
|
||||||
|
baselineFormat?: { name: string, tokens: number },
|
||||||
|
) {
|
||||||
|
const lines: string[] = ['Total ']
|
||||||
|
|
||||||
|
if (baselineFormat) {
|
||||||
|
// Flat-only track with CSV baseline
|
||||||
|
const csvPercentage = Math.min(100, (baselineFormat.tokens / totalToonTokens) * 100)
|
||||||
|
const csvBar = createProgressBar(csvPercentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
|
||||||
|
const csvStr = baselineFormat.tokens.toLocaleString('en-US').padStart(TOKEN_PADDING)
|
||||||
|
lines.push(`csv ${csvBar} ${csvStr} tokens`)
|
||||||
|
|
||||||
|
const overheadPercent = ((totalToonTokens - baselineFormat.tokens) / totalToonTokens) * 100
|
||||||
|
const toonBar = createProgressBar(100, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
|
||||||
|
const toonStr = totalToonTokens.toLocaleString('en-US').padStart(TOKEN_PADDING)
|
||||||
|
lines.push(`toon ${toonBar} ${toonStr} tokens (+${overheadPercent.toFixed(1)}% vs CSV)`)
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
// Mixed-structure track
|
||||||
|
const totalPercentage = Math.min(100, (totalToonTokens / totals[0]!.tokens) * 100)
|
||||||
|
const totalBar = createProgressBar(totalPercentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
|
||||||
|
const toonStr = totalToonTokens.toLocaleString('en-US').padStart(TOKEN_PADDING)
|
||||||
|
lines.push(`toon ${totalBar} ${toonStr} tokens`)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Add comparison lines
|
||||||
|
for (const format of totals) {
|
||||||
|
lines.push(formatComparisonLine({
|
||||||
|
name: format.name,
|
||||||
|
tokens: format.tokens,
|
||||||
|
savings: 0, // Not used in this context
|
||||||
|
savingsPercent: format.savingsPercent,
|
||||||
|
}))
|
||||||
|
}
|
||||||
|
|
||||||
|
return lines.join('\n')
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Generate bar chart for a dataset
|
||||||
|
*/
|
||||||
|
function generateDatasetChart(result: BenchmarkResult): string {
|
||||||
|
const { dataset, formats } = result
|
||||||
|
const toon = formats.find(f => f.name === 'toon')!
|
||||||
|
const jsonPretty = formats.find(f => f.name === 'json-pretty')!
|
||||||
|
|
||||||
|
const emoji = DATASET_ICONS[dataset.name] || DEFAULT_DATASET_ICON
|
||||||
|
const eligibility = dataset.metadata.tabularEligibility
|
||||||
|
const name = `${dataset.description} [eligibility: ${eligibility}%]`
|
||||||
|
const percentage = Math.min(100, 100 - jsonPretty.savingsPercent)
|
||||||
|
const bar = createProgressBar(percentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
|
||||||
|
const toonStr = toon.tokens.toLocaleString('en-US')
|
||||||
|
|
||||||
|
const line1 = `${emoji} ${name.padEnd(LABEL_PADDING)}\ntoon ${bar} ${toonStr.padStart(TOKEN_PADDING)} tokens`
|
||||||
|
|
||||||
|
const comparisonLines = COMPARISON_FORMAT_ORDER.map((formatName) => {
|
||||||
|
const format = formats.find(f => f.name === formatName)
|
||||||
|
if (!format)
|
||||||
|
return null
|
||||||
|
|
||||||
|
return formatComparisonLine(format)
|
||||||
|
}).filter(Boolean)
|
||||||
|
|
||||||
|
return [line1, ...comparisonLines].join('\n')
|
||||||
|
}
|
||||||
|
|
||||||
const results: BenchmarkResult[] = []
|
const results: BenchmarkResult[] = []
|
||||||
const totalTokensByFormat: Record<string, number> = {}
|
|
||||||
|
|
||||||
for (const example of BENCHMARK_EXAMPLES) {
|
// Calculate token counts for all datasets
|
||||||
const data = example.getData()
|
for (const dataset of TOKEN_EFFICIENCY_DATASETS) {
|
||||||
|
|
||||||
// Calculate tokens for each format
|
|
||||||
const formatMetrics: FormatMetrics[] = []
|
const formatMetrics: FormatMetrics[] = []
|
||||||
const tokensByFormat: Record<string, number> = {}
|
const tokensByFormat: Record<string, number> = {}
|
||||||
|
|
||||||
|
// Calculate tokens for each format
|
||||||
for (const [formatName, formatter] of Object.entries(formatters)) {
|
for (const [formatName, formatter] of Object.entries(formatters)) {
|
||||||
const formattedString = formatter(data)
|
// Skip CSV for datasets that don't support it
|
||||||
|
if (formatName === 'csv' && !supportsCSV(dataset))
|
||||||
|
continue
|
||||||
|
|
||||||
|
const formattedString = formatter(dataset.data)
|
||||||
const tokens = tokenize(formattedString)
|
const tokens = tokenize(formattedString)
|
||||||
tokensByFormat[formatName] = tokens
|
tokensByFormat[formatName] = tokens
|
||||||
totalTokensByFormat[formatName] = (totalTokensByFormat[formatName] || 0) + tokens
|
|
||||||
}
|
}
|
||||||
|
|
||||||
// Calculate savings vs TOON
|
// Calculate savings vs TOON
|
||||||
@@ -80,105 +185,126 @@ for (const example of BENCHMARK_EXAMPLES) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
results.push({
|
results.push({
|
||||||
name: example.name,
|
dataset,
|
||||||
emoji: example.emoji,
|
|
||||||
description: example.description,
|
|
||||||
data,
|
|
||||||
formats: formatMetrics,
|
formats: formatMetrics,
|
||||||
showDetailed: example.showDetailed,
|
|
||||||
})
|
})
|
||||||
}
|
}
|
||||||
|
|
||||||
// Calculate total savings percentages
|
// Separate datasets by CSV support
|
||||||
const totalToonTokens = totalTokensByFormat.toon!
|
const mixedStructureDatasets = results.filter(r => !supportsCSV(r.dataset))
|
||||||
const totalSavingsPercent: Record<string, number> = {}
|
const flatOnlyDatasets = results.filter(r => supportsCSV(r.dataset))
|
||||||
for (const [formatName, totalTokens] of Object.entries(totalTokensByFormat)) {
|
|
||||||
if (formatName === 'toon') {
|
|
||||||
totalSavingsPercent[formatName] = 0
|
|
||||||
}
|
|
||||||
else {
|
|
||||||
const savings = totalTokens - totalToonTokens
|
|
||||||
totalSavingsPercent[formatName] = (savings / totalTokens) * 100
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// Generate ASCII bar chart visualization (stacked compact format)
|
// Mixed-Structure Track (no CSV)
|
||||||
const formatOrder = ['json-pretty', 'json-compact', 'yaml', 'xml']
|
const mixedCharts = mixedStructureDatasets
|
||||||
const datasetRows = results
|
.map(result => generateDatasetChart(result))
|
||||||
|
.join('\n\n')
|
||||||
|
|
||||||
|
// Flat-Only Track (with CSV)
|
||||||
|
const flatCharts = flatOnlyDatasets
|
||||||
.map((result) => {
|
.map((result) => {
|
||||||
|
const csv = result.formats.find(f => f.name === 'csv')
|
||||||
const toon = result.formats.find(f => f.name === 'toon')!
|
const toon = result.formats.find(f => f.name === 'toon')!
|
||||||
const percentage = result.formats.find(f => f.name === 'json-pretty')!.savingsPercent
|
|
||||||
const bar = createProgressBar(100 - percentage, 100) // Invert to show TOON tokens
|
if (!csv)
|
||||||
|
return generateDatasetChart(result)
|
||||||
|
|
||||||
|
// Special handling to show CSV first with TOON overhead
|
||||||
|
const { dataset } = result
|
||||||
|
const emoji = DATASET_ICONS[dataset.name] || DEFAULT_DATASET_ICON
|
||||||
|
const eligibility = dataset.metadata.tabularEligibility
|
||||||
|
const name = `${dataset.description} [eligibility: ${eligibility}%]`
|
||||||
|
|
||||||
|
// CSV line
|
||||||
|
const csvPercentage = Math.min(100, (csv.tokens / toon.tokens) * 100)
|
||||||
|
const csvBar = createProgressBar(csvPercentage, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
|
||||||
|
const csvStr = csv.tokens.toLocaleString('en-US')
|
||||||
|
|
||||||
|
const line1 = `${emoji} ${name.padEnd(LABEL_PADDING)}\ncsv ${csvBar} ${csvStr.padStart(TOKEN_PADDING)} tokens`
|
||||||
|
|
||||||
|
// TOON line with overhead vs CSV
|
||||||
|
const toonOverhead = toon.tokens - csv.tokens
|
||||||
|
const toonOverheadPercent = (toonOverhead / toon.tokens) * 100
|
||||||
|
const toonBar = createProgressBar(100, 100, PROGRESS_BAR_WIDTH, PROGRESS_BAR_CONFIG)
|
||||||
const toonStr = toon.tokens.toLocaleString('en-US')
|
const toonStr = toon.tokens.toLocaleString('en-US')
|
||||||
|
const toonVsCSV = toonOverheadPercent >= 0
|
||||||
|
? `(+${toonOverheadPercent.toFixed(1)}% vs CSV)`
|
||||||
|
: `(${toonOverheadPercent.toFixed(1)}% vs CSV)`
|
||||||
|
const toonLine = `toon ${toonBar} ${toonStr.padStart(TOKEN_PADDING)} tokens ${toonVsCSV}`
|
||||||
|
|
||||||
const line1 = `${result.emoji} ${result.name.padEnd(25)} ${bar} ${toonStr.padStart(6)} tokens`
|
// Other format comparisons (vs TOON)
|
||||||
|
const comparisonLines = COMPARISON_FORMAT_ORDER.map((formatName) => {
|
||||||
|
const format = result.formats.find(f => f.name === formatName)
|
||||||
|
if (!format)
|
||||||
|
return null
|
||||||
|
|
||||||
const comparisonLines = formatOrder.map((formatName) => {
|
return formatComparisonLine(format)
|
||||||
const format = result.formats.find(f => f.name === formatName)!
|
}).filter(Boolean)
|
||||||
const label = FORMATTER_DISPLAY_NAMES[formatName] || formatName.toUpperCase()
|
|
||||||
const signedPercent = format.savingsPercent >= 0
|
|
||||||
? `−${format.savingsPercent.toFixed(1)}%`
|
|
||||||
: `+${Math.abs(format.savingsPercent).toFixed(1)}%`
|
|
||||||
const labelWithSavings = `vs ${label} (${signedPercent})`.padEnd(27)
|
|
||||||
const tokenStr = format.tokens.toLocaleString('en-US').padStart(6)
|
|
||||||
return ` ${labelWithSavings}${tokenStr}`
|
|
||||||
})
|
|
||||||
|
|
||||||
return [line1, ...comparisonLines].join('\n')
|
return [line1, toonLine, ...comparisonLines].join('\n')
|
||||||
})
|
})
|
||||||
.join('\n\n')
|
.join('\n\n')
|
||||||
|
|
||||||
// Add separator and totals row
|
// Calculate totals for mixed structure
|
||||||
const separator = '─────────────────────────────────────────────────────────────────────'
|
const { totalToonTokens: totalToonTokensMixed, totals: mixedTotals } = calculateTotalMetrics(mixedStructureDatasets, COMPARISON_FORMAT_ORDER)
|
||||||
|
const mixedTotalLines = generateTotalLines(totalToonTokensMixed, mixedTotals)
|
||||||
|
|
||||||
// Calculate bar for totals (TOON vs average of comparison formats)
|
// Calculate totals for flat-only
|
||||||
const comparisonTokens = formatOrder.map(name => totalTokensByFormat[name]!)
|
const { totalToonTokens: totalToonTokensFlat, totals: flatTotals } = calculateTotalMetrics(flatOnlyDatasets, COMPARISON_FORMAT_ORDER)
|
||||||
const averageComparisonTokens = comparisonTokens.reduce((a, b) => a + b, 0) / comparisonTokens.length
|
const totalCSVTokensFlat = flatOnlyDatasets.reduce((sum, r) => {
|
||||||
const totalPercentage = (totalToonTokens / averageComparisonTokens) * 100
|
const csv = r.formats.find(f => f.name === 'csv')
|
||||||
const totalBar = createProgressBar(totalPercentage, 100)
|
return sum + (csv?.tokens || 0)
|
||||||
|
}, 0)
|
||||||
|
const flatTotalLines = generateTotalLines(totalToonTokensFlat, flatTotals, { name: 'csv', tokens: totalCSVTokensFlat })
|
||||||
|
|
||||||
const totalLine1 = `Total ${totalBar} ${totalToonTokens.toLocaleString('en-US').padStart(6)} tokens`
|
const barChartSection = `
|
||||||
|
## Mixed-Structure Track
|
||||||
|
|
||||||
const totalComparisonLines = formatOrder.map((formatName) => {
|
Datasets with nested or semi-uniform structures. CSV excluded as it cannot properly represent these structures.
|
||||||
const label = FORMATTER_DISPLAY_NAMES[formatName] || formatName.toUpperCase()
|
|
||||||
const tokens = totalTokensByFormat[formatName]!
|
|
||||||
const percent = totalSavingsPercent[formatName]!
|
|
||||||
const signedPercent = percent >= 0 ? `−${percent.toFixed(1)}%` : `+${Math.abs(percent).toFixed(1)}%`
|
|
||||||
const labelWithSavings = `vs ${label} (${signedPercent})`.padEnd(27)
|
|
||||||
const tokenStr = tokens.toLocaleString('en-US').padStart(6)
|
|
||||||
return ` ${labelWithSavings}${tokenStr}`
|
|
||||||
})
|
|
||||||
|
|
||||||
const barChartSection = `${datasetRows}\n\n${separator}\n${totalLine1}\n${totalComparisonLines.join('\n')}`
|
\`\`\`
|
||||||
|
${mixedCharts}
|
||||||
|
|
||||||
// Generate detailed examples (only for selected examples)
|
${SEPARATOR}
|
||||||
// Note: Large datasets are truncated for display readability in the report.
|
${mixedTotalLines}
|
||||||
// Token counts are calculated from the full datasets, not the truncated versions.
|
\`\`\`
|
||||||
|
|
||||||
|
## Flat-Only Track
|
||||||
|
|
||||||
|
Datasets with flat tabular structures where CSV is applicable.
|
||||||
|
|
||||||
|
\`\`\`
|
||||||
|
${flatCharts}
|
||||||
|
|
||||||
|
${SEPARATOR}
|
||||||
|
${flatTotalLines}
|
||||||
|
\`\`\`
|
||||||
|
`.trim()
|
||||||
|
|
||||||
|
// Generate detailed examples (optional: show a few examples)
|
||||||
const detailedExamples = results
|
const detailedExamples = results
|
||||||
.filter(result => result.showDetailed)
|
.filter(r => DETAILED_EXAMPLE_DATASETS.includes(r.dataset.name as any))
|
||||||
.map((result, i, filtered) => {
|
.map((result, i, filtered) => {
|
||||||
// Truncate large datasets for display
|
let displayData = result.dataset.data
|
||||||
let displayData = result.data
|
|
||||||
if (result.name === 'GitHub Repositories') {
|
// Truncate for display
|
||||||
|
if (result.dataset.name === 'github') {
|
||||||
displayData = {
|
displayData = {
|
||||||
repositories: result.data.repositories.slice(0, 3).map((repo: Record<string, any>) => ({
|
repositories: displayData.repositories.slice(0, GITHUB_REPO_LIMIT).map((repo: Record<string, any>) => ({
|
||||||
...repo,
|
...repo,
|
||||||
description: repo.description?.slice(0, 80) + (repo.description?.length > 80 ? '…' : ''),
|
description: repo.description?.slice(0, GITHUB_DESC_LIMIT) + (repo.description?.length > GITHUB_DESC_LIMIT ? '…' : ''),
|
||||||
})),
|
})),
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
else if (result.name === 'Daily Analytics') {
|
else if (result.dataset.name === 'analytics') {
|
||||||
displayData = { metrics: result.data.metrics.slice(0, 5) }
|
displayData = { metrics: displayData.metrics.slice(0, ANALYTICS_METRICS_LIMIT) }
|
||||||
}
|
}
|
||||||
|
|
||||||
const separator = i < filtered.length - 1 ? '\n\n---' : ''
|
const separator = i < filtered.length - 1 ? '\n\n---' : ''
|
||||||
|
const emoji = DATASET_ICONS[result.dataset.name] || DEFAULT_DATASET_ICON
|
||||||
const json = result.formats.find(f => f.name === 'json-pretty')!
|
const json = result.formats.find(f => f.name === 'json-pretty')!
|
||||||
const toon = result.formats.find(f => f.name === 'toon')!
|
const toon = result.formats.find(f => f.name === 'toon')!
|
||||||
|
|
||||||
return `#### ${result.emoji} ${result.name}
|
return `#### ${emoji} ${result.dataset.description}
|
||||||
|
|
||||||
**Configuration:** ${result.description}
|
|
||||||
|
|
||||||
**Savings:** ${json.savings.toLocaleString('en-US')} tokens (${json.savingsPercent.toFixed(1)}% reduction vs JSON)
|
**Savings:** ${json.savings.toLocaleString('en-US')} tokens (${json.savingsPercent.toFixed(1)}% reduction vs JSON)
|
||||||
|
|
||||||
@@ -197,9 +323,7 @@ ${encode(displayData)}
|
|||||||
.join('\n\n')
|
.join('\n\n')
|
||||||
|
|
||||||
const markdown = `
|
const markdown = `
|
||||||
\`\`\`
|
|
||||||
${barChartSection}
|
${barChartSection}
|
||||||
\`\`\`
|
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary><strong>View detailed examples</strong></summary>
|
<summary><strong>View detailed examples</strong></summary>
|
||||||
@@ -209,7 +333,7 @@ ${detailedExamples}
|
|||||||
</details>
|
</details>
|
||||||
`.trimStart()
|
`.trimStart()
|
||||||
|
|
||||||
prompts.log.message(`${barChartSection}\n`)
|
prompts.log.message(barChartSection)
|
||||||
|
|
||||||
const resultsDir = path.join(BENCHMARKS_DIR, 'results')
|
const resultsDir = path.join(BENCHMARKS_DIR, 'results')
|
||||||
await ensureDir(resultsDir)
|
await ensureDir(resultsDir)
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ export const BENCHMARKS_DIR: string = url.fileURLToPath(new URL('../', import.me
|
|||||||
* Model-specific RPM (requests per minute) limits to handle API quotas
|
* Model-specific RPM (requests per minute) limits to handle API quotas
|
||||||
*
|
*
|
||||||
* @remarks
|
* @remarks
|
||||||
* Set `undefined` for models without specific limits
|
* Set `undefined` for models without specific limits.
|
||||||
*/
|
*/
|
||||||
/// keep-sorted
|
/// keep-sorted
|
||||||
export const MODEL_RPM_LIMITS: Record<string, number | undefined> = {
|
export const MODEL_RPM_LIMITS: Record<string, number | undefined> = {
|
||||||
@@ -39,7 +39,7 @@ export const FORMATTER_DISPLAY_NAMES: Record<string, string> = {
|
|||||||
* Enable dry run mode for quick testing with limited AI requests
|
* Enable dry run mode for quick testing with limited AI requests
|
||||||
*
|
*
|
||||||
* @remarks
|
* @remarks
|
||||||
* Set via environment variable: `DRY_RUN=true`
|
* Set via environment variable: `DRY_RUN=true`.
|
||||||
*/
|
*/
|
||||||
export const DRY_RUN: boolean = process.env.DRY_RUN === 'true'
|
export const DRY_RUN: boolean = process.env.DRY_RUN === 'true'
|
||||||
|
|
||||||
@@ -123,4 +123,14 @@ export const QUESTION_LIMITS = {
|
|||||||
aggregationBranches: 2,
|
aggregationBranches: 2,
|
||||||
filteringStarsAndForks: 8,
|
filteringStarsAndForks: 8,
|
||||||
},
|
},
|
||||||
|
eventLogs: {
|
||||||
|
fieldRetrieval: 10,
|
||||||
|
aggregationEndpoints: 3,
|
||||||
|
filteringLevelAndStatus: 2,
|
||||||
|
filteringEndpointAndStatus: 2,
|
||||||
|
},
|
||||||
|
nestedConfig: {
|
||||||
|
fieldRetrieval: 5,
|
||||||
|
filteringComplex: 2,
|
||||||
|
},
|
||||||
} as const
|
} as const
|
||||||
|
|||||||
@@ -5,6 +5,67 @@ import githubRepos from '../data/github-repos.json' with { type: 'json' }
|
|||||||
// Seed for reproducibility
|
// Seed for reproducibility
|
||||||
faker.seed(12345)
|
faker.seed(12345)
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Calculate the tabular eligibility percentage of a data structure
|
||||||
|
*
|
||||||
|
* @remarks
|
||||||
|
* Recursively analyzes data to determine what percentage of arrays qualify
|
||||||
|
* for TOON's tabular format (uniform objects with primitive values only).
|
||||||
|
*/
|
||||||
|
export function calculateTabularEligibility(data: unknown): number {
|
||||||
|
let totalArrays = 0
|
||||||
|
let tabularArrays = 0
|
||||||
|
|
||||||
|
function isTabularArray(arr: unknown[]): boolean {
|
||||||
|
if (arr.length === 0)
|
||||||
|
return false
|
||||||
|
|
||||||
|
// Check if all elements are objects
|
||||||
|
if (!arr.every(item => typeof item === 'object' && item !== null && !Array.isArray(item)))
|
||||||
|
return false
|
||||||
|
|
||||||
|
// Get keys from first object
|
||||||
|
const firstKeys = Object.keys(arr[0] as Record<string, unknown>)
|
||||||
|
if (firstKeys.length === 0)
|
||||||
|
return false
|
||||||
|
|
||||||
|
// Check if all objects have the same keys and only primitive values
|
||||||
|
return arr.every((item) => {
|
||||||
|
const itemObj = item as Record<string, unknown>
|
||||||
|
const itemKeys = Object.keys(itemObj)
|
||||||
|
if (itemKeys.length !== firstKeys.length)
|
||||||
|
return false
|
||||||
|
if (!firstKeys.every(key => itemKeys.includes(key)))
|
||||||
|
return false
|
||||||
|
|
||||||
|
// Check if all values are primitives (no nested objects or arrays)
|
||||||
|
return firstKeys.every((key) => {
|
||||||
|
const value = itemObj[key]
|
||||||
|
return value === null || ['string', 'number', 'boolean'].includes(typeof value)
|
||||||
|
})
|
||||||
|
})
|
||||||
|
}
|
||||||
|
|
||||||
|
function traverse(obj: unknown): void {
|
||||||
|
if (Array.isArray(obj)) {
|
||||||
|
totalArrays++
|
||||||
|
if (isTabularArray(obj))
|
||||||
|
tabularArrays++
|
||||||
|
|
||||||
|
// Continue traversing array elements
|
||||||
|
obj.forEach(item => traverse(item))
|
||||||
|
}
|
||||||
|
else if (typeof obj === 'object' && obj !== null) {
|
||||||
|
// Traverse object properties
|
||||||
|
Object.values(obj).forEach(value => traverse(value))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
traverse(data)
|
||||||
|
|
||||||
|
return totalArrays === 0 ? 0 : Math.round((tabularArrays / totalArrays) * 100)
|
||||||
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Employee record structure for tabular dataset
|
* Employee record structure for tabular dataset
|
||||||
*/
|
*/
|
||||||
@@ -73,6 +134,78 @@ export interface Repository {
|
|||||||
pushedAt: string
|
pushedAt: string
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Event log structure for semi-uniform dataset
|
||||||
|
*/
|
||||||
|
export interface EventLog {
|
||||||
|
timestamp: string
|
||||||
|
level: 'info' | 'warn' | 'error'
|
||||||
|
endpoint: string
|
||||||
|
statusCode: number
|
||||||
|
responseTime: number
|
||||||
|
userId: number
|
||||||
|
error?: {
|
||||||
|
message: string
|
||||||
|
stack: string
|
||||||
|
retryable: boolean
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Nested configuration structure for deeply nested dataset
|
||||||
|
*/
|
||||||
|
export interface NestedConfig {
|
||||||
|
environment: string
|
||||||
|
version: string
|
||||||
|
database: {
|
||||||
|
host: string
|
||||||
|
port: number
|
||||||
|
name: string
|
||||||
|
pool: {
|
||||||
|
min: number
|
||||||
|
max: number
|
||||||
|
idleTimeout: number
|
||||||
|
}
|
||||||
|
replicas: {
|
||||||
|
host: string
|
||||||
|
port: number
|
||||||
|
priority: number
|
||||||
|
}[]
|
||||||
|
}
|
||||||
|
features: Record<string, {
|
||||||
|
enabled: boolean
|
||||||
|
rollout: number
|
||||||
|
variants: {
|
||||||
|
name: string
|
||||||
|
weight: number
|
||||||
|
config: Record<string, any>
|
||||||
|
}[]
|
||||||
|
}>
|
||||||
|
authentication: {
|
||||||
|
providers: {
|
||||||
|
name: string
|
||||||
|
clientId: string
|
||||||
|
scopes: string[]
|
||||||
|
config: Record<string, any>
|
||||||
|
}[]
|
||||||
|
session: {
|
||||||
|
secret: string
|
||||||
|
duration: number
|
||||||
|
refreshThreshold: number
|
||||||
|
}
|
||||||
|
}
|
||||||
|
permissions: {
|
||||||
|
roles: Record<string, {
|
||||||
|
permissions: string[]
|
||||||
|
inherits: string[]
|
||||||
|
}>
|
||||||
|
groups: Record<string, {
|
||||||
|
members: string[]
|
||||||
|
roles: string[]
|
||||||
|
}>
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Generate analytics time-series data
|
* Generate analytics time-series data
|
||||||
*/
|
*/
|
||||||
@@ -108,17 +241,13 @@ export function generateAnalyticsData(days: number, startDate = '2025-01-01'): {
|
|||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Tabular dataset: 100 uniform employee records
|
* Generate employee data (uniform tabular structure)
|
||||||
*
|
|
||||||
* @remarks
|
|
||||||
* Tests TOON's tabular array format
|
|
||||||
*/
|
*/
|
||||||
const departments: readonly string[] = ['Engineering', 'Sales', 'Marketing', 'HR', 'Operations', 'Finance'] as const
|
const departments: readonly string[] = ['Engineering', 'Sales', 'Marketing', 'HR', 'Operations', 'Finance'] as const
|
||||||
const tabularDataset: Dataset = {
|
|
||||||
name: 'tabular',
|
function generateEmployees(count: number): { employees: Employee[] } {
|
||||||
description: 'Uniform employee records (TOON optimal format)',
|
return {
|
||||||
data: {
|
employees: Array.from({ length: count }, (_, i): Employee => {
|
||||||
employees: Array.from({ length: 100 }, (_, i): Employee => {
|
|
||||||
const yearsExp = faker.number.int({ min: 1, max: 25 })
|
const yearsExp = faker.number.int({ min: 1, max: 25 })
|
||||||
return {
|
return {
|
||||||
id: i + 1,
|
id: i + 1,
|
||||||
@@ -130,72 +259,132 @@ const tabularDataset: Dataset = {
|
|||||||
active: faker.datatype.boolean(0.8), // 80% active
|
active: faker.datatype.boolean(0.8), // 80% active
|
||||||
}
|
}
|
||||||
}),
|
}),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Tabular dataset: Uniform employee records
|
||||||
|
*
|
||||||
|
* @remarks
|
||||||
|
* Tests TOON's tabular array format.
|
||||||
|
*/
|
||||||
|
const tabularDataset: Dataset = {
|
||||||
|
name: 'tabular',
|
||||||
|
description: 'Uniform employee records (TOON optimal format)',
|
||||||
|
data: generateEmployees(100),
|
||||||
|
metadata: {
|
||||||
|
supportsCSV: true,
|
||||||
|
structureClass: 'uniform',
|
||||||
|
tabularEligibility: 100,
|
||||||
},
|
},
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Nested dataset: 50 e-commerce orders with nested structures
|
* Generate e-commerce orders (nested structure)
|
||||||
*
|
|
||||||
* @remarks
|
|
||||||
* Tests TOON's handling of complex nested objects
|
|
||||||
*/
|
*/
|
||||||
const productNames: readonly string[] = ['Wireless Mouse', 'USB Cable', 'Laptop Stand', 'Keyboard', 'Webcam', 'Headphones', 'Monitor', 'Desk Lamp'] as const
|
const PRODUCT_NAMES = ['Wireless Mouse', 'USB Cable', 'Laptop Stand', 'Keyboard', 'Webcam', 'Headphones', 'Monitor', 'Desk Lamp'] as const
|
||||||
const statuses: readonly string[] = ['pending', 'processing', 'shipped', 'delivered', 'cancelled'] as const
|
const ORDER_STATUSES = ['pending', 'processing', 'shipped', 'delivered', 'cancelled'] as const
|
||||||
|
|
||||||
const nestedDataset: Dataset = {
|
const ORDER_CONSTANTS = {
|
||||||
name: 'nested',
|
CUSTOMER_ID_MOD: 20,
|
||||||
description: 'E-commerce orders with nested structures',
|
MIN_ITEMS: 1,
|
||||||
data: {
|
MAX_ITEMS: 4,
|
||||||
orders: Array.from({ length: 50 }, (_, i) => {
|
MIN_ITEM_PRICE: 9.99,
|
||||||
const customerId = (i % 20) + 1
|
MAX_ITEM_PRICE: 199.99,
|
||||||
const itemCount = faker.number.int({ min: 1, max: 4 })
|
MIN_ITEM_QUANTITY: 1,
|
||||||
|
MAX_ITEM_QUANTITY: 5,
|
||||||
|
SKU_LENGTH: 6,
|
||||||
|
ORDER_ID_PADDING: 4,
|
||||||
|
RECENT_DAYS: 90,
|
||||||
|
TAX_RATE: 0.08,
|
||||||
|
} as const
|
||||||
|
|
||||||
|
function generateOrders(count: number): { orders: Order[] } {
|
||||||
|
return {
|
||||||
|
orders: Array.from({ length: count }, (_, i) => {
|
||||||
|
const customerId = (i % ORDER_CONSTANTS.CUSTOMER_ID_MOD) + 1
|
||||||
|
const itemCount = faker.number.int({ min: ORDER_CONSTANTS.MIN_ITEMS, max: ORDER_CONSTANTS.MAX_ITEMS })
|
||||||
|
|
||||||
const items = Array.from({ length: itemCount }, (_, j) => {
|
const items = Array.from({ length: itemCount }, (_, j) => {
|
||||||
const price = faker.number.float({ min: 9.99, max: 199.99, fractionDigits: 2 })
|
const price = faker.number.float({
|
||||||
const quantity = faker.number.int({ min: 1, max: 5 })
|
min: ORDER_CONSTANTS.MIN_ITEM_PRICE,
|
||||||
|
max: ORDER_CONSTANTS.MAX_ITEM_PRICE,
|
||||||
|
fractionDigits: 2,
|
||||||
|
})
|
||||||
|
const quantity = faker.number.int({
|
||||||
|
min: ORDER_CONSTANTS.MIN_ITEM_QUANTITY,
|
||||||
|
max: ORDER_CONSTANTS.MAX_ITEM_QUANTITY,
|
||||||
|
})
|
||||||
return {
|
return {
|
||||||
sku: `SKU-${faker.string.alphanumeric({ length: 6 }).toUpperCase()}`,
|
sku: `SKU-${faker.string.alphanumeric({ length: ORDER_CONSTANTS.SKU_LENGTH }).toUpperCase()}`,
|
||||||
name: productNames[j % productNames.length]!,
|
name: PRODUCT_NAMES[j % PRODUCT_NAMES.length]!,
|
||||||
quantity,
|
quantity,
|
||||||
price,
|
price,
|
||||||
}
|
}
|
||||||
})
|
})
|
||||||
|
|
||||||
const total = Number(items.reduce((sum, item) => sum + (item.price * item.quantity), 0).toFixed(2))
|
const subtotal = Number(items.reduce((sum, item) => sum + (item.price * item.quantity), 0).toFixed(2))
|
||||||
|
const tax = Number((subtotal * ORDER_CONSTANTS.TAX_RATE).toFixed(2))
|
||||||
|
const total = Number((subtotal + tax).toFixed(2))
|
||||||
|
|
||||||
return {
|
return {
|
||||||
orderId: `ORD-${String(i + 1).padStart(4, '0')}`,
|
orderId: `ORD-${String(i + 1).padStart(ORDER_CONSTANTS.ORDER_ID_PADDING, '0')}`,
|
||||||
customer: {
|
customer: {
|
||||||
id: customerId,
|
id: customerId,
|
||||||
name: faker.person.fullName(),
|
name: faker.person.fullName(),
|
||||||
email: faker.internet.email().toLowerCase(),
|
email: faker.internet.email().toLowerCase(),
|
||||||
|
phone: faker.phone.number(),
|
||||||
},
|
},
|
||||||
items,
|
items,
|
||||||
|
subtotal,
|
||||||
|
tax,
|
||||||
total,
|
total,
|
||||||
status: statuses[i % statuses.length]!,
|
status: ORDER_STATUSES[i % ORDER_STATUSES.length]!,
|
||||||
orderDate: faker.date.recent({ days: 90 }).toISOString().split('T')[0],
|
orderDate: faker.date.recent({ days: ORDER_CONSTANTS.RECENT_DAYS }).toISOString().split('T')[0],
|
||||||
}
|
}
|
||||||
}),
|
}),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Nested dataset: E-commerce orders with nested structures
|
||||||
|
*
|
||||||
|
* @remarks
|
||||||
|
* Tests TOON's handling of complex nested objects.
|
||||||
|
*/
|
||||||
|
const nestedDataset: Dataset = {
|
||||||
|
name: 'nested',
|
||||||
|
description: 'E-commerce orders with nested structures',
|
||||||
|
data: generateOrders(50),
|
||||||
|
metadata: {
|
||||||
|
supportsCSV: false,
|
||||||
|
structureClass: 'nested',
|
||||||
|
tabularEligibility: 33, // orders array is not tabular, but items arrays within are
|
||||||
},
|
},
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Analytics dataset: 60 days of time-series metrics
|
* Analytics dataset: Time-series metrics
|
||||||
*
|
*
|
||||||
* @remarks
|
* @remarks
|
||||||
* Tests TOON's handling of numeric data and date fields
|
* Tests TOON's handling of numeric data and date fields.
|
||||||
*/
|
*/
|
||||||
const analyticsDataset: Dataset = {
|
const analyticsDataset: Dataset = {
|
||||||
name: 'analytics',
|
name: 'analytics',
|
||||||
description: 'Time-series analytics data',
|
description: 'Time-series analytics data',
|
||||||
data: generateAnalyticsData(60),
|
data: generateAnalyticsData(60),
|
||||||
|
metadata: {
|
||||||
|
supportsCSV: true,
|
||||||
|
structureClass: 'uniform',
|
||||||
|
tabularEligibility: 100,
|
||||||
|
},
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Real-world dataset: Top 100 starred GitHub repositories
|
* Real-world dataset: Top 100 starred GitHub repositories
|
||||||
*
|
*
|
||||||
* @remarks
|
* @remarks
|
||||||
* Tests TOON's tabular format
|
* Tests TOON's tabular format with real data.
|
||||||
*/
|
*/
|
||||||
const githubDataset: Dataset = {
|
const githubDataset: Dataset = {
|
||||||
name: 'github',
|
name: 'github',
|
||||||
@@ -203,13 +392,18 @@ const githubDataset: Dataset = {
|
|||||||
data: {
|
data: {
|
||||||
repositories: githubRepos,
|
repositories: githubRepos,
|
||||||
},
|
},
|
||||||
|
metadata: {
|
||||||
|
supportsCSV: true,
|
||||||
|
structureClass: 'uniform',
|
||||||
|
tabularEligibility: 100,
|
||||||
|
},
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Generate a single e-commerce order with nested structure
|
* Generate a single e-commerce order with nested structure
|
||||||
*
|
*
|
||||||
* @remarks
|
* @remarks
|
||||||
* Used for token efficiency benchmarks
|
* Used for token efficiency benchmarks.
|
||||||
*/
|
*/
|
||||||
export function generateOrderData(): Order {
|
export function generateOrderData(): Order {
|
||||||
return {
|
return {
|
||||||
@@ -235,11 +429,257 @@ export function generateOrderData(): Order {
|
|||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* All datasets used in the benchmark
|
* Generate event logs (semi-uniform structure)
|
||||||
|
*
|
||||||
|
* @remarks
|
||||||
|
* Approximately 50% of logs include nested error objects, 50% are flat.
|
||||||
|
* This creates ~45% tabular eligibility.
|
||||||
*/
|
*/
|
||||||
export const datasets: Dataset[] = [
|
export function generateEventLogs(count: number): { logs: EventLog[] } {
|
||||||
tabularDataset,
|
const endpoints = ['/api/users', '/api/orders', '/api/products', '/api/auth', '/api/payments']
|
||||||
nestedDataset,
|
const levels = ['info', 'warn', 'error'] as const
|
||||||
analyticsDataset,
|
|
||||||
githubDataset,
|
return {
|
||||||
|
logs: Array.from({ length: count }, () => {
|
||||||
|
const level = faker.helpers.arrayElement(levels)
|
||||||
|
const hasError = level === 'error' || (level === 'warn' && faker.datatype.boolean(0.3))
|
||||||
|
|
||||||
|
const log: EventLog = {
|
||||||
|
timestamp: faker.date.recent({ days: 7 }).toISOString(),
|
||||||
|
level,
|
||||||
|
endpoint: faker.helpers.arrayElement(endpoints),
|
||||||
|
statusCode: hasError
|
||||||
|
? faker.number.int({ min: 400, max: 599 })
|
||||||
|
: faker.number.int({ min: 200, max: 299 }),
|
||||||
|
responseTime: faker.number.int({ min: 10, max: 5000 }),
|
||||||
|
userId: faker.number.int({ min: 1000, max: 9999 }),
|
||||||
|
}
|
||||||
|
|
||||||
|
if (hasError) {
|
||||||
|
log.error = {
|
||||||
|
message: faker.helpers.arrayElement([
|
||||||
|
'Database connection timeout',
|
||||||
|
'Invalid authentication token',
|
||||||
|
'Resource not found',
|
||||||
|
'Internal server error',
|
||||||
|
'Rate limit exceeded',
|
||||||
|
]),
|
||||||
|
stack: `Error: ${faker.lorem.sentence()}\n at ${faker.lorem.word()}\n at ${faker.lorem.word()}`,
|
||||||
|
retryable: faker.datatype.boolean(0.6),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return log
|
||||||
|
}),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Generate deeply nested configuration
|
||||||
|
*
|
||||||
|
* @remarks
|
||||||
|
* Creates a complex nested structure with minimal tabular eligibility (~0%).
|
||||||
|
*/
|
||||||
|
export function generateNestedConfig(): NestedConfig {
|
||||||
|
return {
|
||||||
|
environment: faker.helpers.arrayElement(['production', 'staging', 'development']),
|
||||||
|
version: faker.system.semver(),
|
||||||
|
database: {
|
||||||
|
host: faker.internet.domainName(),
|
||||||
|
port: 5432,
|
||||||
|
name: faker.database.type(),
|
||||||
|
pool: {
|
||||||
|
min: 2,
|
||||||
|
max: faker.number.int({ min: 10, max: 50 }),
|
||||||
|
idleTimeout: 30000,
|
||||||
|
},
|
||||||
|
replicas: Array.from({ length: 3 }, (_, i) => ({
|
||||||
|
host: `replica-${i + 1}.${faker.internet.domainName()}`,
|
||||||
|
port: 5432,
|
||||||
|
priority: i + 1,
|
||||||
|
})),
|
||||||
|
},
|
||||||
|
features: {
|
||||||
|
darkMode: {
|
||||||
|
enabled: faker.datatype.boolean(),
|
||||||
|
rollout: faker.number.int({ min: 0, max: 100 }),
|
||||||
|
variants: [
|
||||||
|
{
|
||||||
|
name: 'default',
|
||||||
|
weight: 70,
|
||||||
|
config: { theme: 'dark', animations: true },
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: 'minimal',
|
||||||
|
weight: 30,
|
||||||
|
config: { theme: 'dark', animations: false },
|
||||||
|
},
|
||||||
|
],
|
||||||
|
},
|
||||||
|
analytics: {
|
||||||
|
enabled: faker.datatype.boolean(),
|
||||||
|
rollout: faker.number.int({ min: 0, max: 100 }),
|
||||||
|
variants: [
|
||||||
|
{
|
||||||
|
name: 'full',
|
||||||
|
weight: 100,
|
||||||
|
config: { tracking: 'all', sampling: 1.0 },
|
||||||
|
},
|
||||||
|
],
|
||||||
|
},
|
||||||
|
},
|
||||||
|
authentication: {
|
||||||
|
providers: [
|
||||||
|
{
|
||||||
|
name: 'oauth2',
|
||||||
|
clientId: faker.string.uuid(),
|
||||||
|
scopes: ['read', 'write', 'admin'],
|
||||||
|
config: {
|
||||||
|
authUrl: faker.internet.url(),
|
||||||
|
tokenUrl: faker.internet.url(),
|
||||||
|
},
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: 'saml',
|
||||||
|
clientId: faker.string.uuid(),
|
||||||
|
scopes: ['read'],
|
||||||
|
config: {
|
||||||
|
entryPoint: faker.internet.url(),
|
||||||
|
cert: faker.string.alphanumeric({ length: 64 }),
|
||||||
|
},
|
||||||
|
},
|
||||||
|
],
|
||||||
|
session: {
|
||||||
|
secret: faker.string.alphanumeric({ length: 32 }),
|
||||||
|
duration: 86400,
|
||||||
|
refreshThreshold: 3600,
|
||||||
|
},
|
||||||
|
},
|
||||||
|
permissions: {
|
||||||
|
roles: {
|
||||||
|
admin: {
|
||||||
|
permissions: ['read', 'write', 'delete', 'manage_users', 'manage_roles'],
|
||||||
|
inherits: [],
|
||||||
|
},
|
||||||
|
editor: {
|
||||||
|
permissions: ['read', 'write'],
|
||||||
|
inherits: ['viewer'],
|
||||||
|
},
|
||||||
|
viewer: {
|
||||||
|
permissions: ['read'],
|
||||||
|
inherits: [],
|
||||||
|
},
|
||||||
|
},
|
||||||
|
groups: {
|
||||||
|
engineering: {
|
||||||
|
members: Array.from({ length: 5 }, () => faker.internet.email()),
|
||||||
|
roles: ['admin', 'editor'],
|
||||||
|
},
|
||||||
|
support: {
|
||||||
|
members: Array.from({ length: 3 }, () => faker.internet.email()),
|
||||||
|
roles: ['viewer'],
|
||||||
|
},
|
||||||
|
},
|
||||||
|
},
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Event logs dataset: Semi-uniform structure
|
||||||
|
*
|
||||||
|
* @remarks
|
||||||
|
* Tests TOON with semi-uniform data (~50% flat, ~50% with nested errors).
|
||||||
|
*/
|
||||||
|
const eventLogsDataset: Dataset = {
|
||||||
|
name: 'event-logs',
|
||||||
|
description: 'Semi-uniform event logs',
|
||||||
|
data: generateEventLogs(75),
|
||||||
|
metadata: {
|
||||||
|
supportsCSV: false,
|
||||||
|
structureClass: 'semi-uniform',
|
||||||
|
tabularEligibility: 50, // ~50% of logs have nested error objects
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Nested config dataset: Deeply nested structure
|
||||||
|
*
|
||||||
|
* @remarks
|
||||||
|
* Tests TOON's worst-case scenario with deeply nested configuration.
|
||||||
|
*/
|
||||||
|
const nestedConfigDataset: Dataset = {
|
||||||
|
name: 'nested-config',
|
||||||
|
description: 'Deeply nested configuration',
|
||||||
|
data: generateNestedConfig(),
|
||||||
|
metadata: {
|
||||||
|
supportsCSV: false,
|
||||||
|
structureClass: 'deep',
|
||||||
|
tabularEligibility: 0, // Highly nested, minimal tabular arrays
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Datasets for accuracy benchmarks (smaller sizes for faster evaluation)
|
||||||
|
*/
|
||||||
|
export const ACCURACY_DATASETS: Dataset[] = [
|
||||||
|
tabularDataset, // 100 employees
|
||||||
|
nestedDataset, // 50 orders
|
||||||
|
analyticsDataset, // 60 days
|
||||||
|
githubDataset, // 100 repos
|
||||||
|
eventLogsDataset, // 75 logs
|
||||||
|
nestedConfigDataset, // 1 config
|
||||||
|
]
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Datasets for token efficiency benchmarks (larger sizes to amplify token differences)
|
||||||
|
*/
|
||||||
|
export const TOKEN_EFFICIENCY_DATASETS: Dataset[] = [
|
||||||
|
// Tabular: 2000 employees
|
||||||
|
{
|
||||||
|
name: 'tabular',
|
||||||
|
description: 'Uniform employee records (TOON optimal format)',
|
||||||
|
data: generateEmployees(2000),
|
||||||
|
metadata: {
|
||||||
|
supportsCSV: true,
|
||||||
|
structureClass: 'uniform',
|
||||||
|
tabularEligibility: 100,
|
||||||
|
},
|
||||||
|
},
|
||||||
|
// Nested: 500 orders
|
||||||
|
{
|
||||||
|
name: 'nested',
|
||||||
|
description: 'E-commerce orders with nested structures',
|
||||||
|
data: generateOrders(500),
|
||||||
|
metadata: {
|
||||||
|
supportsCSV: false,
|
||||||
|
structureClass: 'nested',
|
||||||
|
tabularEligibility: 33,
|
||||||
|
},
|
||||||
|
},
|
||||||
|
// Analytics: 365 days
|
||||||
|
{
|
||||||
|
name: 'analytics',
|
||||||
|
description: 'Time-series analytics data',
|
||||||
|
data: generateAnalyticsData(365),
|
||||||
|
metadata: {
|
||||||
|
supportsCSV: true,
|
||||||
|
structureClass: 'uniform',
|
||||||
|
tabularEligibility: 100,
|
||||||
|
},
|
||||||
|
},
|
||||||
|
// GitHub: 100 repos (same as accuracy)
|
||||||
|
githubDataset,
|
||||||
|
// Event logs: 2000 logs
|
||||||
|
{
|
||||||
|
name: 'event-logs',
|
||||||
|
description: 'Semi-uniform event logs',
|
||||||
|
data: generateEventLogs(2000),
|
||||||
|
metadata: {
|
||||||
|
supportsCSV: false,
|
||||||
|
structureClass: 'semi-uniform',
|
||||||
|
tabularEligibility: 50,
|
||||||
|
},
|
||||||
|
},
|
||||||
|
// Nested config: 1 config (same as accuracy)
|
||||||
|
nestedConfigDataset,
|
||||||
]
|
]
|
||||||
|
|||||||
@@ -1,3 +1,4 @@
|
|||||||
|
import type { Dataset } from './types'
|
||||||
import { stringify as stringifyCSV } from 'csv-stringify/sync'
|
import { stringify as stringifyCSV } from 'csv-stringify/sync'
|
||||||
import { XMLBuilder } from 'fast-xml-parser'
|
import { XMLBuilder } from 'fast-xml-parser'
|
||||||
import { stringify as stringifyYAML } from 'yaml'
|
import { stringify as stringifyYAML } from 'yaml'
|
||||||
@@ -75,3 +76,14 @@ function toXML(data: unknown): string {
|
|||||||
|
|
||||||
return builder.build(data)
|
return builder.build(data)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Check if a dataset supports CSV format
|
||||||
|
*
|
||||||
|
* @remarks
|
||||||
|
* CSV is only suitable for flat tabular data. Datasets with nested structures
|
||||||
|
* should not be compared using CSV as it cannot properly represent the data.
|
||||||
|
*/
|
||||||
|
export function supportsCSV(dataset: Dataset): boolean {
|
||||||
|
return dataset.metadata.supportsCSV
|
||||||
|
}
|
||||||
|
|||||||
@@ -1,711 +0,0 @@
|
|||||||
/**
|
|
||||||
* Question generation for TOON benchmarks
|
|
||||||
*
|
|
||||||
* Generates ~150-160 questions across different question types and datasets:
|
|
||||||
* - Field Retrieval: Direct field access with no computation
|
|
||||||
* Examples: "What is X's salary?", "What is the status of order Y?"
|
|
||||||
* - Aggregation: Counts, sums, averages, min/max operations (including single-condition filters)
|
|
||||||
* Examples: "How many X?", "What is the total/average?", "How many X > threshold?"
|
|
||||||
* - Filtering: Multi-condition queries requiring complex logical operations
|
|
||||||
* Examples: "How many X WHERE condition1 AND condition2?"
|
|
||||||
*/
|
|
||||||
|
|
||||||
import type { AnalyticsMetric, Employee, Order, Repository } from './datasets'
|
|
||||||
import type { Question } from './types'
|
|
||||||
import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from './constants'
|
|
||||||
import { datasets } from './datasets'
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Generate all questions from datasets
|
|
||||||
*/
|
|
||||||
export function generateQuestions(): Question[] {
|
|
||||||
const questions: Question[] = []
|
|
||||||
let idCounter = 1
|
|
||||||
|
|
||||||
// Get datasets with proper typing
|
|
||||||
const tabular = (datasets.find(d => d.name === 'tabular')?.data.employees as Employee[]) ?? []
|
|
||||||
const nested = (datasets.find(d => d.name === 'nested')?.data.orders as Order[]) ?? []
|
|
||||||
const analytics = (datasets.find(d => d.name === 'analytics')?.data.metrics as AnalyticsMetric[]) ?? []
|
|
||||||
const github = (datasets.find(d => d.name === 'github')?.data.repositories as Repository[]) ?? []
|
|
||||||
|
|
||||||
if (tabular.length > 0) {
|
|
||||||
// Field retrieval: specific employees
|
|
||||||
for (let i = 0; i < Math.min(QUESTION_LIMITS.tabular.fieldRetrieval, tabular.length); i++) {
|
|
||||||
const emp = tabular[i * 2] || tabular[i]
|
|
||||||
if (!emp)
|
|
||||||
continue
|
|
||||||
|
|
||||||
// Rotate through all field types
|
|
||||||
if (i % 5 === 0) {
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `What is the salary of ${emp.name}?`,
|
|
||||||
groundTruth: String(emp.salary),
|
|
||||||
type: 'field-retrieval',
|
|
||||||
dataset: 'tabular',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
else if (i % 5 === 1) {
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `What department does ${emp.name} work in?`,
|
|
||||||
groundTruth: emp.department,
|
|
||||||
type: 'field-retrieval',
|
|
||||||
dataset: 'tabular',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
else if (i % 5 === 2) {
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `What is the email address of ${emp.name}?`,
|
|
||||||
groundTruth: emp.email,
|
|
||||||
type: 'field-retrieval',
|
|
||||||
dataset: 'tabular',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
else if (i % 5 === 3) {
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many years of experience does ${emp.name} have?`,
|
|
||||||
groundTruth: String(emp.yearsExperience),
|
|
||||||
type: 'field-retrieval',
|
|
||||||
dataset: 'tabular',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
else {
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `Is ${emp.name} an active employee?`,
|
|
||||||
groundTruth: emp.active ? 'yes' : 'no',
|
|
||||||
type: 'field-retrieval',
|
|
||||||
dataset: 'tabular',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// Aggregation: count by department
|
|
||||||
const departments = [...new Set(tabular.map(e => e.department))]
|
|
||||||
for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.aggregationDepartments)) {
|
|
||||||
const count = tabular.filter(e => e.department === dept).length
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many employees work in ${dept}?`,
|
|
||||||
groundTruth: String(count),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'tabular',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
|
|
||||||
// Aggregation: salary ranges (single-condition filters)
|
|
||||||
for (const threshold of QUESTION_THRESHOLDS.tabular.salaryRanges) {
|
|
||||||
const count = tabular.filter(e => e.salary > threshold).length
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many employees have a salary greater than ${threshold}?`,
|
|
||||||
groundTruth: String(count),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'tabular',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
|
|
||||||
// Aggregation: totals and averages
|
|
||||||
const totalEmployees = tabular.length
|
|
||||||
const avgSalary = Math.round(tabular.reduce((sum, e) => sum + e.salary, 0) / totalEmployees)
|
|
||||||
const activeCount = tabular.filter(e => e.active).length
|
|
||||||
const inactiveCount = tabular.filter(e => !e.active).length
|
|
||||||
|
|
||||||
questions.push(
|
|
||||||
{
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: 'How many employees are in the dataset?',
|
|
||||||
groundTruth: String(totalEmployees),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'tabular',
|
|
||||||
},
|
|
||||||
{
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: 'What is the average salary across all employees?',
|
|
||||||
groundTruth: String(avgSalary),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'tabular',
|
|
||||||
},
|
|
||||||
{
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: 'How many employees are active?',
|
|
||||||
groundTruth: String(activeCount),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'tabular',
|
|
||||||
},
|
|
||||||
{
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: 'How many employees are inactive?',
|
|
||||||
groundTruth: String(inactiveCount),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'tabular',
|
|
||||||
},
|
|
||||||
)
|
|
||||||
|
|
||||||
// Filtering: count by department with salary filter (multi-condition)
|
|
||||||
for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringMultiConditionDepartments)) {
|
|
||||||
const count = tabular.filter(e => e.department === dept && e.salary > QUESTION_THRESHOLDS.tabular.departmentSalaryThreshold).length
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many employees in ${dept} have a salary greater than ${QUESTION_THRESHOLDS.tabular.departmentSalaryThreshold}?`,
|
|
||||||
groundTruth: String(count),
|
|
||||||
type: 'filtering',
|
|
||||||
dataset: 'tabular',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
|
|
||||||
// Filtering: active employees by experience (multi-condition)
|
|
||||||
for (const exp of QUESTION_THRESHOLDS.tabular.experienceYears.slice(0, QUESTION_LIMITS.tabular.filteringExperience)) {
|
|
||||||
const count = tabular.filter(e => e.yearsExperience > exp && e.active).length
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many active employees have more than ${exp} years of experience?`,
|
|
||||||
groundTruth: String(count),
|
|
||||||
type: 'filtering',
|
|
||||||
dataset: 'tabular',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
|
|
||||||
// Filtering: department by experience (multi-condition)
|
|
||||||
for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringDepartmentExp)) {
|
|
||||||
const count = tabular.filter(e => e.department === dept && e.yearsExperience > QUESTION_THRESHOLDS.tabular.departmentExperienceThreshold).length
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many employees in ${dept} have more than ${QUESTION_THRESHOLDS.tabular.departmentExperienceThreshold} years of experience?`,
|
|
||||||
groundTruth: String(count),
|
|
||||||
type: 'filtering',
|
|
||||||
dataset: 'tabular',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
|
|
||||||
// Filtering: department by active status (multi-condition)
|
|
||||||
for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringDepartmentActive)) {
|
|
||||||
const count = tabular.filter(e => e.department === dept && e.active).length
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many active employees work in ${dept}?`,
|
|
||||||
groundTruth: String(count),
|
|
||||||
type: 'filtering',
|
|
||||||
dataset: 'tabular',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
if (nested.length > 0) {
|
|
||||||
// Field retrieval: order totals and statuses
|
|
||||||
for (let i = 0; i < Math.min(QUESTION_LIMITS.nested.fieldRetrievalOrders, nested.length); i++) {
|
|
||||||
const order = nested[i * 2] || nested[i]
|
|
||||||
if (!order)
|
|
||||||
continue
|
|
||||||
|
|
||||||
if (i % 2 === 0) {
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `What is the total for order ${order.orderId}?`,
|
|
||||||
groundTruth: String(order.total),
|
|
||||||
type: 'field-retrieval',
|
|
||||||
dataset: 'nested',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
else {
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `What is the status of order ${order.orderId}?`,
|
|
||||||
groundTruth: order.status,
|
|
||||||
type: 'field-retrieval',
|
|
||||||
dataset: 'nested',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// Field retrieval: customer info and order dates (expanded)
|
|
||||||
for (let i = 0; i < Math.min(QUESTION_LIMITS.nested.fieldRetrievalCustomers, nested.length); i++) {
|
|
||||||
const order = nested[i * 2 + 1] || nested[i]
|
|
||||||
if (!order)
|
|
||||||
continue
|
|
||||||
|
|
||||||
if (i % 4 === 0) {
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `What is the customer name for order ${order.orderId}?`,
|
|
||||||
groundTruth: order.customer.name,
|
|
||||||
type: 'field-retrieval',
|
|
||||||
dataset: 'nested',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
else if (i % 4 === 1) {
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `What is the customer email for order ${order.orderId}?`,
|
|
||||||
groundTruth: order.customer.email,
|
|
||||||
type: 'field-retrieval',
|
|
||||||
dataset: 'nested',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
else if (i % 4 === 2) {
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `What is the order date for order ${order.orderId}?`,
|
|
||||||
groundTruth: order.orderDate || '',
|
|
||||||
type: 'field-retrieval',
|
|
||||||
dataset: 'nested',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
else {
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many items are in order ${order.orderId}?`,
|
|
||||||
groundTruth: String(order.items.length),
|
|
||||||
type: 'field-retrieval',
|
|
||||||
dataset: 'nested',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// Aggregation: totals and averages
|
|
||||||
const totalRevenue = nested.reduce((sum, o) => sum + o.total, 0)
|
|
||||||
const avgOrderValue = totalRevenue / nested.length
|
|
||||||
const totalOrders = nested.length
|
|
||||||
const maxOrderValue = Math.max(...nested.map(o => o.total))
|
|
||||||
|
|
||||||
// Count by status
|
|
||||||
const statuses = [...new Set(nested.map(o => o.status))]
|
|
||||||
for (const status of statuses.slice(0, QUESTION_LIMITS.nested.aggregationStatuses)) {
|
|
||||||
const count = nested.filter(o => o.status === status).length
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many orders have status "${status}"?`,
|
|
||||||
groundTruth: String(count),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'nested',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
|
|
||||||
questions.push(
|
|
||||||
{
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: 'What is the total revenue across all orders?',
|
|
||||||
groundTruth: String(totalRevenue.toFixed(2)),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'nested',
|
|
||||||
},
|
|
||||||
{
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: 'What is the average order value?',
|
|
||||||
groundTruth: String(avgOrderValue.toFixed(2)),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'nested',
|
|
||||||
},
|
|
||||||
{
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: 'How many orders are in the dataset?',
|
|
||||||
groundTruth: String(totalOrders),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'nested',
|
|
||||||
},
|
|
||||||
{
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: 'What is the highest order total?',
|
|
||||||
groundTruth: String(maxOrderValue.toFixed(2)),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'nested',
|
|
||||||
},
|
|
||||||
)
|
|
||||||
|
|
||||||
// Aggregation: high-value orders (single-condition filter)
|
|
||||||
for (const threshold of QUESTION_THRESHOLDS.nested.highValueOrders) {
|
|
||||||
const count = nested.filter(o => o.total > threshold).length
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many orders have a total greater than ${threshold}?`,
|
|
||||||
groundTruth: String(count),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'nested',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
|
|
||||||
// Filtering: multi-condition queries (status AND value)
|
|
||||||
const orderStatuses = [...new Set(nested.map(o => o.status))]
|
|
||||||
for (const status of orderStatuses.slice(0, QUESTION_LIMITS.nested.filteringStatusAndValue)) {
|
|
||||||
const count = nested.filter(o => o.status === status && o.total > QUESTION_THRESHOLDS.nested.statusValueThreshold).length
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many orders have status "${status}" and total greater than ${QUESTION_THRESHOLDS.nested.statusValueThreshold}?`,
|
|
||||||
groundTruth: String(count),
|
|
||||||
type: 'filtering',
|
|
||||||
dataset: 'nested',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
|
|
||||||
// Filtering: status AND items count (multi-condition)
|
|
||||||
for (const status of orderStatuses.slice(0, QUESTION_LIMITS.nested.filteringStatusAndItems)) {
|
|
||||||
const count = nested.filter(o => o.status === status && o.items.length >= QUESTION_THRESHOLDS.nested.itemCountThreshold).length
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many orders have status "${status}" and at least ${QUESTION_THRESHOLDS.nested.itemCountThreshold} items?`,
|
|
||||||
groundTruth: String(count),
|
|
||||||
type: 'filtering',
|
|
||||||
dataset: 'nested',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
|
|
||||||
// Filtering: total AND items count (multi-condition)
|
|
||||||
for (const threshold of QUESTION_THRESHOLDS.nested.totalThresholdsForItems) {
|
|
||||||
const count = nested.filter(o => o.total > threshold && o.items.length >= QUESTION_THRESHOLDS.nested.itemCountThreshold).length
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many orders have a total greater than ${threshold} and at least ${QUESTION_THRESHOLDS.nested.itemCountThreshold} items?`,
|
|
||||||
groundTruth: String(count),
|
|
||||||
type: 'filtering',
|
|
||||||
dataset: 'nested',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
if (analytics.length > 0) {
|
|
||||||
// Field retrieval: specific dates (expanded with all metrics)
|
|
||||||
for (let i = 0; i < Math.min(QUESTION_LIMITS.analytics.fieldRetrievalDates, analytics.length); i++) {
|
|
||||||
const metric = analytics[i * 3] || analytics[i]
|
|
||||||
if (!metric)
|
|
||||||
continue
|
|
||||||
|
|
||||||
if (i % 5 === 0) {
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many views were recorded on ${metric.date}?`,
|
|
||||||
groundTruth: String(metric.views),
|
|
||||||
type: 'field-retrieval',
|
|
||||||
dataset: 'analytics',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
else if (i % 5 === 1) {
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `What was the revenue on ${metric.date}?`,
|
|
||||||
groundTruth: String(metric.revenue),
|
|
||||||
type: 'field-retrieval',
|
|
||||||
dataset: 'analytics',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
else if (i % 5 === 2) {
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `What was the conversion count on ${metric.date}?`,
|
|
||||||
groundTruth: String(metric.conversions),
|
|
||||||
type: 'field-retrieval',
|
|
||||||
dataset: 'analytics',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
else if (i % 5 === 3) {
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many clicks were recorded on ${metric.date}?`,
|
|
||||||
groundTruth: String(metric.clicks),
|
|
||||||
type: 'field-retrieval',
|
|
||||||
dataset: 'analytics',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
else {
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `What was the bounce rate on ${metric.date}?`,
|
|
||||||
groundTruth: String(metric.bounceRate),
|
|
||||||
type: 'field-retrieval',
|
|
||||||
dataset: 'analytics',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// Aggregation: totals and averages
|
|
||||||
const totalViews = analytics.reduce((sum, m) => sum + m.views, 0)
|
|
||||||
const totalRevenue = analytics.reduce((sum, m) => sum + m.revenue, 0)
|
|
||||||
const totalConversions = analytics.reduce((sum, m) => sum + m.conversions, 0)
|
|
||||||
const avgViews = Math.round(totalViews / analytics.length)
|
|
||||||
const avgRevenue = totalRevenue / analytics.length
|
|
||||||
const avgConversions = Math.round(totalConversions / analytics.length)
|
|
||||||
|
|
||||||
questions.push(
|
|
||||||
{
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: 'What is the total number of views across all dates?',
|
|
||||||
groundTruth: String(totalViews),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'analytics',
|
|
||||||
},
|
|
||||||
{
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: 'What is the total revenue across all dates?',
|
|
||||||
groundTruth: String(totalRevenue.toFixed(2)),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'analytics',
|
|
||||||
},
|
|
||||||
{
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: 'What is the total number of conversions across all dates?',
|
|
||||||
groundTruth: String(totalConversions),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'analytics',
|
|
||||||
},
|
|
||||||
{
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: 'What is the average number of views per day?',
|
|
||||||
groundTruth: String(avgViews),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'analytics',
|
|
||||||
},
|
|
||||||
{
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: 'What is the average revenue per day?',
|
|
||||||
groundTruth: String(avgRevenue.toFixed(2)),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'analytics',
|
|
||||||
},
|
|
||||||
{
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: 'What is the average number of conversions per day?',
|
|
||||||
groundTruth: String(avgConversions),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'analytics',
|
|
||||||
},
|
|
||||||
{
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: 'How many days are included in the analytics data?',
|
|
||||||
groundTruth: String(analytics.length),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'analytics',
|
|
||||||
},
|
|
||||||
{
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: 'What is the highest number of views recorded in a single day?',
|
|
||||||
groundTruth: String(Math.max(...analytics.map(m => m.views))),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'analytics',
|
|
||||||
},
|
|
||||||
)
|
|
||||||
|
|
||||||
// Aggregation: high-performing days (single-condition filters)
|
|
||||||
for (const threshold of QUESTION_THRESHOLDS.analytics.views) {
|
|
||||||
const count = analytics.filter(m => m.views > threshold).length
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many days had more than ${threshold} views?`,
|
|
||||||
groundTruth: String(count),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'analytics',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
|
|
||||||
// Filtering: multi-condition queries (views AND conversions)
|
|
||||||
for (const viewThreshold of QUESTION_THRESHOLDS.analytics.viewsForFiltering) {
|
|
||||||
const count = analytics.filter(m => m.views > viewThreshold && m.conversions > QUESTION_THRESHOLDS.analytics.conversionsForFiltering).length
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many days had more than ${viewThreshold} views and more than ${QUESTION_THRESHOLDS.analytics.conversionsForFiltering} conversions?`,
|
|
||||||
groundTruth: String(count),
|
|
||||||
type: 'filtering',
|
|
||||||
dataset: 'analytics',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
|
|
||||||
// Filtering: views AND revenue (expanded)
|
|
||||||
for (const revenueThreshold of QUESTION_THRESHOLDS.analytics.revenueThresholds.slice(0, 5)) {
|
|
||||||
const count = analytics.filter(m => m.views > QUESTION_THRESHOLDS.analytics.viewsThresholdForRevenue && m.revenue > revenueThreshold).length
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many days had more than ${QUESTION_THRESHOLDS.analytics.viewsThresholdForRevenue} views and revenue greater than ${revenueThreshold}?`,
|
|
||||||
groundTruth: String(count),
|
|
||||||
type: 'filtering',
|
|
||||||
dataset: 'analytics',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
|
|
||||||
// Filtering: clicks AND conversions (multi-condition)
|
|
||||||
for (const clickThreshold of QUESTION_THRESHOLDS.analytics.clicksForFiltering) {
|
|
||||||
const count = analytics.filter(m => m.clicks > clickThreshold && m.conversions > QUESTION_THRESHOLDS.analytics.conversionsForClickFiltering).length
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many days had more than ${clickThreshold} clicks and more than ${QUESTION_THRESHOLDS.analytics.conversionsForClickFiltering} conversions?`,
|
|
||||||
groundTruth: String(count),
|
|
||||||
type: 'filtering',
|
|
||||||
dataset: 'analytics',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
|
|
||||||
// Filtering: revenue AND bounce rate (multi-condition)
|
|
||||||
for (const revenueThreshold of QUESTION_THRESHOLDS.analytics.revenueForBounceRate) {
|
|
||||||
const count = analytics.filter(m => m.revenue > revenueThreshold && m.bounceRate < QUESTION_THRESHOLDS.analytics.bounceRateThreshold).length
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many days had revenue greater than ${revenueThreshold} and bounce rate less than ${QUESTION_THRESHOLDS.analytics.bounceRateThreshold}?`,
|
|
||||||
groundTruth: String(count),
|
|
||||||
type: 'filtering',
|
|
||||||
dataset: 'analytics',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
if (github.length > 0) {
|
|
||||||
// Helper to extract owner from repo field
|
|
||||||
const getOwner = (repoFullName: string) => repoFullName.split('/')[0]!
|
|
||||||
|
|
||||||
// Field retrieval: specific repos (diverse fields)
|
|
||||||
for (let i = 0; i < Math.min(QUESTION_LIMITS.github.fieldRetrievalRepos, github.length); i++) {
|
|
||||||
const repo = github[i * 7]
|
|
||||||
if (!repo)
|
|
||||||
continue
|
|
||||||
|
|
||||||
if (i % 5 === 0) {
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many stars does ${repo.repo} have?`,
|
|
||||||
groundTruth: String(repo.stars),
|
|
||||||
type: 'field-retrieval',
|
|
||||||
dataset: 'github',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
else if (i % 5 === 1) {
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many forks does ${repo.repo} have?`,
|
|
||||||
groundTruth: String(repo.forks),
|
|
||||||
type: 'field-retrieval',
|
|
||||||
dataset: 'github',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
else if (i % 5 === 2) {
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `Who is the owner of ${repo.repo}?`,
|
|
||||||
groundTruth: getOwner(repo.repo),
|
|
||||||
type: 'field-retrieval',
|
|
||||||
dataset: 'github',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
else if (i % 5 === 3) {
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `What is the default branch of ${repo.repo}?`,
|
|
||||||
groundTruth: repo.defaultBranch,
|
|
||||||
type: 'field-retrieval',
|
|
||||||
dataset: 'github',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
else {
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many watchers does ${repo.repo} have?`,
|
|
||||||
groundTruth: String(repo.watchers),
|
|
||||||
type: 'field-retrieval',
|
|
||||||
dataset: 'github',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// Aggregation: popular repositories
|
|
||||||
const totalStars = github.reduce((sum, r) => sum + r.stars, 0)
|
|
||||||
const totalRepos = github.length
|
|
||||||
const avgStars = Math.round(totalStars / totalRepos)
|
|
||||||
|
|
||||||
questions.push(
|
|
||||||
{
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: 'What is the total number of stars across all repositories?',
|
|
||||||
groundTruth: String(totalStars),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'github',
|
|
||||||
},
|
|
||||||
{
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: 'How many repositories are in the dataset?',
|
|
||||||
groundTruth: String(totalRepos),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'github',
|
|
||||||
},
|
|
||||||
{
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: 'What is the average number of stars per repository?',
|
|
||||||
groundTruth: String(avgStars),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'github',
|
|
||||||
},
|
|
||||||
)
|
|
||||||
|
|
||||||
// Aggregation: star thresholds (single-condition filters)
|
|
||||||
for (const threshold of QUESTION_THRESHOLDS.github.stars) {
|
|
||||||
const count = github.filter(r => r.stars > threshold).length
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many repositories have more than ${threshold} stars?`,
|
|
||||||
groundTruth: String(count),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'github',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
|
|
||||||
// Aggregation: fork thresholds (single-condition filters)
|
|
||||||
for (const threshold of QUESTION_THRESHOLDS.github.forks) {
|
|
||||||
const count = github.filter(r => r.forks > threshold).length
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many repositories have more than ${threshold} forks?`,
|
|
||||||
groundTruth: String(count),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'github',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
|
|
||||||
// Aggregation: watcher thresholds (single-condition filters)
|
|
||||||
for (const threshold of QUESTION_THRESHOLDS.github.watchers) {
|
|
||||||
const count = github.filter(r => r.watchers > threshold).length
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many repositories have more than ${threshold} watchers?`,
|
|
||||||
groundTruth: String(count),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'github',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
|
|
||||||
// Aggregation: default branch counts
|
|
||||||
const branches = [...new Set(github.map(r => r.defaultBranch))]
|
|
||||||
for (const branch of branches.slice(0, QUESTION_LIMITS.github.aggregationBranches)) {
|
|
||||||
const count = github.filter(r => r.defaultBranch === branch).length
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many repositories use "${branch}" as their default branch?`,
|
|
||||||
groundTruth: String(count),
|
|
||||||
type: 'aggregation',
|
|
||||||
dataset: 'github',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
|
|
||||||
// Filtering: multi-condition queries (stars AND forks)
|
|
||||||
for (const combo of QUESTION_THRESHOLDS.github.starForkCombinations.slice(0, QUESTION_LIMITS.github.filteringStarsAndForks)) {
|
|
||||||
const count = github.filter(r => r.stars > combo.stars && r.forks > combo.forks).length
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many repositories have more than ${combo.stars} stars and more than ${combo.forks} forks?`,
|
|
||||||
groundTruth: String(count),
|
|
||||||
type: 'filtering',
|
|
||||||
dataset: 'github',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
|
|
||||||
// Filtering: stars AND watchers (multi-condition)
|
|
||||||
for (const combo of QUESTION_THRESHOLDS.github.starWatcherCombinations) {
|
|
||||||
const count = github.filter(r => r.stars > combo.stars && r.watchers > combo.watchers).length
|
|
||||||
questions.push({
|
|
||||||
id: `q${idCounter++}`,
|
|
||||||
prompt: `How many repositories have more than ${combo.stars} stars and more than ${combo.watchers} watchers?`,
|
|
||||||
groundTruth: String(count),
|
|
||||||
type: 'filtering',
|
|
||||||
dataset: 'github',
|
|
||||||
})
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
return questions
|
|
||||||
}
|
|
||||||
196
benchmarks/src/questions/analytics.ts
Normal file
196
benchmarks/src/questions/analytics.ts
Normal file
@@ -0,0 +1,196 @@
|
|||||||
|
import type { AnalyticsMetric } from '../datasets'
|
||||||
|
import type { Question } from '../types'
|
||||||
|
import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from '../constants'
|
||||||
|
import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils'
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Generate analytics (website metrics) questions
|
||||||
|
*/
|
||||||
|
export function generateAnalyticsQuestions(metrics: AnalyticsMetric[], getId: () => string): Question[] {
|
||||||
|
const questions: Question[] = []
|
||||||
|
|
||||||
|
if (metrics.length === 0)
|
||||||
|
return questions
|
||||||
|
|
||||||
|
// Field retrieval: date-based metrics
|
||||||
|
const metricFieldGenerators: Array<(metric: AnalyticsMetric, getId: () => string) => Question> = [
|
||||||
|
(metric, getId) => new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`What are the views for ${metric.date}?`)
|
||||||
|
.groundTruth(String(metric.views))
|
||||||
|
.type('field-retrieval')
|
||||||
|
.dataset('analytics')
|
||||||
|
.build(),
|
||||||
|
(metric, getId) => new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`What is the revenue for ${metric.date}?`)
|
||||||
|
.groundTruth(String(metric.revenue))
|
||||||
|
.type('field-retrieval')
|
||||||
|
.dataset('analytics')
|
||||||
|
.build(),
|
||||||
|
(metric, getId) => new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`What is the bounce rate for ${metric.date}?`)
|
||||||
|
.groundTruth(String(metric.bounceRate))
|
||||||
|
.type('field-retrieval')
|
||||||
|
.dataset('analytics')
|
||||||
|
.build(),
|
||||||
|
(metric, getId) => new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many conversions were there on ${metric.date}?`)
|
||||||
|
.groundTruth(String(metric.conversions))
|
||||||
|
.type('field-retrieval')
|
||||||
|
.dataset('analytics')
|
||||||
|
.build(),
|
||||||
|
]
|
||||||
|
|
||||||
|
questions.push(...rotateQuestions(
|
||||||
|
metrics,
|
||||||
|
metricFieldGenerators,
|
||||||
|
QUESTION_LIMITS.analytics.fieldRetrievalDates,
|
||||||
|
SAMPLE_STRIDES.ANALYTICS_FIELD,
|
||||||
|
getId,
|
||||||
|
))
|
||||||
|
|
||||||
|
// Aggregation: basic statistics
|
||||||
|
const totalDays = metrics.length
|
||||||
|
const totalViews = metrics.reduce((sum, m) => sum + m.views, 0)
|
||||||
|
const totalConversions = metrics.reduce((sum, m) => sum + m.conversions, 0)
|
||||||
|
const totalRevenue = metrics.reduce((sum, m) => sum + m.revenue, 0)
|
||||||
|
const avgBounceRate = metrics.reduce((sum, m) => sum + m.bounceRate, 0) / metrics.length
|
||||||
|
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('How many days of data are in the dataset?')
|
||||||
|
.groundTruth(String(totalDays))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('analytics')
|
||||||
|
.build(),
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('What is the total number of views across all dates?')
|
||||||
|
.groundTruth(String(totalViews))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('analytics')
|
||||||
|
.build(),
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('What is the total number of conversions across all dates?')
|
||||||
|
.groundTruth(String(totalConversions))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('analytics')
|
||||||
|
.build(),
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('What is the total revenue across all dates?')
|
||||||
|
.groundTruth(String(totalRevenue.toFixed(2)))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('analytics')
|
||||||
|
.build(),
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('What is the average bounce rate?')
|
||||||
|
.groundTruth(String(avgBounceRate.toFixed(2)))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('analytics')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
|
||||||
|
// Aggregation: high views/conversions
|
||||||
|
for (const threshold of QUESTION_THRESHOLDS.analytics.views) {
|
||||||
|
const count = countByPredicate(metrics, m => m.views > threshold)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many days had more than ${threshold} views?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('analytics')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
for (const threshold of QUESTION_THRESHOLDS.analytics.conversions) {
|
||||||
|
const count = countByPredicate(metrics, m => m.conversions > threshold)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many days had more than ${threshold} conversions?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('analytics')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Filtering: multi-condition (views AND revenue)
|
||||||
|
for (const threshold of QUESTION_THRESHOLDS.analytics.viewsForFiltering) {
|
||||||
|
const count = countByPredicate(
|
||||||
|
metrics,
|
||||||
|
m => m.views > threshold && m.conversions > QUESTION_THRESHOLDS.analytics.conversionsForFiltering,
|
||||||
|
)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many days had more than ${threshold} views and more than ${QUESTION_THRESHOLDS.analytics.conversionsForFiltering} conversions?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('filtering')
|
||||||
|
.dataset('analytics')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Filtering: revenue thresholds
|
||||||
|
for (const threshold of QUESTION_THRESHOLDS.analytics.revenueThresholds) {
|
||||||
|
const count = countByPredicate(
|
||||||
|
metrics,
|
||||||
|
m => m.revenue > threshold && m.views > QUESTION_THRESHOLDS.analytics.viewsThresholdForRevenue,
|
||||||
|
)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many days had revenue greater than ${threshold} with views above ${QUESTION_THRESHOLDS.analytics.viewsThresholdForRevenue}?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('filtering')
|
||||||
|
.dataset('analytics')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Filtering: clicks and conversions
|
||||||
|
for (const threshold of QUESTION_THRESHOLDS.analytics.clicksForFiltering) {
|
||||||
|
const count = countByPredicate(
|
||||||
|
metrics,
|
||||||
|
m => m.clicks > threshold && m.conversions > QUESTION_THRESHOLDS.analytics.conversionsForClickFiltering,
|
||||||
|
)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many days had more than ${threshold} clicks and more than ${QUESTION_THRESHOLDS.analytics.conversionsForClickFiltering} conversions?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('filtering')
|
||||||
|
.dataset('analytics')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Filtering: revenue and bounce rate
|
||||||
|
for (const threshold of QUESTION_THRESHOLDS.analytics.revenueForBounceRate) {
|
||||||
|
const count = countByPredicate(
|
||||||
|
metrics,
|
||||||
|
m => m.revenue > threshold && m.bounceRate < QUESTION_THRESHOLDS.analytics.bounceRateThreshold,
|
||||||
|
)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many days had revenue greater than ${threshold} with bounce rate below ${QUESTION_THRESHOLDS.analytics.bounceRateThreshold}?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('filtering')
|
||||||
|
.dataset('analytics')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
return questions
|
||||||
|
}
|
||||||
162
benchmarks/src/questions/event-logs.ts
Normal file
162
benchmarks/src/questions/event-logs.ts
Normal file
@@ -0,0 +1,162 @@
|
|||||||
|
import type { EventLog } from '../datasets'
|
||||||
|
import type { Question } from '../types'
|
||||||
|
import { QUESTION_LIMITS } from '../constants'
|
||||||
|
import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils'
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Generate event log questions
|
||||||
|
*/
|
||||||
|
export function generateEventLogsQuestions(logs: EventLog[], getId: () => string): Question[] {
|
||||||
|
const questions: Question[] = []
|
||||||
|
|
||||||
|
if (logs.length === 0)
|
||||||
|
return questions
|
||||||
|
|
||||||
|
// Field retrieval: log metadata
|
||||||
|
const logFieldGenerators: Array<(log: EventLog, getId: () => string) => Question> = [
|
||||||
|
(log, getId) => new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`What is the level of the log at ${log.timestamp}?`)
|
||||||
|
.groundTruth(log.level)
|
||||||
|
.type('field-retrieval')
|
||||||
|
.dataset('event-logs')
|
||||||
|
.build(),
|
||||||
|
(log, getId) => new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`What is the endpoint for the log at ${log.timestamp}?`)
|
||||||
|
.groundTruth(log.endpoint)
|
||||||
|
.type('field-retrieval')
|
||||||
|
.dataset('event-logs')
|
||||||
|
.build(),
|
||||||
|
(log, getId) => new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`What is the status code for the log at ${log.timestamp}?`)
|
||||||
|
.groundTruth(String(log.statusCode))
|
||||||
|
.type('field-retrieval')
|
||||||
|
.dataset('event-logs')
|
||||||
|
.build(),
|
||||||
|
(log, getId) => new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`What is the response time for the log at ${log.timestamp}?`)
|
||||||
|
.groundTruth(String(log.responseTime))
|
||||||
|
.type('field-retrieval')
|
||||||
|
.dataset('event-logs')
|
||||||
|
.build(),
|
||||||
|
]
|
||||||
|
|
||||||
|
questions.push(...rotateQuestions(
|
||||||
|
logs,
|
||||||
|
logFieldGenerators,
|
||||||
|
QUESTION_LIMITS.eventLogs.fieldRetrieval,
|
||||||
|
SAMPLE_STRIDES.EVENT_LOG_FIELD,
|
||||||
|
getId,
|
||||||
|
))
|
||||||
|
|
||||||
|
// Aggregation: basic statistics
|
||||||
|
const totalLogs = logs.length
|
||||||
|
const avgResponseTime = logs.reduce((sum, l) => sum + l.responseTime, 0) / logs.length
|
||||||
|
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('How many log entries are in the dataset?')
|
||||||
|
.groundTruth(String(totalLogs))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('event-logs')
|
||||||
|
.build(),
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('What is the average response time across all logs?')
|
||||||
|
.groundTruth(String(avgResponseTime.toFixed(2)))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('event-logs')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
|
||||||
|
// Aggregation: by level
|
||||||
|
const levels = [...new Set(logs.map(l => l.level))]
|
||||||
|
for (const level of levels) {
|
||||||
|
const count = countByPredicate(logs, l => l.level === level)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many log entries have level "${level}"?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('event-logs')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Aggregation: by endpoint
|
||||||
|
const endpoints = [...new Set(logs.map(l => l.endpoint))]
|
||||||
|
for (const endpoint of endpoints.slice(0, QUESTION_LIMITS.eventLogs.aggregationEndpoints)) {
|
||||||
|
const count = countByPredicate(logs, l => l.endpoint === endpoint)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many log entries are for endpoint "${endpoint}"?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('event-logs')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Aggregation: by status code range
|
||||||
|
const errorCount = countByPredicate(logs, l => l.statusCode >= 400)
|
||||||
|
const successCount = countByPredicate(logs, l => l.statusCode >= 200 && l.statusCode < 300)
|
||||||
|
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('How many log entries have a status code indicating an error (>= 400)?')
|
||||||
|
.groundTruth(String(errorCount))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('event-logs')
|
||||||
|
.build(),
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('How many log entries have a successful status code (200-299)?')
|
||||||
|
.groundTruth(String(successCount))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('event-logs')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
|
||||||
|
// Filtering: multi-condition (level AND status)
|
||||||
|
for (const level of levels.slice(0, QUESTION_LIMITS.eventLogs.filteringLevelAndStatus)) {
|
||||||
|
const count = countByPredicate(
|
||||||
|
logs,
|
||||||
|
l => l.level === level && l.statusCode >= 400,
|
||||||
|
)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many log entries have level "${level}" and status code >= 400?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('filtering')
|
||||||
|
.dataset('event-logs')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Filtering: endpoint AND status
|
||||||
|
for (const endpoint of endpoints.slice(0, QUESTION_LIMITS.eventLogs.filteringEndpointAndStatus)) {
|
||||||
|
const count = countByPredicate(
|
||||||
|
logs,
|
||||||
|
l => l.endpoint === endpoint && l.statusCode >= 500,
|
||||||
|
)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many log entries are for endpoint "${endpoint}" with status code >= 500?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('filtering')
|
||||||
|
.dataset('event-logs')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
return questions
|
||||||
|
}
|
||||||
184
benchmarks/src/questions/github.ts
Normal file
184
benchmarks/src/questions/github.ts
Normal file
@@ -0,0 +1,184 @@
|
|||||||
|
import type { Repository } from '../datasets'
|
||||||
|
import type { Question } from '../types'
|
||||||
|
import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from '../constants'
|
||||||
|
import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils'
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Generate GitHub repository questions
|
||||||
|
*/
|
||||||
|
export function generateGithubQuestions(repos: Repository[], getId: () => string): Question[] {
|
||||||
|
const questions: Question[] = []
|
||||||
|
|
||||||
|
if (repos.length === 0)
|
||||||
|
return questions
|
||||||
|
|
||||||
|
// Field retrieval: repository metadata
|
||||||
|
const repoFieldGenerators: Array<(repo: Repository, getId: () => string) => Question> = [
|
||||||
|
(repo, getId) => new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many stars does ${repo.owner}/${repo.name} have?`)
|
||||||
|
.groundTruth(String(repo.stars))
|
||||||
|
.type('field-retrieval')
|
||||||
|
.dataset('github')
|
||||||
|
.build(),
|
||||||
|
(repo, getId) => new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many forks does ${repo.owner}/${repo.name} have?`)
|
||||||
|
.groundTruth(String(repo.forks))
|
||||||
|
.type('field-retrieval')
|
||||||
|
.dataset('github')
|
||||||
|
.build(),
|
||||||
|
(repo, getId) => new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many watchers does ${repo.owner}/${repo.name} have?`)
|
||||||
|
.groundTruth(String(repo.watchers))
|
||||||
|
.type('field-retrieval')
|
||||||
|
.dataset('github')
|
||||||
|
.build(),
|
||||||
|
(repo, getId) => new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`What is the main branch of ${repo.owner}/${repo.name}?`)
|
||||||
|
.groundTruth(repo.defaultBranch)
|
||||||
|
.type('field-retrieval')
|
||||||
|
.dataset('github')
|
||||||
|
.build(),
|
||||||
|
]
|
||||||
|
|
||||||
|
questions.push(...rotateQuestions(
|
||||||
|
repos,
|
||||||
|
repoFieldGenerators,
|
||||||
|
QUESTION_LIMITS.github.fieldRetrievalRepos,
|
||||||
|
SAMPLE_STRIDES.REPO_FIELD,
|
||||||
|
getId,
|
||||||
|
))
|
||||||
|
|
||||||
|
// Aggregation: basic statistics
|
||||||
|
const totalRepos = repos.length
|
||||||
|
const totalStars = repos.reduce((sum, r) => sum + r.stars, 0)
|
||||||
|
const totalForks = repos.reduce((sum, r) => sum + r.forks, 0)
|
||||||
|
const avgStars = totalStars / totalRepos
|
||||||
|
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('How many repositories are in the dataset?')
|
||||||
|
.groundTruth(String(totalRepos))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('github')
|
||||||
|
.build(),
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('What is the total number of stars across all repositories?')
|
||||||
|
.groundTruth(String(totalStars))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('github')
|
||||||
|
.build(),
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('What is the total number of forks across all repositories?')
|
||||||
|
.groundTruth(String(totalForks))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('github')
|
||||||
|
.build(),
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('What is the average number of stars per repository?')
|
||||||
|
.groundTruth(String(Math.round(avgStars)))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('github')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
|
||||||
|
// Aggregation: by default branch
|
||||||
|
const branches = [...new Set(repos.map(r => r.defaultBranch))]
|
||||||
|
for (const branch of branches.slice(0, QUESTION_LIMITS.github.aggregationBranches)) {
|
||||||
|
const count = countByPredicate(repos, r => r.defaultBranch === branch)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many repositories use "${branch}" as their default branch?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('github')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Aggregation: high star counts
|
||||||
|
for (const threshold of QUESTION_THRESHOLDS.github.stars) {
|
||||||
|
const count = countByPredicate(repos, r => r.stars > threshold)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many repositories have more than ${threshold} stars?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('github')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Aggregation: high fork counts
|
||||||
|
for (const threshold of QUESTION_THRESHOLDS.github.forks) {
|
||||||
|
const count = countByPredicate(repos, r => r.forks > threshold)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many repositories have more than ${threshold} forks?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('github')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Aggregation: high watcher counts
|
||||||
|
for (const threshold of QUESTION_THRESHOLDS.github.watchers) {
|
||||||
|
const count = countByPredicate(repos, r => r.watchers > threshold)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many repositories have more than ${threshold} watchers?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('github')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Filtering: multi-condition (stars AND forks)
|
||||||
|
for (const combo of QUESTION_THRESHOLDS.github.starForkCombinations.slice(0, QUESTION_LIMITS.github.filteringStarsAndForks)) {
|
||||||
|
const count = countByPredicate(
|
||||||
|
repos,
|
||||||
|
r => r.stars > combo.stars && r.forks > combo.forks,
|
||||||
|
)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many repositories have more than ${combo.stars} stars and more than ${combo.forks} forks?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('filtering')
|
||||||
|
.dataset('github')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Filtering: stars AND watchers
|
||||||
|
for (const combo of QUESTION_THRESHOLDS.github.starWatcherCombinations) {
|
||||||
|
const count = countByPredicate(
|
||||||
|
repos,
|
||||||
|
r => r.stars > combo.stars && r.watchers > combo.watchers,
|
||||||
|
)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many repositories have more than ${combo.stars} stars and more than ${combo.watchers} watchers?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('filtering')
|
||||||
|
.dataset('github')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
return questions
|
||||||
|
}
|
||||||
46
benchmarks/src/questions/index.ts
Normal file
46
benchmarks/src/questions/index.ts
Normal file
@@ -0,0 +1,46 @@
|
|||||||
|
import type { AnalyticsMetric, Employee, EventLog, NestedConfig, Order, Repository } from '../datasets'
|
||||||
|
import type { Question } from '../types'
|
||||||
|
import { ACCURACY_DATASETS } from '../datasets'
|
||||||
|
import { generateAnalyticsQuestions } from './analytics'
|
||||||
|
import { generateEventLogsQuestions } from './event-logs'
|
||||||
|
import { generateGithubQuestions } from './github'
|
||||||
|
import { generateNestedQuestions } from './nested'
|
||||||
|
import { generateNestedConfigQuestions } from './nested-config'
|
||||||
|
import { generateTabularQuestions } from './tabular'
|
||||||
|
import { createIdGenerator } from './utils'
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Generate all questions from datasets
|
||||||
|
*
|
||||||
|
* @remarks
|
||||||
|
* Generates ~150-160 questions across different question types and datasets:
|
||||||
|
* - Field Retrieval: Direct field access with no computation
|
||||||
|
* Examples: "What is X's salary?", "What is the status of order Y?"
|
||||||
|
* - Aggregation: Counts, sums, averages, min/max operations (including single-condition filters)
|
||||||
|
* Examples: "How many X?", "What is the total/average?", "How many X > threshold?"
|
||||||
|
* - Filtering: Multi-condition queries requiring complex logical operations
|
||||||
|
* Examples: "How many X WHERE condition1 AND condition2?"
|
||||||
|
*/
|
||||||
|
export function generateQuestions(): Question[] {
|
||||||
|
const questions: Question[] = []
|
||||||
|
const idGen = createIdGenerator()
|
||||||
|
const getId = () => idGen.next().value
|
||||||
|
|
||||||
|
// Get datasets with proper typing
|
||||||
|
const tabular = (ACCURACY_DATASETS.find(d => d.name === 'tabular')?.data.employees as Employee[]) ?? []
|
||||||
|
const nested = (ACCURACY_DATASETS.find(d => d.name === 'nested')?.data.orders as Order[]) ?? []
|
||||||
|
const analytics = (ACCURACY_DATASETS.find(d => d.name === 'analytics')?.data.metrics as AnalyticsMetric[]) ?? []
|
||||||
|
const github = (ACCURACY_DATASETS.find(d => d.name === 'github')?.data.repositories as Repository[]) ?? []
|
||||||
|
const eventLogs = (ACCURACY_DATASETS.find(d => d.name === 'event-logs')?.data.logs as EventLog[]) ?? []
|
||||||
|
const nestedConfig = ACCURACY_DATASETS.find(d => d.name === 'nested-config')?.data as NestedConfig | undefined
|
||||||
|
|
||||||
|
// Generate questions for each dataset
|
||||||
|
questions.push(...generateTabularQuestions(tabular, getId))
|
||||||
|
questions.push(...generateNestedQuestions(nested, getId))
|
||||||
|
questions.push(...generateAnalyticsQuestions(analytics, getId))
|
||||||
|
questions.push(...generateGithubQuestions(github, getId))
|
||||||
|
questions.push(...generateEventLogsQuestions(eventLogs, getId))
|
||||||
|
questions.push(...generateNestedConfigQuestions(nestedConfig, getId))
|
||||||
|
|
||||||
|
return questions
|
||||||
|
}
|
||||||
147
benchmarks/src/questions/nested-config.ts
Normal file
147
benchmarks/src/questions/nested-config.ts
Normal file
@@ -0,0 +1,147 @@
|
|||||||
|
import type { NestedConfig } from '../datasets'
|
||||||
|
import type { Question } from '../types'
|
||||||
|
import { QUESTION_LIMITS } from '../constants'
|
||||||
|
import { QuestionBuilder } from './utils'
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Generate nested configuration questions
|
||||||
|
*/
|
||||||
|
export function generateNestedConfigQuestions(config: NestedConfig | undefined, getId: () => string): Question[] {
|
||||||
|
const questions: Question[] = []
|
||||||
|
|
||||||
|
if (!config)
|
||||||
|
return questions
|
||||||
|
|
||||||
|
// Field retrieval: top-level config values
|
||||||
|
const fieldRetrievalQuestions = [
|
||||||
|
{
|
||||||
|
prompt: 'What is the environment in the configuration?',
|
||||||
|
groundTruth: config.environment,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
prompt: 'What is the database host?',
|
||||||
|
groundTruth: config.database.host,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
prompt: 'What is the database port?',
|
||||||
|
groundTruth: String(config.database.port),
|
||||||
|
},
|
||||||
|
{
|
||||||
|
prompt: 'What is the maximum connection pool size?',
|
||||||
|
groundTruth: String(config.database.pool.max),
|
||||||
|
},
|
||||||
|
{
|
||||||
|
prompt: 'What is the session duration?',
|
||||||
|
groundTruth: String(config.authentication.session.duration),
|
||||||
|
},
|
||||||
|
]
|
||||||
|
|
||||||
|
for (const q of fieldRetrievalQuestions.slice(0, QUESTION_LIMITS.nestedConfig.fieldRetrieval)) {
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(q.prompt)
|
||||||
|
.groundTruth(q.groundTruth)
|
||||||
|
.type('field-retrieval')
|
||||||
|
.dataset('nested-config')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Aggregation: counts of nested structures
|
||||||
|
const roleCount = Object.keys(config.permissions.roles).length
|
||||||
|
const groupCount = Object.keys(config.permissions.groups).length
|
||||||
|
const providerCount = config.authentication.providers.length
|
||||||
|
const featureCount = Object.keys(config.features).length
|
||||||
|
const replicaCount = config.database.replicas.length
|
||||||
|
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('How many roles are defined in permissions?')
|
||||||
|
.groundTruth(String(roleCount))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('nested-config')
|
||||||
|
.build(),
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('How many groups are defined in permissions?')
|
||||||
|
.groundTruth(String(groupCount))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('nested-config')
|
||||||
|
.build(),
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('How many authentication providers are configured?')
|
||||||
|
.groundTruth(String(providerCount))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('nested-config')
|
||||||
|
.build(),
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('How many feature flags are defined?')
|
||||||
|
.groundTruth(String(featureCount))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('nested-config')
|
||||||
|
.build(),
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('How many database replicas are configured?')
|
||||||
|
.groundTruth(String(replicaCount))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('nested-config')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
|
||||||
|
// Aggregation: feature flag details
|
||||||
|
const enabledFeatures = Object.entries(config.features).filter(([_, f]) => f.enabled).length
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('How many feature flags are enabled?')
|
||||||
|
.groundTruth(String(enabledFeatures))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('nested-config')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
|
||||||
|
// Aggregation: role permissions
|
||||||
|
const adminPermissions = config.permissions.roles.admin?.permissions.length ?? 0
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('How many permissions does the admin role have?')
|
||||||
|
.groundTruth(String(adminPermissions))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('nested-config')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
|
||||||
|
// Filtering: complex multi-condition queries
|
||||||
|
const filteringQuestions = [
|
||||||
|
{
|
||||||
|
prompt: 'How many feature flags are enabled with rollout greater than 50%?',
|
||||||
|
groundTruth: String(Object.entries(config.features)
|
||||||
|
.filter(([_, f]) => f.enabled && f.rollout > 50).length),
|
||||||
|
},
|
||||||
|
{
|
||||||
|
prompt: 'How many groups have the admin role?',
|
||||||
|
groundTruth: String(Object.entries(config.permissions.groups)
|
||||||
|
.filter(([_, g]) => g.roles.includes('admin')).length),
|
||||||
|
},
|
||||||
|
]
|
||||||
|
|
||||||
|
for (const q of filteringQuestions.slice(0, QUESTION_LIMITS.nestedConfig.filteringComplex)) {
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(q.prompt)
|
||||||
|
.groundTruth(q.groundTruth)
|
||||||
|
.type('filtering')
|
||||||
|
.dataset('nested-config')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
return questions
|
||||||
|
}
|
||||||
202
benchmarks/src/questions/nested.ts
Normal file
202
benchmarks/src/questions/nested.ts
Normal file
@@ -0,0 +1,202 @@
|
|||||||
|
import type { Order } from '../datasets'
|
||||||
|
import type { Question } from '../types'
|
||||||
|
import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from '../constants'
|
||||||
|
import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils'
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Generate nested (orders) questions
|
||||||
|
*/
|
||||||
|
export function generateNestedQuestions(orders: Order[], getId: () => string): Question[] {
|
||||||
|
const questions: Question[] = []
|
||||||
|
|
||||||
|
if (orders.length === 0)
|
||||||
|
return questions
|
||||||
|
|
||||||
|
// Field retrieval: order totals and statuses
|
||||||
|
const orderFieldGenerators: Array<(order: Order, getId: () => string) => Question> = [
|
||||||
|
(order, getId) => new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`What is the total for order ${order.orderId}?`)
|
||||||
|
.groundTruth(String(order.total))
|
||||||
|
.type('field-retrieval')
|
||||||
|
.dataset('nested')
|
||||||
|
.build(),
|
||||||
|
(order, getId) => new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`What is the status of order ${order.orderId}?`)
|
||||||
|
.groundTruth(order.status)
|
||||||
|
.type('field-retrieval')
|
||||||
|
.dataset('nested')
|
||||||
|
.build(),
|
||||||
|
]
|
||||||
|
|
||||||
|
questions.push(...rotateQuestions(
|
||||||
|
orders,
|
||||||
|
orderFieldGenerators,
|
||||||
|
QUESTION_LIMITS.nested.fieldRetrievalOrders,
|
||||||
|
SAMPLE_STRIDES.ORDER_FIELD,
|
||||||
|
getId,
|
||||||
|
))
|
||||||
|
|
||||||
|
// Field retrieval: customer info and order dates
|
||||||
|
const customerFieldGenerators: Array<(order: Order, getId: () => string) => Question> = [
|
||||||
|
(order, getId) => new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`What is the customer name for order ${order.orderId}?`)
|
||||||
|
.groundTruth(order.customer.name)
|
||||||
|
.type('field-retrieval')
|
||||||
|
.dataset('nested')
|
||||||
|
.build(),
|
||||||
|
(order, getId) => new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`What is the customer email for order ${order.orderId}?`)
|
||||||
|
.groundTruth(order.customer.email)
|
||||||
|
.type('field-retrieval')
|
||||||
|
.dataset('nested')
|
||||||
|
.build(),
|
||||||
|
(order, getId) => new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`What is the order date for order ${order.orderId}?`)
|
||||||
|
.groundTruth(order.orderDate || '')
|
||||||
|
.type('field-retrieval')
|
||||||
|
.dataset('nested')
|
||||||
|
.build(),
|
||||||
|
(order, getId) => new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many items are in order ${order.orderId}?`)
|
||||||
|
.groundTruth(String(order.items.length))
|
||||||
|
.type('field-retrieval')
|
||||||
|
.dataset('nested')
|
||||||
|
.build(),
|
||||||
|
]
|
||||||
|
|
||||||
|
// Use stride + 1 for customer fields to offset from order fields
|
||||||
|
const customerOrders = orders.map((_, i) => orders[i * SAMPLE_STRIDES.CUSTOMER_FIELD + 1] || orders[i]).filter(Boolean) as Order[]
|
||||||
|
questions.push(...rotateQuestions(
|
||||||
|
customerOrders,
|
||||||
|
customerFieldGenerators,
|
||||||
|
QUESTION_LIMITS.nested.fieldRetrievalCustomers,
|
||||||
|
1,
|
||||||
|
getId,
|
||||||
|
))
|
||||||
|
|
||||||
|
// Aggregation: totals and averages
|
||||||
|
const totalRevenue = orders.reduce((sum, o) => sum + o.total, 0)
|
||||||
|
const avgOrderValue = totalRevenue / orders.length
|
||||||
|
const totalOrders = orders.length
|
||||||
|
const maxOrderValue = Math.max(...orders.map(o => o.total))
|
||||||
|
|
||||||
|
// Count by status
|
||||||
|
const statuses = [...new Set(orders.map(o => o.status))]
|
||||||
|
for (const status of statuses.slice(0, QUESTION_LIMITS.nested.aggregationStatuses)) {
|
||||||
|
const count = countByPredicate(orders, o => o.status === status)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many orders have status "${status}"?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('nested')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('What is the total revenue across all orders?')
|
||||||
|
.groundTruth(String(totalRevenue.toFixed(2)))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('nested')
|
||||||
|
.build(),
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('What is the average order value?')
|
||||||
|
.groundTruth(String(avgOrderValue.toFixed(2)))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('nested')
|
||||||
|
.build(),
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('How many orders are in the dataset?')
|
||||||
|
.groundTruth(String(totalOrders))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('nested')
|
||||||
|
.build(),
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('What is the highest order total?')
|
||||||
|
.groundTruth(String(maxOrderValue.toFixed(2)))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('nested')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
|
||||||
|
// Aggregation: high-value orders (single-condition filter)
|
||||||
|
for (const threshold of QUESTION_THRESHOLDS.nested.highValueOrders) {
|
||||||
|
const count = countByPredicate(orders, o => o.total > threshold)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many orders have a total greater than ${threshold}?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('nested')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Filtering: multi-condition queries (status AND value)
|
||||||
|
const orderStatuses = [...new Set(orders.map(o => o.status))]
|
||||||
|
for (const status of orderStatuses.slice(0, QUESTION_LIMITS.nested.filteringStatusAndValue)) {
|
||||||
|
const count = countByPredicate(
|
||||||
|
orders,
|
||||||
|
o => o.status === status && o.total > QUESTION_THRESHOLDS.nested.statusValueThreshold,
|
||||||
|
)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many orders have status "${status}" and total greater than ${QUESTION_THRESHOLDS.nested.statusValueThreshold}?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('filtering')
|
||||||
|
.dataset('nested')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Filtering: status AND items count (multi-condition)
|
||||||
|
for (const status of orderStatuses.slice(0, QUESTION_LIMITS.nested.filteringStatusAndItems)) {
|
||||||
|
const count = countByPredicate(
|
||||||
|
orders,
|
||||||
|
o => o.status === status && o.items.length >= QUESTION_THRESHOLDS.nested.itemCountThreshold,
|
||||||
|
)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many orders have status "${status}" and at least ${QUESTION_THRESHOLDS.nested.itemCountThreshold} items?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('filtering')
|
||||||
|
.dataset('nested')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Filtering: total AND items count (multi-condition)
|
||||||
|
for (const threshold of QUESTION_THRESHOLDS.nested.totalThresholdsForItems) {
|
||||||
|
const count = countByPredicate(
|
||||||
|
orders,
|
||||||
|
o => o.total > threshold && o.items.length >= QUESTION_THRESHOLDS.nested.itemCountThreshold,
|
||||||
|
)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many orders have a total greater than ${threshold} and at least ${QUESTION_THRESHOLDS.nested.itemCountThreshold} items?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('filtering')
|
||||||
|
.dataset('nested')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
return questions
|
||||||
|
}
|
||||||
191
benchmarks/src/questions/tabular.ts
Normal file
191
benchmarks/src/questions/tabular.ts
Normal file
@@ -0,0 +1,191 @@
|
|||||||
|
import type { Employee } from '../datasets'
|
||||||
|
import type { Question } from '../types'
|
||||||
|
import { QUESTION_LIMITS, QUESTION_THRESHOLDS } from '../constants'
|
||||||
|
import { countByPredicate, QuestionBuilder, rotateQuestions, SAMPLE_STRIDES } from './utils'
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Generate tabular (employee) questions
|
||||||
|
*/
|
||||||
|
export function generateTabularQuestions(employees: Employee[], getId: () => string): Question[] {
|
||||||
|
const questions: Question[] = []
|
||||||
|
|
||||||
|
if (employees.length === 0)
|
||||||
|
return questions
|
||||||
|
|
||||||
|
// Field retrieval: specific employees
|
||||||
|
const fieldGenerators: Array<(emp: Employee, getId: () => string) => Question> = [
|
||||||
|
(emp, getId) => new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`What is the salary of ${emp.name}?`)
|
||||||
|
.groundTruth(String(emp.salary))
|
||||||
|
.type('field-retrieval')
|
||||||
|
.dataset('tabular')
|
||||||
|
.build(),
|
||||||
|
(emp, getId) => new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`What department does ${emp.name} work in?`)
|
||||||
|
.groundTruth(emp.department)
|
||||||
|
.type('field-retrieval')
|
||||||
|
.dataset('tabular')
|
||||||
|
.build(),
|
||||||
|
(emp, getId) => new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`What is the email address of ${emp.name}?`)
|
||||||
|
.groundTruth(emp.email)
|
||||||
|
.type('field-retrieval')
|
||||||
|
.dataset('tabular')
|
||||||
|
.build(),
|
||||||
|
(emp, getId) => new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many years of experience does ${emp.name} have?`)
|
||||||
|
.groundTruth(String(emp.yearsExperience))
|
||||||
|
.type('field-retrieval')
|
||||||
|
.dataset('tabular')
|
||||||
|
.build(),
|
||||||
|
(emp, getId) => new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`Is ${emp.name} an active employee?`)
|
||||||
|
.groundTruth(emp.active ? 'yes' : 'no')
|
||||||
|
.type('field-retrieval')
|
||||||
|
.dataset('tabular')
|
||||||
|
.build(),
|
||||||
|
]
|
||||||
|
|
||||||
|
questions.push(...rotateQuestions(
|
||||||
|
employees,
|
||||||
|
fieldGenerators,
|
||||||
|
QUESTION_LIMITS.tabular.fieldRetrieval,
|
||||||
|
SAMPLE_STRIDES.EMPLOYEE_FIELD,
|
||||||
|
getId,
|
||||||
|
))
|
||||||
|
|
||||||
|
// Aggregation: count by department
|
||||||
|
const departments = [...new Set(employees.map(e => e.department))]
|
||||||
|
for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.aggregationDepartments)) {
|
||||||
|
const count = countByPredicate(employees, e => e.department === dept)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many employees work in ${dept}?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('tabular')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Aggregation: salary ranges (single-condition filters)
|
||||||
|
for (const threshold of QUESTION_THRESHOLDS.tabular.salaryRanges) {
|
||||||
|
const count = countByPredicate(employees, e => e.salary > threshold)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many employees have a salary greater than ${threshold}?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('tabular')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Aggregation: totals and averages
|
||||||
|
const totalEmployees = employees.length
|
||||||
|
const avgSalary = Math.round(employees.reduce((sum, e) => sum + e.salary, 0) / totalEmployees)
|
||||||
|
const activeCount = countByPredicate(employees, e => e.active)
|
||||||
|
const inactiveCount = countByPredicate(employees, e => !e.active)
|
||||||
|
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('How many employees are in the dataset?')
|
||||||
|
.groundTruth(String(totalEmployees))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('tabular')
|
||||||
|
.build(),
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('What is the average salary across all employees?')
|
||||||
|
.groundTruth(String(avgSalary))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('tabular')
|
||||||
|
.build(),
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('How many employees are active?')
|
||||||
|
.groundTruth(String(activeCount))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('tabular')
|
||||||
|
.build(),
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt('How many employees are inactive?')
|
||||||
|
.groundTruth(String(inactiveCount))
|
||||||
|
.type('aggregation')
|
||||||
|
.dataset('tabular')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
|
||||||
|
// Filtering: count by department with salary filter (multi-condition)
|
||||||
|
for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringMultiConditionDepartments)) {
|
||||||
|
const count = countByPredicate(
|
||||||
|
employees,
|
||||||
|
e => e.department === dept && e.salary > QUESTION_THRESHOLDS.tabular.departmentSalaryThreshold,
|
||||||
|
)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many employees in ${dept} have a salary greater than ${QUESTION_THRESHOLDS.tabular.departmentSalaryThreshold}?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('filtering')
|
||||||
|
.dataset('tabular')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Filtering: active employees by experience (multi-condition)
|
||||||
|
for (const exp of QUESTION_THRESHOLDS.tabular.experienceYears.slice(0, QUESTION_LIMITS.tabular.filteringExperience)) {
|
||||||
|
const count = countByPredicate(employees, e => e.yearsExperience > exp && e.active)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many active employees have more than ${exp} years of experience?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('filtering')
|
||||||
|
.dataset('tabular')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Filtering: department by experience (multi-condition)
|
||||||
|
for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringDepartmentExp)) {
|
||||||
|
const count = countByPredicate(
|
||||||
|
employees,
|
||||||
|
e => e.department === dept && e.yearsExperience > QUESTION_THRESHOLDS.tabular.departmentExperienceThreshold,
|
||||||
|
)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many employees in ${dept} have more than ${QUESTION_THRESHOLDS.tabular.departmentExperienceThreshold} years of experience?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('filtering')
|
||||||
|
.dataset('tabular')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Filtering: department by active status (multi-condition)
|
||||||
|
for (const dept of departments.slice(0, QUESTION_LIMITS.tabular.filteringDepartmentActive)) {
|
||||||
|
const count = countByPredicate(employees, e => e.department === dept && e.active)
|
||||||
|
questions.push(
|
||||||
|
new QuestionBuilder()
|
||||||
|
.id(getId())
|
||||||
|
.prompt(`How many active employees work in ${dept}?`)
|
||||||
|
.groundTruth(String(count))
|
||||||
|
.type('filtering')
|
||||||
|
.dataset('tabular')
|
||||||
|
.build(),
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
return questions
|
||||||
|
}
|
||||||
95
benchmarks/src/questions/utils.ts
Normal file
95
benchmarks/src/questions/utils.ts
Normal file
@@ -0,0 +1,95 @@
|
|||||||
|
import type { Question } from '../types'
|
||||||
|
|
||||||
|
// Constants for sampling strides
|
||||||
|
export const SAMPLE_STRIDES = {
|
||||||
|
EMPLOYEE_FIELD: 2,
|
||||||
|
ORDER_FIELD: 2,
|
||||||
|
CUSTOMER_FIELD: 2,
|
||||||
|
ANALYTICS_FIELD: 3,
|
||||||
|
METRIC_FIELD: 3,
|
||||||
|
REPO_FIELD: 7,
|
||||||
|
EVENT_LOG_FIELD: 5,
|
||||||
|
} as const
|
||||||
|
|
||||||
|
/**
|
||||||
|
* ID Generator
|
||||||
|
*/
|
||||||
|
export function* createIdGenerator(): Generator<string, never, never> {
|
||||||
|
let id = 1
|
||||||
|
while (true) {
|
||||||
|
yield `q${id++}`
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Question Builder class for fluent question creation
|
||||||
|
*/
|
||||||
|
export class QuestionBuilder {
|
||||||
|
private question: Partial<Question> = {}
|
||||||
|
|
||||||
|
id(id: string): this {
|
||||||
|
this.question.id = id
|
||||||
|
return this
|
||||||
|
}
|
||||||
|
|
||||||
|
prompt(prompt: string): this {
|
||||||
|
this.question.prompt = prompt
|
||||||
|
return this
|
||||||
|
}
|
||||||
|
|
||||||
|
groundTruth(groundTruth: string): this {
|
||||||
|
this.question.groundTruth = groundTruth
|
||||||
|
return this
|
||||||
|
}
|
||||||
|
|
||||||
|
type(type: Question['type']): this {
|
||||||
|
this.question.type = type
|
||||||
|
return this
|
||||||
|
}
|
||||||
|
|
||||||
|
dataset(dataset: Question['dataset']): this {
|
||||||
|
this.question.dataset = dataset
|
||||||
|
return this
|
||||||
|
}
|
||||||
|
|
||||||
|
build(): Question {
|
||||||
|
if (!this.question.id || !this.question.prompt || !this.question.groundTruth || !this.question.type || !this.question.dataset) {
|
||||||
|
throw new Error('Incomplete question')
|
||||||
|
}
|
||||||
|
return this.question as Question
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Helper: Count items matching a predicate
|
||||||
|
*/
|
||||||
|
export function countByPredicate<T>(items: T[], predicate: (item: T) => boolean): number {
|
||||||
|
return items.filter(predicate).length
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Helper: Rotate through question generators
|
||||||
|
*/
|
||||||
|
export function rotateQuestions<T>(
|
||||||
|
items: T[],
|
||||||
|
generators: Array<(item: T, getId: () => string) => Question>,
|
||||||
|
limit: number,
|
||||||
|
stride: number,
|
||||||
|
getId: () => string,
|
||||||
|
): Question[] {
|
||||||
|
const questions: Question[] = []
|
||||||
|
|
||||||
|
for (let i = 0; i < Math.min(limit, items.length); i++) {
|
||||||
|
const item = items[i * stride] || items[i]
|
||||||
|
if (!item)
|
||||||
|
continue
|
||||||
|
|
||||||
|
const generatorIndex = i % generators.length
|
||||||
|
const generator = generators[generatorIndex]
|
||||||
|
if (generator) {
|
||||||
|
questions.push(generator(item, getId))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return questions
|
||||||
|
}
|
||||||
@@ -1,7 +1,8 @@
|
|||||||
import type { EfficiencyRanking, EvaluationResult, FormatResult, Question } from './types'
|
import type { Dataset, EfficiencyRanking, EvaluationResult, FormatResult, Question } from './types'
|
||||||
import { FORMATTER_DISPLAY_NAMES } from './constants'
|
import { FORMATTER_DISPLAY_NAMES } from './constants'
|
||||||
import { datasets } from './datasets'
|
import { ACCURACY_DATASETS } from './datasets'
|
||||||
import { models } from './evaluate'
|
import { models } from './evaluate'
|
||||||
|
import { supportsCSV } from './formatters'
|
||||||
import { generateQuestions } from './questions'
|
import { generateQuestions } from './questions'
|
||||||
import { createProgressBar, tokenize } from './utils'
|
import { createProgressBar, tokenize } from './utils'
|
||||||
|
|
||||||
@@ -16,7 +17,11 @@ export function calculateTokenCounts(
|
|||||||
const tokenCounts: Record<string, number> = {}
|
const tokenCounts: Record<string, number> = {}
|
||||||
|
|
||||||
for (const [formatName, formatter] of Object.entries(formatters)) {
|
for (const [formatName, formatter] of Object.entries(formatters)) {
|
||||||
for (const dataset of datasets) {
|
for (const dataset of ACCURACY_DATASETS) {
|
||||||
|
// Skip CSV for datasets that don't support it
|
||||||
|
if (formatName === 'csv' && !supportsCSV(dataset))
|
||||||
|
continue
|
||||||
|
|
||||||
const formatted = formatter(dataset.data)
|
const formatted = formatter(dataset.data)
|
||||||
const key = `${formatName}-${dataset.name}`
|
const key = `${formatName}-${dataset.name}`
|
||||||
tokenCounts[key] = tokenize(formatted)
|
tokenCounts[key] = tokenize(formatted)
|
||||||
@@ -42,9 +47,9 @@ export function calculateFormatResults(
|
|||||||
const accuracy = correctCount / totalCount
|
const accuracy = correctCount / totalCount
|
||||||
|
|
||||||
// Calculate average tokens across all datasets for this format
|
// Calculate average tokens across all datasets for this format
|
||||||
const avgTokens = Object.entries(tokenCounts)
|
const formatTokenEntries = Object.entries(tokenCounts)
|
||||||
.filter(([key]) => key.startsWith(`${formatName}-`))
|
.filter(([key]) => key.startsWith(`${formatName}-`))
|
||||||
.reduce((sum, [, tokens]) => sum + tokens, 0) / datasets.length
|
const avgTokens = formatTokenEntries.reduce((sum, [, tokens]) => sum + tokens, 0) / formatTokenEntries.length
|
||||||
|
|
||||||
const averageLatency = formatResults.reduce((sum, r) => sum + r.latencyMs, 0) / totalCount
|
const averageLatency = formatResults.reduce((sum, r) => sum + r.latencyMs, 0) / totalCount
|
||||||
|
|
||||||
@@ -75,6 +80,8 @@ export function generateAccuracyReport(
|
|||||||
return `
|
return `
|
||||||
Benchmarks test LLM comprehension across different input formats using ${totalQuestions} data retrieval questions on ${modelNames.length} ${modelNames.length === 1 ? 'model' : 'models'}.
|
Benchmarks test LLM comprehension across different input formats using ${totalQuestions} data retrieval questions on ${modelNames.length} ${modelNames.length === 1 ? 'model' : 'models'}.
|
||||||
|
|
||||||
|
${generateDatasetCatalog(ACCURACY_DATASETS)}
|
||||||
|
|
||||||
#### Efficiency Ranking (Accuracy per 1K Tokens)
|
#### Efficiency Ranking (Accuracy per 1K Tokens)
|
||||||
|
|
||||||
${generateEfficiencyRankingReport(formatResults)}
|
${generateEfficiencyRankingReport(formatResults)}
|
||||||
@@ -85,6 +92,38 @@ ${generateDetailedAccuracyReport(formatResults, results, questions, tokenCounts)
|
|||||||
`.trimStart()
|
`.trimStart()
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Generate dataset catalog section
|
||||||
|
*/
|
||||||
|
function generateDatasetCatalog(datasets: Dataset[]): string {
|
||||||
|
const rows = datasets.map((dataset) => {
|
||||||
|
const csvSupport = supportsCSV(dataset) ? '✓' : '✗'
|
||||||
|
const rowCount = Object.values(dataset.data)[0]?.length ?? 1
|
||||||
|
const structure = dataset.metadata.structureClass
|
||||||
|
const eligibility = `${dataset.metadata.tabularEligibility}%`
|
||||||
|
|
||||||
|
return `| ${dataset.description} | ${rowCount} | ${structure} | ${csvSupport} | ${eligibility} |`
|
||||||
|
}).join('\n')
|
||||||
|
|
||||||
|
return `
|
||||||
|
#### Dataset Catalog
|
||||||
|
|
||||||
|
| Dataset | Rows | Structure | CSV Support | Eligibility |
|
||||||
|
| ------- | ---- | --------- | ----------- | ----------- |
|
||||||
|
${rows}
|
||||||
|
|
||||||
|
**Structure classes:**
|
||||||
|
- **uniform**: All objects have identical fields with primitive values
|
||||||
|
- **semi-uniform**: Mix of uniform and non-uniform structures
|
||||||
|
- **nested**: Objects with nested structures (nested objects or arrays)
|
||||||
|
- **deep**: Highly nested with minimal tabular eligibility
|
||||||
|
|
||||||
|
**CSV Support:** ✓ (supported), ✗ (not supported - would require lossy flattening)
|
||||||
|
|
||||||
|
**Eligibility:** Percentage of arrays that qualify for TOON's tabular format (uniform objects with primitive values)
|
||||||
|
`.trim()
|
||||||
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Generate efficiency ranking report
|
* Generate efficiency ranking report
|
||||||
*/
|
*/
|
||||||
@@ -168,10 +207,12 @@ function generateDetailedAccuracyReport(
|
|||||||
const filteringPercent = ((filteringCount / totalQuestions) * 100).toFixed(0)
|
const filteringPercent = ((filteringCount / totalQuestions) * 100).toFixed(0)
|
||||||
|
|
||||||
// Calculate dataset sizes
|
// Calculate dataset sizes
|
||||||
const tabularSize = datasets.find(d => d.name === 'tabular')?.data.employees?.length || 0
|
const tabularSize = ACCURACY_DATASETS.find(d => d.name === 'tabular')?.data.employees?.length || 0
|
||||||
const nestedSize = datasets.find(d => d.name === 'nested')?.data.orders?.length || 0
|
const nestedSize = ACCURACY_DATASETS.find(d => d.name === 'nested')?.data.orders?.length || 0
|
||||||
const analyticsSize = datasets.find(d => d.name === 'analytics')?.data.metrics?.length || 0
|
const analyticsSize = ACCURACY_DATASETS.find(d => d.name === 'analytics')?.data.metrics?.length || 0
|
||||||
const githubSize = datasets.find(d => d.name === 'github')?.data.repositories?.length || 0
|
const githubSize = ACCURACY_DATASETS.find(d => d.name === 'github')?.data.repositories?.length || 0
|
||||||
|
const eventLogsSize = ACCURACY_DATASETS.find(d => d.name === 'event-logs')?.data.logs?.length || 0
|
||||||
|
const nestedConfigSize = 1 // Single config object
|
||||||
|
|
||||||
// Calculate number of formats and evaluations
|
// Calculate number of formats and evaluations
|
||||||
const formatCount = formatResults.length
|
const formatCount = formatResults.length
|
||||||
@@ -208,12 +249,14 @@ This benchmark tests **LLM comprehension and data retrieval accuracy** across di
|
|||||||
|
|
||||||
#### Datasets Tested
|
#### Datasets Tested
|
||||||
|
|
||||||
Four datasets designed to test different structural patterns (all contain arrays of uniform objects, TOON's optimal format):
|
Six datasets designed to test different structural patterns:
|
||||||
|
|
||||||
1. **Tabular** (${tabularSize} employee records): Uniform objects with identical fields – optimal for TOON's tabular format.
|
1. **Tabular** (${tabularSize} employee records): Uniform objects with identical fields – optimal for TOON's tabular format.
|
||||||
2. **Nested** (${nestedSize} e-commerce orders): Complex structures with nested customer objects and item arrays.
|
2. **Nested** (${nestedSize} e-commerce orders): Complex structures with nested customer objects and item arrays.
|
||||||
3. **Analytics** (${analyticsSize} days of metrics): Time-series data with dates and numeric values.
|
3. **Analytics** (${analyticsSize} days of metrics): Time-series data with dates and numeric values.
|
||||||
4. **GitHub** (${githubSize} repositories): Real-world data from top GitHub repos by stars.
|
4. **GitHub** (${githubSize} repositories): Real-world data from top GitHub repos by stars.
|
||||||
|
5. **Event Logs** (${eventLogsSize} logs): Semi-uniform data with ~50% flat logs and ~50% with nested error objects.
|
||||||
|
6. **Nested Config** (${nestedConfigSize} configuration): Deeply nested configuration with minimal tabular eligibility.
|
||||||
|
|
||||||
#### Question Types
|
#### Question Types
|
||||||
|
|
||||||
@@ -314,7 +357,7 @@ function generateDatasetBreakdown(
|
|||||||
questions: Question[],
|
questions: Question[],
|
||||||
tokenCounts: Record<string, number>,
|
tokenCounts: Record<string, number>,
|
||||||
): string {
|
): string {
|
||||||
return datasets.map((dataset) => {
|
return ACCURACY_DATASETS.map((dataset) => {
|
||||||
const datasetResults = formatResults.map((fr) => {
|
const datasetResults = formatResults.map((fr) => {
|
||||||
const datasetFormatResults = results.filter(r => r.questionId.includes(dataset.name) || questions.find(q => q.id === r.questionId)?.dataset === dataset.name)
|
const datasetFormatResults = results.filter(r => r.questionId.includes(dataset.name) || questions.find(q => q.id === r.questionId)?.dataset === dataset.name)
|
||||||
if (datasetFormatResults.length === 0)
|
if (datasetFormatResults.length === 0)
|
||||||
|
|||||||
@@ -1,7 +1,14 @@
|
|||||||
|
export interface DatasetMetadata {
|
||||||
|
supportsCSV: boolean
|
||||||
|
structureClass: 'uniform' | 'semi-uniform' | 'nested' | 'deep'
|
||||||
|
tabularEligibility: number
|
||||||
|
}
|
||||||
|
|
||||||
export interface Dataset {
|
export interface Dataset {
|
||||||
name: string
|
name: string
|
||||||
description: string
|
description: string
|
||||||
data: Record<string, any>
|
data: Record<string, any>
|
||||||
|
metadata: DatasetMetadata
|
||||||
}
|
}
|
||||||
|
|
||||||
export interface Question {
|
export interface Question {
|
||||||
|
|||||||
Reference in New Issue
Block a user