diff --git a/README.md b/README.md index 73163c7..4d765ad 100644 --- a/README.md +++ b/README.md @@ -250,7 +250,7 @@ gemini-2.5-flash **Advantage:** TOON achieves **86.6% accuracy** (vs JSON's 83.2%) while using **46.3% fewer tokens**.
-View detailed breakdown by dataset and model +Performance by dataset and model #### Performance by Dataset @@ -326,12 +326,61 @@ gemini-2.5-flash | `json` | 81.8% | 130/159 | | `yaml` | 78.6% | 125/159 | -#### Methodology +
-- **Semantic validation**: LLM-as-judge validates responses semantically (not exact string matching). -- **Token counting**: Using `gpt-tokenizer` with `o200k_base` encoding. -- **Question types**: ~160 questions across field retrieval, aggregation, and filtering tasks. -- **Datasets**: Faker.js-generated datasets (seeded) + GitHub repositories. +
+How the benchmark works + +#### What's Being Measured + +This benchmark tests **LLM comprehension and data retrieval accuracy** when data is presented in different formats. Each LLM receives formatted data and must answer questions about it (this does NOT test LLM's ability to generate TOON output). + +#### Datasets Tested + +Four datasets designed to test different structural patterns: + +1. **Tabular** (100 employee records): Uniform objects with identical fields – optimal for TOON's tabular format. +2. **Nested** (50 e-commerce orders): Complex structures with nested customer objects and item arrays. +3. **Analytics** (60 days of metrics): Time-series data with dates and numeric values. +4. **GitHub** (100 repositories): Real-world data from top GitHub repos by stars. + +#### Question Types + +~160 questions are generated dynamically across three categories: + +- **Field retrieval (50%)**: Direct value lookups + - Example: "What is Alice's salary?" → `75000` + - Example: "What is the customer name for order ORD-0042?" → `John Doe` + +- **Aggregation (25%)**: Counting and summation tasks + - Example: "How many employees work in Engineering?" → `17` + - Example: "What is the total revenue across all orders?" → `45123.50` + +- **Filtering (25%)**: Conditional queries + - Example: "How many employees in Sales have salary > 80000?" → `5` + - Example: "How many orders have total > 400?" → `12` + +#### Evaluation Process + +1. **Format conversion**: Each dataset is converted to all 5 formats (TOON, JSON, YAML, CSV, XML). +2. **Query LLM**: Each model receives formatted data + question in a prompt. +3. **LLM responds**: Model extracts the answer from the data. +4. **Validate with LLM-as-judge**: GPT-5-nano validates if the answer is semantically correct. + +#### Semantic Validation + +Answers are validated by an LLM judge (`gpt-5-nano`) using semantic equivalence, not exact string matching: + +- **Numeric formats**: `50000` = `$50,000` = `50000 dollars` ✓ +- **Case insensitive**: `Engineering` = `engineering` = `ENGINEERING` ✓ +- **Minor formatting**: `2025-01-01` = `January 1, 2025` ✓ + +#### Models & Configuration + +- **Models tested**: `gpt-5-nano`, `claude-haiku-4-5`, `gemini-2.5-flash` +- **Token counting**: Using `gpt-tokenizer` with `o200k_base` encoding (GPT-5 tokenizer) +- **Temperature**: 0 (for non-reasoning models) +- **Total evaluations**: 159 questions × 5 formats × 3 models = 2,385 LLM calls
diff --git a/benchmarks/results/accuracy/report.md b/benchmarks/results/accuracy/report.md index 9cb96ae..0aea84f 100644 --- a/benchmarks/results/accuracy/report.md +++ b/benchmarks/results/accuracy/report.md @@ -28,7 +28,7 @@ gemini-2.5-flash **Advantage:** TOON achieves **86.6% accuracy** (vs JSON's 83.2%) while using **46.3% fewer tokens**.
-View detailed breakdown by dataset and model +Performance by dataset and model #### Performance by Dataset @@ -104,11 +104,60 @@ gemini-2.5-flash | `json` | 81.8% | 130/159 | | `yaml` | 78.6% | 125/159 | -#### Methodology +
-- **Semantic validation**: LLM-as-judge validates responses semantically (not exact string matching). -- **Token counting**: Using `gpt-tokenizer` with `o200k_base` encoding. -- **Question types**: ~160 questions across field retrieval, aggregation, and filtering tasks. -- **Datasets**: Faker.js-generated datasets (seeded) + GitHub repositories. +
+How the benchmark works + +#### What's Being Measured + +This benchmark tests **LLM comprehension and data retrieval accuracy** when data is presented in different formats. Each LLM receives formatted data and must answer questions about it (this does NOT test LLM's ability to generate TOON output). + +#### Datasets Tested + +Four datasets designed to test different structural patterns: + +1. **Tabular** (100 employee records): Uniform objects with identical fields – optimal for TOON's tabular format. +2. **Nested** (50 e-commerce orders): Complex structures with nested customer objects and item arrays. +3. **Analytics** (60 days of metrics): Time-series data with dates and numeric values. +4. **GitHub** (100 repositories): Real-world data from top GitHub repos by stars. + +#### Question Types + +~160 questions are generated dynamically across three categories: + +- **Field retrieval (50%)**: Direct value lookups + - Example: "What is Alice's salary?" → `75000` + - Example: "What is the customer name for order ORD-0042?" → `John Doe` + +- **Aggregation (25%)**: Counting and summation tasks + - Example: "How many employees work in Engineering?" → `17` + - Example: "What is the total revenue across all orders?" → `45123.50` + +- **Filtering (25%)**: Conditional queries + - Example: "How many employees in Sales have salary > 80000?" → `5` + - Example: "How many orders have total > 400?" → `12` + +#### Evaluation Process + +1. **Format conversion**: Each dataset is converted to all 5 formats (TOON, JSON, YAML, CSV, XML). +2. **Query LLM**: Each model receives formatted data + question in a prompt. +3. **LLM responds**: Model extracts the answer from the data. +4. **Validate with LLM-as-judge**: GPT-5-nano validates if the answer is semantically correct. + +#### Semantic Validation + +Answers are validated by an LLM judge (`gpt-5-nano`) using semantic equivalence, not exact string matching: + +- **Numeric formats**: `50000` = `$50,000` = `50000 dollars` ✓ +- **Case insensitive**: `Engineering` = `engineering` = `ENGINEERING` ✓ +- **Minor formatting**: `2025-01-01` = `January 1, 2025` ✓ + +#### Models & Configuration + +- **Models tested**: `gpt-5-nano`, `claude-haiku-4-5`, `gemini-2.5-flash` +- **Token counting**: Using `gpt-tokenizer` with `o200k_base` encoding (GPT-5 tokenizer) +- **Temperature**: 0 (for non-reasoning models) +- **Total evaluations**: 159 questions × 5 formats × 3 models = 2,385 LLM calls
diff --git a/benchmarks/results/accuracy/summary.json b/benchmarks/results/accuracy/summary.json index 688a296..69d1ae1 100644 --- a/benchmarks/results/accuracy/summary.json +++ b/benchmarks/results/accuracy/summary.json @@ -87,5 +87,5 @@ "yaml-analytics": 2938, "yaml-github": 13129 }, - "timestamp": "2025-10-27T15:01:57.523Z" + "timestamp": "2025-10-27T19:35:05.310Z" } diff --git a/benchmarks/src/report.ts b/benchmarks/src/report.ts index 3a8fdda..e1a109a 100644 --- a/benchmarks/src/report.ts +++ b/benchmarks/src/report.ts @@ -189,7 +189,7 @@ ${modelBreakdown} ${summaryComparison}
-View detailed breakdown by dataset and model +Performance by dataset and model #### Performance by Dataset @@ -197,12 +197,61 @@ ${datasetBreakdown} #### Performance by Model ${modelPerformance} -#### Methodology +
-- **Semantic validation**: LLM-as-judge validates responses semantically (not exact string matching). -- **Token counting**: Using \`gpt-tokenizer\` with \`o200k_base\` encoding. -- **Question types**: ~160 questions across field retrieval, aggregation, and filtering tasks. -- **Datasets**: Faker.js-generated datasets (seeded) + GitHub repositories. +
+How the benchmark works + +#### What's Being Measured + +This benchmark tests **LLM comprehension and data retrieval accuracy** when data is presented in different formats. Each LLM receives formatted data and must answer questions about it (this does NOT test LLM's ability to generate TOON output). + +#### Datasets Tested + +Four datasets designed to test different structural patterns: + +1. **Tabular** (100 employee records): Uniform objects with identical fields – optimal for TOON's tabular format. +2. **Nested** (50 e-commerce orders): Complex structures with nested customer objects and item arrays. +3. **Analytics** (60 days of metrics): Time-series data with dates and numeric values. +4. **GitHub** (100 repositories): Real-world data from top GitHub repos by stars. + +#### Question Types + +~160 questions are generated dynamically across three categories: + +- **Field retrieval (50%)**: Direct value lookups + - Example: "What is Alice's salary?" → \`75000\` + - Example: "What is the customer name for order ORD-0042?" → \`John Doe\` + +- **Aggregation (25%)**: Counting and summation tasks + - Example: "How many employees work in Engineering?" → \`17\` + - Example: "What is the total revenue across all orders?" → \`45123.50\` + +- **Filtering (25%)**: Conditional queries + - Example: "How many employees in Sales have salary > 80000?" → \`5\` + - Example: "How many orders have total > 400?" → \`12\` + +#### Evaluation Process + +1. **Format conversion**: Each dataset is converted to all 5 formats (TOON, JSON, YAML, CSV, XML). +2. **Query LLM**: Each model receives formatted data + question in a prompt. +3. **LLM responds**: Model extracts the answer from the data. +4. **Validate with LLM-as-judge**: GPT-5-nano validates if the answer is semantically correct. + +#### Semantic Validation + +Answers are validated by an LLM judge (\`gpt-5-nano\`) using semantic equivalence, not exact string matching: + +- **Numeric formats**: \`50000\` = \`$50,000\` = \`50000 dollars\` ✓ +- **Case insensitive**: \`Engineering\` = \`engineering\` = \`ENGINEERING\` ✓ +- **Minor formatting**: \`2025-01-01\` = \`January 1, 2025\` ✓ + +#### Models & Configuration + +- **Models tested**: \`gpt-5-nano\`, \`claude-haiku-4-5\`, \`gemini-2.5-flash\` +- **Token counting**: Using \`gpt-tokenizer\` with \`o200k_base\` encoding (GPT-5 tokenizer) +- **Temperature**: 0 (for non-reasoning models) +- **Total evaluations**: 159 questions × 5 formats × 3 models = 2,385 LLM calls
`.trimStart()