refactor: token efficiency benchmark code

2026-01-29 23:34:10 +08:00 · 2025-10-28 07:42:49 +01:00
parent 8836831de3
commit 8b9924ff05
3 changed files with 52 additions and 41 deletions
--- a/benchmarks/src/evaluate.ts
+++ b/benchmarks/src/evaluate.ts
@@ -81,7 +81,8 @@ async function validateAnswer(
  }:
  { actual: string, expected: string, question: string },
 ): Promise<boolean> {
-  const prompt = `You are validating answers to questions about structured data.
+  const prompt = `
+You are validating answers to questions about structured data.

 Question: ${question}
 Expected answer: ${expected}
@@ -93,7 +94,8 @@ Is the actual answer correct? Consider:
 - Minor formatting differences are acceptable
 - Case-insensitive comparison for text

-Respond with only "YES" or "NO".`
+Respond with only "YES" or "NO".
+`.trim()

  try {
    const { text } = await generateText({
--- a/benchmarks/src/report.ts
+++ b/benchmarks/src/report.ts
@@ -204,7 +204,7 @@ ${modelPerformance}

 #### What's Being Measured

-This benchmark tests **LLM comprehension and data retrieval accuracy** when data is presented in different formats. Each LLM receives formatted data and must answer questions about it (this does NOT test LLM's ability to generate TOON output).
+This benchmark tests **LLM comprehension and data retrieval accuracy** across different input formats. Each LLM receives formatted data and must answer questions about it (this does **not** test model's ability to generate TOON output).

 #### Datasets Tested

@@ -233,18 +233,9 @@ Four datasets designed to test different structural patterns:

 #### Evaluation Process

-1. **Format conversion**: Each dataset is converted to all 5 formats (TOON, JSON, YAML, CSV, XML).
-2. **Query LLM**: Each model receives formatted data + question in a prompt.
-3. **LLM responds**: Model extracts the answer from the data.
-4. **Validate with LLM-as-judge**: GPT-5-nano validates if the answer is semantically correct.
-
-#### Semantic Validation
-
-Answers are validated by an LLM judge (\`gpt-5-nano\`) using semantic equivalence, not exact string matching:
-
- **Numeric formats**: \`50000\` = \`$50,000\` = \`50000 dollars\` ✓
- **Case insensitive**: \`Engineering\` = \`engineering\` = \`ENGINEERING\` ✓
- **Minor formatting**: \`2025-01-01\` = \`January 1, 2025\` ✓
+1. **Format conversion:** Each dataset is converted to all 5 formats (TOON, JSON, YAML, CSV, XML).
+2. **Query LLM**: Each model receives formatted data + question in a prompt and extracts the answer.
+4. **Validate with LLM-as-judge**: \`gpt-5-nano\` validates if the answer is semantically correct (e.g., \`50000\` = \`$50,000\`, \`Engineering\` = \`engineering\`, \`2025-01-01\` = \`January 1, 2025\`).

 #### Models & Configuration