From 352e936370fafd8a3d9957954743ae10a93c74f3 Mon Sep 17 00:00:00 2001 From: Johann Schopplich Date: Tue, 28 Oct 2025 07:44:35 +0100 Subject: [PATCH] docs: update notes & limitations guide --- README.md | 45 +++++-------------- benchmarks/results/accuracy/report.md | 17 ++----- benchmarks/results/accuracy/summary.json | 2 +- .../scripts/token-efficiency-benchmark.ts | 2 +- 4 files changed, 18 insertions(+), 48 deletions(-) diff --git a/README.md b/README.md index 1e891a3..baf7ffd 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ # Token-Oriented Object Notation (TOON) -**Token-Oriented Object Notation** is a compact, human-readable format designed for passing structured data to Large Language Models with significantly reduced token usage. +**Token-Oriented Object Notation** is a compact, human-readable format designed for passing structured data to Large Language Models with significantly reduced token usage. It's intended for LLM input, not output. TOON excels at **uniform complex objects** – multiple fields per row, same structure across items. It borrows YAML's indentation-based structure for nested objects and CSV's tabular format for uniform data rows, then optimizes both for token efficiency in LLM contexts. @@ -34,16 +34,6 @@ users[2]{id,name,role}: -## Format Comparison - -Format familiarity matters as much as token count. - -- **CSV:** best for uniform tables. -- **JSON:** best for non-uniform data. -- **TOON:** best for uniform complex (but not deeply nested) objects. - -TOON switches to list format for non-uniform arrays. In those cases, JSON can be cheaper at scale. - ## Key Features - 💸 **Token-efficient:** typically 30–60% fewer tokens than JSON @@ -363,17 +353,12 @@ Four datasets designed to test different structural patterns: #### Evaluation Process 1. **Format conversion**: Each dataset is converted to all 5 formats (TOON, JSON, YAML, CSV, XML). -2. **Query LLM**: Each model receives formatted data + question in a prompt. -3. **LLM responds**: Model extracts the answer from the data. -4. **Validate with LLM-as-judge**: GPT-5-nano validates if the answer is semantically correct. +2. **Query LLM**: Each model receives formatted data + question in a prompt and extracts the answer. +3. **Validate with LLM-as-judge**: GPT-5-nano validates if the answer is semantically correct. #### Semantic Validation -Answers are validated by an LLM judge (`gpt-5-nano`) using semantic equivalence, not exact string matching: - -- **Numeric formats**: `50000` = `$50,000` = `50000 dollars` ✓ -- **Case insensitive**: `Engineering` = `engineering` = `ENGINEERING` ✓ -- **Minor formatting**: `2025-01-01` = `January 1, 2025` ✓ +Answers are validated by an LLM judge (`gpt-5-nano`) using semantic equivalence, not exact string matching (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`). #### Models & Configuration @@ -810,6 +795,14 @@ console.log(encode(data, { lengthMarker: '#', delimiter: '|' })) // B2|1|14.5 ``` +## Notes and Limitations + +- Format familiarity matters as much as token count. TOON's tabular format requires arrays of objects with identical keys and primitive values only – when this doesn't hold (due to mixed types, non-uniform objects, or nested structures), TOON switches to list format where JSON can be cheaper at scale. + - **TOON** is best for uniform complex (but not deeply nested) objects, especially large arrays of such objects. + - **JSON** is best for non-uniform data and deeply nested structures. +- **Token counts vary by tokenizer and model.** Benchmarks use a GPT-style tokenizer (cl100k/o200k); actual savings will differ with other models (e.g., [SentencePiece](https://github.com/google/sentencepiece)). +- **TOON is designed for LLM input** where human readability and token efficiency matter. It's **not** a drop-in replacement for JSON in APIs or storage. + ## Using TOON in LLM Prompts TOON works best when you show the format instead of describing it. The structure is self-documenting – models parse it naturally once they see the pattern. @@ -843,20 +836,6 @@ Task: Return only users with role "user" as TOON. Use the same header. Set [N] t > [!TIP] > For large uniform tables, use `encode(data, { delimiter: '\t' })` and tell the model "fields are tab-separated." Tabs often tokenize better than commas and reduce the need for quote-escaping. -## Notes and Limitations - -- **Token counts vary by tokenizer and model.** Benchmarks use a GPT-style tokenizer (cl100k/o200k); actual savings will differ with other models (e.g., SentencePiece). -- **TOON is designed for LLM contexts** where human readability and token efficiency matter. It's **not** a drop-in replacement for JSON in APIs or storage. -- **Tabular arrays** require all objects to have exactly the same keys with primitive values only. Arrays with mixed types (primitives + objects/arrays), non-uniform objects, or nested structures will use a more verbose list format. -- **Object key order** is preserved from the input. In tabular arrays, header order follows the first object's keys. -- **Arrays mixing primitives and objects/arrays** always use list form: - ``` - items[2]: - - a: 1 - - [2]: 1,2 - ``` -- **Deterministic formatting:** 2-space indentation, stable key order, no trailing spaces/newline. - ## Quick Reference ``` diff --git a/benchmarks/results/accuracy/report.md b/benchmarks/results/accuracy/report.md index 0aea84f..ff366db 100644 --- a/benchmarks/results/accuracy/report.md +++ b/benchmarks/results/accuracy/report.md @@ -111,7 +111,7 @@ gemini-2.5-flash #### What's Being Measured -This benchmark tests **LLM comprehension and data retrieval accuracy** when data is presented in different formats. Each LLM receives formatted data and must answer questions about it (this does NOT test LLM's ability to generate TOON output). +This benchmark tests **LLM comprehension and data retrieval accuracy** across different input formats. Each LLM receives formatted data and must answer questions about it (this does **not** test model's ability to generate TOON output). #### Datasets Tested @@ -140,18 +140,9 @@ Four datasets designed to test different structural patterns: #### Evaluation Process -1. **Format conversion**: Each dataset is converted to all 5 formats (TOON, JSON, YAML, CSV, XML). -2. **Query LLM**: Each model receives formatted data + question in a prompt. -3. **LLM responds**: Model extracts the answer from the data. -4. **Validate with LLM-as-judge**: GPT-5-nano validates if the answer is semantically correct. - -#### Semantic Validation - -Answers are validated by an LLM judge (`gpt-5-nano`) using semantic equivalence, not exact string matching: - -- **Numeric formats**: `50000` = `$50,000` = `50000 dollars` ✓ -- **Case insensitive**: `Engineering` = `engineering` = `ENGINEERING` ✓ -- **Minor formatting**: `2025-01-01` = `January 1, 2025` ✓ +1. **Format conversion:** Each dataset is converted to all 5 formats (TOON, JSON, YAML, CSV, XML). +2. **Query LLM**: Each model receives formatted data + question in a prompt and extracts the answer. +4. **Validate with LLM-as-judge**: `gpt-5-nano` validates if the answer is semantically correct (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`). #### Models & Configuration diff --git a/benchmarks/results/accuracy/summary.json b/benchmarks/results/accuracy/summary.json index 69d1ae1..f5dd2c2 100644 --- a/benchmarks/results/accuracy/summary.json +++ b/benchmarks/results/accuracy/summary.json @@ -87,5 +87,5 @@ "yaml-analytics": 2938, "yaml-github": 13129 }, - "timestamp": "2025-10-27T19:35:05.310Z" + "timestamp": "2025-10-28T06:43:10.560Z" } diff --git a/benchmarks/scripts/token-efficiency-benchmark.ts b/benchmarks/scripts/token-efficiency-benchmark.ts index 88ddf8d..34c8c32 100644 --- a/benchmarks/scripts/token-efficiency-benchmark.ts +++ b/benchmarks/scripts/token-efficiency-benchmark.ts @@ -204,7 +204,7 @@ ${detailedExamples} `.trimStart() -console.log(markdown) +console.log(`${barChartSection}\n`) await ensureDir(path.join(BENCHMARKS_DIR, 'results')) await fsp.writeFile(outputFilePath, markdown, 'utf-8')