From 352e936370fafd8a3d9957954743ae10a93c74f3 Mon Sep 17 00:00:00 2001
From: Johann Schopplich <mail@johannschopplich.com>
Date: Tue, 28 Oct 2025 07:44:35 +0100
Subject: [PATCH] docs: update notes & limitations guide

---
 README.md                                     | 45 +++++--------------
 benchmarks/results/accuracy/report.md         | 17 ++-----
 benchmarks/results/accuracy/summary.json      |  2 +-
 .../scripts/token-efficiency-benchmark.ts     |  2 +-
 4 files changed, 18 insertions(+), 48 deletions(-)
diff --git a/README.md b/README.md
index 1e891a3..baf7ffd 100644
--- a/README.md
+++ b/README.md
@@ -2,7 +2,7 @@
 
 # Token-Oriented Object Notation (TOON)
 
-**Token-Oriented Object Notation** is a compact, human-readable format designed for passing structured data to Large Language Models with significantly reduced token usage.
+**Token-Oriented Object Notation** is a compact, human-readable format designed for passing structured data to Large Language Models with significantly reduced token usage. It's intended for LLM input, not output.
 
 TOON excels at **uniform complex objects** – multiple fields per row, same structure across items. It borrows YAML's indentation-based structure for nested objects and CSV's tabular format for uniform data rows, then optimizes both for token efficiency in LLM contexts.
 
@@ -34,16 +34,6 @@ users[2]{id,name,role}:
 
 </details>
 
-## Format Comparison
-
-Format familiarity matters as much as token count.
-
-- **CSV:** best for uniform tables.
-- **JSON:** best for non-uniform data.
-- **TOON:** best for uniform complex (but not deeply nested) objects.
-
-TOON switches to list format for non-uniform arrays. In those cases, JSON can be cheaper at scale.
-
 ## Key Features
 
 - 💸 **Token-efficient:** typically 30–60% fewer tokens than JSON
@@ -363,17 +353,12 @@ Four datasets designed to test different structural patterns:
 #### Evaluation Process
 
 1. **Format conversion**: Each dataset is converted to all 5 formats (TOON, JSON, YAML, CSV, XML).
-2. **Query LLM**: Each model receives formatted data + question in a prompt.
-3. **LLM responds**: Model extracts the answer from the data.
-4. **Validate with LLM-as-judge**: GPT-5-nano validates if the answer is semantically correct.
+2. **Query LLM**: Each model receives formatted data + question in a prompt and extracts the answer.
+3. **Validate with LLM-as-judge**: GPT-5-nano validates if the answer is semantically correct.
 
 #### Semantic Validation
 
-Answers are validated by an LLM judge (`gpt-5-nano`) using semantic equivalence, not exact string matching:
-
-- **Numeric formats**: `50000` = `$50,000` = `50000 dollars` ✓
-- **Case insensitive**: `Engineering` = `engineering` = `ENGINEERING` ✓
-- **Minor formatting**: `2025-01-01` = `January 1, 2025` ✓
+Answers are validated by an LLM judge (`gpt-5-nano`) using semantic equivalence, not exact string matching (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`).
 
 #### Models & Configuration
 
@@ -810,6 +795,14 @@ console.log(encode(data, { lengthMarker: '#', delimiter: '|' }))
 //   B2|1|14.5
 ```
 
+## Notes and Limitations
+
+- Format familiarity matters as much as token count. TOON's tabular format requires arrays of objects with identical keys and primitive values only – when this doesn't hold (due to mixed types, non-uniform objects, or nested structures), TOON switches to list format where JSON can be cheaper at scale.
+  - **TOON** is best for uniform complex (but not deeply nested) objects, especially large arrays of such objects.
+  - **JSON** is best for non-uniform data and deeply nested structures.
+- **Token counts vary by tokenizer and model.** Benchmarks use a GPT-style tokenizer (cl100k/o200k); actual savings will differ with other models (e.g., [SentencePiece](https://github.com/google/sentencepiece)).
+- **TOON is designed for LLM input** where human readability and token efficiency matter. It's **not** a drop-in replacement for JSON in APIs or storage.
+
 ## Using TOON in LLM Prompts
 
 TOON works best when you show the format instead of describing it. The structure is self-documenting – models parse it naturally once they see the pattern.
@@ -843,20 +836,6 @@ Task: Return only users with role "user" as TOON. Use the same header. Set [N] t
 > [!TIP]
 > For large uniform tables, use `encode(data, { delimiter: '\t' })` and tell the model "fields are tab-separated." Tabs often tokenize better than commas and reduce the need for quote-escaping.
 
-## Notes and Limitations
-
-- **Token counts vary by tokenizer and model.** Benchmarks use a GPT-style tokenizer (cl100k/o200k); actual savings will differ with other models (e.g., SentencePiece).
-- **TOON is designed for LLM contexts** where human readability and token efficiency matter. It's **not** a drop-in replacement for JSON in APIs or storage.
-- **Tabular arrays** require all objects to have exactly the same keys with primitive values only. Arrays with mixed types (primitives + objects/arrays), non-uniform objects, or nested structures will use a more verbose list format.
-- **Object key order** is preserved from the input. In tabular arrays, header order follows the first object's keys.
-- **Arrays mixing primitives and objects/arrays** always use list form:
-  ```
-  items[2]:
-    - a: 1
-    - [2]: 1,2
-  ```
-- **Deterministic formatting:** 2-space indentation, stable key order, no trailing spaces/newline.
-
 ## Quick Reference
 
 ```
diff --git a/benchmarks/results/accuracy/report.md b/benchmarks/results/accuracy/report.md
index 0aea84f..ff366db 100644
--- a/benchmarks/results/accuracy/report.md
+++ b/benchmarks/results/accuracy/report.md
@@ -111,7 +111,7 @@ gemini-2.5-flash
 
 #### What's Being Measured
 
-This benchmark tests **LLM comprehension and data retrieval accuracy** when data is presented in different formats. Each LLM receives formatted data and must answer questions about it (this does NOT test LLM's ability to generate TOON output).
+This benchmark tests **LLM comprehension and data retrieval accuracy** across different input formats. Each LLM receives formatted data and must answer questions about it (this does **not** test model's ability to generate TOON output).
 
 #### Datasets Tested
 
@@ -140,18 +140,9 @@ Four datasets designed to test different structural patterns:
 
 #### Evaluation Process
 
-1. **Format conversion**: Each dataset is converted to all 5 formats (TOON, JSON, YAML, CSV, XML).
-2. **Query LLM**: Each model receives formatted data + question in a prompt.
-3. **LLM responds**: Model extracts the answer from the data.
-4. **Validate with LLM-as-judge**: GPT-5-nano validates if the answer is semantically correct.
-
-#### Semantic Validation
-
-Answers are validated by an LLM judge (`gpt-5-nano`) using semantic equivalence, not exact string matching:
-
-- **Numeric formats**: `50000` = `$50,000` = `50000 dollars` ✓
-- **Case insensitive**: `Engineering` = `engineering` = `ENGINEERING` ✓
-- **Minor formatting**: `2025-01-01` = `January 1, 2025` ✓
+1. **Format conversion:** Each dataset is converted to all 5 formats (TOON, JSON, YAML, CSV, XML).
+2. **Query LLM**: Each model receives formatted data + question in a prompt and extracts the answer.
+4. **Validate with LLM-as-judge**: `gpt-5-nano` validates if the answer is semantically correct (e.g., `50000` = `$50,000`, `Engineering` = `engineering`, `2025-01-01` = `January 1, 2025`).
 
 #### Models & Configuration
 
diff --git a/benchmarks/results/accuracy/summary.json b/benchmarks/results/accuracy/summary.json
index 69d1ae1..f5dd2c2 100644
--- a/benchmarks/results/accuracy/summary.json
+++ b/benchmarks/results/accuracy/summary.json
@@ -87,5 +87,5 @@
     "yaml-analytics": 2938,
     "yaml-github": 13129
   },
-  "timestamp": "2025-10-27T19:35:05.310Z"
+  "timestamp": "2025-10-28T06:43:10.560Z"
 }
diff --git a/benchmarks/scripts/token-efficiency-benchmark.ts b/benchmarks/scripts/token-efficiency-benchmark.ts
index 88ddf8d..34c8c32 100644
--- a/benchmarks/scripts/token-efficiency-benchmark.ts
+++ b/benchmarks/scripts/token-efficiency-benchmark.ts
@@ -204,7 +204,7 @@ ${detailedExamples}
 </details>
 `.trimStart()
 
-console.log(markdown)
+console.log(`${barChartSection}\n`)
 
 await ensureDir(path.join(BENCHMARKS_DIR, 'results'))
 await fsp.writeFile(outputFilePath, markdown, 'utf-8')