chore: initial commit

2026-01-29 15:24:10 +08:00 · 2025-10-22 20:16:02 +02:00
commit f105551c3e
24 changed files with 6983 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,602 @@
+# Token-Oriented Object Notation (TOON)
+
+AI is becoming cheaper and more accessible, but larger context windows allow for larger data inputs as well. **LLM tokens still cost money** – this is where TOON comes in.
+
+**Token-Oriented Object Notation** is a compact, human-readable format designed for passing structured data to Large Language Models. It reduces token usage compared to JSON by:
+
+- Removing redundant punctuation (braces/brackets, most quotes)
+- Using indentation for structure
+- Tabularizing arrays of objects
+- Writing inline primitive arrays without spaces
+
+## Token Benchmarks
+
+<!-- automd:file src="./docs/benchmarks.md" -->
+
+| Example | JSON | TOON | Saved | Reduction |
+|---------|------|------|-------|-----------|
+| 👤 Simple user object | 31 | 18 | 13 | **41.9%** |
+| 🏷️ User with tags | 48 | 28 | 20 | **41.7%** |
+| 📦 Small product catalog | 117 | 49 | 68 | **58.1%** |
+| 👥 API response with users | 123 | 53 | 70 | **56.9%** |
+| ⚙️ Nested configuration | 67 | 41 | 26 | **38.8%** |
+| 🛒 E-commerce order | 163 | 94 | 69 | **42.3%** |
+| 📊 Analytics data | 209 | 94 | 115 | **55.0%** |
+| 📈 Large dataset (50 records) | 2159 | 762 | 1397 | **64.7%** |
+| **Total** | **2917** | **1139** | **1778** | **61.0%** |
+
+<details>
+<summary><strong>View detailed examples</strong></summary>
+
+### 📦 Small product catalog
+
+**Savings: 68 tokens (58.1% reduction)**
+
+**JSON** (117 tokens):
+
+```json
+{
+  "items": [
+    {
+      "sku": "A1",
+      "name": "Widget",
+      "qty": 2,
+      "price": 9.99
+    },
+    {
+      "sku": "B2",
+      "name": "Gadget",
+      "qty": 1,
+      "price": 14.5
+    },
+    {
+      "sku": "C3",
+      "name": "Doohickey",
+      "qty": 5,
+      "price": 7.25
+    }
+  ]
+}
+```
+
+**TOON** (49 tokens):
+
+```
+items[3]{sku,name,qty,price}:
+  A1,Widget,2,9.99
+  B2,Gadget,1,14.5
+  C3,Doohickey,5,7.25
+```
+
+---
+
+### 👥 API response with users
+
+**Savings: 70 tokens (56.9% reduction)**
+
+**JSON** (123 tokens):
+
+```json
+{
+  "users": [
+    {
+      "id": 1,
+      "name": "Alice",
+      "email": "alice@example.com",
+      "active": true
+    },
+    {
+      "id": 2,
+      "name": "Bob",
+      "email": "bob@example.com",
+      "active": true
+    },
+    {
+      "id": 3,
+      "name": "Charlie",
+      "email": "charlie@example.com",
+      "active": false
+    }
+  ],
+  "total": 3,
+  "page": 1
+}
+```
+
+**TOON** (53 tokens):
+
+```
+users[3]{id,name,email,active}:
+  1,Alice,alice@example.com,true
+  2,Bob,bob@example.com,true
+  3,Charlie,charlie@example.com,false
+total: 3
+page: 1
+```
+
+---
+
+### 📊 Analytics data
+
+**Savings: 115 tokens (55.0% reduction)**
+
+**JSON** (209 tokens):
+
+```json
+{
+  "metrics": [
+    {
+      "date": "2025-01-01",
+      "views": 1234,
+      "clicks": 89,
+      "conversions": 12
+    },
+    {
+      "date": "2025-01-02",
+      "views": 2345,
+      "clicks": 156,
+      "conversions": 23
+    },
+    {
+      "date": "2025-01-03",
+      "views": 1890,
+      "clicks": 123,
+      "conversions": 18
+    },
+    {
+      "date": "2025-01-04",
+      "views": 3456,
+      "clicks": 234,
+      "conversions": 34
+    },
+    {
+      "date": "2025-01-05",
+      "views": 2789,
+      "clicks": 178,
+      "conversions": 27
+    }
+  ]
+}
+```
+
+**TOON** (94 tokens):
+
+```
+metrics[5]{date,views,clicks,conversions}:
+  2025-01-01,1234,89,12
+  2025-01-02,2345,156,23
+  2025-01-03,1890,123,18
+  2025-01-04,3456,234,34
+  2025-01-05,2789,178,27
+```
+
+</details>
+
+<!-- /automd -->
+
+> [!NOTE]
+> Measured with [`gpt-tokenizer`](https://github.com/niieani/gpt-tokenizer) using `o200k_base` encoding (used by GPT-5 and other modern models). Savings will vary across models and tokenizers.
+
+## Why TOON?
+
+Standard JSON is verbose and token-expensive in LLM contexts:
+
+```json
+{
+  "users": [
+    { "id": 1, "name": "Alice", "role": "admin" },
+    { "id": 2, "name": "Bob", "role": "user" }
+  ]
+}
+```
+
+TOON conveys the same information with **fewer tokens**:
+
+```
+users[2]{id,name,role}:
+  1,Alice,admin
+  2,Bob,user
+```
+
+## Key Features
+
+- 📉 **Token-efficient:** typically 30–60% fewer tokens vs JSON on GPT-style tokenizers
+- 📊 **Tabular arrays:** write object keys once, list rows beneath
+- ✂️ **Minimal quoting:** only when required (e.g., commas, colons, ambiguous primitives)
+- 📐 **Indentation-based structure:** no braces/brackets for objects
+- 🎯 **Inline primitive arrays:** written without spaces after commas
+- 🎲 **Deterministic:** stable key order, no trailing spaces/newline
+
+## Installation
+
+```bash
+# npm
+npm install toon
+
+# pnpm
+pnpm add toon
+
+# yarn
+yarn add toon
+```
+
+## Quick Start
+
+```ts
+import { encode } from 'toon'
+
+const data = {
+  user: {
+    id: 123,
+    name: 'Ada',
+    tags: ['admin', 'ops'],
+    active: true
+  }
+}
+
+console.log(encode(data))
+```
+
+Output:
+
+```
+user:
+  id: 123
+  name: Ada
+  tags[2]: admin,ops
+  active: true
+```
+
+## Canonical Formatting Rules
+
+TOON formatting is deterministic and minimal:
+
+- **Indentation**: 2 spaces per nesting level.
+- **Lines**:
+  - `key: value` for primitives (single space after colon).
+  - `key:` for nested/empty objects (no trailing space on that line).
+- **Arrays**:
+  - Primitive arrays inline: `key[N]: v1,v2` (no spaces after commas).
+  - List items: two spaces, hyphen, space (`"  - …"`).
+- **Whitespace invariants**:
+  - No trailing spaces at end of any line.
+  - No trailing newline at end of output.
+
+## Format Overview
+
+### Objects
+
+Simple objects with primitive values:
+
+```ts
+encode({
+  id: 123,
+  name: 'Ada',
+  active: true
+})
+```
+
+```
+id: 123
+name: Ada
+active: true
+```
+
+Nested objects:
+
+```ts
+encode({
+  user: {
+    id: 123,
+    name: 'Ada'
+  }
+})
+```
+
+```
+user:
+  id: 123
+  name: Ada
+```
+
+### Arrays
+
+#### Primitive Arrays (Inline)
+
+```ts
+encode({
+  tags: ['admin', 'ops', 'dev']
+})
+```
+
+```
+tags[3]: admin,ops,dev
+```
+
+#### Arrays of Objects (Tabular)
+
+When all objects share the same primitive fields, TOON uses an efficient **tabular format**:
+
+```ts
+encode({
+  items: [
+    { sku: 'A1', qty: 2, price: 9.99 },
+    { sku: 'B2', qty: 1, price: 14.5 }
+  ]
+})
+```
+
+```
+items[2]{sku,qty,price}:
+  A1,2,9.99
+  B2,1,14.5
+```
+
+#### Mixed and Non-Uniform Arrays
+
+Arrays that don't meet the tabular requirements use list format:
+
+```
+items[3]:
+  - 1
+  - a: 1
+  - text
+```
+
+When objects appear in list format, the first field is placed on the hyphen line:
+
+```
+items[2]:
+  - id: 1
+    name: First
+  - id: 2
+    name: Second
+    extra: true
+```
+
+#### Arrays of Arrays
+
+When you have arrays containing primitive inner arrays:
+
+```ts
+encode({
+  pairs: [
+    [1, 2],
+    [3, 4]
+  ]
+})
+```
+
+```
+pairs[2]:
+  - [2]: 1,2
+  - [2]: 3,4
+```
+
+#### Empty Arrays and Objects
+
+Empty containers have special representations:
+
+```ts
+encode({ items: [] }) // items[0]:
+encode([]) // [0]:
+encode({}) // (empty output)
+encode({ config: {} }) // config:
+```
+
+### Quoting Rules
+
+TOON quotes strings **only when necessary** to maximize token efficiency. Inner spaces are allowed; leading or trailing spaces force quotes. Unicode and emoji are safe unquoted.
+
+#### Keys
+
+Keys are quoted when any of the following is true:
+
+| Condition | Examples |
+|---|---|
+| Contains spaces, commas, colons, quotes, control chars | `"full name"`, `"a,b"`, `"order:id"`, `"tab\there"` |
+| Contains brackets or braces | `"[index]"`, `"{key}"` |
+| Leading hyphen | `"-lead"` |
+| Numeric-only key | `"123"` |
+| Empty key | `""` |
+
+**Notes:**
+
+- Quotes and control characters in keys are escaped (e.g., `"he said \"hi\""`, `"line\nbreak"`).
+
+#### String Values
+
+String values are quoted when any of the following is true:
+
+| Condition | Examples |
+|---|---|
+| Empty string | `""` |
+| Contains comma, colon, quote, backslash, or control chars | `"a,b"`, `"a:b"`, `"say \"hi\""`, `"C:\\Users"`, `"line1\\nline2"` |
+| Leading or trailing spaces | `" padded "`, `"  "` |
+| Looks like boolean/number/null | `"true"`, `"false"`, `"null"`, `"42"`, `"-3.14"`, `"1e-6"`, `"05"` |
+| Starts with `"- "` (list-like) | `"- item"` |
+| Looks like structural token | `"[5]"`, `"{key}"`, `"[3]: x,y"` |
+
+#### Examples
+
+```
+note: "hello, world"
+items[3]: x,"true","- item"
+hello 👋 world         // unquoted
+" padded "             // quoted
+value: null            // null value
+name: ""               // empty string (quoted)
+text: "line1\nline2"   // multi-line string (escaped)
+```
+
+### Tabular Format Requirements
+
+For arrays of objects to use the efficient tabular format, all of the following must be true:
+
+| Requirement | Detail |
+|---|---|
+| All elements are objects | No primitives in the array |
+| Identical key sets | No missing or extra keys across rows |
+| Primitive values only | No nested arrays or objects |
+| Header key order | Taken from the first object |
+| Header key quoting | Same rules as object keys |
+| Row value quoting | Same rules as string values |
+
+If any condition fails, TOON falls back to list format.
+
+## Type Conversions
+
+Some non-JSON types are automatically normalized for LLM-safe output:
+
+| Input | Output |
+|---|---|
+| Number (finite) | Decimal form, no scientific notation; `-0` → `0` |
+| Number (`NaN`, `±Infinity`) | `null` |
+| `BigInt` | Decimal digits (no quotes) |
+| `Date` | ISO string in quotes (e.g., `"2025-01-01T00:00:00.000Z"`) |
+| `undefined` | `null` |
+| `function` | `null` |
+| `symbol` | `null` |
+
+Number normalization examples:
+
+```
+-0    → 0
+1e6   → 1000000
+1e-6  → 0.000001
+```
+
+## API
+
+### `encode(value: unknown): string`
+
+Converts any JSON-serializable value to TOON format.
+
+**Parameters:**
+
+- `value` – Any JSON-serializable value (object, array, primitive, or nested structure). Non-JSON-serializable values (functions, symbols, undefined, non-finite numbers) are converted to `null`. Dates are converted to ISO strings, and BigInts are emitted as decimal integers (no quotes).
+
+**Returns:**
+
+A TOON-formatted string with no trailing newline or spaces.
+
+**Example:**
+
+```ts
+import { encode } from 'toon'
+
+const items = [
+  { sku: 'A1', qty: 2, price: 9.99 },
+  { sku: 'B2', qty: 1, price: 14.5 }
+]
+
+console.log(encode({ items }))
+```
+
+**Output:**
+
+```
+items[2]{sku,qty,price}:
+  A1,2,9.99
+  B2,1,14.5
+```
+
+## Using TOON in LLM Prompts
+
+When incorporating TOON into your LLM workflows:
+
+- Wrap TOON data in a fenced code block in your prompt.
+- Tell the model: "Do not add extra punctuation or spaces; follow the exact TOON format."
+- When asking the model to generate TOON, specify the same rules (2-space indentation, no trailing spaces, quoting rules).
+
+## Token Savings Example
+
+Here's a realistic API response to illustrate the token savings:
+
+**JSON:**
+```json
+{
+  "users": [
+    { "id": 1, "name": "Alice", "email": "alice@example.com", "active": true },
+    { "id": 2, "name": "Bob", "email": "bob@example.com", "active": true },
+    { "id": 3, "name": "Charlie", "email": "charlie@example.com", "active": false }
+  ]
+}
+```
+
+**TOON:**
+
+```
+users[3]{id,name,email,active}:
+  1,Alice,alice@example.com,true
+  2,Bob,bob@example.com,true
+  3,Charlie,charlie@example.com,false
+```
+
+Typical savings vs JSON are in the **30–60% range** on GPT-style tokenizers, driven by:
+
+- Tabular arrays of objects (keys written once)
+- No structural braces/brackets
+- Minimal quoting
+- No spaces after commas
+
+## Notes and Limitations
+
+- **Token counts vary by tokenizer and model.** Benchmarks use a GPT-style tokenizer (cl100k/o200k); actual savings will differ with other models (e.g., SentencePiece).
+- **TOON is designed for LLM contexts** where human readability and token efficiency matter. It's **not** a drop-in replacement for JSON in APIs or storage.
+- **Tabular arrays** require all objects to have exactly the same keys with primitive values only. Arrays with mixed types (primitives + objects/arrays), non-uniform objects, or nested structures will use a more verbose list format.
+- **Object key order** is preserved from the input. In tabular arrays, header order follows the first object's keys.
+- **Arrays mixing primitives and objects/arrays** always use list form:
+  ```
+  items[2]:
+    - a: 1
+    - [2]: 1,2
+  ```
+- **Deterministic formatting:** 2-space indentation, stable key order, no trailing spaces/newline.
+
+## Quick Reference
+
+```
+// Object
+{ id: 1, name: 'Ada' }          → id: 1
+                                   name: Ada
+
+// Nested object
+{ user: { id: 1 } }             → user:
+                                     id: 1
+
+// Primitive array (inline)
+{ tags: ['a', 'b'] }            → tags[2]: a,b
+
+// Tabular array (uniform objects)
+{ items: [                      → items[2]{id,qty}:
+  { id: 1, qty: 5 },                1,5
+  { id: 2, qty: 3 }                 2,3
+]}
+
+// Mixed / non-uniform (list)
+{ items: [1, { a: 1 }, 'x'] }   → items[3]:
+                                     - 1
+                                     - a: 1
+                                     - x
+
+// Array of arrays
+{ pairs: [[1, 2], [3, 4]] }     → pairs[2]:
+                                     - [2]: 1,2
+                                     - [2]: 3,4
+
+// Root array
+['x', 'y']                      → [2]: x,y
+
+// Empty containers
+{}                              → (empty output)
+{ items: [] }                   → items[0]:
+
+// Special quoting
+{ note: 'hello, world' }        → note: "hello, world"
+{ items: ['true', true] }       → items[2]: "true",true
+```
+
+## License
+
+[MIT](./LICENSE) License © 2025-PRESENT [Johann Schopplich](https://github.com/johannschopplich)