# Token-Oriented Object Notation (TOON) AI is becoming cheaper and more accessible, but larger context windows allow for larger data inputs as well. **LLM tokens still cost money** – this is where TOON comes in. **Token-Oriented Object Notation** is a compact, human-readable format designed for passing structured data to Large Language Models. It reduces token usage compared to JSON by: - Removing redundant punctuation (braces/brackets, most quotes) - Using indentation for structure - Tabularizing arrays of objects - Writing inline primitive arrays without spaces ## Token Benchmarks | Example | JSON | TOON | Saved | Reduction | |---------|------|------|-------|-----------| | πŸ‘€ Simple user object | 31 | 18 | 13 | **41.9%** | | 🏷️ User with tags | 48 | 28 | 20 | **41.7%** | | πŸ“¦ Small product catalog | 117 | 49 | 68 | **58.1%** | | πŸ‘₯ API response with users | 123 | 53 | 70 | **56.9%** | | βš™οΈ Nested configuration | 67 | 41 | 26 | **38.8%** | | πŸ›’ E-commerce order | 163 | 94 | 69 | **42.3%** | | πŸ“Š Analytics data | 209 | 94 | 115 | **55.0%** | | πŸ“ˆ Large dataset (50 records) | 2159 | 762 | 1397 | **64.7%** | | **Total** | **2917** | **1139** | **1778** | **61.0%** |
View detailed examples ### πŸ“¦ Small product catalog **Savings: 68 tokens (58.1% reduction)** **JSON** (117 tokens): ```json { "items": [ { "sku": "A1", "name": "Widget", "qty": 2, "price": 9.99 }, { "sku": "B2", "name": "Gadget", "qty": 1, "price": 14.5 }, { "sku": "C3", "name": "Doohickey", "qty": 5, "price": 7.25 } ] } ``` **TOON** (49 tokens): ``` items[3]{sku,name,qty,price}: A1,Widget,2,9.99 B2,Gadget,1,14.5 C3,Doohickey,5,7.25 ``` --- ### πŸ‘₯ API response with users **Savings: 70 tokens (56.9% reduction)** **JSON** (123 tokens): ```json { "users": [ { "id": 1, "name": "Alice", "email": "alice@example.com", "active": true }, { "id": 2, "name": "Bob", "email": "bob@example.com", "active": true }, { "id": 3, "name": "Charlie", "email": "charlie@example.com", "active": false } ], "total": 3, "page": 1 } ``` **TOON** (53 tokens): ``` users[3]{id,name,email,active}: 1,Alice,alice@example.com,true 2,Bob,bob@example.com,true 3,Charlie,charlie@example.com,false total: 3 page: 1 ``` --- ### πŸ“Š Analytics data **Savings: 115 tokens (55.0% reduction)** **JSON** (209 tokens): ```json { "metrics": [ { "date": "2025-01-01", "views": 1234, "clicks": 89, "conversions": 12 }, { "date": "2025-01-02", "views": 2345, "clicks": 156, "conversions": 23 }, { "date": "2025-01-03", "views": 1890, "clicks": 123, "conversions": 18 }, { "date": "2025-01-04", "views": 3456, "clicks": 234, "conversions": 34 }, { "date": "2025-01-05", "views": 2789, "clicks": 178, "conversions": 27 } ] } ``` **TOON** (94 tokens): ``` metrics[5]{date,views,clicks,conversions}: 2025-01-01,1234,89,12 2025-01-02,2345,156,23 2025-01-03,1890,123,18 2025-01-04,3456,234,34 2025-01-05,2789,178,27 ```
> [!NOTE] > Measured with [`gpt-tokenizer`](https://github.com/niieani/gpt-tokenizer) using `o200k_base` encoding (used by GPT-5 and other modern models). Savings will vary across models and tokenizers. ## Why TOON? Standard JSON is verbose and token-expensive in LLM contexts: ```json { "users": [ { "id": 1, "name": "Alice", "role": "admin" }, { "id": 2, "name": "Bob", "role": "user" } ] } ``` TOON conveys the same information with **fewer tokens**: ``` users[2]{id,name,role}: 1,Alice,admin 2,Bob,user ``` ## Key Features - πŸ“‰ **Token-efficient:** typically 30–60% fewer tokens vs JSON on GPT-style tokenizers - πŸ“Š **Tabular arrays:** write object keys once, list rows beneath - βœ‚οΈ **Minimal quoting:** only when required (e.g., commas, colons, ambiguous primitives) - πŸ“ **Indentation-based structure:** no braces/brackets for objects - 🎯 **Inline primitive arrays:** written without spaces after commas - 🎲 **Deterministic:** stable key order, no trailing spaces/newline ## Installation ```bash # npm npm install @byjohann/toon # pnpm pnpm add @byjohann/toon # yarn yarn add @byjohann/toon ``` ## Quick Start ```ts import { encode } from '@byjohann/toon' const data = { user: { id: 123, name: 'Ada', tags: ['admin', 'ops'], active: true } } console.log(encode(data)) ``` Output: ``` user: id: 123 name: Ada tags[2]: admin,ops active: true ``` ## Canonical Formatting Rules TOON formatting is deterministic and minimal: - **Indentation**: 2 spaces per nesting level. - **Lines**: - `key: value` for primitives (single space after colon). - `key:` for nested/empty objects (no trailing space on that line). - **Arrays**: - Primitive arrays inline: `key[N]: v1,v2` (no spaces after commas). - List items: two spaces, hyphen, space (`" - …"`). - **Whitespace invariants**: - No trailing spaces at end of any line. - No trailing newline at end of output. ## Format Overview ### Objects Simple objects with primitive values: ```ts encode({ id: 123, name: 'Ada', active: true }) ``` ``` id: 123 name: Ada active: true ``` Nested objects: ```ts encode({ user: { id: 123, name: 'Ada' } }) ``` ``` user: id: 123 name: Ada ``` ### Arrays #### Primitive Arrays (Inline) ```ts encode({ tags: ['admin', 'ops', 'dev'] }) ``` ``` tags[3]: admin,ops,dev ``` #### Arrays of Objects (Tabular) When all objects share the same primitive fields, TOON uses an efficient **tabular format**: ```ts encode({ items: [ { sku: 'A1', qty: 2, price: 9.99 }, { sku: 'B2', qty: 1, price: 14.5 } ] }) ``` ``` items[2]{sku,qty,price}: A1,2,9.99 B2,1,14.5 ``` #### Mixed and Non-Uniform Arrays Arrays that don't meet the tabular requirements use list format: ``` items[3]: - 1 - a: 1 - text ``` When objects appear in list format, the first field is placed on the hyphen line: ``` items[2]: - id: 1 name: First - id: 2 name: Second extra: true ``` #### Arrays of Arrays When you have arrays containing primitive inner arrays: ```ts encode({ pairs: [ [1, 2], [3, 4] ] }) ``` ``` pairs[2]: - [2]: 1,2 - [2]: 3,4 ``` #### Empty Arrays and Objects Empty containers have special representations: ```ts encode({ items: [] }) // items[0]: encode([]) // [0]: encode({}) // (empty output) encode({ config: {} }) // config: ``` ### Quoting Rules TOON quotes strings **only when necessary** to maximize token efficiency. Inner spaces are allowed; leading or trailing spaces force quotes. Unicode and emoji are safe unquoted. > [!NOTE] > When using alternative delimiters (tab or pipe), the quoting rules adapt automatically. Strings containing the active delimiter will be quoted, while other delimiters remain safe. #### Keys Keys are quoted when any of the following is true: | Condition | Examples | |---|---| | Contains spaces, commas, colons, quotes, control chars | `"full name"`, `"a,b"`, `"order:id"`, `"tab\there"` | | Contains brackets or braces | `"[index]"`, `"{key}"` | | Leading hyphen | `"-lead"` | | Numeric-only key | `"123"` | | Empty key | `""` | **Notes:** - Quotes and control characters in keys are escaped (e.g., `"he said \"hi\""`, `"line\nbreak"`). #### String Values String values are quoted when any of the following is true: | Condition | Examples | |---|---| | Empty string | `""` | | Contains active delimiter, colon, quote, backslash, or control chars | `"a,b"` (comma), `"a\tb"` (tab), `"a\|b"` (pipe), `"a:b"`, `"say \"hi\""`, `"C:\\Users"`, `"line1\\nline2"` | | Leading or trailing spaces | `" padded "`, `" "` | | Looks like boolean/number/null | `"true"`, `"false"`, `"null"`, `"42"`, `"-3.14"`, `"1e-6"`, `"05"` | | Starts with `"- "` (list-like) | `"- item"` | | Looks like structural token | `"[5]"`, `"{key}"`, `"[3]: x,y"` | > [!NOTE] > **Delimiter-aware quoting:** The quoting rules are context-sensitive. When using tab or pipe delimiters, commas don't need quoting. Only the active delimiter triggers quoting – this applies to both array values and object values. #### Examples ``` note: "hello, world" items[3]: x,"true","- item" hello πŸ‘‹ world // unquoted " padded " // quoted value: null // null value name: "" // empty string (quoted) text: "line1\nline2" // multi-line string (escaped) ``` ### Tabular Format Requirements For arrays of objects to use the efficient tabular format, all of the following must be true: | Requirement | Detail | |---|---| | All elements are objects | No primitives in the array | | Identical key sets | No missing or extra keys across rows | | Primitive values only | No nested arrays or objects | | Header key order | Taken from the first object | | Header key quoting | Same rules as object keys | | Row value quoting | Same rules as string values | If any condition fails, TOON falls back to list format. ## Type Conversions Some non-JSON types are automatically normalized for LLM-safe output: | Input | Output | |---|---| | Number (finite) | Decimal form, no scientific notation; `-0` β†’ `0` | | Number (`NaN`, `Β±Infinity`) | `null` | | `BigInt` | Decimal digits (no quotes) | | `Date` | ISO string in quotes (e.g., `"2025-01-01T00:00:00.000Z"`) | | `undefined` | `null` | | `function` | `null` | | `symbol` | `null` | Number normalization examples: ``` -0 β†’ 0 1e6 β†’ 1000000 1e-6 β†’ 0.000001 ``` ## API ### `encode(value: unknown, options?: EncodeOptions): string` Converts any JSON-serializable value to TOON format. **Parameters:** - `value` – Any JSON-serializable value (object, array, primitive, or nested structure). Non-JSON-serializable values (functions, symbols, undefined, non-finite numbers) are converted to `null`. Dates are converted to ISO strings, and BigInts are emitted as decimal integers (no quotes). - `options` – Optional encoding options: - `indent?: number` – Number of spaces per indentation level (default: `2`) - `delimiter?: ',' | '\t' | '|'` – Delimiter for array values and tabular rows (default: `','`) **Returns:** A TOON-formatted string with no trailing newline or spaces. **Example:** ```ts import { encode } from '@byjohann/toon' const items = [ { sku: 'A1', qty: 2, price: 9.99 }, { sku: 'B2', qty: 1, price: 14.5 } ] console.log(encode({ items })) ``` **Output:** ``` items[2]{sku,qty,price}: A1,2,9.99 B2,1,14.5 ``` #### Delimiter Options The `delimiter` option allows you to choose between comma (default), tab, or pipe delimiters for array values and tabular rows. Alternative delimiters can provide additional token savings in specific contexts. ##### Tab Delimiter (`\t`) Using tab delimiters instead of commas can reduce token count further, especially for tabular data: ```ts import { encode } from '@byjohann/toon' const data = { items: [ { sku: 'A1', name: 'Widget', qty: 2, price: 9.99 }, { sku: 'B2', name: 'Gadget', qty: 1, price: 14.5 } ] } console.log(encode(data, { delimiter: '\t' })) ``` **Output:** ``` items[2]{sku,name,qty,price}: A1 Widget 2 9.99 B2 Gadget 1 14.5 ``` **Benefits:** - Tabs are single characters and often tokenize more efficiently than commas - Tabs rarely appear in natural text, reducing the need for quote-escaping **Considerations:** - Some terminals and editors may collapse or expand tabs visually - String values containing tabs will still require quoting ##### Pipe Delimiter (`|`) Pipe delimiters offer a middle ground between commas and tabs: ```ts console.log(encode(data, { delimiter: '|' })) ``` **Output:** ``` items[2]{sku,name,qty,price}: A1|Widget|2|9.99 B2|Gadget|1|14.5 ``` ## Using TOON in LLM Prompts When incorporating TOON into your LLM workflows: - Wrap TOON data in a fenced code block in your prompt. - Tell the model: "Do not add extra punctuation or spaces; follow the exact TOON format." - When asking the model to generate TOON, specify the same rules (2-space indentation, no trailing spaces, quoting rules). ## Token Savings Example Here's a realistic API response to illustrate the token savings: **JSON:** ```json { "users": [ { "id": 1, "name": "Alice", "email": "alice@example.com", "active": true }, { "id": 2, "name": "Bob", "email": "bob@example.com", "active": true }, { "id": 3, "name": "Charlie", "email": "charlie@example.com", "active": false } ] } ``` **TOON:** ``` users[3]{id,name,email,active}: 1,Alice,alice@example.com,true 2,Bob,bob@example.com,true 3,Charlie,charlie@example.com,false ``` Typical savings vs JSON are in the **30–60% range** on GPT-style tokenizers, driven by: - Tabular arrays of objects (keys written once) - No structural braces/brackets - Minimal quoting - No spaces after commas ## Notes and Limitations - **Token counts vary by tokenizer and model.** Benchmarks use a GPT-style tokenizer (cl100k/o200k); actual savings will differ with other models (e.g., SentencePiece). - **TOON is designed for LLM contexts** where human readability and token efficiency matter. It's **not** a drop-in replacement for JSON in APIs or storage. - **Tabular arrays** require all objects to have exactly the same keys with primitive values only. Arrays with mixed types (primitives + objects/arrays), non-uniform objects, or nested structures will use a more verbose list format. - **Object key order** is preserved from the input. In tabular arrays, header order follows the first object's keys. - **Arrays mixing primitives and objects/arrays** always use list form: ``` items[2]: - a: 1 - [2]: 1,2 ``` - **Deterministic formatting:** 2-space indentation, stable key order, no trailing spaces/newline. ## Quick Reference ``` // Object { id: 1, name: 'Ada' } β†’ id: 1 name: Ada // Nested object { user: { id: 1 } } β†’ user: id: 1 // Primitive array (inline) { tags: ['a', 'b'] } β†’ tags[2]: a,b // Tabular array (uniform objects) { items: [ β†’ items[2]{id,qty}: { id: 1, qty: 5 }, 1,5 { id: 2, qty: 3 } 2,3 ]} // Mixed / non-uniform (list) { items: [1, { a: 1 }, 'x'] } β†’ items[3]: - 1 - a: 1 - x // Array of arrays { pairs: [[1, 2], [3, 4]] } β†’ pairs[2]: - [2]: 1,2 - [2]: 3,4 // Root array ['x', 'y'] β†’ [2]: x,y // Empty containers {} β†’ (empty output) { items: [] } β†’ items[0]: // Special quoting { note: 'hello, world' } β†’ note: "hello, world" { items: ['true', true] } β†’ items[2]: "true",true ``` ## License [MIT](./LICENSE) License Β© 2025-PRESENT [Johann Schopplich](https://github.com/johannschopplich)