toon/README.md

# Token-Oriented Object Notation (TOON)

AI is becoming cheaper and more accessible, but larger context windows allow for larger data inputs as well. **LLM tokens still cost money** – this is where TOON comes in.

**Token-Oriented Object Notation** is a compact, human-readable format designed for passing structured data to Large Language Models. It reduces token usage compared to JSON by:

- Removing redundant punctuation (braces/brackets, most quotes)
- Using indentation for structure
- Tabularizing arrays of objects
- Writing inline primitive arrays without spaces

## Token Benchmarks

<!-- automd:file src="./docs/benchmarks.md" -->

| Example | JSON | TOON | Saved | Reduction |
|---------|------|------|-------|-----------|
| 👤 Simple user object | 31 | 18 | 13 | **41.9%** |
| 🏷️ User with tags | 48 | 28 | 20 | **41.7%** |
| 📦 Small product catalog | 117 | 49 | 68 | **58.1%** |
| 👥 API response with users | 123 | 53 | 70 | **56.9%** |
| ⚙️ Nested configuration | 67 | 41 | 26 | **38.8%** |
| 🛒 E-commerce order | 163 | 94 | 69 | **42.3%** |
| 📊 Analytics data | 209 | 94 | 115 | **55.0%** |
| 📈 Large dataset (50 records) | 2159 | 762 | 1397 | **64.7%** |
| **Total** | **2917** | **1139** | **1778** | **61.0%** |

<details>
<summary><strong>View detailed examples</strong></summary>

### 📦 Small product catalog

**Savings: 68 tokens (58.1% reduction)**

**JSON** (117 tokens):

```json
{
  "items": [
    {
      "sku": "A1",
      "name": "Widget",
      "qty": 2,
      "price": 9.99
    },
    {
      "sku": "B2",
      "name": "Gadget",
      "qty": 1,
      "price": 14.5
    },
    {
      "sku": "C3",
      "name": "Doohickey",
      "qty": 5,
      "price": 7.25
    }
  ]
}
```

**TOON** (49 tokens):

```
items[3]{sku,name,qty,price}:
  A1,Widget,2,9.99
  B2,Gadget,1,14.5
  C3,Doohickey,5,7.25
```

---

### 👥 API response with users

**Savings: 70 tokens (56.9% reduction)**

**JSON** (123 tokens):

```json
{
  "users": [
    {
      "id": 1,
      "name": "Alice",
      "email": "alice@example.com",
      "active": true
    },
    {
      "id": 2,
      "name": "Bob",
      "email": "bob@example.com",
      "active": true
    },
    {
      "id": 3,
      "name": "Charlie",
      "email": "charlie@example.com",
      "active": false
    }
  ],
  "total": 3,
  "page": 1
}
```

**TOON** (53 tokens):

```
users[3]{id,name,email,active}:
  1,Alice,alice@example.com,true
  2,Bob,bob@example.com,true
  3,Charlie,charlie@example.com,false
total: 3
page: 1
```

---

### 📊 Analytics data

**Savings: 115 tokens (55.0% reduction)**

**JSON** (209 tokens):

```json
{
  "metrics": [
    {
      "date": "2025-01-01",
      "views": 1234,
      "clicks": 89,
      "conversions": 12
    },
    {
      "date": "2025-01-02",
      "views": 2345,
      "clicks": 156,
      "conversions": 23
    },
    {
      "date": "2025-01-03",
      "views": 1890,
      "clicks": 123,
      "conversions": 18
    },
    {
      "date": "2025-01-04",
      "views": 3456,
      "clicks": 234,
      "conversions": 34
    },
    {
      "date": "2025-01-05",
      "views": 2789,
      "clicks": 178,
      "conversions": 27
    }
  ]
}
```

**TOON** (94 tokens):

```
metrics[5]{date,views,clicks,conversions}:
  2025-01-01,1234,89,12
  2025-01-02,2345,156,23
  2025-01-03,1890,123,18
  2025-01-04,3456,234,34
  2025-01-05,2789,178,27
```

</details>

<!-- /automd -->

> [!NOTE]
> Measured with [`gpt-tokenizer`](https://github.com/niieani/gpt-tokenizer) using `o200k_base` encoding (used by GPT-5 and other modern models). Savings will vary across models and tokenizers.

## Why TOON?

Standard JSON is verbose and token-expensive in LLM contexts:

```json
{
  "users": [
    { "id": 1, "name": "Alice", "role": "admin" },
    { "id": 2, "name": "Bob", "role": "user" }
  ]
}
```

TOON conveys the same information with **fewer tokens**:

```
users[2]{id,name,role}:
  1,Alice,admin
  2,Bob,user
```

## Key Features

- 📉 **Token-efficient:** typically 30–60% fewer tokens vs JSON on GPT-style tokenizers
- 📊 **Tabular arrays:** write object keys once, list rows beneath
- ✂️ **Minimal quoting:** only when required (e.g., commas, colons, ambiguous primitives)
- 📐 **Indentation-based structure:** no braces/brackets for objects
- 🎯 **Inline primitive arrays:** written without spaces after commas
- 🎲 **Deterministic:** stable key order, no trailing spaces/newline

## Installation

```bash
# npm
npm install @byjohann/toon

# pnpm
pnpm add @byjohann/toon

# yarn
yarn add @byjohann/toon
```

## Quick Start

```ts
import { encode } from '@byjohann/toon'

const data = {
  user: {
    id: 123,
    name: 'Ada',
    tags: ['admin', 'ops'],
    active: true
  }
}

console.log(encode(data))
```

Output:

```
user:
  id: 123
  name: Ada
  tags[2]: admin,ops
  active: true
```

## Canonical Formatting Rules

TOON formatting is deterministic and minimal:

- **Indentation**: 2 spaces per nesting level.
- **Lines**:
  - `key: value` for primitives (single space after colon).
  - `key:` for nested/empty objects (no trailing space on that line).
- **Arrays**:
  - Primitive arrays inline: `key[N]: v1,v2` (no spaces after commas).
  - List items: two spaces, hyphen, space (`"  - …"`).
- **Whitespace invariants**:
  - No trailing spaces at end of any line.
  - No trailing newline at end of output.

## Format Overview

### Objects

Simple objects with primitive values:

```ts
encode({
  id: 123,
  name: 'Ada',
  active: true
})
```

```
id: 123
name: Ada
active: true
```

Nested objects:

```ts
encode({
  user: {
    id: 123,
    name: 'Ada'
  }
})
```

```
user:
  id: 123
  name: Ada
```

### Arrays

#### Primitive Arrays (Inline)

```ts
encode({
  tags: ['admin', 'ops', 'dev']
})
```

```
tags[3]: admin,ops,dev
```

#### Arrays of Objects (Tabular)

When all objects share the same primitive fields, TOON uses an efficient **tabular format**:

```ts
encode({
  items: [
    { sku: 'A1', qty: 2, price: 9.99 },
    { sku: 'B2', qty: 1, price: 14.5 }
  ]
})
```

```
items[2]{sku,qty,price}:
  A1,2,9.99
  B2,1,14.5
```

#### Mixed and Non-Uniform Arrays

Arrays that don't meet the tabular requirements use list format:

```
items[3]:
  - 1
  - a: 1
  - text
```

When objects appear in list format, the first field is placed on the hyphen line:

```
items[2]:
  - id: 1
    name: First
  - id: 2
    name: Second
    extra: true
```

#### Arrays of Arrays

When you have arrays containing primitive inner arrays:

```ts
encode({
  pairs: [
    [1, 2],
    [3, 4]
  ]
})
```

```
pairs[2]:
  - [2]: 1,2
  - [2]: 3,4
```

#### Empty Arrays and Objects

Empty containers have special representations:

```ts
encode({ items: [] }) // items[0]:
encode([]) // [0]:
encode({}) // (empty output)
encode({ config: {} }) // config:
```

### Quoting Rules

TOON quotes strings **only when necessary** to maximize token efficiency. Inner spaces are allowed; leading or trailing spaces force quotes. Unicode and emoji are safe unquoted.

> [!NOTE]
> When using alternative delimiters (tab or pipe), the quoting rules adapt automatically. Strings containing the active delimiter will be quoted, while other delimiters remain safe.

#### Keys

Keys are quoted when any of the following is true:

| Condition | Examples |
|---|---|
| Contains spaces, commas, colons, quotes, control chars | `"full name"`, `"a,b"`, `"order:id"`, `"tab\there"` |
| Contains brackets or braces | `"[index]"`, `"{key}"` |
| Leading hyphen | `"-lead"` |
| Numeric-only key | `"123"` |
| Empty key | `""` |

**Notes:**

- Quotes and control characters in keys are escaped (e.g., `"he said \"hi\""`, `"line\nbreak"`).

#### String Values

String values are quoted when any of the following is true:

| Condition | Examples |
|---|---|
| Empty string | `""` |
| Contains active delimiter, colon, quote, backslash, or control chars | `"a,b"` (comma), `"a\tb"` (tab), `"a\|b"` (pipe), `"a:b"`, `"say \"hi\""`, `"C:\\Users"`, `"line1\\nline2"` |
| Leading or trailing spaces | `" padded "`, `"  "` |
| Looks like boolean/number/null | `"true"`, `"false"`, `"null"`, `"42"`, `"-3.14"`, `"1e-6"`, `"05"` |
| Starts with `"- "` (list-like) | `"- item"` |
| Looks like structural token | `"[5]"`, `"{key}"`, `"[3]: x,y"` |

> [!NOTE]
> **Delimiter-aware quoting:** The quoting rules are context-sensitive. When using tab or pipe delimiters, commas don't need quoting. Only the active delimiter triggers quoting – this applies to both array values and object values.

#### Examples

```
note: "hello, world"
items[3]: x,"true","- item"
hello 👋 world         // unquoted
" padded "             // quoted
value: null            // null value
name: ""               // empty string (quoted)
text: "line1\nline2"   // multi-line string (escaped)
```

### Tabular Format Requirements

For arrays of objects to use the efficient tabular format, all of the following must be true:

| Requirement | Detail |
|---|---|
| All elements are objects | No primitives in the array |
| Identical key sets | No missing or extra keys across rows |
| Primitive values only | No nested arrays or objects |
| Header key order | Taken from the first object |
| Header key quoting | Same rules as object keys |
| Row value quoting | Same rules as string values |

If any condition fails, TOON falls back to list format.

## Type Conversions

Some non-JSON types are automatically normalized for LLM-safe output:

| Input | Output |
|---|---|
| Number (finite) | Decimal form, no scientific notation; `-0` → `0` |
| Number (`NaN`, `±Infinity`) | `null` |
| `BigInt` | Decimal digits (no quotes) |
| `Date` | ISO string in quotes (e.g., `"2025-01-01T00:00:00.000Z"`) |
| `undefined` | `null` |
| `function` | `null` |
| `symbol` | `null` |

Number normalization examples:

```
-0    → 0
1e6   → 1000000
1e-6  → 0.000001
```

## API

### `encode(value: unknown, options?: EncodeOptions): string`

Converts any JSON-serializable value to TOON format.

**Parameters:**

- `value` – Any JSON-serializable value (object, array, primitive, or nested structure). Non-JSON-serializable values (functions, symbols, undefined, non-finite numbers) are converted to `null`. Dates are converted to ISO strings, and BigInts are emitted as decimal integers (no quotes).
- `options` – Optional encoding options:
  - `indent?: number` – Number of spaces per indentation level (default: `2`)
  - `delimiter?: ',' | '\t' | '|'` – Delimiter for array values and tabular rows (default: `','`)

**Returns:**

A TOON-formatted string with no trailing newline or spaces.

**Example:**

```ts
import { encode } from '@byjohann/toon'

const items = [
  { sku: 'A1', qty: 2, price: 9.99 },
  { sku: 'B2', qty: 1, price: 14.5 }
]

console.log(encode({ items }))
```

**Output:**

```
items[2]{sku,qty,price}:
  A1,2,9.99
  B2,1,14.5
```

#### Delimiter Options

The `delimiter` option allows you to choose between comma (default), tab, or pipe delimiters for array values and tabular rows. Alternative delimiters can provide additional token savings in specific contexts.

##### Tab Delimiter (`\t`)

Using tab delimiters instead of commas can reduce token count further, especially for tabular data:

```ts
import { encode } from '@byjohann/toon'

const data = {
  items: [
    { sku: 'A1', name: 'Widget', qty: 2, price: 9.99 },
    { sku: 'B2', name: 'Gadget', qty: 1, price: 14.5 }
  ]
}

console.log(encode(data, { delimiter: '\t' }))
```

**Output:**

```
items[2]{sku,name,qty,price}:
  A1	Widget	2	9.99
  B2	Gadget	1	14.5
```

**Benefits:**

- Tabs are single characters and often tokenize more efficiently than commas
- Tabs rarely appear in natural text, reducing the need for quote-escaping

**Considerations:**

- Some terminals and editors may collapse or expand tabs visually
- String values containing tabs will still require quoting

##### Pipe Delimiter (`|`)

Pipe delimiters offer a middle ground between commas and tabs:

```ts
console.log(encode(data, { delimiter: '|' }))
```

**Output:**

```
items[2]{sku,name,qty,price}:
  A1|Widget|2|9.99
  B2|Gadget|1|14.5
```

## Using TOON in LLM Prompts

When incorporating TOON into your LLM workflows:

- Wrap TOON data in a fenced code block in your prompt.
- Tell the model: "Do not add extra punctuation or spaces; follow the exact TOON format."
- When asking the model to generate TOON, specify the same rules (2-space indentation, no trailing spaces, quoting rules).

## Token Savings Example

Here's a realistic API response to illustrate the token savings:

**JSON:**
```json
{
  "users": [
    { "id": 1, "name": "Alice", "email": "alice@example.com", "active": true },
    { "id": 2, "name": "Bob", "email": "bob@example.com", "active": true },
    { "id": 3, "name": "Charlie", "email": "charlie@example.com", "active": false }
  ]
}
```

**TOON:**

```
users[3]{id,name,email,active}:
  1,Alice,alice@example.com,true
  2,Bob,bob@example.com,true
  3,Charlie,charlie@example.com,false
```

Typical savings vs JSON are in the **30–60% range** on GPT-style tokenizers, driven by:

- Tabular arrays of objects (keys written once)
- No structural braces/brackets
- Minimal quoting
- No spaces after commas

## Notes and Limitations

- **Token counts vary by tokenizer and model.** Benchmarks use a GPT-style tokenizer (cl100k/o200k); actual savings will differ with other models (e.g., SentencePiece).
- **TOON is designed for LLM contexts** where human readability and token efficiency matter. It's **not** a drop-in replacement for JSON in APIs or storage.
- **Tabular arrays** require all objects to have exactly the same keys with primitive values only. Arrays with mixed types (primitives + objects/arrays), non-uniform objects, or nested structures will use a more verbose list format.
- **Object key order** is preserved from the input. In tabular arrays, header order follows the first object's keys.
- **Arrays mixing primitives and objects/arrays** always use list form:
  ```
  items[2]:
    - a: 1
    - [2]: 1,2
  ```
- **Deterministic formatting:** 2-space indentation, stable key order, no trailing spaces/newline.

## Quick Reference

```
// Object
{ id: 1, name: 'Ada' }          → id: 1
                                   name: Ada

// Nested object
{ user: { id: 1 } }             → user:
                                     id: 1

// Primitive array (inline)
{ tags: ['a', 'b'] }            → tags[2]: a,b

// Tabular array (uniform objects)
{ items: [                      → items[2]{id,qty}:
  { id: 1, qty: 5 },                1,5
  { id: 2, qty: 3 }                 2,3
]}

// Mixed / non-uniform (list)
{ items: [1, { a: 1 }, 'x'] }   → items[3]:
                                     - 1
                                     - a: 1
                                     - x

// Array of arrays
{ pairs: [[1, 2], [3, 4]] }     → pairs[2]:
                                     - [2]: 1,2
                                     - [2]: 3,4

// Root array
['x', 'y']                      → [2]: x,y

// Empty containers
{}                              → (empty output)
{ items: [] }                   → items[0]:

// Special quoting
{ note: 'hello, world' }        → note: "hello, world"
{ items: ['true', true] }       → items[2]: "true",true
```

## License

[MIT](./LICENSE) License © 2025-PRESENT [Johann Schopplich](https://github.com/johannschopplich)