mirror of
https://github.com/voson-wang/toon.git
synced 2026-01-29 23:34:10 +08:00
240 lines
7.4 KiB
Markdown
240 lines
7.4 KiB
Markdown
# Getting Started
|
||
|
||
## What is TOON?
|
||
|
||
**Token-Oriented Object Notation** is a compact, human-readable encoding of the JSON data model for LLM prompts. It provides a lossless serialization of the same objects, arrays, and primitives as JSON, but in a syntax that minimizes tokens and makes structure easy for models to follow.
|
||
|
||
TOON combines YAML's indentation-based structure for nested objects with a CSV-style tabular layout for uniform arrays. TOON's sweet spot is uniform arrays of objects (multiple fields per row, same structure across items), achieving CSV-like compactness while adding explicit structure that helps LLMs parse and validate data reliably.
|
||
|
||
Think of it as a translation layer: use JSON programmatically, and encode it as TOON for LLM input.
|
||
|
||
### Why TOON?
|
||
|
||
Standard JSON is verbose and token-expensive. For uniform arrays of objects, JSON repeats every field name for every record:
|
||
|
||
```json
|
||
{
|
||
"users": [
|
||
{ "id": 1, "name": "Alice", "role": "admin" },
|
||
{ "id": 2, "name": "Bob", "role": "user" }
|
||
]
|
||
}
|
||
```
|
||
|
||
YAML already reduces some redundancy with indentation instead of braces:
|
||
|
||
```yaml
|
||
users:
|
||
- id: 1
|
||
name: Alice
|
||
role: admin
|
||
- id: 2
|
||
name: Bob
|
||
role: user
|
||
```
|
||
|
||
TOON goes further by declaring fields once and streaming data as rows:
|
||
|
||
```yaml
|
||
users[2]{id,name,role}:
|
||
1,Alice,admin
|
||
2,Bob,user
|
||
```
|
||
|
||
The `[2]` declares the array length, enabling LLMs to answer dataset size questions and detect truncation. The `{id,name,role}` declares the field names. Each row is then a compact, comma-separated list of values. This is the core pattern: declare structure once, stream data compactly. The format approaches CSV's efficiency while adding explicit structure.
|
||
|
||
For a more realistic example, here's how TOON handles a dataset with both nested objects and tabular arrays:
|
||
|
||
::: code-group
|
||
|
||
```json [JSON (235 tokens)]
|
||
{
|
||
"context": {
|
||
"task": "Our favorite hikes together",
|
||
"location": "Boulder",
|
||
"season": "spring_2025"
|
||
},
|
||
"friends": ["ana", "luis", "sam"],
|
||
"hikes": [
|
||
{
|
||
"id": 1,
|
||
"name": "Blue Lake Trail",
|
||
"distanceKm": 7.5,
|
||
"elevationGain": 320,
|
||
"companion": "ana",
|
||
"wasSunny": true
|
||
},
|
||
{
|
||
"id": 2,
|
||
"name": "Ridge Overlook",
|
||
"distanceKm": 9.2,
|
||
"elevationGain": 540,
|
||
"companion": "luis",
|
||
"wasSunny": false
|
||
},
|
||
{
|
||
"id": 3,
|
||
"name": "Wildflower Loop",
|
||
"distanceKm": 5.1,
|
||
"elevationGain": 180,
|
||
"companion": "sam",
|
||
"wasSunny": true
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
```yaml [TOON (106 tokens)]
|
||
context:
|
||
task: Our favorite hikes together
|
||
location: Boulder
|
||
season: spring_2025
|
||
friends[3]: ana,luis,sam
|
||
hikes[3]{id,name,distanceKm,elevationGain,companion,wasSunny}:
|
||
1,Blue Lake Trail,7.5,320,ana,true
|
||
2,Ridge Overlook,9.2,540,luis,false
|
||
3,Wildflower Loop,5.1,180,sam,true
|
||
```
|
||
|
||
:::
|
||
|
||
Notice how TOON combines YAML's indentation for the `context` object with inline format for the primitive `friends` array and tabular format for the structured `hikes` array. Each format is chosen automatically based on the data structure.
|
||
|
||
### Design Goals
|
||
|
||
TOON is optimized for specific use cases. It aims to:
|
||
|
||
- Make uniform arrays of objects as compact as possible by declaring structure once and streaming data.
|
||
- Stay fully lossless and deterministic – round-trips preserve all data and structure.
|
||
- Keep parsing simple and robust for both LLMs and humans through explicit structure markers.
|
||
- Provide validation guardrails (array lengths, field counts) that help detect truncation and malformed output.
|
||
|
||
## When to Use TOON
|
||
|
||
TOON excels with uniform arrays of objects – data with the same structure across items. For LLM prompts, the format produces deterministic, minimally quoted text with built-in validation. Explicit array lengths (`[N]`) and field headers (`{fields}`) help detect truncation and malformed data, while the tabular structure declares fields once rather than repeating them in every row.
|
||
|
||
::: tip Production Ready
|
||
TOON is production-ready and actively maintained, with implementations in TypeScript, Python, Go, Rust, .NET, and more. The format is stable, but also an idea in progress. Nothing's set in stone – help shape where it goes by contributing to the [specification](https://github.com/toon-format/spec) or sharing feedback.
|
||
:::
|
||
|
||
## When Not to Use TOON
|
||
|
||
TOON is not always the best choice. Consider alternatives when:
|
||
|
||
- **Deeply nested or non-uniform structures** (tabular eligibility ≈ 0%): JSON-compact often uses fewer tokens. Example: complex configuration objects with many nested levels.
|
||
- **Semi-uniform arrays** (~40–60% tabular eligibility): Token savings diminish. Prefer JSON if your pipelines already rely on it.
|
||
- **Pure tabular data**: CSV is smaller than TOON for flat tables. TOON adds minimal overhead (~5-10%) to provide structure (array length declarations, field headers, delimiter scoping) that improves LLM reliability.
|
||
- **Latency-critical applications**: Benchmark on your exact setup. Some deployments (especially local/quantized models) may process compact JSON faster despite TOON's lower token count.
|
||
|
||
> [!NOTE]
|
||
> For data-driven comparisons across different structures, see [benchmarks](/guide/benchmarks). When optimizing for latency, measure TTFT, tokens/sec, and total time for both TOON and JSON-compact and use whichever performs better in your specific environment.
|
||
|
||
## Installation
|
||
|
||
### TypeScript Library
|
||
|
||
Install the library via your preferred package manager:
|
||
|
||
::: code-group
|
||
|
||
```bash [npm]
|
||
npm install @toon-format/toon
|
||
```
|
||
|
||
```bash [pnpm]
|
||
pnpm add @toon-format/toon
|
||
```
|
||
|
||
```bash [yarn]
|
||
yarn add @toon-format/toon
|
||
```
|
||
|
||
:::
|
||
|
||
### CLI
|
||
|
||
The CLI can be used without installation via `npx`, or installed globally:
|
||
|
||
::: code-group
|
||
|
||
```bash [npx (no install)]
|
||
npx @toon-format/cli input.json -o output.toon
|
||
```
|
||
|
||
```bash [npm]
|
||
npm install -g @toon-format/cli
|
||
```
|
||
|
||
```bash [pnpm]
|
||
pnpm add -g @toon-format/cli
|
||
```
|
||
|
||
```bash [yarn]
|
||
yarn global add @toon-format/cli
|
||
```
|
||
|
||
:::
|
||
|
||
For full CLI documentation, see the [CLI reference](/cli/).
|
||
|
||
## Your First Example
|
||
|
||
The examples below use the TypeScript library for demonstration, but the same operations work in any language with a TOON implementation.
|
||
|
||
Let's encode a simple dataset with the TypeScript library:
|
||
|
||
```ts
|
||
import { encode } from '@toon-format/toon'
|
||
|
||
const data = {
|
||
users: [
|
||
{ id: 1, name: 'Alice', role: 'admin' },
|
||
{ id: 2, name: 'Bob', role: 'user' }
|
||
]
|
||
}
|
||
|
||
console.log(encode(data))
|
||
```
|
||
|
||
**Output:**
|
||
|
||
```yaml
|
||
users[2]{id,name,role}:
|
||
1,Alice,admin
|
||
2,Bob,user
|
||
```
|
||
|
||
### Decoding Back to JSON
|
||
|
||
Decoding is just as simple:
|
||
|
||
```ts
|
||
import { decode } from '@toon-format/toon'
|
||
|
||
const toon = `
|
||
users[2]{id,name,role}:
|
||
1,Alice,admin
|
||
2,Bob,user
|
||
`
|
||
|
||
const data = decode(toon)
|
||
console.log(JSON.stringify(data, null, 2))
|
||
```
|
||
|
||
**Output:**
|
||
|
||
```json
|
||
{
|
||
"users": [
|
||
{ "id": 1, "name": "Alice", "role": "admin" },
|
||
{ "id": 2, "name": "Bob", "role": "user" }
|
||
]
|
||
}
|
||
```
|
||
|
||
Round-tripping is lossless: `decode(encode(x))` always equals `x` (after normalization of non-JSON types like `Date`, `NaN`, etc.).
|
||
|
||
## Where to Go Next
|
||
|
||
Now that you've seen your first TOON document, read the [Format Overview](/guide/format-overview) for complete syntax details (objects, arrays, quoting rules, key folding), then explore [Using TOON with LLMs](/guide/llm-prompts) to see how to use it effectively in prompts. For implementation details, check the [API reference](/reference/api) (TypeScript) or the [specification](/reference/spec) (language-agnostic normative rules).
|