diff --git a/README.md b/README.md index b862d27..7b3ccba 100644 --- a/README.md +++ b/README.md @@ -2,18 +2,40 @@ # Token-Oriented Object Notation (TOON) -AI is becoming cheaper and more accessible, but larger context windows allow for larger data inputs as well. **LLM tokens still cost money** โ€“ this is where TOON comes in. - -**Token-Oriented Object Notation** is a compact, human-readable format designed for passing structured data to Large Language Models. It reduces token usage compared to JSON by: - -- Removing redundant punctuation (braces/brackets, most quotes) -- Using indentation for structure -- Tabularizing arrays of objects -- Writing inline primitive arrays without spaces +**Token-Oriented Object Notation** is a compact, human-readable format designed for passing structured data to Large Language Models with significantly reduced token usage. > [!TIP] > Wrap your JSON in `encode()` before sending it to LLMs and save ~1/2 of the token cost for structured data! +## Why TOON? + +AI is becoming cheaper and more accessible, but larger context windows allow for larger data inputs as well. **LLM tokens still cost money** โ€“ and standard JSON is verbose and token-expensive: + +```json +{ + "users": [ + { "id": 1, "name": "Alice", "role": "admin" }, + { "id": 2, "name": "Bob", "role": "user" } + ] +} +``` + +TOON conveys the same information with **fewer tokens**: + +``` +users[2]{id,name,role}: + 1,Alice,admin + 2,Bob,user +``` + +## Key Features + +- ๐Ÿ’ธ **Token-efficient:** typically 30โ€“60% fewer tokens than JSON +- ๐Ÿคฟ **LLM-friendly guardrails:** explicit lengths and field lists help models validate output +- ๐Ÿฑ **Minimal syntax:** removes redundant punctuation (braces, brackets, most quotes) +- ๐Ÿ“ **Indentation-based structure:** replaces braces with whitespace for better readability +- ๐Ÿงบ **Tabular arrays:** declare keys once, then stream rows without repetition + ## Token Benchmarks @@ -182,35 +204,6 @@ metrics[5]{date,views,clicks,conversions}: > [!NOTE] > Measured with [`gpt-tokenizer`](https://github.com/niieani/gpt-tokenizer) using `o200k_base` encoding (used by GPT-5 and other modern models). Savings will vary across models and tokenizers. -## Why TOON? - -Standard JSON is verbose and token-expensive in LLM contexts: - -```json -{ - "users": [ - { "id": 1, "name": "Alice", "role": "admin" }, - { "id": 2, "name": "Bob", "role": "user" } - ] -} -``` - -TOON conveys the same information with **fewer tokens**: - -``` -users[2]{id,name,role}: - 1,Alice,admin - 2,Bob,user -``` - -## Key Features - -- ๐Ÿ’ธ **Token-efficient:** typically 30โ€“60% fewer tokens vs JSON on GPT-style tokenizers, based on real benchmarks -- ๐ŸŽ›๏ธ **Deterministic, tokenizer-aware output:** minimal quoting and stable ordering keep payloads compact and reproducible -- ๐Ÿงบ **Tabular arrays without repetition:** declare uniform keys once, then stream rows for dense datasets -- ๐Ÿ“ **Readable yet concise structure:** indentation replaces braces so nested data stays scannable without extra tokens -- ๐Ÿ”ข **LLM-friendly guardrails:** explicit lengths and field lists help models validate and reproduce structured responses - ## Installation ```bash