Files
toon/SPEC.md
2025-10-28 10:57:49 +01:00

533 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# TOON Specification (v1)
Status: Draft, normative where indicated. This version specifies encoding (producer behavior). A formal decoding spec is out of scope for v1.
- Normative statements use RFC 2119/8174 keywords: MUST, MUST NOT, SHOULD, SHOULD NOT, MAY.
- This spec targets implementers of encoders/validators, tool authors, and practitioners embedding TOON in LLM prompts.
Changelog:
- v1: Initial encoding + normalization + conformance rules based on reference encoder and test suite.
Scope:
- This document defines the data model, normalization (for the reference JavaScript/TypeScript encoder), concrete syntax, and conformance requirements for producing TOON. Decoding is informative only and not standardized in v1.
## 1. Terminology and Conventions
- TOON document: A sequence of UTF-8 text lines formatted according to this spec.
- Line: A sequence of non-newline characters terminated by LF (U+000A) in serialized form. TOON output MUST use LF line endings.
- Indentation level (depth): The number of indentation units (spaces) applied to a line. Depth 0 lines have no leading indentation.
- Indentation unit: A fixed number of spaces per level (default 2). Tabs MUST NOT be used for indentation.
- Header: The bracketed declaration for arrays, optionally followed by a field list, and terminating with a colon: e.g., key[3]: or items[2]{a,b}:.
- Field list: The brace-enclosed, delimiter-separated list of field names for tabular arrays: {f1<delim>f2}.
- List item: A line beginning with a hyphen and a space at a given depth (“- ”), representing an element in an expanded array form.
- Delimiter: The character used to separate array/tabular values: comma (default), tab, or pipe.
- Length marker: An optional “#” prefix for array lengths in headers, e.g., [#3].
- Primitive: string, number, boolean, or null.
- Object: Mapping from string keys to JsonValue.
- Array: Ordered sequence of JsonValue.
- JsonValue: Primitive | Object | Array.
Notation:
- Regular expressions appear in slash-delimited form.
- Examples are informative unless stated otherwise.
## 2. Data Model
- TOON models data as:
- JsonPrimitive: string | number | boolean | null
- JsonObject: { [string]: JsonValue }
- JsonArray: JsonValue[]
- Ordering:
- Array order is preserved.
- Object key order is preserved as encountered by the encoder.
- Numeric canonicalization:
- -0 MUST be normalized to 0.
- Finite numbers MUST be rendered without scientific notation (e.g., 1e6 → 1000000, 1e-6 → 0.000001), as per host-language number-to-string rules that avoid exponent notation in these cases.
- Null semantics: null is represented as the literal null.
## 3. Host-Language Normalization (Reference Encoder)
The reference encoder normalizes non-JSON values to the data model as follows:
- Number:
- Finite: retained as number. -0 → 0. Non-exponential canonical form is required.
- NaN, +Infinity, -Infinity: normalized to null.
- BigInt (JavaScript):
- If within Number.MIN_SAFE_INTEGER..Number.MAX_SAFE_INTEGER: converted to number.
- Otherwise: converted to a decimal string (e.g., "9007199254740993"). This string is then encoded using the string rules (see Section 6), and because it is numeric-like, it will be quoted.
- Date: converted to ISO string (e.g., "2025-01-01T00:00:00.000Z").
- Set: converted to array by iterating entries and normalizing each element.
- Map: converted to object using String(key) for keys and normalizing values.
- Plain object: own enumerable string keys are preserved in encounter order, values normalized recursively.
- Function, symbol, undefined, or unrecognized types: normalized to null.
Note: Other language ports SHOULD apply analogous normalization strategies consistent with this specs data model and encoding rules.
## 4. Concrete Syntax Overview
TOON is a deterministic, line-oriented, indentation-based notation:
- Objects:
- key: value for primitives.
- key: alone for nested or empty objects, with nested fields indented one level.
- Arrays:
- Primitive arrays are inline: key[N<delim?>]: v1<delim>v2.
- Arrays of arrays (primitives): expanded list under a header: key[N<delim?>]: then “- [M<delim?>]: …” lines.
- Arrays of objects:
- Tabular form when uniform and primitive-only: key[N<delim?>]{f1<delim>f2}: then one row per line.
- Otherwise expanded list: key[N<delim?>]: with “- …” items, following object-as-list-item rules.
- Whitespace invariants:
- No trailing spaces at the end of any line.
- No trailing newline at the end of the document.
- One space after “: ” in key: value lines and after array headers when followed by inline values (non-empty primitive arrays).
## 5. Tokens and Lexical Elements
- Structural characters: colon (:), hyphen (-), brackets ([ ]), braces ({ }), double-quote ("), backslash (\).
- Delimiters:
- Comma (,) is the default.
- Tab (\t) and pipe (|) are supported alternatives.
- The active delimiter MAY appear inside array headers (see Section 7).
- Indentation unit: default 2 spaces per level; configurable at encode-time.
- List item markers: “- ” (hyphen + single space) at the appropriate indentation level. An empty object as a list item is represented as a lone hyphen (“-”).
- Character set: UTF-8. Tabs MUST NOT appear as indentation but MAY appear as the chosen delimiter or inside quoted strings via escapes.
## 6. String and Key Encoding
6.1 Escaping
The following characters in quoted strings and keys MUST be escaped:
- Backslash: "\\" → "\\\\"
- Double quote: "\"" → "\\\""
- Newline: U+000A → "\\n"
- Carriage return: U+000D → "\\r"
- Tab: U+0009 → "\\t"
6.2 Quoting Rules for String Values
A string value MUST be quoted (with escaping as above) if any of the following is true:
- It is empty ("").
- It has leading or trailing whitespace.
- It equals true, false, or null (case-sensitive matches of these literals).
- It is numeric-like:
- Matches /^-?\d+(?:\.\d+)?(?:e[+-]?\d+)?$/i (e.g., "42", "-3.14", "1e-6").
- Or matches /^0\d+$/ (leading-zero decimals such as "05").
- It contains a colon (:), double quote ("), or backslash (\).
- It contains brackets or braces ([, ], {, }).
- It contains control characters such as newline, carriage return, or tab.
- It contains the active delimiter (comma, tab, or pipe).
- It starts with a hyphen (-), to avoid ambiguity with list markers.
If none of the conditions above apply, the string MAY be emitted without quotes. Unicode, emoji, and strings with internal (non-leading/trailing) spaces are safe unquoted provided they do not violate the conditions.
6.3 Key Encoding
Object keys and tabular field names:
- MAY be unquoted only if they match the pattern: ^[A-Za-z_][\w.]*$.
- Otherwise, they MUST be quoted using the escaping rules above.
Note: Keys containing spaces, punctuation (e.g., colon, pipe, hyphen), or starting with a digit MUST be quoted.
## 7. Array Headers
General header syntax:
- Without key (root arrays): [<marker?>N<delim?>]:
- With key: key[<marker?>N<delim?>]:
- With tabular fields: key[<marker?>N<delim?>]{field1<delim>field2}:
Where:
- N is the array length (non-negative integer).
- <marker?> is optional “#” when the length marker option is enabled (Section 10).
- <delim?> is:
- Absent when the delimiter is comma.
- Present and equal to the active delimiter when the delimiter is tab or pipe.
- Field names within braces are separated by the active delimiter and encoded using key rules (Section 6.3).
- Every header line MUST end with a colon.
Spacing:
- When an inline list of values follows a header on the same line (non-empty primitive arrays), there MUST be exactly one space after the colon before the first value.
- Otherwise, no trailing space follows the colon on the header line.
## 8. Primitive Encoding
- null: literal null.
- boolean: true or false (lowercase).
- number:
- Finite: base-10 non-exponential representation, preserving sign except -0 normalized to 0.
- Non-finite (NaN, ±Infinity): treated as null via normalization (Section 3).
- string: encoded per Section 6 with delimiter-aware quoting.
## 9. Object Syntax
- Primitive fields: key: value (single space after colon).
- Nested or empty objects: key: on its own line; if non-empty, nested fields appear at one more indentation level.
- Key order: Implementations MUST preserve the encounter order when emitting fields.
- An empty object at the root results in an empty document (no lines).
## 10. Arrays
10.1 Primitive Arrays (Inline)
- Non-empty arrays: key[N<delim?>]: v1<delim>v2<delim>… where each vi is encoded as a primitive (Section 8) with delimiter-aware quoting (Section 6).
- Empty arrays: key[0<delim?>]: (no values following).
- Root arrays use the same rules without a key: [N<delim?>]: v1<delim>…
10.2 Arrays of Arrays (Primitives Only) — Expanded List
- Parent header: key[N<delim?>]: on its own line.
- Each inner primitive array is a list item:
- - [M<delim?>]: v1<delim>v2<delim>…
- Empty inner arrays: - [0<delim?>]:
- Root arrays of arrays use [N<delim?>]: as the parent header with the same rules.
10.3 Arrays of Objects — Tabular Form
Tabular detection (MUST hold for all rows):
- Every element is an object.
- All objects have the same set of keys (order per object MAY vary).
- All values across these keys are primitives (no nested arrays/objects).
When satisfied:
- Header: key[N<delim?>]{f1<delim>f2<delim>…}: where the field order is the encounter order of the first objects keys.
- Field names encoded as keys (Section 6.3), delimiter-aware.
- Rows: one line per object at one indentation level under the header, values joined by the active delimiter. Each value encoded as a primitive (Section 8) with delimiter-aware quoting (Section 6).
- Root tabular arrays omit the key: [N<delim?>]{…}: then rows.
10.4 Mixed / Non-Uniform Arrays — Expanded List
When tabular requirements are not met:
- Header: key[N<delim?>]:
- Each element is rendered as a list item at one indentation level under the header:
- Primitive: - <primitive>
- Primitive array: - [M<delim?>]: v1<delim>…
- Object: formatted using “objects as list items” (Section 11).
- Complex arrays (e.g., arrays of arrays with mixed shapes): - key'[M<delim?>]: followed by nested items as appropriate.
## 11. Objects as List Items
For an object appearing as a list item:
- If the object is empty, render a single “-” at the list item indentation level.
- Otherwise, place the first field on the hyphen line using the following rules:
- If the first fields value is a primitive: - key: value
- If the first fields value is a primitive array: - key[M<delim?>]: v1<delim>…
- If the first fields value is an array of objects that qualifies for tabular form:
- - key[N<delim?>]{fields}:
- Followed by tabular rows at one more indentation level.
- If the first fields value is a non-uniform array of objects:
- - key[N<delim?>]:
- Followed by list items at one more indentation level (apply these same rules recursively).
- If the first fields value is a complex array (e.g., arrays of arrays, nested mixed arrays):
- - key[N<delim?>]:
- Followed by nested encodings (e.g., “- [M<delim?>]: …”) at one more indentation level.
- If the first fields value is an object:
- - key:
- Nested object fields appear at two more indentation levels (i.e., one deeper than subsequent sibling fields of the same list item).
- Remaining fields of the same object appear at one indentation level under the hyphen line, in encounter order, using normal object field rules.
## 12. Delimiters
- Supported delimiters:
- Comma (default): header omits the delimiter symbol.
- Tab: header includes the tab character inside brackets and braces (e.g., [N<TAB>], {a<TAB>b}); rows/inline arrays use tabs to separate values.
- Pipe: header includes “|” inside brackets and braces; rows/inline arrays use “|”.
- Delimiter-aware quoting:
- Strings containing the active delimiter MUST be quoted across object values, array values, and tabular rows.
- Strings containing non-active delimiters (e.g., commas when using tab) do not require quoting unless another quoting condition applies.
- Changing the delimiter does not relax other quoting rules (colon, brackets/braces, leading hyphen, numeric-like, boolean/null-like).
## 13. Length Marker
- When enabled, the length marker “#” MUST appear immediately before the length in every array header, including nested arrays and tabular headers:
- key[#N<delim?>]: …
- key[#N<delim?>]{…}:
- - [#M<delim?>]: …
- Semantics: purely informational to emphasize counts; no change to other parsing or formatting rules.
## 14. Indentation and Whitespace Invariants
- Indentation:
- The encoder MUST use a consistent number of spaces per level (default 2; configurable).
- Tabs MUST NOT be used for indentation.
- Spacing:
- Exactly one space after “: ” in key: value lines.
- Exactly one space after array headers when followed by inline values (non-empty primitive arrays).
- End-of-line:
- No trailing spaces at the end of any line.
- No trailing newline at the end of the document.
## 15. Conformance
Conformance classes:
- Encoder:
- MUST produce output adhering to all normative rules in Sections 214.
- MUST be deterministic with respect to:
- Object field order (encounter order).
- Tabular detection (either uniformly tabular or not, given the input).
- Quoting decisions for given values and active delimiter.
- Validator:
- SHOULD verify structural conformance (headers, indentation, list markers).
- SHOULD verify whitespace invariants.
- SHOULD verify delimiter consistency between headers and rows.
- Parser/Decoder:
- Out of scope for v1; MAY be implemented. Implementers SHOULD follow the invariants in this spec for robust parsing (e.g., delimiter discovery from headers, length counts as consistency checks).
Options:
- indent (default: 2 spaces)
- delimiter (default: comma; alternatives: tab, pipe)
- lengthMarker (default: disabled)
## 16. Error Handling and Diagnostics
- Inputs that cannot be represented in the data model (Section 2) are normalized (Section 3) before encoding (e.g., NaN → null).
- Tabular fallback:
- If any tabular condition fails (Section 10.3), encoders MUST use expanded list format (Section 10.4).
- Validators SHOULD report:
- Trailing spaces, trailing newlines.
- Headers missing delimiters when non-comma is active.
- Mismatched row counts vs. declared [N].
- Values violating delimiter-aware quoting rules.
## 17. Security Considerations
- Injection and ambiguity are mitigated by quoting rules:
- Strings with colon, active delimiter, leading hyphen, control characters, brackets/braces MUST be quoted.
- Encoders SHOULD avoid excessive memory use on large inputs; implement streaming/tabular row emission where feasible.
- Unicode inputs:
- Encoders SHOULD avoid altering Unicode content beyond required escaping; validators SHOULD accept all valid Unicode in quoted strings and keys (with escapes as required).
## 18. Internationalization
- TOON supports full Unicode in keys and values, subject to quoting and escaping rules.
- Encoders MUST NOT apply locale-dependent formatting for numbers or booleans (e.g., no thousands separators).
- ISO 8601 strings SHOULD be used for date representations when normalizing host Date types.
## 19. Interoperability and Mappings (Informative)
- JSON:
- TOON is designed for deterministic encoding of JSON-compatible data (after normalization).
- Arrays of uniform objects map to CSV-like rows; other structures map to YAML-like nested forms.
- CSV:
- TOON tabular sections generalize CSV with explicit lengths, field lists, and flexible delimiter choice.
- YAML:
- TOON borrows indentation and list-item patterns but uses fewer quotes and explicit array headers to constrain ambiguity in LLM contexts.
## 20. Media Type and File Extensions (Provisional)
- Suggested media type: text/toon
- Suggested file extension: .toon
- Encoding: UTF-8
- Line endings: LF (U+000A)
## 21. Examples (Informative)
Objects:
```
id: 123
name: Ada
active: true
```
Nested objects:
```
user:
id: 123
name: Ada
```
Primitive arrays:
```
tags[3]: admin,ops,dev
```
Arrays of arrays (primitives):
```
pairs[2]:
- [2]: 1,2
- [2]: 3,4
```
Tabular arrays:
```
items[2]{sku,qty,price}:
A1,2,9.99
B2,1,14.5
```
Mixed arrays:
```
items[3]:
- 1
- a: 1
- text
```
Objects as list items (first field on hyphen line):
```
items[2]:
- id: 1
name: First
- id: 2
name: Second
extra: true
```
Nested tabular inside a list item:
```
items[1]:
- users[2]{id,name}:
1,Ada
2,Bob
status: active
```
Delimiter variations:
```
# Tab delimiter
items[2 ]{sku name qty price}:
A1 Widget 2 9.99
B2 Gadget 1 14.5
# Pipe delimiter
tags[3|]: reading|gaming|coding
```
Length marker:
```
tags[#3]: reading,gaming,coding
pairs[#2]:
- [#2]: a,b
- [#2]: c,d
```
## 22. Reference Algorithms (Informative)
22.1 Tabular Detection
Given an array rows:
- If rows is empty → not tabular (fall back to expanded format).
- Let header = keys of the first row in encounter order; if header is empty → not tabular.
- For each row:
- If rows key count ≠ header length → not tabular.
- For each key in header:
- If key missing in row → not tabular.
- If row[key] is not a primitive → not tabular.
- Otherwise tabular with header from the first row.
22.2 Safe-Unquoted String Decision
Given a string s and active delimiter d:
- If s is empty or s !== s.trim() → quote.
- If s ∈ {true,false,null} → quote.
- If s is numeric-like (regexes in Section 6.2) → quote.
- If s contains ":" or "\"" or "\\" → quote.
- If s contains any of "[", "]", "{", "}" → quote.
- If s contains any of "\n", "\r", "\t" → quote.
- If s contains the active delimiter d → quote.
- If s starts with "-" → quote.
- Else unquoted.
22.3 Header Formatting
- Start with optional key (encoded as per key rules).
- Append “[<marker?>N<delim?>]”, where:
- <marker?> is “#” if enabled.
- <delim?> is absent for comma, or is the delimiter symbol for tab/pipe.
- If tabular, append “{field1<delim>field2}” where field names are key-encoded and joined by the active delimiter.
- Append “:”.
- For non-empty primitive arrays on a single line, append a space and the joined values (each primitive-encoded with delimiter-aware quoting), joined by the active delimiter.
## 23. ABNF Sketch (Informative)
This sketch omits full Unicode and escaping details; it illustrates structure only.
```
document = *(line LF) [line]
line = indent (object-line / array-header / list-item / row)
indent = *SP ; multiple of indent unit (default 2 SP)
object-line = key ":" [SP primitive]
array-header = [key] "[" [marker] length [delimsym] "]" [fields] ":" [SP inline-values]
marker = "#"
length = 1*DIGIT
delimsym = "|" / HTAB
fields = "{" fieldname *(delim fieldname) "}"
fieldname = key
inline-values = primitive *(delim primitive)
delim = delimsym / "," ; actual active delimiter for the array
list-item = "- " ( primitive
/ inline-array
/ object-head
/ nested-array-head )
inline-array = "[" [marker] length [delimsym] "]" ":" [SP inline-values]
object-head = key ":" ; followed by nested object at deeper indent
nested-array-head = key "[" [marker] length [delimsym] "]" ":" [LF] ; followed by nested items
row = primitive *(delim primitive)
key = unquoted-key / quoted
unquoted-key = ALPHA / "_" , *( ALPHA / DIGIT / "_" / "." )
quoted = DQUOTE *(escaped-char / safe-char) DQUOTE
primitive = null / boolean / number / string
null = "null"
boolean = "true" / "false"
number = 1*DIGIT / "-" 1*DIGIT / 1*DIGIT "." 1*DIGIT / ...
string = quoted / safe-unquoted-string
```
Notes:
- Safe-unquoted-string constraints are defined in Section 6.2.
- Actual tokenization relies on the declared header delimiter and quoting rules.
## 24. Test Suite and Compliance (Informative)
- Implementations are encouraged to validate against a comprehensive test suite covering:
- Primitive encoding, quoting, control-character escaping.
- Object key encoding and order preservation.
- Primitive arrays (inline), empty arrays.
- Arrays of arrays (expanded), mixed-length and empty inner arrays.
- Tabular detection and encoding, including delimiter variations.
- Mixed arrays and objects-as-list-items behavior, including nested arrays and objects.
- Whitespace invariants (no trailing spaces/newline).
- Normalization (BigInt, Date, undefined, NaN/Infinity, functions, symbols).
The provided reference tests in the repository mirror these conditions and SHOULD be used to ensure conformance.
## 25. Rationale (Informative)
- Token efficiency: Removing repeated keys and braces for uniform arrays markedly reduces tokens vs. JSON.
- LLM-friendly guardrails: Declared lengths and field lists help models parse and validate structure.
- Determinism: Strict quoting/spacing/ordering yields outputs that are easy to compare, cache, and validate.
- Delimiters: Tab and pipe often reduce quoting needs (e.g., commas in natural language), and can tokenize more efficiently.
## 26. Versioning and Extensibility
- Backward-compatible evolutions SHOULD preserve current headers, quoting rules, and indentation semantics.
- Reserved/structural characters (colon, brackets, braces, hyphen) MUST retain current meanings.
- Future work (non-normative): decoding/parsing spec, schemas, comments/annotations, additional delimiter profiles.
## 27. Acknowledgments and License
- Credits: Author and contributors; ports in other languages (Elixir, PHP, Python, Ruby, Java, .NET, Swift).
- License: MIT (see repository for details).
---
Appendix: Cross-check With Reference Behavior (Informative)
- All normative behaviors specified herein are implemented and validated by the reference encoder and its test suite, including:
- Safe-unquoted string rules and delimiter-aware quoting.
- Object and tabular header formation using the active delimiter (comma implicit; tab/pipe explicit).
- Length marker propagation to nested arrays.
- Tabular detection requiring uniform keys and primitive-only values.
- Objects-as-list-items formatting (first field on hyphen line, subsequent fields at +1 indent; nested object content at +2).
- Whitespace invariants and no trailing newline.