docs: update TOON specs to v1.1 with decoding behavior

This commit is contained in:
Johann Schopplich
2025-10-29 08:27:44 +01:00
parent 92b51db825
commit 61fb751540

361
SPEC.md
View File

@@ -1,31 +1,34 @@
# TOON Specification (v1) # TOON Specification (v1.1)
Status: Draft, normative where indicated. This version specifies encoding (producer behavior). A formal decoding spec is out of scope for v1. Status: Draft, normative where indicated. This version specifies both encoding (producer behavior) and decoding (parser behavior).
- Normative statements use RFC 2119/8174 keywords: MUST, MUST NOT, SHOULD, SHOULD NOT, MAY. - Normative statements use RFC 2119/8174 keywords: MUST, MUST NOT, SHOULD, SHOULD NOT, MAY.
- This spec targets implementers of encoders/validators, tool authors, and practitioners embedding TOON in LLM prompts. - This spec targets implementers of encoders/decoders/validators, tool authors, and practitioners embedding TOON in LLM prompts.
Changelog: Changelog:
- v1.1: Made decoding behavior normative; added decoding semantics, strict-mode validation rules, delimiter-aware parsing, and reference decoding algorithms. Added decoder options (indent, strict).
- v1: Initial encoding + normalization + conformance rules based on reference encoder and test suite. - v1: Initial encoding + normalization + conformance rules based on reference encoder and test suite.
Scope: Scope:
- This document defines the data model, normalization (for the reference JavaScript/TypeScript encoder), concrete syntax, and conformance requirements for producing TOON. Decoding is informative only and not standardized in v1. - This document defines the data model, encoding normalization (for the reference JavaScript/TypeScript encoder), concrete syntax, decoding semantics, and conformance requirements for producing and consuming TOON.
## 1. Terminology and Conventions ## 1. Terminology and Conventions
- TOON document: A sequence of UTF-8 text lines formatted according to this spec. - TOON document: A sequence of UTF-8 text lines formatted according to this spec.
- Line: A sequence of non-newline characters terminated by LF (U+000A) in serialized form. TOON output MUST use LF line endings. - Line: A sequence of non-newline characters terminated by LF (U+000A) in serialized form. Encoders MUST use LF line endings.
- Indentation level (depth): The number of indentation units (spaces) applied to a line. Depth 0 lines have no leading indentation. - Indentation level (depth): The number of indentation units (spaces) applied to a line. Depth 0 lines have no leading indentation.
- Indentation unit: A fixed number of spaces per level (default 2). Tabs MUST NOT be used for indentation. - Indentation unit: A fixed number of spaces per level (default 2). Tabs MUST NOT be used for indentation.
- Header: The bracketed declaration for arrays, optionally followed by a field list, and terminating with a colon: e.g., key[3]: or items[2]{a,b}:. - Header: The bracketed declaration for arrays, optionally followed by a field list, and terminating with a colon: e.g., key[3]: or items[2]{a,b}:.
- Field list: The brace-enclosed, delimiter-separated list of field names for tabular arrays: {f1<delim>f2}. - Field list: The brace-enclosed, delimiter-separated list of field names for tabular arrays: {f1<delim>f2}.
- List item: A line beginning with a hyphen and a space at a given depth (- ), representing an element in an expanded array form. - List item: A line beginning with a hyphen and a space at a given depth ("- "), representing an element in an expanded array form.
- Delimiter: The character used to separate array/tabular values: comma (default), tab, or pipe. - Delimiter: The character used to separate array/tabular values: comma (default), tab, or pipe.
- Length marker: An optional “#” prefix for array lengths in headers, e.g., [#3]. - Active delimiter: The delimiter declared by the closest array header in scope. Used to split inline primitive arrays and tabular rows under that header.
- Length marker: An optional "#" prefix for array lengths in headers, e.g., [#3]. Decoders MUST accept and ignore the marker semantically.
- Primitive: string, number, boolean, or null. - Primitive: string, number, boolean, or null.
- Object: Mapping from string keys to JsonValue. - Object: Mapping from string keys to JsonValue.
- Array: Ordered sequence of JsonValue. - Array: Ordered sequence of JsonValue.
- JsonValue: Primitive | Object | Array. - JsonValue: Primitive | Object | Array.
- Strict mode: Decoder mode that enforces array lengths, tabular row counts, and delimiter consistency; also rejects invalid escapes and missing colons (default: true).
Notation: Notation:
- Regular expressions appear in slash-delimited form. - Regular expressions appear in slash-delimited form.
@@ -40,7 +43,7 @@ Notation:
- Ordering: - Ordering:
- Array order is preserved. - Array order is preserved.
- Object key order is preserved as encountered by the encoder. - Object key order is preserved as encountered by the encoder.
- Numeric canonicalization: - Numeric canonicalization (encoding):
- -0 MUST be normalized to 0. - -0 MUST be normalized to 0.
- Finite numbers MUST be rendered without scientific notation (e.g., 1e6 → 1000000, 1e-6 → 0.000001), as per host-language number-to-string rules that avoid exponent notation in these cases. - Finite numbers MUST be rendered without scientific notation (e.g., 1e6 → 1000000, 1e-6 → 0.000001), as per host-language number-to-string rules that avoid exponent notation in these cases.
- Null semantics: null is represented as the literal null. - Null semantics: null is represented as the literal null.
@@ -63,6 +66,31 @@ The reference encoder normalizes non-JSON values to the data model as follows:
Note: Other language ports SHOULD apply analogous normalization strategies consistent with this specs data model and encoding rules. Note: Other language ports SHOULD apply analogous normalization strategies consistent with this specs data model and encoding rules.
## 3A. Host-Language Interpretation (Reference Decoder)
Decoders map text tokens to host values as follows:
- Quoted tokens (strings and keys):
- MUST be unescaped using only these escape sequences:
- "\\" → backslash
- "\"" → double quote
- "\n" → newline
- "\r" → carriage return
- "\t" → tab
- Any other escape (e.g., "\x", trailing backslash) MUST be rejected.
- Unterminated quotes MUST be rejected.
- Quoted primitives remain strings even if they lexically resemble numbers, booleans, or null (e.g., "true" → "true").
- Unquoted value tokens:
- The exact tokens true, false, null map to booleans/null.
- Numeric parsing:
- MUST accept standard decimal and exponent forms (e.g., 42, -3.14, 1e-6).
- MUST reject leading-zero decimals (e.g., "05", "0001"); such tokens MUST be treated as strings.
- Only finite numbers are represented in TOON text; non-finite are not expected from conforming encoders.
- Otherwise, the token is a string.
- Keys:
- Decoded as strings. Quoted keys MUST be unescaped as above.
- Missing colon after a (quoted or unquoted) key MUST be treated as an error.
## 4. Concrete Syntax Overview ## 4. Concrete Syntax Overview
TOON is a deterministic, line-oriented, indentation-based notation: TOON is a deterministic, line-oriented, indentation-based notation:
@@ -72,14 +100,18 @@ TOON is a deterministic, line-oriented, indentation-based notation:
- key: alone for nested or empty objects, with nested fields indented one level. - key: alone for nested or empty objects, with nested fields indented one level.
- Arrays: - Arrays:
- Primitive arrays are inline: key[N<delim?>]: v1<delim>v2. - Primitive arrays are inline: key[N<delim?>]: v1<delim>v2.
- Arrays of arrays (primitives): expanded list under a header: key[N<delim?>]: then - [M<delim?>]: … lines. - Arrays of arrays (primitives): expanded list under a header: key[N<delim?>]: then "- [M<delim?>]: …" lines.
- Arrays of objects: - Arrays of objects:
- Tabular form when uniform and primitive-only: key[N<delim?>]{f1<delim>f2}: then one row per line. - Tabular form when uniform and primitive-only: key[N<delim?>]{f1<delim>f2}: then one row per line.
- Otherwise expanded list: key[N<delim?>]: with - … items, following object-as-list-item rules. - Otherwise expanded list: key[N<delim?>]: with "- …" items, following object-as-list-item rules.
- Whitespace invariants: - Whitespace invariants (encoding):
- No trailing spaces at the end of any line. - No trailing spaces at the end of any line.
- No trailing newline at the end of the document. - No trailing newline at the end of the document.
- One space after : in key: value lines and after array headers when followed by inline values (non-empty primitive arrays). - One space after ": " in key: value lines and after array headers when followed by inline values (non-empty primitive arrays).
- Decoder discovery:
- If the first non-empty depth-0 line is a valid root array header ("[ … ]:"), decode a root array.
- If the document has a single line that is neither a valid array header nor a key-value line, decode it as a single primitive.
- Otherwise, decode an object.
## 5. Tokens and Lexical Elements ## 5. Tokens and Lexical Elements
@@ -88,13 +120,16 @@ TOON is a deterministic, line-oriented, indentation-based notation:
- Comma (,) is the default. - Comma (,) is the default.
- Tab (\t) and pipe (|) are supported alternatives. - Tab (\t) and pipe (|) are supported alternatives.
- The active delimiter MAY appear inside array headers (see Section 7). - The active delimiter MAY appear inside array headers (see Section 7).
- Indentation unit: default 2 spaces per level; configurable at encode-time. - Indentation unit: default 2 spaces per level; configurable at encode-time and decode-time. Tabs MUST NOT be used for indentation.
- List item markers: - (hyphen + single space) at the appropriate indentation level. An empty object as a list item is represented as a lone hyphen (“-”). - List item markers: "- " (hyphen + single space) at the appropriate indentation level. An empty object as a list item is represented as a lone hyphen ("-").
- Character set: UTF-8. Tabs MUST NOT appear as indentation but MAY appear as the chosen delimiter or inside quoted strings via escapes. - Character set: UTF-8. Tabs MUST NOT appear as indentation but MAY appear as the chosen delimiter or inside quoted strings via escapes.
- Decoding constraints:
- Quoted strings and keys MUST use only the five escapes listed in Section 3A; others MUST error.
- Decoders MUST locate the colon that follows the header (after any [..] and optional {..}) for arrays; missing colon MUST error.
## 6. String and Key Encoding ## 6. Strings and Keys (Encoding and Decoding)
6.1 Escaping 6.1 Escaping (Encoding and Decoding)
The following characters in quoted strings and keys MUST be escaped: The following characters in quoted strings and keys MUST be escaped:
- Backslash: "\\" → "\\\\" - Backslash: "\\" → "\\\\"
@@ -103,12 +138,14 @@ The following characters in quoted strings and keys MUST be escaped:
- Carriage return: U+000D → "\\r" - Carriage return: U+000D → "\\r"
- Tab: U+0009 → "\\t" - Tab: U+0009 → "\\t"
6.2 Quoting Rules for String Values Decoders MUST reject any other escape sequence and unterminated strings.
6.2 Quoting Rules for String Values (Encoding)
A string value MUST be quoted (with escaping as above) if any of the following is true: A string value MUST be quoted (with escaping as above) if any of the following is true:
- It is empty (""). - It is empty ("").
- It has leading or trailing whitespace. - It has leading or trailing whitespace.
- It equals true, false, or null (case-sensitive matches of these literals). - It equals true, false, or null (case-sensitive).
- It is numeric-like: - It is numeric-like:
- Matches /^-?\d+(?:\.\d+)?(?:e[+-]?\d+)?$/i (e.g., "42", "-3.14", "1e-6"). - Matches /^-?\d+(?:\.\d+)?(?:e[+-]?\d+)?$/i (e.g., "42", "-3.14", "1e-6").
- Or matches /^0\d+$/ (leading-zero decimals such as "05"). - Or matches /^0\d+$/ (leading-zero decimals such as "05").
@@ -120,7 +157,7 @@ A string value MUST be quoted (with escaping as above) if any of the following i
If none of the conditions above apply, the string MAY be emitted without quotes. Unicode, emoji, and strings with internal (non-leading/trailing) spaces are safe unquoted provided they do not violate the conditions. If none of the conditions above apply, the string MAY be emitted without quotes. Unicode, emoji, and strings with internal (non-leading/trailing) spaces are safe unquoted provided they do not violate the conditions.
6.3 Key Encoding 6.3 Key Encoding (Encoding)
Object keys and tabular field names: Object keys and tabular field names:
- MAY be unquoted only if they match the pattern: ^[A-Za-z_][\w.]*$. - MAY be unquoted only if they match the pattern: ^[A-Za-z_][\w.]*$.
@@ -128,6 +165,15 @@ Object keys and tabular field names:
Note: Keys containing spaces, punctuation (e.g., colon, pipe, hyphen), or starting with a digit MUST be quoted. Note: Keys containing spaces, punctuation (e.g., colon, pipe, hyphen), or starting with a digit MUST be quoted.
6.4 Decoding Rules for Strings and Keys (Decoding)
- Quoted strings and keys MUST be unescaped using only the five escapes in 6.1. Any other escape MUST error. Quoted primitives remain strings.
- Unquoted values:
- true/false/null → boolean/null
- Numeric tokens → numbers (with the leading-zero rule from 3A)
- Otherwise → strings
- Keys (quoted or unquoted) MUST be followed by ":"; missing colon MUST error.
## 7. Array Headers ## 7. Array Headers
General header syntax: General header syntax:
@@ -138,7 +184,7 @@ General header syntax:
Where: Where:
- N is the array length (non-negative integer). - N is the array length (non-negative integer).
- <marker?> is optional “#” when the length marker option is enabled (Section 10). - <marker?> is optional "#" when the length marker option is enabled (Section 13).
- <delim?> is: - <delim?> is:
- Absent when the delimiter is comma. - Absent when the delimiter is comma.
- Present and equal to the active delimiter when the delimiter is tab or pipe. - Present and equal to the active delimiter when the delimiter is tab or pipe.
@@ -149,6 +195,13 @@ Spacing:
- When an inline list of values follows a header on the same line (non-empty primitive arrays), there MUST be exactly one space after the colon before the first value. - When an inline list of values follows a header on the same line (non-empty primitive arrays), there MUST be exactly one space after the colon before the first value.
- Otherwise, no trailing space follows the colon on the header line. - Otherwise, no trailing space follows the colon on the header line.
Decoding requirements:
- The bracket segment "[ … ]" MUST parse as a non-negative integer length. If present, a trailing tab or pipe inside the brackets selects the active delimiter for the header; otherwise comma is the active delimiter.
- An optional "#" MAY precede the length and MUST be ignored semantically.
- If a brace-enclosed fields segment "{ … }" is present, field names MUST be parsed using the active delimiter, and quoted field names MUST be unescaped per Section 6.1.
- A colon MUST follow the bracket (and fields) segment; missing colon MUST error.
- Inline values, if present on the same line, are split using the headers active delimiter.
## 8. Primitive Encoding ## 8. Primitive Encoding
- null: literal null. - null: literal null.
@@ -158,106 +211,147 @@ Spacing:
- Non-finite (NaN, ±Infinity): treated as null via normalization (Section 3). - Non-finite (NaN, ±Infinity): treated as null via normalization (Section 3).
- string: encoded per Section 6 with delimiter-aware quoting. - string: encoded per Section 6 with delimiter-aware quoting.
Decoding note:
- Primitive tokens are interpreted per Section 3A (quoted → string; unquoted → boolean/null/number/string with leading-zero rule).
## 9. Object Syntax ## 9. Object Syntax
- Primitive fields: key: value (single space after colon). - Encoding:
- Nested or empty objects: key: on its own line; if non-empty, nested fields appear at one more indentation level. - Primitive fields: key: value (single space after colon).
- Key order: Implementations MUST preserve the encounter order when emitting fields. - Nested or empty objects: key: on its own line; if non-empty, nested fields appear at one more indentation level.
- An empty object at the root results in an empty document (no lines). - Key order: Implementations MUST preserve the encounter order when emitting fields.
- An empty object at the root results in an empty document (no lines).
- Decoding:
- A line "key:" with nothing after the colon at depth d opens an object; subsequent lines at depth > d belong to that object until the depth decreases to ≤ d.
- Lines with "key: value" at the same depth are sibling fields.
- Missing colon after a key (quoted or unquoted) MUST error.
- Quoted keys MUST be followed immediately by ":"; missing colon MUST error.
## 10. Arrays ## 10. Arrays
10.1 Primitive Arrays (Inline) 10.1 Primitive Arrays (Inline)
- Non-empty arrays: key[N<delim?>]: v1<delim>v2<delim>… where each vi is encoded as a primitive (Section 8) with delimiter-aware quoting (Section 6). - Encoding:
- Empty arrays: key[0<delim?>]: (no values following). - Non-empty arrays: key[N<delim?>]: v1<delim>v2<delim>… where each vi is encoded as a primitive (Section 8) with delimiter-aware quoting (Section 6).
- Root arrays use the same rules without a key: [N<delim?>]: v1<delim>… - Empty arrays: key[0<delim?>]: (no values following).
- Root arrays use the same rules without a key: [N<delim?>]: v1<delim>…
- Decoding:
- Inline arrays are split using the active delimiter declared by the header; non-active delimiters MUST NOT split values.
- In strict mode, the number of decoded values MUST equal N; otherwise error.
10.2 Arrays of Arrays (Primitives Only) — Expanded List 10.2 Arrays of Arrays (Primitives Only) — Expanded List
- Parent header: key[N<delim?>]: on its own line. - Encoding:
- Each inner primitive array is a list item: - Parent header: key[N<delim?>]: on its own line.
- Each inner primitive array is a list item:
- - [M<delim?>]: v1<delim>v2<delim>… - - [M<delim?>]: v1<delim>v2<delim>…
- Empty inner arrays: - [0<delim?>]: - Empty inner arrays: - [0<delim?>]:
- Root arrays of arrays use [N<delim?>]: as the parent header with the same rules. - Decoding:
- Items appear at one deeper depth, each starting with "- " and an inner array header "[M<delim?>]: …".
- Inner arrays are split using their own active delimiter; in strict mode, counts MUST match M.
- In strict mode, the number of list items MUST equal outer N.
10.3 Arrays of Objects — Tabular Form 10.3 Arrays of Objects — Tabular Form
Tabular detection (MUST hold for all rows): Tabular detection (encoding; MUST hold for all rows):
- Every element is an object. - Every element is an object.
- All objects have the same set of keys (order per object MAY vary). - All objects have the same set of keys (order per object MAY vary).
- All values across these keys are primitives (no nested arrays/objects). - All values across these keys are primitives (no nested arrays/objects).
When satisfied: When satisfied (encoding):
- Header: key[N<delim?>]{f1<delim>f2<delim>…}: where the field order is the encounter order of the first objects keys. - Header: key[N<delim?>]{f1<delim>f2<delim>…}: where the field order is the encounter order of the first objects keys.
- Field names encoded as keys (Section 6.3), delimiter-aware. - Field names encoded as keys (Section 6.3), delimiter-aware.
- Rows: one line per object at one indentation level under the header, values joined by the active delimiter. Each value encoded as a primitive (Section 8) with delimiter-aware quoting (Section 6). - Rows: one line per object at one indentation level under the header, values joined by the active delimiter. Each value encoded as a primitive (Section 8) with delimiter-aware quoting (Section 6).
- Root tabular arrays omit the key: [N<delim?>]{…}: then rows. - Root tabular arrays omit the key: [N<delim?>]{…}: then rows.
Decoding:
- A tabular header declares the active delimiter and the ordered field list.
- Rows appear at one deeper depth as value lines separated by the active delimiter.
- Each rows value count MUST equal the field count in strict mode; otherwise error.
- The number of rows MUST equal N in strict mode; otherwise error.
- Disambiguation at row depth:
- If a line has no colon → it is a data row.
- If a line has both a colon and the active delimiter, compare first occurrences:
- Delimiter before colon → row.
- Colon before delimiter → key-value line (end of rows).
- If a line has a colon but no active delimiter → key-value line (end of rows).
10.4 Mixed / Non-Uniform Arrays — Expanded List 10.4 Mixed / Non-Uniform Arrays — Expanded List
When tabular requirements are not met: When tabular requirements are not met (encoding):
- Header: key[N<delim?>]: - Header: key[N<delim?>]:
- Each element is rendered as a list item at one indentation level under the header: - Each element is rendered as a list item at one indentation level under the header:
- Primitive: - <primitive> - Primitive: - <primitive>
- Primitive array: - [M<delim?>]: v1<delim>… - Primitive array: - [M<delim?>]: v1<delim>…
- Object: formatted using objects as list items (Section 11). - Object: formatted using "objects as list items" (Section 11).
- Complex arrays (e.g., arrays of arrays with mixed shapes): - key'[M<delim?>]: followed by nested items as appropriate. - Complex arrays (e.g., arrays of arrays with mixed shapes): - key'[M<delim?>]: followed by nested items as appropriate.
Decoding:
- Header declares the list length N and active delimiter for nested inline arrays.
- Each list item starts with "- " at one deeper depth and is parsed as:
- Primitive (no colon or array header),
- Inline primitive array (- [M<delim?>]: …),
- First-field-on-hyphen object (- key: … or - key[N…]{…}: …),
- Or complex nested arrays (e.g., arrays of arrays) using nested headers.
- In strict mode, the number of list items MUST equal N; otherwise error.
## 11. Objects as List Items ## 11. Objects as List Items
For an object appearing as a list item: For an object appearing as a list item:
- If the object is empty, render a single “-” at the list item indentation level. - Empty object list item: a single "-" at the list item indentation level.
- First field on the hyphen line:
- Otherwise, place the first field on the hyphen line using the following rules: - Primitive: - key: value
- If the first fields value is a primitive: - key: value - Primitive array: - key[M<delim?>]: v1<delim>…
- If the first fields value is a primitive array: - key[M<delim?>]: v1<delim>… - Tabular array: - key[N<delim?>]{fields}:
- If the first fields value is an array of objects that qualifies for tabular form: - Followed by tabular rows at one more indentation level (relative to the hyphen line).
- - key[N<delim?>]{fields}: - Non-uniform array of objects: - key[N<delim?>]:
- Followed by tabular rows at one more indentation level. - Followed by list items at one more indentation level.
- If the first fields value is a non-uniform array of objects: - Object: - key:
- - key[N<delim?>]:
- Followed by list items at one more indentation level (apply these same rules recursively).
- If the first fields value is a complex array (e.g., arrays of arrays, nested mixed arrays):
- - key[N<delim?>]:
- Followed by nested encodings (e.g., “- [M<delim?>]: …”) at one more indentation level.
- If the first fields value is an object:
- - key:
- Nested object fields appear at two more indentation levels (i.e., one deeper than subsequent sibling fields of the same list item). - Nested object fields appear at two more indentation levels (i.e., one deeper than subsequent sibling fields of the same list item).
- Remaining fields of the same object appear at one indentation level under the hyphen line, in encounter order, using normal object field rules. - Remaining fields of the same object appear at one indentation level under the hyphen line, in encounter order, using normal object field rules.
Decoding:
- The first field is parsed from the hyphen line. If it is a nested object (- key:), nested fields are at +2 depth relative to the hyphen line; subsequent fields of the same list item are at +1 depth.
- If the first field is a tabular header on the hyphen line, its rows are at +1 depth and then subsequent sibling fields continue at +1 depth after the rows.
## 12. Delimiters ## 12. Delimiters
- Supported delimiters: - Supported delimiters:
- Comma (default): header omits the delimiter symbol. - Comma (default): header omits the delimiter symbol.
- Tab: header includes the tab character inside brackets and braces (e.g., [N<TAB>], {a<TAB>b}); rows/inline arrays use tabs to separate values. - Tab: header includes the tab character inside brackets and braces (e.g., [N<TAB>], {a<TAB>b}); rows/inline arrays use tabs to separate values.
- Pipe: header includes “|” inside brackets and braces; rows/inline arrays use “|”. - Pipe: header includes "|" inside brackets and braces; rows/inline arrays use "|".
- Delimiter-aware quoting: - Delimiter-aware quoting (encoding):
- Strings containing the active delimiter MUST be quoted across object values, array values, and tabular rows. - Strings containing the active delimiter MUST be quoted across object values, array values, and tabular rows.
- Strings containing non-active delimiters (e.g., commas when using tab) do not require quoting unless another quoting condition applies. - Strings containing non-active delimiters (e.g., commas when using tab) do not require quoting unless another quoting condition applies.
- Changing the delimiter does not relax other quoting rules (colon, brackets/braces, leading hyphen, numeric-like, boolean/null-like). - Delimiter-aware parsing (decoding):
- Inline arrays and tabular rows MUST be split only on the active delimiter declared by the nearest array header.
- Strings containing the active delimiter MUST be quoted to avoid splitting; non-active delimiters MUST NOT cause splits.
- Nested headers may change the active delimiter; decoding MUST use the delimiter declared by the nearest header.
## 13. Length Marker ## 13. Length Marker
- When enabled, the length marker “#” MUST appear immediately before the length in every array header, including nested arrays and tabular headers: - When enabled by an encoder, the length marker "#" MUST appear immediately before the length in every array header, including nested arrays and tabular headers:
- key[#N<delim?>]: … - key[#N<delim?>]: …
- key[#N<delim?>]{…}: - key[#N<delim?>]{…}:
- - [#M<delim?>]: … - - [#M<delim?>]: …
- Semantics: purely informational to emphasize counts; no change to other parsing or formatting rules. - Decoding:
- The marker MUST be accepted and ignored semantically.
- In strict mode, declared lengths MUST match actual counts (rows/items/inline values); mismatches MUST error.
## 14. Indentation and Whitespace Invariants ## 14. Indentation and Whitespace Invariants
- Indentation: - Encoding:
- The encoder MUST use a consistent number of spaces per level (default 2; configurable). - The encoder MUST use a consistent number of spaces per level (default 2; configurable).
- Tabs MUST NOT be used for indentation. - Tabs MUST NOT be used for indentation.
- Spacing: - Exactly one space after ": " in key: value lines.
- Exactly one space after “: ” in key: value lines.
- Exactly one space after array headers when followed by inline values (non-empty primitive arrays). - Exactly one space after array headers when followed by inline values (non-empty primitive arrays).
- End-of-line:
- No trailing spaces at the end of any line. - No trailing spaces at the end of any line.
- No trailing newline at the end of the document. - No trailing newline at the end of the document.
- Decoding:
- Depth is derived from the number of leading spaces and the configured indent size. Implementations SHOULD accept inputs where depth is computed as floor(indentSpaces / indentSize).
- Decoders SHOULD be resilient to surrounding whitespace around tokens; internal token semantics follow quoting rules.
- Tabs used as indentation are non-conforming; behavior is undefined (validators MAY flag this).
## 15. Conformance ## 15. Conformance
@@ -270,37 +364,61 @@ Conformance classes:
- Tabular detection (either uniformly tabular or not, given the input). - Tabular detection (either uniformly tabular or not, given the input).
- Quoting decisions for given values and active delimiter. - Quoting decisions for given values and active delimiter.
- Decoder:
- MUST implement tokenization, escaping, and type interpretation per Sections 3A and 6.4.
- MUST parse array headers per Section 7 and apply the declared active delimiter to inline arrays and tabular rows.
- MUST implement structures and depth rules per Sections 912, including objects-as-list-items placement.
- In strict mode (default true), MUST enforce:
- Inline primitive array value count equals the declared length.
- Tabular row count equals the declared length.
- Tabular row value count equals the field count.
- Invalid escapes and unterminated strings error.
- Missing colon in key-value context errors.
- Delimiter mismatches (e.g., rows not split by the active delimiter) provoke errors via count checks.
- Validator: - Validator:
- SHOULD verify structural conformance (headers, indentation, list markers). - SHOULD verify structural conformance (headers, indentation, list markers).
- SHOULD verify whitespace invariants. - SHOULD verify whitespace invariants.
- SHOULD verify delimiter consistency between headers and rows. - SHOULD verify delimiter consistency between headers and rows.
- SHOULD verify length counts vs. declared [N].
- Parser/Decoder:
- Out of scope for v1; MAY be implemented. Implementers SHOULD follow the invariants in this spec for robust parsing (e.g., delimiter discovery from headers, length counts as consistency checks).
Options: Options:
- indent (default: 2 spaces) - Encoder options:
- delimiter (default: comma; alternatives: tab, pipe) - indent (default: 2 spaces)
- lengthMarker (default: disabled) - delimiter (default: comma; alternatives: tab, pipe)
- lengthMarker (default: disabled)
- Decoder options:
- indent (default: 2 spaces)
- strict (default: true)
## 16. Error Handling and Diagnostics ## 16. Error Handling and Diagnostics
- Inputs that cannot be represented in the data model (Section 2) are normalized (Section 3) before encoding (e.g., NaN → null). - Encoding normalization:
- Tabular fallback: - Inputs that cannot be represented in the data model (Section 2) are normalized (Section 3) before encoding (e.g., NaN → null).
- Tabular fallback (encoding):
- If any tabular condition fails (Section 10.3), encoders MUST use expanded list format (Section 10.4). - If any tabular condition fails (Section 10.3), encoders MUST use expanded list format (Section 10.4).
- Decoding errors (strict mode):
- Array length mismatch (inline arrays and list/tabular forms) MUST error.
- Tabular row value count mismatch vs. field count MUST error.
- Tabular row count mismatch vs. declared length MUST error.
- Invalid escape sequences or unterminated strings MUST error.
- Missing colon in key-value context MUST error.
- Delimiter mismatch (e.g., rows joined by a different delimiter) MUST error via count checks.
- Empty input is invalid and SHOULD error.
- Validators SHOULD report: - Validators SHOULD report:
- Trailing spaces, trailing newlines. - Trailing spaces, trailing newlines (encoder invariants).
- Headers missing delimiters when non-comma is active. - Headers missing delimiter marks when non-comma delimiter is in use.
- Mismatched row counts vs. declared [N]. - Mismatched row counts vs. declared [N].
- Values violating delimiter-aware quoting rules. - Values violating delimiter-aware quoting rules.
## 17. Security Considerations ## 17. Security Considerations
- Injection and ambiguity are mitigated by quoting rules: - Injection and ambiguity are mitigated by quoting rules:
- Strings with colon, active delimiter, leading hyphen, control characters, brackets/braces MUST be quoted. - Strings with colon, the active delimiter, leading hyphen, control characters, brackets/braces MUST be quoted.
- Decoders in strict mode reject malformed strings/escapes and structural inconsistencies (length/row counts), helping detect truncation or injected rows.
- Encoders SHOULD avoid excessive memory use on large inputs; implement streaming/tabular row emission where feasible. - Encoders SHOULD avoid excessive memory use on large inputs; implement streaming/tabular row emission where feasible.
- Unicode inputs: - Unicode inputs:
- Encoders SHOULD avoid altering Unicode content beyond required escaping; validators SHOULD accept all valid Unicode in quoted strings and keys (with escapes as required). - Encoders SHOULD avoid altering Unicode content beyond required escaping; decoders SHOULD accept all valid Unicode in quoted strings and keys (with escapes as required).
## 18. Internationalization ## 18. Internationalization
@@ -408,7 +526,7 @@ pairs[#2]:
## 22. Reference Algorithms (Informative) ## 22. Reference Algorithms (Informative)
22.1 Tabular Detection 22.1 Tabular Detection (Encoding)
Given an array rows: Given an array rows:
- If rows is empty → not tabular (fall back to expanded format). - If rows is empty → not tabular (fall back to expanded format).
@@ -420,7 +538,7 @@ Given an array rows:
- If row[key] is not a primitive → not tabular. - If row[key] is not a primitive → not tabular.
- Otherwise tabular with header from the first row. - Otherwise tabular with header from the first row.
22.2 Safe-Unquoted String Decision 22.2 Safe-Unquoted String Decision (Encoding)
Given a string s and active delimiter d: Given a string s and active delimiter d:
- If s is empty or s !== s.trim() → quote. - If s is empty or s !== s.trim() → quote.
@@ -433,16 +551,73 @@ Given a string s and active delimiter d:
- If s starts with "-" → quote. - If s starts with "-" → quote.
- Else unquoted. - Else unquoted.
22.3 Header Formatting 22.3 Header Formatting (Encoding)
- Start with optional key (encoded as per key rules). - Start with optional key (encoded as per key rules).
- Append [<marker?>N<delim?>], where: - Append "[<marker?>N<delim?>]", where:
- <marker?> is “#” if enabled. - <marker?> is "#" if enabled.
- <delim?> is absent for comma, or is the delimiter symbol for tab/pipe. - <delim?> is absent for comma, or is the delimiter symbol for tab/pipe.
- If tabular, append {field1<delim>field2} where field names are key-encoded and joined by the active delimiter. - If tabular, append "{field1<delim>field2}" where field names are key-encoded and joined by the active delimiter.
- Append “:”. - Append ":".
- For non-empty primitive arrays on a single line, append a space and the joined values (each primitive-encoded with delimiter-aware quoting), joined by the active delimiter. - For non-empty primitive arrays on a single line, append a space and the joined values (each primitive-encoded with delimiter-aware quoting), joined by the active delimiter.
22.4 Decoding Overview
- Split input into lines; compute depth from leading spaces and indent size (default 2). Depth computation MAY be floor(indentSpaces / indentSize).
- Decide root form:
- If first non-empty depth-0 line is a valid root array header: decode a root array.
- Else if exactly one line and it is not a key-value line: decode a single primitive.
- Else: decode an object.
- For objects at depth d: process lines at depth d; for arrays at depth d: read rows/list items at depth d+1.
22.5 Array Header Parsing (Decoding)
- Locate the first "[ … ]" segment on the line; parse:
- Optional leading "#" marker (ignored semantically).
- Length N as decimal integer.
- Optional delimiter marker at the end: tab or pipe (comma otherwise).
- If a "{ … }" fields segment occurs between the "]" and the ":", parse field names using the active delimiter; for each name, if quoted, unescape it (Section 6.1).
- A colon MUST appear after the bracket/fields segment; otherwise error.
- Return the header (key, length, delimiter, fields?, hasLengthMarker) and any inline values after the colon.
22.6 parseDelimitedValues (Decoding)
- Iterate characters left-to-right keeping:
- current token, inQuotes flag.
- If encountering a double quote, toggle inQuotes.
- While inQuotes, treat backslash + next char as a literal pair (to be validated later by the string parser).
- Only split on the active delimiter when not in quotes.
- Trim surrounding spaces around each token.
22.7 Primitive Token Parsing (Decoding)
- If token starts with a quote, it MUST be a properly quoted string (no trailing characters after the closing quote). Unescape it using only the five escapes; otherwise error.
- Else if token is true/false/null → boolean/null.
- Else if token is numeric without forbidden leading zeros and finite → number.
- Else → string.
- Empty tokens decode to empty string.
22.8 Object and List Item Parsing (Decoding)
- Key-value line: parse a (quoted or unquoted) key up to the first colon; missing colon → error. Rest of the line is the primitive value (if present).
- Nested object: "key:" with nothing after colon opens a nested object. If this is:
- A field inside a regular object: nested fields at +1 depth relative to that line.
- The first field on a list-item hyphen line: nested fields at +2 depth relative to the hyphen line; subsequent sibling fields at +1 depth.
- List items:
- Lines start with "- " at one deeper depth than the parent array header.
- After "- ":
- If "[ … ]:" appears → an inline array item; decode with its own header and active delimiter.
- Else if a colon appears → object with first field on hyphen line; parse first field and then subsequent fields as above.
- Else → primitive token.
22.9 Strict Mode Count Checks (Decoding)
- After decoding:
- Inline arrays: item count MUST equal N.
- List arrays: number of items MUST equal N.
- Tabular arrays: number of rows MUST equal N; each rows value count MUST equal field count.
- For tabular arrays, at row depth after N rows, if another same-depth line looks like a row (per disambiguation in 10.3), it MUST error in strict mode.
## 23. ABNF Sketch (Informative) ## 23. ABNF Sketch (Informative)
This sketch omits full Unicode and escaping details; it illustrates structure only. This sketch omits full Unicode and escaping details; it illustrates structure only.
@@ -484,20 +659,22 @@ string = quoted / safe-unquoted-string
``` ```
Notes: Notes:
- Safe-unquoted-string constraints are defined in Section 6.2. - Safe-unquoted-string constraints are defined in Section 6.2 (encoding).
- Actual tokenization relies on the declared header delimiter and quoting rules. - Quoted strings/keys accept only the five escapes in Section 6.1; others MUST error in decoding.
- Row/key-value disambiguation at tabular row depth is defined in 10.3.
## 24. Test Suite and Compliance (Informative) ## 24. Test Suite and Compliance (Informative)
- Implementations are encouraged to validate against a comprehensive test suite covering: - Implementations are encouraged to validate against a comprehensive test suite covering:
- Primitive encoding, quoting, control-character escaping. - Primitive encoding/decoding, quoting, control-character escaping.
- Object key encoding and order preservation. - Object key encoding/decoding and order preservation.
- Primitive arrays (inline), empty arrays. - Primitive arrays (inline), empty arrays.
- Arrays of arrays (expanded), mixed-length and empty inner arrays. - Arrays of arrays (expanded), mixed-length and empty inner arrays.
- Tabular detection and encoding, including delimiter variations. - Tabular detection and formatting, including delimiter variations.
- Mixed arrays and objects-as-list-items behavior, including nested arrays and objects. - Mixed arrays and objects-as-list-items behavior, including nested arrays and objects.
- Whitespace invariants (no trailing spaces/newline). - Whitespace invariants (no trailing spaces/newline).
- Normalization (BigInt, Date, undefined, NaN/Infinity, functions, symbols). - Normalization (BigInt, Date, undefined, NaN/Infinity, functions, symbols).
- Decoder strict-mode errors: count mismatches, invalid escapes, missing colon, delimiter mismatches.
The provided reference tests in the repository mirror these conditions and SHOULD be used to ensure conformance. The provided reference tests in the repository mirror these conditions and SHOULD be used to ensure conformance.
@@ -512,21 +689,21 @@ The provided reference tests in the repository mirror these conditions and SHOUL
- Backward-compatible evolutions SHOULD preserve current headers, quoting rules, and indentation semantics. - Backward-compatible evolutions SHOULD preserve current headers, quoting rules, and indentation semantics.
- Reserved/structural characters (colon, brackets, braces, hyphen) MUST retain current meanings. - Reserved/structural characters (colon, brackets, braces, hyphen) MUST retain current meanings.
- Future work (non-normative): decoding/parsing spec, schemas, comments/annotations, additional delimiter profiles. - Future work (non-normative): schemas, comments/annotations, additional delimiter profiles.
## 27. Acknowledgments and License ## 27. Acknowledgments and License
- Credits: Author and contributors; ports in other languages (Elixir, PHP, Python, Ruby, Java, .NET, Swift). - Credits: Author and contributors; ports in other languages (Elixir, PHP, Python, Ruby, Java, .NET, Swift, Go).
- License: MIT (see repository for details). - License: MIT (see repository for details).
--- ---
Appendix: Cross-check With Reference Behavior (Informative) Appendix: Cross-check With Reference Behavior (Informative)
- All normative behaviors specified herein are implemented and validated by the reference encoder and its test suite, including: - All normative behaviors specified herein are implemented and validated by the reference encoder and decoder test suites, including:
- Safe-unquoted string rules and delimiter-aware quoting. - Safe-unquoted string rules and delimiter-aware quoting.
- Object and tabular header formation using the active delimiter (comma implicit; tab/pipe explicit). - Object and tabular header formation using the active delimiter (comma implicit; tab/pipe explicit), and delimiter-aware parsing.
- Length marker propagation to nested arrays. - Length marker propagation (encoding) and acceptance (decoding).
- Tabular detection requiring uniform keys and primitive-only values. - Tabular detection requiring uniform keys and primitive-only values (encoding).
- Objects-as-list-items formatting (first field on hyphen line, subsequent fields at +1 indent; nested object content at +2). - Objects-as-list-items formatting and decoding (first field on hyphen line, nested object content at +2; subsequent fields at +1).
- Whitespace invariants and no trailing newline. - Whitespace invariants for encoding and depth-based parsing for decoding.