Spinel AST format

The parser front-end consumes a Ruby source file and produces a text-based AST that the rest of the compiler reads. This file documents that text format -- the record types, the flattening conventions, and the per-node fields the consumers rely on.

Pipeline position

.rb  ──[sp_parse_file_to_text]──▶  text AST  ──[nt_load_text]──▶  NodeTable
        ──[analyze_program + codegen_program]──▶  .c

Spinel is a single binary (spinel). sp_parse_file_to_text (src/spinel_parse.c) is the only stage that talks to libprism; it returns the flattened text AST as an in-memory string -- there is no on-disk .ast intermediate (though spinel --dump-ast prints it). nt_load_text (src/node_table.c) parses that text into a NodeTable, and the analyzer and code generator both read the table through the nt_* accessor API. The text format is the stable contract between Prism (a moving upstream) and Spinel's fixed-shape backend.

There is one parser implementation: src/spinel_parse.c, which links libprism directly (no Ruby is needed at build time once make deps has fetched the sources). The retired Ruby self-host had a spinel_parse.rb fallback; the C compiler does not.

File format

Plain UTF-8 text, line-oriented (\n-terminated). Every line is one record. Fields are space-separated.

Records are not length-limited by the format. Large source constructs can produce A records longer than 4096 bytes, so parser emitters must size output records dynamically instead of formatting through a fixed line buffer.

The first record is always:

ROOT <int>

naming the AST node id that the rest of the file roots at (always 0 in practice, since flatten() walks the tree top-down and assigns ids in DFS order). It is followed by the source-file header records:

SOURCE_FILE <path>        # the primary source path (unescaped)
FILE <fid> <path>         # file-id -> path map, one per distinct source file

The FILE records back the multi-file #line map: a node's node_file integer attribute (below) indexes this table, so an error in generated C points at the right original .rb even across an inlined require. nt_file_path(nt, fid) resolves the id.

Every subsequent record begins with a single-letter tag:

Tag	Shape	Meaning
`N`	`N <id> <NodeType>`	Node header. Introduces a new node; binds its id to its Prism type.
`S`	`S <id> <field> <string>`	String-valued attribute (`name`, `content`, `pattern`, etc.).
`I`	`I <id> <field> <int>`	Integer attribute (`value`, `flags`, and the `node_line` / `node_file` / `node_col` source position).
`F`	`F <id> <field> <float>`	Float attribute (`value`). Stored as the node's `content` string.
`R`	`R <id> <field> <child-id-or--1>`	Single child reference. `-1` means the child slot is empty.
`A`	`A <id> <field> <id,id,…>`	Array of child references. Body is comma-separated; empty allowed.

Every node carries node_line / node_file / node_col integer attributes (the source position Prism reports); codegen reads them via nt_int(nt, id, "node_line", 0) etc. to emit #line directives under --line-map (on by default) and --debug.

A node is defined by its N line followed by zero or more attribute lines (S/I/F/R/A) that share the same <id>. Attribute lines appear in source order -- but downstream code addresses attributes by field name, not position, so the order is informational rather than load-bearing.

Example

puts "hi"

flattens to:

ROOT 0
N 0 ProgramNode
N 1 StatementsNode
N 2 CallNode
S 2 name puts
R 2 receiver -1
N 3 ArgumentsNode
N 4 StringNode
S 4 content hi
A 3 arguments 4
R 2 arguments 3
R 2 block -1
A 1 body 2
R 0 statements 1

Node ids are assigned in DFS pre-order during parse, so a child's id is always greater than its parent's N line id. The attribute lines for a parent may interleave with the child subtree -- the consumer matches by id, not by position.

Encoding rules

String values (S records) are produced by escape_str() in spinel_parse.c:

Char	Encoded as
`\n`	`\\n`
`\t`	`\\t`
`\r`	`\\r`
`\\`	`\\\\`
`"`	`\\"`
`\0`	`\\0`
byte ≥ `0x80`	passed through verbatim (UTF-8)
other	passed through

The encoding is reversible by the consumer's unescape_str helper.

The space between fields is a single ASCII space; any spaces inside the string payload survive (they are not separators after the third field). Newlines do not appear in string payloads -- they are encoded as \n.

Node-type taxonomy

src/spinel_parse.c's flatten() enumerates every Prism node it understands. Adding support for a new Ruby idiom usually means adding one case PM_<NODE>: arm to the switch and a matching consumer-side handler in src/analyze*.c / src/codegen*.c. The list below groups the recognised nodes by category; field names per node are the ones Prism documents, with a few Spinel-specific aliases for clarity.

Program / scopes

Node	Fields used	Notes
`ProgramNode`	`statements` (R)	Top-level wrapper. The root node.
`StatementsNode`	`body` (A)	Sequence of statements -- methods bodies, blocks, etc.
`ClassNode`	`constant_path` (R), `superclass` (R), `body` (R)	Class definition. `superclass` is `-1` for no parent.
`ModuleNode`	`constant_path` (R), `body` (R)	Module definition.
`SingletonClassNode`	`expression` (R), `body` (R)	`class << recv; ... end`.
`DefNode`	`name` (S), `parameters` (R), `body` (R), `receiver` (R)	Method definition. `receiver` is non-`-1` for `def self.x` / `def obj.x`.
`LambdaNode`	`parameters` (R), `body` (R)	`-> { ... }` / `lambda { ... }`.
`BlockNode`	`parameters` (R), `body` (R)	`do ... end` / `{ ... }` attached to a call.
`ParametersNode`	`requireds` (A), `optionals` (A), `keywords` (A), `posts` (A), `rest` (R), `block` (R)	Per-method or per-block parameter list.
`NumberedParametersNode`	`value` (I) (max param `_1` .. `_N`)	`{ _1 + _2 }` shorthand.

Control flow

Node	Fields used
`IfNode`	`predicate` (R), `statements` (R), `subsequent` (R)
`UnlessNode`	`predicate` (R), `statements` (R), `consequent` (R)
`ElseNode`	`statements` (R)
`CaseNode`	`predicate` (R), `conditions` (A), `else_clause` (R)
`CaseMatchNode`	`predicate` (R), `conditions` (A), `else_clause` (R)
`WhenNode`	`conditions` (A), `statements` (R)
`InNode`	`pattern` (R), `statements` (R)
`WhileNode`	`predicate` (R), `statements` (R)
`UntilNode`	`predicate` (R), `statements` (R)
`BreakNode` / `NextNode` / `RedoNode` / `ReturnNode`	`arguments` (R)
`BeginNode`	`statements` (R), `rescue_clause` (R), `else_clause` (R), `ensure_clause` (R)
`RescueNode`	`exceptions` (A), `reference` (R), `statements` (R), `subsequent` (R)
`EnsureNode`	`statements` (R)
`YieldNode`	`arguments` (R)
`SuperNode` / `ForwardingSuperNode`	`arguments` (R), `block` (R)
`CatchNode` / `ThrowNode`	(encoded as CallNode-style; no dedicated Prism node -- the parser injects helpers)

Calls

CallNode is the workhorse: every method invocation -- including operators, attribute access, indexing -- compiles to one. The disambiguating field is name, plus the recv / args / block triple.

Node	Fields used	Notes
`CallNode`	`name` (S), `receiver` (R), `arguments` (R), `block` (R), `flags` (I)	Universal method call. `flags` bit 0 = `&.` safe navigation.
`ArgumentsNode`	`arguments` (A)	Positional + keyword + splat args.
`KeywordHashNode`	`elements` (A)	Trailing keyword hash literal.
`AssocNode`	`key` (R), `value` (R)	One key=>value pair inside a Hash or kwargs.
`SplatNode`	`expression` (R)	`*args` -- splat in args or LHS.
`BlockArgumentNode`	`expression` (R)	`&block` -- pass block by name.
`ForwardingArgumentsNode`	(no fields)	`def f(...); g(...); end` shape.
`BlockParameterNode`	`name` (S)	`&blk` in a method's parameter list.

Literals

Node	Fields used	Notes
`IntegerNode`	`value` (I)	Decimal, hex, oct, bin.
`FloatNode`	`value` (F)	Includes exponent forms.
`RationalNode` / `ImaginaryNode`	`numeric` (R)	Wrapping number suffix.
`StringNode`	`content` (S)	Literal string.
`SymbolNode`	`content` (S)	Symbol literal.
`InterpolatedStringNode` / `InterpolatedSymbolNode`	`parts` (A)	Each part is either a StringNode or an `EmbeddedStatementsNode`.
`EmbeddedStatementsNode`	`statements` (R)	`#{...}` inside a string.
`RegularExpressionNode`	`content` (S), `flags` (I)	`/pat/`.
`InterpolatedRegularExpressionNode`	`parts` (A), `flags` (I)	`/#{x}/`.
`ArrayNode`	`elements` (A)	`[1, 2, 3]`.
`HashNode`	`elements` (A) -- each an AssocNode	`{a: 1}`.
`RangeNode`	`left` (R), `right` (R), `flags` (I)	`flags` bit 1 = exclusive end.
`LambdaNode`	(see above)	`-> x { x*2 }`.
`TrueNode` / `FalseNode` / `NilNode` / `SelfNode`	(no fields)	Singleton literals.

References & assignment

Node	Fields used
`LocalVariableReadNode`	`name` (S)
`LocalVariableWriteNode`	`name` (S), `value` (R)
`LocalVariableTargetNode`	`name` (S) -- LHS of `MultiWriteNode`
`LocalVariableOperatorWriteNode`	`name` (S), `value` (R), `operator` (S)
`LocalVariableOrWriteNode` / `LocalVariableAndWriteNode`	`name` (S), `value` (R)
`InstanceVariableReadNode`	`name` (S)
`InstanceVariableWriteNode`	`name` (S), `value` (R)
`InstanceVariableOperatorWriteNode`	(same triple, plus `operator` S)
`ClassVariableReadNode` / `ClassVariableWriteNode`	`name` (S), `value` (R)
`GlobalVariableReadNode` / `GlobalVariableWriteNode`	`name` (S), `value` (R)
`ConstantReadNode`	`name` (S)
`ConstantPathNode`	`parent` (R), `name` (S)
`ConstantWriteNode`	`name` (S), `value` (R)
`MultiWriteNode`	`targets` (A), `expression` (R), `rest` (R)
`MultiTargetNode`	`lefts` (A), `rest` (R), `rights` (A)

Index access

Node	Fields used
`IndexOrWriteNode` / `IndexAndWriteNode` / `IndexOperatorWriteNode`	`receiver` (R), `arguments` (R), `value` (R), plus `operator` for op-write
`CallOperatorWriteNode` / `CallAndWriteNode` / `CallOrWriteNode`	`receiver` (R), `read_name` (S), `write_name` (S), `value` (R), `operator` (S) -- `obj.x op= v` shape
(`a[i]` reads itself parse as a `CallNode` with `name == "[]"`. The `OrWrite` / `AndWrite` / `*OperatorWrite` nodes only appear for the assignment shape.)

Pattern matching

Node	Fields used	Notes
`ArrayPatternNode`	`requireds` (A), `rest` (R), `posts` (A)	`case x; in [a, b]`
`HashPatternNode`	`elements` (A), `rest` (R)	`case x; in {a: 1}`
`FindPatternNode`	`left` (R), `requireds` (A), `right` (R)	`case x; in [, a, ]`
`PinnedVariableNode`	`variable` (R)	`^var` in a pattern.
`ImplicitNode`	`value` (R)	Sugar elision (`a:` short-hand).

Forwarding / synthetic

Node	Fields used
`ForwardingArgumentsNode`	(no fields)
`ForwardingParameterNode`	(no fields)

The parser also injects synthetic CallNodes for built-in idioms that Prism parses as separate constructs: catch / throw, defined?, and the *-write shorthand operators. Those are documented at their emit sites in src/spinel_parse.c.

Field semantics by category

`R <id> <field> <child>`

A single reference. -1 means "no child" -- consumers must check before indexing arrays by this id. A handful of field names appear repeatedly across node types with a consistent meaning:

receiver -- the recv of a CallNode; -1 for bare calls.
arguments -- usually points at an ArgumentsNode, sometimes directly at the single expression (BreakNode / ReturnNode).
block -- points at a BlockNode (literal block) or BlockArgumentNode (&block forward) attached to a call; -1 otherwise.
value / expression -- the RHS of an assignment / write.
statements -- the body of a control-flow construct.
parameters -- a ParametersNode attached to a method / lambda / block.

`A <id> <field> <id,id,…>`

A list of references, comma-separated, empty body allowed. The empty string after <field> means "zero elements". Used for collections whose size isn't fixed at parse time: statement bodies, argument lists, WhenNode condition lists, hash element lists, etc.

The order is the source order. Consumers walk left-to-right via get_args(args_id) (or parse_id_list for raw lists), so reordering within an A record changes program semantics.

`S <id> <field> <string>`

The string is escape_str-encoded (see "Encoding rules" above). Common fields:

name -- the method name on CallNode, the variable name on LocalVariableReadNode, etc.
content -- the literal text of StringNode / SymbolNode / RegularExpressionNode.
operator -- +=, -=, *=, etc. on the op-write nodes.
pattern -- regex source for RegularExpressionNode.

`I <id> <field> <int>`

Decimal integer, possibly signed. Used for IntegerNode.value (the parsed literal), flags bitfields, and Prism's NumberedParametersNode arity.

`F <id> <field> <float>`

Float literal in printf("%.17g") form, always with a decimal point.

How consumers read the AST

nt_load_text parses the text into a NodeTable: an array of SpNode records indexed by node id (src/node_table.h). Unlike the retired Ruby self-host -- which built one flat @nd_<field> array per attribute -- each SpNode stores its own small field lists (string / int / ref / array), and the analyzer and code generator both query them by name through the accessor API:

Accessor	Reads
`nt_type(nt, id)`	Node type name (the `N` record); `NULL` if unset.
`nt_str(nt, id, "name")`	An `S` string attribute; `NULL` if absent.
`nt_int(nt, id, "value", dflt)`	An `I` integer attribute; `dflt` if absent.
`nt_ref(nt, id, "receiver")`	An `R` child reference; `-1` if absent / empty.
`nt_arr(nt, id, "arguments", &n)`	An `A` child-id list and its length; `NULL,0` if absent.
`nt_kind(nt, id)`	The cached `NodeKind` enum (a fast switch key vs. `strcmp` on the type string).

A reference field absent from a node reads back as -1 and an absent array as empty, so consumers check before indexing. The loader skips records it does not recognise; an attribute the parser emits but no accessor ever reads is simply inert. New parser fields are picked up automatically by nt_str / nt_int / nt_ref / nt_arr once a consumer asks for them by name -- there is no separate per-field loader arm to keep in sync.

The analyzer's results do not live in the NodeTable; they live in the Compiler struct (the per-node type cache plus the class / method / scope tables). The NodeTable is the immutable parsed program; the Compiler is the derived analysis. See ANALYZE-IR.md.

Limitations

Integer overflow: IntegerNode.value is encoded as a signed 64-bit decimal. Literals outside [INT64_MIN, INT64_MAX] are truncated by pm_int_value(); the AOT compiler then auto-promotes to bigint at use sites it can recognise (loop multiplication, fibonacci-style addition). See README.md §Bigint.
Floating point: %.17g round-trips IEEE-754 doubles but the surface text isn't human-readable for boundary cases. Codegen uses the literal text directly in C output, so the encoded form drives what the C compiler sees.
Encoding: bytes ≥ 0x80 pass through verbatim, so the parser assumes UTF-8. ASCII-7 sources are always safe; mixed-encoding sources are not supported (matches the README's "no encoding" limitation).
Source locations: emitted. Each node carries node_line / node_file / node_col integer attributes and the file table is carried in SOURCE_FILE / FILE records, so codegen can map a generated-C line back to the original .rb (the --line-map / --debug #line directives, and the --line-map diagnostics that re-point cc errors at Ruby source).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spinel AST format

Pipeline position

File format

Example

Encoding rules

Node-type taxonomy

Program / scopes

Control flow

Calls

Literals

References & assignment

Index access

Pattern matching

Forwarding / synthetic

Field semantics by category

`R <id> <field> <child>`

`A <id> <field> <id,id,…>`

`S <id> <field> <string>`

`I <id> <field> <int>`

`F <id> <field> <float>`

How consumers read the AST

Limitations

See also

FilesExpand file tree

AST.md

Latest commit

History

AST.md

File metadata and controls

Spinel AST format

Pipeline position

File format

Example

Encoding rules

Node-type taxonomy

Program / scopes

Control flow

Calls

Literals

References & assignment

Index access

Pattern matching

Forwarding / synthetic

Field semantics by category

R <id> <field> <child>

A <id> <field> <id,id,…>

S <id> <field> <string>

I <id> <field> <int>

F <id> <field> <float>

How consumers read the AST

Limitations

See also

`R <id> <field> <child>`

`A <id> <field> <id,id,…>`

`S <id> <field> <string>`

`I <id> <field> <int>`

`F <id> <field> <float>`