The parser front-end consumes a Ruby source file and produces a text-based AST that the rest of the compiler reads. This file documents that text format -- the record types, the flattening conventions, and the per-node fields the consumers rely on.
.rb ──[sp_parse_file_to_text]──▶ text AST ──[nt_load_text]──▶ NodeTable
──[analyze_program + codegen_program]──▶ .c
Spinel is a single binary (spinel). sp_parse_file_to_text
(src/spinel_parse.c) is the only stage that talks to libprism; it
returns the flattened text AST as an in-memory string -- there is no
on-disk .ast intermediate (though spinel --dump-ast prints it).
nt_load_text (src/node_table.c) parses that text into a NodeTable,
and the analyzer and code generator both read the table through the
nt_* accessor API. The text format is the stable contract between
Prism (a moving upstream) and Spinel's fixed-shape backend.
There is one parser implementation: src/spinel_parse.c, which links
libprism directly (no Ruby is needed at build time once make deps has
fetched the sources). The retired Ruby self-host had a spinel_parse.rb
fallback; the C compiler does not.
Plain UTF-8 text, line-oriented (\n-terminated). Every line is one
record. Fields are space-separated.
Records are not length-limited by the format. Large source constructs
can produce A records longer than 4096 bytes, so parser emitters must
size output records dynamically instead of formatting through a fixed
line buffer.
The first record is always:
ROOT <int>
naming the AST node id that the rest of the file roots at (always 0
in practice, since flatten() walks the tree top-down and assigns ids
in DFS order). It is followed by the source-file header records:
SOURCE_FILE <path> # the primary source path (unescaped)
FILE <fid> <path> # file-id -> path map, one per distinct source file
The FILE records back the multi-file #line map: a node's
node_file integer attribute (below) indexes this table, so an error in
generated C points at the right original .rb even across an inlined
require. nt_file_path(nt, fid) resolves the id.
Every subsequent record begins with a single-letter tag:
| Tag | Shape | Meaning |
|---|---|---|
N |
N <id> <NodeType> |
Node header. Introduces a new node; binds its id to its Prism type. |
S |
S <id> <field> <string> |
String-valued attribute (name, content, pattern, etc.). |
I |
I <id> <field> <int> |
Integer attribute (value, flags, and the node_line / node_file / node_col source position). |
F |
F <id> <field> <float> |
Float attribute (value). Stored as the node's content string. |
R |
R <id> <field> <child-id-or--1> |
Single child reference. -1 means the child slot is empty. |
A |
A <id> <field> <id,id,…> |
Array of child references. Body is comma-separated; empty allowed. |
Every node carries node_line / node_file / node_col integer
attributes (the source position Prism reports); codegen reads them via
nt_int(nt, id, "node_line", 0) etc. to emit #line directives under
--line-map (on by default) and --debug.
A node is defined by its N line followed by zero or more attribute
lines (S/I/F/R/A) that share the same <id>. Attribute lines
appear in source order -- but downstream code addresses attributes by
field name, not position, so the order is informational rather than
load-bearing.
puts "hi"flattens to:
ROOT 0
N 0 ProgramNode
N 1 StatementsNode
N 2 CallNode
S 2 name puts
R 2 receiver -1
N 3 ArgumentsNode
N 4 StringNode
S 4 content hi
A 3 arguments 4
R 2 arguments 3
R 2 block -1
A 1 body 2
R 0 statements 1
Node ids are assigned in DFS pre-order during parse, so a child's id is
always greater than its parent's N line id. The attribute lines for a
parent may interleave with the child subtree -- the consumer matches by
id, not by position.
String values (S records) are produced by escape_str() in
spinel_parse.c:
| Char | Encoded as |
|---|---|
\n |
\\n |
\t |
\\t |
\r |
\\r |
\\ |
\\\\ |
" |
\\" |
\0 |
\\0 |
byte ≥ 0x80 |
passed through verbatim (UTF-8) |
| other | passed through |
The encoding is reversible by the consumer's unescape_str helper.
The space between fields is a single ASCII space; any spaces inside the
string payload survive (they are not separators after the third field).
Newlines do not appear in string payloads -- they are encoded as \n.
src/spinel_parse.c's flatten() enumerates every Prism node it
understands. Adding support for a new Ruby idiom usually means adding
one case PM_<NODE>: arm to the switch and a matching consumer-side
handler in src/analyze*.c / src/codegen*.c. The list below groups
the recognised nodes by category; field names per node are the ones
Prism documents, with a few Spinel-specific aliases for clarity.
| Node | Fields used | Notes |
|---|---|---|
ProgramNode |
statements (R) |
Top-level wrapper. The root node. |
StatementsNode |
body (A) |
Sequence of statements -- methods bodies, blocks, etc. |
ClassNode |
constant_path (R), superclass (R), body (R) |
Class definition. superclass is -1 for no parent. |
ModuleNode |
constant_path (R), body (R) |
Module definition. |
SingletonClassNode |
expression (R), body (R) |
class << recv; ... end. |
DefNode |
name (S), parameters (R), body (R), receiver (R) |
Method definition. receiver is non--1 for def self.x / def obj.x. |
LambdaNode |
parameters (R), body (R) |
-> { ... } / lambda { ... }. |
BlockNode |
parameters (R), body (R) |
do ... end / { ... } attached to a call. |
ParametersNode |
requireds (A), optionals (A), keywords (A), posts (A), rest (R), block (R) |
Per-method or per-block parameter list. |
NumberedParametersNode |
value (I) (max param _1 .. _N) |
{ _1 + _2 } shorthand. |
| Node | Fields used |
|---|---|
IfNode |
predicate (R), statements (R), subsequent (R) |
UnlessNode |
predicate (R), statements (R), consequent (R) |
ElseNode |
statements (R) |
CaseNode |
predicate (R), conditions (A), else_clause (R) |
CaseMatchNode |
predicate (R), conditions (A), else_clause (R) |
WhenNode |
conditions (A), statements (R) |
InNode |
pattern (R), statements (R) |
WhileNode |
predicate (R), statements (R) |
UntilNode |
predicate (R), statements (R) |
BreakNode / NextNode / RedoNode / ReturnNode |
arguments (R) |
BeginNode |
statements (R), rescue_clause (R), else_clause (R), ensure_clause (R) |
RescueNode |
exceptions (A), reference (R), statements (R), subsequent (R) |
EnsureNode |
statements (R) |
YieldNode |
arguments (R) |
SuperNode / ForwardingSuperNode |
arguments (R), block (R) |
CatchNode / ThrowNode |
(encoded as CallNode-style; no dedicated Prism node -- the parser injects helpers) |
CallNode is the workhorse: every method invocation -- including
operators, attribute access, indexing -- compiles to one. The
disambiguating field is name, plus the recv / args / block triple.
| Node | Fields used | Notes |
|---|---|---|
CallNode |
name (S), receiver (R), arguments (R), block (R), flags (I) |
Universal method call. flags bit 0 = &. safe navigation. |
ArgumentsNode |
arguments (A) |
Positional + keyword + splat args. |
KeywordHashNode |
elements (A) |
Trailing keyword hash literal. |
AssocNode |
key (R), value (R) |
One key=>value pair inside a Hash or kwargs. |
SplatNode |
expression (R) |
*args -- splat in args or LHS. |
BlockArgumentNode |
expression (R) |
&block -- pass block by name. |
ForwardingArgumentsNode |
(no fields) | def f(...); g(...); end shape. |
BlockParameterNode |
name (S) |
&blk in a method's parameter list. |
| Node | Fields used | Notes |
|---|---|---|
IntegerNode |
value (I) |
Decimal, hex, oct, bin. |
FloatNode |
value (F) |
Includes exponent forms. |
RationalNode / ImaginaryNode |
numeric (R) |
Wrapping number suffix. |
StringNode |
content (S) |
Literal string. |
SymbolNode |
content (S) |
Symbol literal. |
InterpolatedStringNode / InterpolatedSymbolNode |
parts (A) |
Each part is either a StringNode or an EmbeddedStatementsNode. |
EmbeddedStatementsNode |
statements (R) |
#{...} inside a string. |
RegularExpressionNode |
content (S), flags (I) |
/pat/. |
InterpolatedRegularExpressionNode |
parts (A), flags (I) |
/#{x}/. |
ArrayNode |
elements (A) |
[1, 2, 3]. |
HashNode |
elements (A) -- each an AssocNode |
{a: 1}. |
RangeNode |
left (R), right (R), flags (I) |
flags bit 1 = exclusive end. |
LambdaNode |
(see above) | -> x { x*2 }. |
TrueNode / FalseNode / NilNode / SelfNode |
(no fields) | Singleton literals. |
| Node | Fields used |
|---|---|
LocalVariableReadNode |
name (S) |
LocalVariableWriteNode |
name (S), value (R) |
LocalVariableTargetNode |
name (S) -- LHS of MultiWriteNode |
LocalVariableOperatorWriteNode |
name (S), value (R), operator (S) |
LocalVariableOrWriteNode / LocalVariableAndWriteNode |
name (S), value (R) |
InstanceVariableReadNode |
name (S) |
InstanceVariableWriteNode |
name (S), value (R) |
InstanceVariableOperatorWriteNode |
(same triple, plus operator S) |
ClassVariableReadNode / ClassVariableWriteNode |
name (S), value (R) |
GlobalVariableReadNode / GlobalVariableWriteNode |
name (S), value (R) |
ConstantReadNode |
name (S) |
ConstantPathNode |
parent (R), name (S) |
ConstantWriteNode |
name (S), value (R) |
MultiWriteNode |
targets (A), expression (R), rest (R) |
MultiTargetNode |
lefts (A), rest (R), rights (A) |
| Node | Fields used |
|---|---|
IndexOrWriteNode / IndexAndWriteNode / IndexOperatorWriteNode |
receiver (R), arguments (R), value (R), plus operator for op-write |
CallOperatorWriteNode / CallAndWriteNode / CallOrWriteNode |
receiver (R), read_name (S), write_name (S), value (R), operator (S) -- obj.x op= v shape |
(a[i] reads itself parse as a CallNode with name == "[]". The *OrWrite / *AndWrite / *OperatorWrite nodes only appear for the assignment shape.) |
| Node | Fields used | Notes |
|---|---|---|
ArrayPatternNode |
requireds (A), rest (R), posts (A) |
case x; in [a, b] |
HashPatternNode |
elements (A), rest (R) |
case x; in {a: 1} |
FindPatternNode |
left (R), requireds (A), right (R) |
case x; in [*, a, *] |
PinnedVariableNode |
variable (R) |
^var in a pattern. |
ImplicitNode |
value (R) |
Sugar elision (a: short-hand). |
| Node | Fields used |
|---|---|
ForwardingArgumentsNode |
(no fields) |
ForwardingParameterNode |
(no fields) |
The parser also injects synthetic CallNodes for built-in idioms that
Prism parses as separate constructs: catch / throw, defined?, and
the *-write shorthand operators. Those are documented at their
emit sites in src/spinel_parse.c.
A single reference. -1 means "no child" -- consumers must check before
indexing arrays by this id. A handful of field names appear repeatedly
across node types with a consistent meaning:
receiver-- the recv of a CallNode;-1for bare calls.arguments-- usually points at an ArgumentsNode, sometimes directly at the single expression (BreakNode / ReturnNode).block-- points at a BlockNode (literal block) or BlockArgumentNode (&blockforward) attached to a call;-1otherwise.value/expression-- the RHS of an assignment / write.statements-- the body of a control-flow construct.parameters-- a ParametersNode attached to a method / lambda / block.
A list of references, comma-separated, empty body allowed. The empty
string after <field> means "zero elements". Used for collections
whose size isn't fixed at parse time: statement bodies, argument lists,
WhenNode condition lists, hash element lists, etc.
The order is the source order. Consumers walk left-to-right via
get_args(args_id) (or parse_id_list for raw lists), so reordering
within an A record changes program semantics.
The string is escape_str-encoded (see "Encoding rules" above). Common fields:
name-- the method name on CallNode, the variable name onLocalVariableReadNode, etc.content-- the literal text ofStringNode/SymbolNode/RegularExpressionNode.operator--+=,-=,*=, etc. on the op-write nodes.pattern-- regex source forRegularExpressionNode.
Decimal integer, possibly signed. Used for IntegerNode.value (the
parsed literal), flags bitfields, and Prism's NumberedParametersNode
arity.
Float literal in printf("%.17g") form, always with a decimal point.
nt_load_text parses the text into a NodeTable: an array of SpNode
records indexed by node id (src/node_table.h). Unlike the retired Ruby
self-host -- which built one flat @nd_<field> array per attribute --
each SpNode stores its own small field lists (string / int / ref /
array), and the analyzer and code generator both query them by name
through the accessor API:
| Accessor | Reads |
|---|---|
nt_type(nt, id) |
Node type name (the N record); NULL if unset. |
nt_str(nt, id, "name") |
An S string attribute; NULL if absent. |
nt_int(nt, id, "value", dflt) |
An I integer attribute; dflt if absent. |
nt_ref(nt, id, "receiver") |
An R child reference; -1 if absent / empty. |
nt_arr(nt, id, "arguments", &n) |
An A child-id list and its length; NULL,0 if absent. |
nt_kind(nt, id) |
The cached NodeKind enum (a fast switch key vs. strcmp on the type string). |
A reference field absent from a node reads back as -1 and an absent
array as empty, so consumers check before indexing. The loader skips
records it does not recognise; an attribute the parser emits but no
accessor ever reads is simply inert. New parser fields are picked up
automatically by nt_str / nt_int / nt_ref / nt_arr once a
consumer asks for them by name -- there is no separate per-field loader
arm to keep in sync.
The analyzer's results do not live in the NodeTable; they live in
the Compiler struct (the per-node type cache plus the class / method /
scope tables). The NodeTable is the immutable parsed program; the
Compiler is the derived analysis. See
ANALYZE-IR.md.
- Integer overflow:
IntegerNode.valueis encoded as a signed 64-bit decimal. Literals outside[INT64_MIN, INT64_MAX]are truncated bypm_int_value(); the AOT compiler then auto-promotes to bigint at use sites it can recognise (loop multiplication, fibonacci-style addition). See README.md §Bigint. - Floating point:
%.17ground-trips IEEE-754 doubles but the surface text isn't human-readable for boundary cases. Codegen uses the literal text directly in C output, so the encoded form drives what the C compiler sees. - Encoding: bytes ≥ 0x80 pass through verbatim, so the parser assumes UTF-8. ASCII-7 sources are always safe; mixed-encoding sources are not supported (matches the README's "no encoding" limitation).
- Source locations: emitted. Each node carries
node_line/node_file/node_colinteger attributes and the file table is carried inSOURCE_FILE/FILErecords, so codegen can map a generated-C line back to the original.rb(the--line-map/--debug#linedirectives, and the--line-mapdiagnostics that re-point cc errors at Ruby source).
- docs/ANALYZE-IR.md -- the in-memory analyze ↔ codegen
contract (the
Compilerstruct that replaced the old.irfile). - docs/FFI.md -- direct C-call declarations.
src/spinel_parse.c-- the C front-end that produces this format.src/node_table.[ch]--nt_load_textand thent_*accessor API.src/analyze*.c-- type inference; fills theCompilernode type cache.src/codegen*.c-- C emission; reads theNodeTable+Compiler.