Skip to content

Experiment: Port MySQL-on-SQLite to LALR(1) parser#432

Draft
JanJakes wants to merge 9 commits into
lalr-parserfrom
lalr-parser-driver
Draft

Experiment: Port MySQL-on-SQLite to LALR(1) parser#432
JanJakes wants to merge 9 commits into
lalr-parserfrom
lalr-parser-driver

Conversation

@JanJakes

Copy link
Copy Markdown
Member

Stacked on #429. This experiment ports the MySQL-on-SQLite driver end-to-end from the hand-written recursive parser to the LALR(1) parser, consumed as a proper Composer dependency. All driver tests pass: 543 tests / 7,172 assertions in mysql-on-sqlite, and 41 tests / 1,051 assertions in mysql-parser, including the 69,577-query MySQL server corpus pin. Net diff: +1,611 / −7,180 lines.

Composer-based package reuse

The wordpress/mysql-parser package gets a classmap Composer autoloader (the WordPress-style file naming rules out PSR-4) and exposes WP_MySQL_Parser::PARSE_TABLE_PATH, since the generated parse table is data the autoloader cannot cover. The wordpress/mysql-on-sqlite package requires it through a Composer path repository pointing at the monorepo sibling, so development has a single source of truth (a vendor symlink) and nothing is duplicated: the driver's old parser machinery — grammar, lexer, parse tree classes, and the native Rust parser fork bound to the old grammar contract — is removed entirely.

Parser changes the port surfaced

  • Empty reductions produce no AST nodes. Empty optionals (opt_*) and Bison's mid-rule action rules ($@N) carry no information, so they no longer appear in the tree, and consumers see an optional clause only when it is present.
  • WP_Parser_Node::get_flattened_child_nodes() iterates left-recursive grammar lists (list: list ',' item) as if they were flat.
  • ANSI_QUOTES SQL mode in the lexer, with a driver-side parse retry. MySQL rejects double-quoted identifiers without ANSI_QUOTES, but WordPress relies on them (dbDelta can produce double-quoted index names) and the previous parser accepted them. The retry accepts these statements while preserving string-literal semantics for everything that parses without it.

Driver port

The statement dispatch, the query translation layer, and the information schema builder are re-keyed to the official sql_yacc.yy rule names and tree shapes. Multi-statement input is split on top-level ; separators, as the grammar parses a single statement (this is how MySQL clients split multi-statement input); the old create_parser()/next_query() API is replaced by parse_mysql_query().

The information schema builder was verified byte-exact against the old parser and builder over a DDL battery covering all supported data types, constraints, indexes, and table options. Multi-column ADD COLUMN (a INT, b INT) is now recorded correctly; the previous builder crashed on it.

Deployment and CI

The WordPress Docker environments install the driver's Composer dependencies and mount the package vendor directory and the parser package into the containers. The plugin zip build bundles the driver's production dependencies, resolving the path-repository symlink into a real, pruned copy of the parser package. The driver test workflow now also triggers on parser package changes, and the native parser extension jobs and scripts are removed (packages/php-ext-wp-mysql-parser is orphaned by this branch).

Testing

cd packages/mysql-on-sqlite && composer install && composer run test
cd packages/mysql-parser && composer install && composer run test
composer run build-sqlite-plugin-zip

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

🤖 Lexer benchmark

Changes to lexer-related files were detected and triggered a benchmark:

Config Base (QPS) This PR (QPS) Speedup
no JIT 74,170 55,056 0.74×
tracing JIT 152,834 114,823 0.75×

Note: Hosted runners are noisy, and absolute numbers vary. Treat the results with caution and verify them locally.

To reproduce locally:

cd packages/mysql-on-sqlite && composer run bench-lexer

@JanJakes JanJakes force-pushed the lalr-parser-driver branch 4 times, most recently from 6ede829 to bee436b Compare June 12, 2026 09:01
@JanJakes JanJakes force-pushed the lalr-parser-driver branch from bee436b to 076e3db Compare June 12, 2026 09:50
@JanJakes JanJakes force-pushed the lalr-parser-driver branch from 076e3db to 40a90cf Compare June 12, 2026 14:39
@JanJakes JanJakes force-pushed the lalr-parser-driver branch from 40a90cf to e75997a Compare June 12, 2026 14:56
@JanJakes JanJakes force-pushed the lalr-parser-driver branch from e75997a to 982dd0f Compare June 12, 2026 15:28
@JanJakes JanJakes force-pushed the lalr-parser-driver branch from 982dd0f to 7730323 Compare June 12, 2026 19:09
@JanJakes JanJakes force-pushed the lalr-parser-driver branch from 7730323 to 9504ce3 Compare June 12, 2026 19:19
@JanJakes JanJakes force-pushed the lalr-parser-driver branch from 9504ce3 to d5ddec4 Compare June 12, 2026 20:36
@JanJakes JanJakes force-pushed the lalr-parser-driver branch from d5ddec4 to b59b50b Compare June 13, 2026 13:08
@JanJakes JanJakes force-pushed the lalr-parser-driver branch from b59b50b to 9aa6c4e Compare June 13, 2026 13:19
JanJakes added 7 commits June 13, 2026 16:10
The parse table is data, not a class, so the Composer autoloader does not
cover it. Expose its path as a class constant, so consumers can load it
without depending on the package file layout:
new WP_MySQL_Parser( require WP_MySQL_Parser::PARSE_TABLE_PATH ).
A reduction with no children carries no information: empty optionals
(opt_*) and Bison's mid-rule action rules ($@n) only add noise to the
tree. Produce no node for them, so consumers see an optional clause
only when it is present.
Left-recursive grammar list rules nest through their own rule name
("list: list ',' item | item"). The new accessor collects child nodes
of the whole nested chain in source order, as if the list were flat,
which is how AST consumers want to iterate list items.
With the ANSI_QUOTES SQL mode, MySQL treats double-quoted text as a
quoted identifier instead of a string literal. Emit an identifier token
for it, so identifier positions accept double-quoted names.
Replace the hand-written recursive parser with the table-driven LALR(1)
parser generated from MySQL's official grammar, consumed as a Composer
dependency:

- Require wordpress/mysql-parser, resolved from the monorepo sibling
  package via a Composer path repository, and load it through the
  Composer autoloader in the driver loader.
- Drop the old parser machinery (WP_Parser, WP_Parser_Grammar, the
  lexer, the parse tree classes, and mysql-grammar.php), all provided
  by the parser package now, and the native parser fork, which is bound
  to the old grammar contract.
- Parse multi-statement input by splitting the token stream on top-level
  ';' separators, as the grammar parses a single statement (this is how
  MySQL clients split multi-statement input).
- Re-key the statement dispatch to the sql_yacc.yy rule names and map
  keyword token constants to the grammar keyword table.

The translation layer still needs to be ported to the new AST shapes.
Re-key the SQL-to-SQLite translation from the old hand-written grammar
to the sql_yacc.yy rule names and tree shapes:

- Rewrite the translate() special cases and per-statement handlers
  (SELECT, INSERT/REPLACE, UPDATE, DELETE, DDL, SHOW, SET, USE,
  transactions and locking, administration statements).
- Iterate grammar lists with the flattened child node accessor, as
  lists are left-recursive in the new grammar.
- Walk JOINs recursively when building the table reference map, as
  joins nest through the left operand in the new grammar.
- Retry parsing with the ANSI_QUOTES SQL mode when a query fails to
  parse. MySQL rejects double-quoted identifiers without ANSI_QUOTES,
  but WordPress relies on them (dbDelta can produce double-quoted index
  names) and the previous parser accepted them.
Re-key CREATE TABLE, ALTER TABLE, and index statement analysis to the
sql_yacc.yy rule names and tree shapes. The recorded information schema
rows are unchanged: a battery of DDL statements covering all supported
data types, constraints, indexes, and table options produces the exact
same rows as the previous parser and builder.

Multi-column ADD COLUMN (a INT, b INT) is now recorded correctly; the
previous builder crashed on it.
JanJakes added 2 commits June 13, 2026 16:10
The lexer, parser, token data, and parse tree classes are tested in the
wordpress/mysql-parser package now:

- Remove the lexer and parser test suites from the driver package (the
  corpus data stays here; the parser package corpus test reads it from
  the sibling package and skips when it is not available).
- Move the parse tree node tests to the parser package and cover the
  new flattened child node accessor.
- Remove the native parser extension tests and tools, which are bound
  to the old grammar contract.
- Update the AST dump and benchmark tools to the new parser API.
The SQLite driver now loads the MySQL parser as a Composer dependency,
and the native parser extension bound to the old grammar is gone:

- Install the driver Composer dependencies in the WordPress test setup
  and mount the package vendor directory and the parser package into
  the WordPress containers.
- Bundle the driver's production Composer dependencies into the plugin
  zip, resolving the path-repository symlink into a real copy of the
  parser package.
- Run the driver test workflow against changes to the parser package
  and drop the native parser extension jobs and setup scripts.
- Install the driver Composer dependencies in the lexer benchmark
  workflow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant