fix(parser): stop DOCX parser dropping bare-integer text runs (#120) by thewrz · Pull Request #121 · wrzonance/SpecR

thewrz · 2026-06-08T16:18:02Z

Fixes #120.

Problem

The DOCX parser silently dropped any Word text run whose entire content was a bare integer. Word splits numbers across runs at edit/rsid/spell-check boundaries, so a CSI section like 09 91 26 — stored as runs ["09 ", "9", "1 26"] — lost its bare "9" run and rendered as 09 1 26. This affected any standalone integer in DOCX body text (quantities, years, dimensions), and corrupted section numbers feeding cross-reference resolution.

Root cause

src/parser/docx/document.ts built its XMLParser without parseTagValue: false. fast-xml-parser therefore coerced <w:t>9</w:t> to the JS number 9; extractRunText() only handles string / object-with-#text and fell through to return '', deleting the run.

Fix

Add parseTagValue: false — <w:t> is document text and must never be type-coerced (same family as the existing processEntities: true / trimValues: false settings). Attribute-derived numbers (numId, ilvl, w:ind/@w:left, w:outlineLvl) are parsed via getAttrNumVal/parseInt and are unaffected.

Test plan

pnpm test → 49 files, 589 unit tests pass.
New regression test in src/parser/docx/document.test.ts pins the exact symptom: runs ["09 ", "9", "1 26"] → 09 91 26 (was 09 1 26). Verified RED before the fix, GREEN after.
pnpm lint (eslint + tsc --noEmit + prettier) clean.

Scope / out of scope

One-line parser fix + one regression test. document.ts is identical on main and mockup; this also lands on mockup for the live demo.
NOT included: the separate client-side staleness in public/js/tree.js (its section regex has drifted from src/lib/section-number.ts). Noted in DOCX parser silently drops bare-integer text runs (e.g. 09 91 26 → 09 1 26) #120 for a follow-up issue.

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Fixed an issue where numeric sequences split across runs in Word documents would incorrectly lose digits during parsing (e.g., "09 91 26" now correctly preserves all numbers).
Tests
- Added regression test to ensure numeric sequences remain intact when parsing Word documents with distributed text runs.

fast-xml-parser coerced numeric <w:t> content to JS numbers, so a run whose text was a bare integer (<w:t>9</w:t>) became the number 9. extractRunText handles only string / object-with-#text and silently dropped it, deleting digits from numbers Word splits across runs — "09 91 26" stored as ["09 ", "9", "1 26"] rendered as "09 1 26". Set parseTagValue: false so w:t stays a string. Attribute-derived numbers (numId, ilvl, w:ind) parse via getAttrNumVal/parseInt and are unaffected. Regression test pins the exact run-split symptom. Fixes #120

coderabbitai · 2026-06-08T16:19:13Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 2c695044-d238-4e14-b34b-1fc1ba0af3eb

📥 Commits

Reviewing files that changed from the base of the PR and between 7e45816 and 857f738.

📒 Files selected for processing (2)

src/parser/docx/document.test.ts
src/parser/docx/document.ts

📝 Walkthrough

Walkthrough

This PR fixes a bug in the DOCX parser where bare-integer text runs were silently dropped. When Word splits numeric sequences like CSI section numbers across multiple runs, the parser was type-coercing runs containing only digits into numbers, which were then dropped during text extraction. The fix configures the XMLParser to keep text runs as strings and validates the correction with a regression test.

Changes

DOCX Parser Numeric Run Handling

Layer / File(s)	Summary
Parser configuration fix and regression test `src/parser/docx/document.ts`, `src/parser/docx/document.test.ts`	XMLParser config updated to set `parseTagValue: false`, preventing fast-xml-parser from coercing `<w:t>` text node values to numbers. Documentation clarifies the effect of `trimValues: false` (whitespace preservation) and the new setting (string preservation). Regression test constructs DOCX XML with numeric sequence split across runs and verifies correct concatenation into `'09 91 26'`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

A rabbit once parsed a text run,
But numerals turned into one—
A bare digit 9 just vanished away,
Till parseTagValue: false saved the day! 🐰✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	Title accurately and concisely describes the main fix—preventing DOCX parser from dropping bare-integer text runs—matching the changeset's core purpose.
Linked Issues check	✅ Passed	All objectives from issue `#120` are met: parseTagValue: false added to XMLParser config, extractRunText preserves bare-integer runs, regression test included verifying the fix, and attribute-derived numeric parsing unchanged.
Out of Scope Changes check	✅ Passed	All changes directly address issue `#120`: one-line parser fix in document.ts and regression test in document.test.ts. No unrelated modifications present.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/docx-numeric-run-drop

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

thewrz merged commit a844386 into main Jun 8, 2026
5 checks passed

thewrz deleted the fix/docx-numeric-run-drop branch June 8, 2026 16:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(parser): stop DOCX parser dropping bare-integer text runs (#120)#121

fix(parser): stop DOCX parser dropping bare-integer text runs (#120)#121
thewrz merged 1 commit into
mainfrom
fix/docx-numeric-run-drop

thewrz commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thewrz commented Jun 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Fix

Test plan

Scope / out of scope

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thewrz commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading