Skip to content

fix(parser): stop DOCX parser dropping bare-integer text runs (#120)#121

Merged
thewrz merged 1 commit into
mainfrom
fix/docx-numeric-run-drop
Jun 8, 2026
Merged

fix(parser): stop DOCX parser dropping bare-integer text runs (#120)#121
thewrz merged 1 commit into
mainfrom
fix/docx-numeric-run-drop

Conversation

@thewrz

@thewrz thewrz commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Fixes #120.

Problem

The DOCX parser silently dropped any Word text run whose entire content was a bare integer. Word splits numbers across runs at edit/rsid/spell-check boundaries, so a CSI section like 09 91 26 — stored as runs ["09 ", "9", "1 26"] — lost its bare "9" run and rendered as 09 1 26. This affected any standalone integer in DOCX body text (quantities, years, dimensions), and corrupted section numbers feeding cross-reference resolution.

Root cause

src/parser/docx/document.ts built its XMLParser without parseTagValue: false. fast-xml-parser therefore coerced <w:t>9</w:t> to the JS number 9; extractRunText() only handles string / object-with-#text and fell through to return '', deleting the run.

Fix

Add parseTagValue: false<w:t> is document text and must never be type-coerced (same family as the existing processEntities: true / trimValues: false settings). Attribute-derived numbers (numId, ilvl, w:ind/@w:left, w:outlineLvl) are parsed via getAttrNumVal/parseInt and are unaffected.

Test plan

  • pnpm test → 49 files, 589 unit tests pass.
  • New regression test in src/parser/docx/document.test.ts pins the exact symptom: runs ["09 ", "9", "1 26"]09 91 26 (was 09 1 26). Verified RED before the fix, GREEN after.
  • pnpm lint (eslint + tsc --noEmit + prettier) clean.

Scope / out of scope

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Fixed an issue where numeric sequences split across runs in Word documents would incorrectly lose digits during parsing (e.g., "09 91 26" now correctly preserves all numbers).
  • Tests

    • Added regression test to ensure numeric sequences remain intact when parsing Word documents with distributed text runs.

fast-xml-parser coerced numeric <w:t> content to JS numbers, so a run
whose text was a bare integer (<w:t>9</w:t>) became the number 9.
extractRunText handles only string / object-with-#text and silently
dropped it, deleting digits from numbers Word splits across runs —
"09 91 26" stored as ["09 ", "9", "1 26"] rendered as "09 1 26".

Set parseTagValue: false so w:t stays a string. Attribute-derived
numbers (numId, ilvl, w:ind) parse via getAttrNumVal/parseInt and are
unaffected. Regression test pins the exact run-split symptom.

Fixes #120
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 2c695044-d238-4e14-b34b-1fc1ba0af3eb

📥 Commits

Reviewing files that changed from the base of the PR and between 7e45816 and 857f738.

📒 Files selected for processing (2)
  • src/parser/docx/document.test.ts
  • src/parser/docx/document.ts

📝 Walkthrough

Walkthrough

This PR fixes a bug in the DOCX parser where bare-integer text runs were silently dropped. When Word splits numeric sequences like CSI section numbers across multiple runs, the parser was type-coercing runs containing only digits into numbers, which were then dropped during text extraction. The fix configures the XMLParser to keep text runs as strings and validates the correction with a regression test.

Changes

DOCX Parser Numeric Run Handling

Layer / File(s) Summary
Parser configuration fix and regression test
src/parser/docx/document.ts, src/parser/docx/document.test.ts
XMLParser config updated to set parseTagValue: false, preventing fast-xml-parser from coercing <w:t> text node values to numbers. Documentation clarifies the effect of trimValues: false (whitespace preservation) and the new setting (string preservation). Regression test constructs DOCX XML with numeric sequence split across runs and verifies correct concatenation into '09 91 26'.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

A rabbit once parsed a text run,
But numerals turned into one—
A bare digit 9 just vanished away,
Till parseTagValue: false saved the day! 🐰✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title accurately and concisely describes the main fix—preventing DOCX parser from dropping bare-integer text runs—matching the changeset's core purpose.
Linked Issues check ✅ Passed All objectives from issue #120 are met: parseTagValue: false added to XMLParser config, extractRunText preserves bare-integer runs, regression test included verifying the fix, and attribute-derived numeric parsing unchanged.
Out of Scope Changes check ✅ Passed All changes directly address issue #120: one-line parser fix in document.ts and regression test in document.test.ts. No unrelated modifications present.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/docx-numeric-run-drop

Comment @coderabbitai help to get the list of available commands and usage tips.

@thewrz thewrz merged commit a844386 into main Jun 8, 2026
5 checks passed
@thewrz thewrz deleted the fix/docx-numeric-run-drop branch June 8, 2026 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DOCX parser silently drops bare-integer text runs (e.g. 09 91 2609 1 26)

1 participant