fix(parser): stop DOCX parser dropping bare-integer text runs (#120)#121
Conversation
fast-xml-parser coerced numeric <w:t> content to JS numbers, so a run whose text was a bare integer (<w:t>9</w:t>) became the number 9. extractRunText handles only string / object-with-#text and silently dropped it, deleting digits from numbers Word splits across runs — "09 91 26" stored as ["09 ", "9", "1 26"] rendered as "09 1 26". Set parseTagValue: false so w:t stays a string. Attribute-derived numbers (numId, ilvl, w:ind) parse via getAttrNumVal/parseInt and are unaffected. Regression test pins the exact run-split symptom. Fixes #120
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR fixes a bug in the DOCX parser where bare-integer text runs were silently dropped. When Word splits numeric sequences like CSI section numbers across multiple runs, the parser was type-coercing runs containing only digits into numbers, which were then dropped during text extraction. The fix configures the XMLParser to keep text runs as strings and validates the correction with a regression test. ChangesDOCX Parser Numeric Run Handling
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Fixes #120.
Problem
The DOCX parser silently dropped any Word text run whose entire content was a bare integer. Word splits numbers across runs at edit/rsid/spell-check boundaries, so a CSI section like
09 91 26— stored as runs["09 ", "9", "1 26"]— lost its bare"9"run and rendered as09 1 26. This affected any standalone integer in DOCX body text (quantities, years, dimensions), and corrupted section numbers feeding cross-reference resolution.Root cause
src/parser/docx/document.tsbuilt itsXMLParserwithoutparseTagValue: false. fast-xml-parser therefore coerced<w:t>9</w:t>to the JS number9;extractRunText()only handlesstring/ object-with-#textand fell through toreturn '', deleting the run.Fix
Add
parseTagValue: false—<w:t>is document text and must never be type-coerced (same family as the existingprocessEntities: true/trimValues: falsesettings). Attribute-derived numbers (numId,ilvl,w:ind/@w:left,w:outlineLvl) are parsed viagetAttrNumVal/parseIntand are unaffected.Test plan
pnpm test→ 49 files, 589 unit tests pass.src/parser/docx/document.test.tspins the exact symptom: runs["09 ", "9", "1 26"]→09 91 26(was09 1 26). Verified RED before the fix, GREEN after.pnpm lint(eslint +tsc --noEmit+ prettier) clean.Scope / out of scope
document.tsis identical onmainandmockup; this also lands onmockupfor the live demo.public/js/tree.js(its section regex has drifted fromsrc/lib/section-number.ts). Noted in DOCX parser silently drops bare-integer text runs (e.g.09 91 26→09 1 26) #120 for a follow-up issue.🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests