From the 2026-06-10 codebase-health review. Priority: MED (Worth exploring). Related: #3, #4 (both live inside this file's concerns).
src/data/loaders/mail_archive_x.py bundles three concerns with different change cadences in one class:
- HTML→text extraction (
_HTMLTextExtractor + supporting regexes)
- EML parsing / header decoding
- reply-chain stripping (6 regex patterns)
Each can only be reached (and tested) through the whole loader, and src/data/loaders/azure_blob.py:13 couples to the entire bundle just to reuse the EML parsing.
Fix: extract src/data/html_text.py (html_to_text(s)) and optionally src/data/reply_strip.py (strip_replies(s)); the loader keeps file discovery + orchestration. Local and Azure loaders then both depend on the shared parsing modules instead of Azure delegating into Local. Each extracted module is deep — small interface over fiddly regex implementations — and unit-testable in isolation (which would also make #3 and #4 easier to fix safely).
From the 2026-06-10 codebase-health review. Priority: MED (Worth exploring). Related: #3, #4 (both live inside this file's concerns).
src/data/loaders/mail_archive_x.pybundles three concerns with different change cadences in one class:_HTMLTextExtractor+ supporting regexes)Each can only be reached (and tested) through the whole loader, and
src/data/loaders/azure_blob.py:13couples to the entire bundle just to reuse the EML parsing.Fix: extract
src/data/html_text.py(html_to_text(s)) and optionallysrc/data/reply_strip.py(strip_replies(s)); the loader keeps file discovery + orchestration. Local and Azure loaders then both depend on the shared parsing modules instead of Azure delegating into Local. Each extracted module is deep — small interface over fiddly regex implementations — and unit-testable in isolation (which would also make #3 and #4 easier to fix safely).