Split mail_archive_x.py (549 LOC) — HTML→text, EML parsing and reply-stripping are three modules in one file

From the 2026-06-10 codebase-health review. **Priority: MED (Worth exploring).** Related: #3, #4 (both live inside this file's concerns).

`src/data/loaders/mail_archive_x.py` bundles three concerns with different change cadences in one class:
1. HTML→text extraction (`_HTMLTextExtractor` + supporting regexes)
2. EML parsing / header decoding
3. reply-chain stripping (6 regex patterns)

Each can only be reached (and tested) through the whole loader, and `src/data/loaders/azure_blob.py:13` couples to the entire bundle just to reuse the EML parsing.

**Fix:** extract `src/data/html_text.py` (`html_to_text(s)`) and optionally `src/data/reply_strip.py` (`strip_replies(s)`); the loader keeps file discovery + orchestration. Local and Azure loaders then both depend on the shared parsing modules instead of Azure delegating into Local. Each extracted module is deep — small interface over fiddly regex implementations — and unit-testable in isolation (which would also make #3 and #4 easier to fix safely).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split mail_archive_x.py (549 LOC) — HTML→text, EML parsing and reply-stripping are three modules in one file #48

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Split mail_archive_x.py (549 LOC) — HTML→text, EML parsing and reply-stripping are three modules in one file #48

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions