Skip to content

Split mail_archive_x.py (549 LOC) — HTML→text, EML parsing and reply-stripping are three modules in one file #48

@fmasi

Description

@fmasi

From the 2026-06-10 codebase-health review. Priority: MED (Worth exploring). Related: #3, #4 (both live inside this file's concerns).

src/data/loaders/mail_archive_x.py bundles three concerns with different change cadences in one class:

  1. HTML→text extraction (_HTMLTextExtractor + supporting regexes)
  2. EML parsing / header decoding
  3. reply-chain stripping (6 regex patterns)

Each can only be reached (and tested) through the whole loader, and src/data/loaders/azure_blob.py:13 couples to the entire bundle just to reuse the EML parsing.

Fix: extract src/data/html_text.py (html_to_text(s)) and optionally src/data/reply_strip.py (strip_replies(s)); the loader keeps file discovery + orchestration. Local and Azure loaders then both depend on the shared parsing modules instead of Azure delegating into Local. Each extracted module is deep — small interface over fiddly regex implementations — and unit-testable in isolation (which would also make #3 and #4 easier to fix safely).

Metadata

Metadata

Assignees

No one assigned

    Labels

    code-healthTidiness / refactoring findings from codebase-health reviews

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions