Skip to content

feat: full-text article extraction via Mozilla Readability#43

Merged
bahdotsh merged 9 commits into
mainfrom
feat/2-full-text-article-extraction-readability
May 15, 2026
Merged

feat: full-text article extraction via Mozilla Readability#43
bahdotsh merged 9 commits into
mainfrom
feat/2-full-text-article-extraction-readability

Conversation

@bahdotsh
Copy link
Copy Markdown
Owner

@bahdotsh bahdotsh commented May 15, 2026

Summary

Adds Mozilla-Readability full-text article extraction to feedr. Press Shift+F in the article detail view to fetch the linked URL and render the actual article inline, replacing the typically-truncated feed summary. Press again to toggle back. Per-feed fulltext = true auto-extracts newly-seen items on refresh.

Closes the daily paper-cut of summary-only feeds without leaving the terminal.

Design choices

  • Crate: dom_smoothie — pure sync, MSRV matches feedr's exactly (1.75), actively maintained (v0.17.0, March 2026), MIT, configurable parse budget. Picked over article_scraper because that one is async-only and would drag tokio + a full async reqwest stack into a codebase that has been deliberately single-threaded synchronous from day one.
  • Threading: each extraction runs on a std::thread::spawn worker that posts the result back over a long-lived mpsc::channel. Same pattern feedr already uses for spawn_feed_refresh. The UI never blocks on extraction.
  • State: App::extracted: HashMap<item_id, ExtractionState> with LRU cap of 500. In-memory only — no disk persistence in v1. A restart re-fetches on demand.
  • Security: per-feed Authorization headers are deliberately NOT forwarded to article URLs. Article URLs are third-party hosts; propagating bearer tokens would be a credential leak.
  • Auto-fetch scope: newly-seen items only (same mark_feed_seen semantics as exec_on_new — first observation of a feed seeds silently to avoid a firehose).
  • Failure handling: pages with <200 chars of extracted body, non-HTML content-type, or >5 MB response are rejected with a clear reason. The detail view falls back to the summary and lets the user retry with Shift+F.

What changed

  • Cargo.toml: add dom_smoothie = "0.17"
  • src/config.rs: optional fulltext field on DefaultFeed (additive, backward-compatible)
  • src/feed.rs: new ExtractedArticle + extract_article (HTTP) + extract_from_html (pure test seam)
  • src/app.rs: ExtractionState enum, App::extracted / extracted_order / fulltext_feeds / show_extracted / pending_extraction_requests, LRU cache helpers, prune-on-feed-remove
  • src/keybindings.rs: KeyAction::FetchFullText + default bind Shift+F
  • src/events.rs: dispatch in View::FeedItemDetail + macro whitelist
  • src/tui.rs: long-lived mpsc channel, spawn_pending_extractions worker pool, hoisted mark_feed_seen so exec_on_new and fulltext share one mark per feed arrival
  • src/ui/detail.rs: render extracted vs summary, four-state indicator (Summary / Extracting… / Full-text / Full-text failed), inline failure reason
  • src/ui/modals.rs: help overlay entry
  • README.md + CLAUDE.md: docs

Test plan

  • cargo test --all-features — 130 lib + 4 integration tests pass (9 new)
  • cargo clippy --all-targets --all-features -- -D warnings — clean
  • cargo fmt --all -- --check — clean
  • Manual smoke: add a summary-only feed (e.g. Hacker News), open an item, press Shift+F, verify the four states (SummaryExtracting…Full-text / Full-text failed) and toggle behavior
  • Manual smoke: set fulltext = true on a feed, refresh, confirm newly-seen items auto-extract in the background without blocking the UI
  • Manual smoke: try a JS-rendered / paywalled page (Bloomberg / NYT) and confirm graceful fallback to the summary with a clear failure reason
  • Manual smoke: confirm ? help overlay shows Shift+F Toggle/fetch full-text (Readability) in the Item Detail section

bahdotsh added 9 commits May 15, 2026 13:58
Most RSS feeds ship truncated summaries and call it a day. The user
then has to leave the terminal, open a browser, and re-read the
metadata they just read in feedr. This is the single biggest paper
cut in the whole experience.

Add a full-text extraction path that fetches the article URL and
runs Mozilla Readability over the HTML to pull out the actual
content. Bind it to Shift+F in the detail view (toggle on success,
retry on failure, no-op while pending), or set fulltext = true on
a feed to auto-extract newly-seen items on refresh.

The crate of choice is dom_smoothie — pure sync, MSRV matches ours
exactly, actively maintained, has a parse budget. The obvious
alternative (article_scraper) is async-only and would drag tokio
plus a full reqwest async stack into a codebase that has been
deliberately single-threaded synchronous from day one. Not
happening.

Extractions run on per-request std::thread::spawn workers that
post results back over an mpsc::channel — same shape as
spawn_feed_refresh. State lives in App::extracted, keyed by the
existing item-id scheme, with an LRU cap of 500 and explicit
in-memory-only semantics. No disk persistence in v1 — cache
invalidation is a problem I would rather not invent for a feature
whose entire point is "fetch fresh content on demand."

Per-feed Authorization headers are deliberately NOT forwarded to
article URLs. They are third-party hosts. Leaking bearer tokens to
random blog domains is the kind of thing that ends up in a CVE.

While at it, mark_feed_seen is hoisted out of collect_exec_on_new
so exec_on_new and the new auto-fulltext path share one mark per
feed arrival. Calling it twice would double-mark and the second
consumer would see an empty newly_seen list and silently do
nothing. The gate also covers both consumers now, so users with
neither hook still pay zero disk for seen-set tracking — same
behavior as before, just with a second consumer wired in.
While wiring Shift+F in, I only updated the help *overlay* (the
scrollable `?` thing) and the README. The bottom help bar that's
always on screen, the compact-mode help bar, and the macro-whitelist
documentation in the auto-generated config.toml comment block all
still pretended the action didn't exist.

Add the FetchFullText keybinding to both render_help_bar and
render_compact_help_bar for the FeedItemDetail view, and tack
fetch-full-text onto the list of macro-callable actions in the
config doc-comment.

The dispatch_action whitelist in events.rs already allowed it from
macros, but a user reading the config file would never have known.
The review of the Readability extraction PR turned up a small
pile of issues. None individually catastrophic, but together they
made the feature unsafe to enable on any real-world feed list.
Let's fix them in one go.

The big ones:

  * `String::from_utf8_lossy` on the response body meant every
    non-UTF8 page (Windows-1252, Shift_JIS, GBK, ...) decoded into
    U+FFFD soup, which Readability then dutifully rejected as
    "page appears empty." Route through `encoding_rs` with
    Content-Type charset > <meta charset> sniff > UTF-8.

  * `response.bytes()` read the entire body before checking the
    5 MB cap, which is a great way to allocate 200 MB on a hostile
    server. Switch to `Response::take(MAX+1).read_to_end(...)` and
    pre-check Content-Length while we're at it.

  * `spawn_pending_extractions` spawned one OS thread per queued
    request, unconditionally, with zero per-domain throttling.
    A refresh across several `fulltext = true` feeds could fire
    20+ concurrent requests at the same host. Cap concurrent
    workers at 4 via an Arc<AtomicUsize>, and reuse
    `last_domain_fetch` so article fetches respect the same
    politeness budget feed fetches already do.

  * Auto-fulltext blindly fetched whatever the feed XML put in
    <link>. A hostile feed could point that at
    169.254.169.254 or 10.0.0.x and use the reader to probe the
    user's network. Add a small allowlist (`is_safe_auto_url`):
    http/https only, no RFC1918/loopback/link-local/CGNAT, no
    `.local` or `localhost` names. Manual Shift+F bypasses it --
    that's the user's explicit click, not the user's problem.

  * The LRU cap was soft. When the head was Pending, eviction
    bailed out and inserted anyway, so a wave of stuck workers
    could grow the cache past EXTRACTED_CACHE_CAP. Always evict
    the head; teach `record_extraction_result` to drop late
    results for non-Pending slots so we don't resurrect evicted
    or removed entries.

  * `dom_smoothie` panicking on hostile HTML stranded the slot
    on Pending forever with no path to retry. Wrap the worker in
    `catch_unwind` and surface panics as Failed.

While at it: detail-view body-source lookup now uses
`current_article_indices()` so it matches the action handler;
`toggle_or_request_fulltext` shows a success-message hint when
fired from a non-detail view (the macro path was silent before);
`remove_extraction` helper replaces the two open-coded map+deque
removals. Tests cover the new behaviors.

The Readability dep didn't need to come with this many footguns
on day one. This is what reviews are for.
A staff-level review of the previous hardening commit turned up four
real gaps that the upfront `is_safe_auto_url` check didn't actually
close.

The big one: the auto-fulltext client used reqwest's stock
`redirect::Policy::limited(10)`, which silently follows redirects to
*anywhere*. A hostile feed publishes
`<link>http://attacker.com</link>`, the attacker returns
`302 Location: http://169.254.169.254/...`, and the entire SSRF
gate is sidestepped. The whole point of the allowlist was keeping
feed-supplied URLs out of the user's internal network, and we were
waving them through on the redirect path.

Not great.

Add `Feed::build_safe_redirect_client`, whose custom policy re-runs
`is_safe_auto_url` on every hop. Wire an
`ExtractionRequest::safe_redirects` flag through so auto-path
requests pick that client. Manual `Shift+F` keeps the permissive
one — the user already chose the article, same trust model as
opening it in a browser.

While at it, three smaller things:

- `spawn_pending_extractions` happily popped requests whose
  `extracted` slot had been evicted by LRU or pruned by feed
  removal, did a real HTTP fetch, then dropped the result. Gate
  each pop on the slot still being `Pending` before spawning.

- `std::thread::spawn` panics on OS thread-creation failure with
  the inflight counter already bumped. That's a TUI crash with a
  counter we can't decrement. Switch to
  `thread::Builder::new().spawn()`, release the slot, re-queue
  the request.

- The IPv6 SSRF list was missing 6to4 (`2002::/16`), NAT64
  (`64:ff9b::/96`), deprecated site-local (`fec0::/10`), and
  multicast (`ff00::/8`). 6to4 is the practical concern:
  `2002:a9fe:a9fe::` reaches 169.254.169.254 wherever a relay
  exists.

Also dedup the summary-formatting in the detail view so the Failed
branch isn't carrying a second copy of it. Four new tests; 148/148
passing.
Another pass over the Readability code turned up six more things
that wanted dealing with. None individually awful, but together
they made the feature feel half-thought-through.

The biggest one is the *Pending*-state trap. When the user fires
`Shift+F`, the detail view replaced the entire body with "Fetching
full text…" — and `Shift+F` is documented as a no-op while
Pending, so the user is now stuck staring at a placeholder until
the worker finishes. On a slow site behind the per-domain rate
limit, that's many seconds with nothing to read.

Not great.

Render the summary as the body while extraction is in flight; the
scroll-indicator title already says "Extracting…" so the user
knows what's happening. Same trick we already use for Failed.

Second one is a priority inversion: `request_extraction` always
`push_back`'d, so an explicit `Shift+F` could land behind a whole
refresh's worth of auto-extractions to the same host. Add a
`priority_front: bool`; manual passes true, auto passes false.
User-explicit actions should not wait behind background work.

While at it:

 - `last_domain_fetch` was stamped *before* `thread::Builder::spawn`,
   so a rare OS spawn failure re-queued the request with a fresh
   stamp it hadn't earned. Stamp after the spawn succeeds.

 - `sniff_meta_charset` matched the first `charset` substring
   anywhere in the leading 1KB, so `<div data-charset="bogus">` or
   a `charset=evil` string inside `<script>` could pick the wrong
   encoding. Restrict to `charset=` *inside* a `<meta>` tag, the
   way the spec actually defines it.

 - `manual_client` and `safe_client` were locals inside
   `spawn_pending_extractions`, so a busy refresh built fresh
   `reqwest::blocking::Client`s every tick. Hoist to `run_app`
   scope, build once per session per path.

 - IPv6 SSRF comment said "carrier-grade /15". CGNAT is /10 and is
   already covered above by the v4 check. The comment was just
   wrong; fix it.

Five new tests covering the priority jump and the four sniffer
hardening cases (data-attribute, script literal, `<metadata>` tag,
http-equiv form). 153/153 green.
The PR review found two real bugs hiding in the extraction worker
pool, plus a handful of smaller things worth cleaning up. Time to
actually fix them instead of waving hands at them.

The first one: if a Pending slot gets LRU-evicted and then
re-requested for the same id, the *original* worker is still out
there, blissfully fetching. When it eventually delivers, the
staleness gate sees Pending and writes through — so the "manual
retry" the user just performed gets silently satisfied by the
stale fetch. The id maps 1:1 to URL so the content is at least
the right URL, but it's not the fresh data the user asked for.

Fix: tag every ExtractionRequest with a monotonic generation.
Pending carries it too. record_extraction_result requires both
the state AND generation to match. The worker echoes its
generation back through the channel; stale results get dropped.

The second one: catch_unwind covers Rust panics, but if the OS
decides to kill the worker (OOM, SIGSEGV in a native dep), the
slot stays Pending forever. toggle_or_request_fulltext's
contains_key guard then blocks the user from ever retrying.

Fix: add a watchdog. Pending now carries started_at; every loop
tick we flip anything older than http_timeout * 3 to
Failed("timed out"). The inflight counter is intentionally *not*
decremented — if the worker is still alive and exits normally,
its own fetch_sub runs; double-decrementing would underflow.
After enough OS kills the budget gets pinned and the user has to
restart, but at least retries work.

While at it:

- The meta-charset sniffer was matching the first \`charset\`
  substring it found in a meta tag, which meant
  \`<meta name="charset" charset="utf-8">\` extracted the empty
  value out of name= and gave up before finding the real one.
  Now requires a word boundary before \`charset\` and \`=\` after.

- The detail-view "Press F to retry" hint was hardcoded. If the
  user rebinds fetch_full_text, the hint lies. Use key_display().

- The doc claim that body allocation was "hard-capped at ~5 MB"
  wasn't quite true — Vec::read_to_end's doubling growth from a
  64 KiB hint peaks at ~8 MiB for a 5 MiB body. Preallocate at
  the cap. Modern allocators don't commit physical pages until
  written, so small pages cost nothing.

- The inflight counter was using SeqCst, which is overkill for a
  plain bound. Relaxed everywhere; the mpsc channel carries the
  result-handoff ordering.

- New regression test for the "mark_feed_seen is called exactly
  once per feed arrival" contract — the promise was in the docs
  but not in tests.
…lticast

Code review against the full-text extraction path turned up three
things worth fixing before merge.

The biggest one: the stale-Pending watchdog flipped slots to Failed
but *intentionally* didn't decrement `extract_inflight`. The reasoning
was sound — if the worker is actually still alive, it will decrement
on its own and we'd otherwise underflow the counter. The unstated
consequence is that a truly-dead worker (OS kill, SIGSEGV in a native
dep) leaks one budget slot for the rest of the session. After four
such events the pool is permanently saturated and Shift+F just sits
there doing nothing.

Let's fix that. `prune_stale_pending_extractions` now returns the
count of slots it timed out, and the caller releases that many
inflight budget slots using a saturating decrement. The worker exit
path uses the same saturating decrement, so a slow worker that
completes *after* the watchdog already released its slot clamps at
zero instead of wrapping to `usize::MAX` — which would turn the bound
into "spawn as many workers as you want until the heap runs out."
Worst case is a transient one-slot overcommit, which is a much better
failure mode than either the leak or the underflow.

While at it: `show_extracted` was a single global bool. If you
toggled to summary view on article A and walked to article B, B
silently rendered as summary even though its title bar advertised
"Full-text" as the body. Anchor the toggle to the focused indices
and reset it when focus changes.

And the IPv4 branch of `is_global_ip` wasn't rejecting `224.0.0.0/4`
even though the IPv6 branch rejected `ff00::/8`. Multicast HTTP
servers are not really a thing, but asymmetric allowlists *are* a
thing reviewers reasonably complain about. Added `v4.is_multicast()`
and called it a day.

Tests added for all three. cargo fmt and cargo clippy -D warnings
clean; 165 tests pass.
Code review caught a slow leak in the extraction worker pool. The
4-worker cap turned out to be advisory at best.

It turns out `ExtractionState::Pending.started_at` was stamped at
`request_extraction` time, but the watchdog uses the same field to
identify stuck workers — and releases one inflight permit per pruned
slot. A request that aged out while still *queued* (rate-limited
behind a long backlog) never claimed a permit in the first place. The
watchdog "released" it anyway, silently widening the effective pool
by one per spurious release. Saturating sub kept it from
underflowing, so this wasn't a crash — just drift in the wrong
direction. Not great.

Fix is structural: add `spawned: bool` to Pending, default false,
flip to true only after `thread::Builder::spawn` returns Ok. The
watchdog now counts only spawned-pruned slots toward the inflight
release. Queue-stranded slots still flip to Failed so manual retry
works — only the permit accounting is gated. Adding a field forces
every construction site to declare its spawned-state, which beats
comments.

While at it: switch the `dom_smoothie` spec from `0.17` to `~0.17` —
same constraint in cargo (pre-1.0 caret == tilde), but the intent is
explicit at the manifest for a security-critical dep. Reword the
misleading "peak allocation ~ FULLTEXT_MAX_BYTES" comment (DOM
allocation inside dom_smoothie is separate working-set, not bounded
here). Swap `format!("{}", e)` for `format!("{:#}", e)` in
`record_extraction_result` so anyhow's cause chain survives.

Also added five hermetic stub tests for `extract_article`: oversized
Content-Length, wrong Content-Type, and the safe-redirect client
refusing to follow into RFC1918. The pure-Readability path was
well-tested; the wrapping fetch path's security gates were not.
Another pass over the extraction worker pool turned up two real
correctness issues, plus a couple of load-bearing invariants the
code was relying on without anything actually locking them.

The big one: `insert_extraction` unconditionally evicted the oldest
deque entry when the cache hit its cap. If that entry happened to
be a `Pending` slot whose worker was still alive — and the cache
only fills under sustained extraction load, so by definition the
workers ARE alive — the slot disappears from `extracted`. The
stale-Pending watchdog only walks `extracted`, so once the slot is
gone, the watchdog can't find it. If that worker is silently dead
(OS kill, SIGSEGV in a native dep), its inflight permit leaks for
the rest of the session. Eventually the pool saturates and Shift+F
sits there doing nothing.

Not great.

Fix is to scan the deque from the head for the first entry that's
NOT a `Pending { spawned: true }` and evict THAT one. Soft cap
instead of hard — the effective bound becomes
`EXTRACTED_CACHE_CAP + EXTRACTION_MAX_INFLIGHT`, i.e. 504 entries
instead of 500. The worker pool size caps how many we can ever be
protecting, so this can't grow without bound. If literally every
entry is a spawned Pending (would mean the whole pool has been
stuck for a long time), we let the cache transiently grow; the
watchdog gets to those slots on the next tick and restores
evictability on the next insert.

The second one: `prune_stale_pending_extractions` flipped a stale
Pending to Failed by calling `insert_extraction`, which moves the
entry to MRU. So a watchdog-timed-out slot was suddenly *fresher*
than every Ready entry that came in after it. Inverted eviction
order, for absolutely no reason.

Mutate the state in place instead. Preserves the original LRU
position so timed-out entries get evicted ahead of newer Ready
results, which is what you'd actually want.

While at it, the comment on the watchdog's inflight release said
"transiently overcommit by one slot." That's wrong. If all four
pool slots stall, the watchdog releases four, four new workers
spawn, and now we have eight actually running. The real bound is
`EXTRACTION_MAX_INFLIGHT` slots of overcommit, not one. Fix the
comment to say what the code actually does.

Two contracts that were load-bearing but had no test coverage:

  * `extract_article` does not send Authorization headers. The
    function signature deliberately omits a headers parameter, but
    nothing failed loudly if a future refactor added one to the
    static header list. Added a capturing stub server and a test
    that asserts on the raw bytes on the wire. If anyone ever
    wires per-feed auth into article extraction, this fails
    immediately.

  * `enqueue_fulltext_for_new` short-circuits on items that already
    have any extraction state (Ready / Failed / Pending). The
    guard lives in `request_extraction`'s `contains_key` check,
    but a refactor could quietly remove it and a refresh would
    suddenly start re-extracting every cached article on every
    cycle. Locked that too.

cargo fmt clean, clippy -D warnings clean, 177/177 tests passing.
@bahdotsh bahdotsh merged commit 3715058 into main May 15, 2026
8 checks passed
bahdotsh added a commit that referenced this pull request May 15, 2026
* feat(detail): add in-article find with smart-case matching

Full-text extraction landed in v0.7.0 (#43), which is great right
up until you open a 5000-word article and want to find a specific
phrase. Then it's just you, j/k, and your patience.

Before this commit `/` in the detail view opened the *cross-feed*
search modal, which made approximately zero sense — you're staring
at one article and pressing slash drops a full-screen overlay that
filters your entire bookmark list. Let's fix that.

Add a real find-in-page: `/` enters a new `ArticleSearch` input
mode, live-highlights matches as you type, `n`/`N` cycle through
them with wrap-around, and `ESC` clears. The query persists across
`Shift+F` toggling between summary and extracted body, so the same
search works whether you're reading the feed-provided HTML or the
Readability-extracted plain text. Focus change to a different
article drops the query — a stale `/foo` following you to the next
article is the kind of UX that makes people uninstall things.

Matching is ripgrep-style smart-case: case-sensitive when the
query has any uppercase character, case-insensitive (ASCII fold)
otherwise. Highlights render as styled spans inside a `Vec<Line>`
fed to the existing `Paragraph` widget, so soft-wrap continues to
work for free. The current match (the one `n` lands on) gets the
filled-block treatment; non-current matches stay subtle so the
page doesn't look like a Christmas tree.

Jump-to-match uses a sibling helper to `count_wrapped_lines` that
returns the cumulative wrapped-row index for a given source line,
so the viewport scrolls exactly where the wrapped output puts the
match — no drift, no math at three different abstraction layers.
Body and content-width are cached on `App` at render time so the
event handler can resolve match → scroll position without
re-deriving the body. State lives only in `App`; nothing persisted.

22 new tests cover the helpers, the state machine, and the event
dispatch.

* refactor(detail): tighten in-article find after self-review

Self-review of the in-article-find diff turned up four things worth
fixing before someone else trips on them. None are user-visible
today; they're all latent.

*exit_detail_view* cleared the search query but left *input_mode*
alone. Today no code path triggers the bad case — but the day
someone wires a macro or remote-trigger that swaps views, the user
gets stranded in ArticleSearch mode in a view that has no footer.
One line, fixed.

The detail renderer was cloning the entire rendered body into
*article_body_cache* every frame regardless of whether search was
active — about 500KB/s of churn on a 50KB full-text article for a
feature nobody was using. Gate it on "search footer visible" and
the steady state goes to zero allocations.

The doc on *find_article_matches* called the ASCII-only fold a "v1
limitation." It is not a limitation, it is a load-bearing
*invariant*: the highlight render slices the original line at the
byte offsets the search reported, which only works because
*to_ascii_lowercase* is byte-length-preserving. *str::to_lowercase*
— which someone will reach for when they want proper Unicode — is
*not* length-preserving (\`"İ".to_lowercase()\` is two code points)
and would silently invalidate the offset model. Reframe the
comment so a future maintainer can't miss it.

While at it: *build_styled_body* had a dead *'a* lifetime and was
*to_string()*-ing every unmatched line. Borrow from *body* via
*Line::from(&str)* and *Span::raw(&str)*, replace the HashMap
bucketing with a linear walk over already-sorted matches. Tests
added for the invariant, multi-byte offset correctness,
*input_mode* reset, and *build_styled_body* coverage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant