feat: full-text article extraction via Mozilla Readability by bahdotsh · Pull Request #43 · bahdotsh/feedr

bahdotsh · 2026-05-15T11:59:02Z

Summary

Adds Mozilla-Readability full-text article extraction to feedr. Press Shift+F in the article detail view to fetch the linked URL and render the actual article inline, replacing the typically-truncated feed summary. Press again to toggle back. Per-feed fulltext = true auto-extracts newly-seen items on refresh.

Closes the daily paper-cut of summary-only feeds without leaving the terminal.

Design choices

Crate: dom_smoothie — pure sync, MSRV matches feedr's exactly (1.75), actively maintained (v0.17.0, March 2026), MIT, configurable parse budget. Picked over article_scraper because that one is async-only and would drag tokio + a full async reqwest stack into a codebase that has been deliberately single-threaded synchronous from day one.
Threading: each extraction runs on a std::thread::spawn worker that posts the result back over a long-lived mpsc::channel. Same pattern feedr already uses for spawn_feed_refresh. The UI never blocks on extraction.
State: App::extracted: HashMap<item_id, ExtractionState> with LRU cap of 500. In-memory only — no disk persistence in v1. A restart re-fetches on demand.
Security: per-feed Authorization headers are deliberately NOT forwarded to article URLs. Article URLs are third-party hosts; propagating bearer tokens would be a credential leak.
Auto-fetch scope: newly-seen items only (same mark_feed_seen semantics as exec_on_new — first observation of a feed seeds silently to avoid a firehose).
Failure handling: pages with <200 chars of extracted body, non-HTML content-type, or >5 MB response are rejected with a clear reason. The detail view falls back to the summary and lets the user retry with Shift+F.

What changed

Cargo.toml: add dom_smoothie = "0.17"
src/config.rs: optional fulltext field on DefaultFeed (additive, backward-compatible)
src/feed.rs: new ExtractedArticle + extract_article (HTTP) + extract_from_html (pure test seam)
src/app.rs: ExtractionState enum, App::extracted / extracted_order / fulltext_feeds / show_extracted / pending_extraction_requests, LRU cache helpers, prune-on-feed-remove
src/keybindings.rs: KeyAction::FetchFullText + default bind Shift+F
src/events.rs: dispatch in View::FeedItemDetail + macro whitelist
src/tui.rs: long-lived mpsc channel, spawn_pending_extractions worker pool, hoisted mark_feed_seen so exec_on_new and fulltext share one mark per feed arrival
src/ui/detail.rs: render extracted vs summary, four-state indicator (Summary / Extracting… / Full-text / Full-text failed), inline failure reason
src/ui/modals.rs: help overlay entry
README.md + CLAUDE.md: docs

Test plan

cargo test --all-features — 130 lib + 4 integration tests pass (9 new)
cargo clippy --all-targets --all-features -- -D warnings — clean
cargo fmt --all -- --check — clean
Manual smoke: add a summary-only feed (e.g. Hacker News), open an item, press Shift+F, verify the four states (Summary → Extracting… → Full-text / Full-text failed) and toggle behavior
Manual smoke: set fulltext = true on a feed, refresh, confirm newly-seen items auto-extract in the background without blocking the UI
Manual smoke: try a JS-rendered / paywalled page (Bloomberg / NYT) and confirm graceful fallback to the summary with a clear failure reason
Manual smoke: confirm ? help overlay shows Shift+F Toggle/fetch full-text (Readability) in the Item Detail section

Most RSS feeds ship truncated summaries and call it a day. The user then has to leave the terminal, open a browser, and re-read the metadata they just read in feedr. This is the single biggest paper cut in the whole experience. Add a full-text extraction path that fetches the article URL and runs Mozilla Readability over the HTML to pull out the actual content. Bind it to Shift+F in the detail view (toggle on success, retry on failure, no-op while pending), or set fulltext = true on a feed to auto-extract newly-seen items on refresh. The crate of choice is dom_smoothie — pure sync, MSRV matches ours exactly, actively maintained, has a parse budget. The obvious alternative (article_scraper) is async-only and would drag tokio plus a full reqwest async stack into a codebase that has been deliberately single-threaded synchronous from day one. Not happening. Extractions run on per-request std::thread::spawn workers that post results back over an mpsc::channel — same shape as spawn_feed_refresh. State lives in App::extracted, keyed by the existing item-id scheme, with an LRU cap of 500 and explicit in-memory-only semantics. No disk persistence in v1 — cache invalidation is a problem I would rather not invent for a feature whose entire point is "fetch fresh content on demand." Per-feed Authorization headers are deliberately NOT forwarded to article URLs. They are third-party hosts. Leaking bearer tokens to random blog domains is the kind of thing that ends up in a CVE. While at it, mark_feed_seen is hoisted out of collect_exec_on_new so exec_on_new and the new auto-fulltext path share one mark per feed arrival. Calling it twice would double-mark and the second consumer would see an empty newly_seen list and silently do nothing. The gate also covers both consumers now, so users with neither hook still pay zero disk for seen-set tracking — same behavior as before, just with a second consumer wired in.

While wiring Shift+F in, I only updated the help *overlay* (the scrollable `?` thing) and the README. The bottom help bar that's always on screen, the compact-mode help bar, and the macro-whitelist documentation in the auto-generated config.toml comment block all still pretended the action didn't exist. Add the FetchFullText keybinding to both render_help_bar and render_compact_help_bar for the FeedItemDetail view, and tack fetch-full-text onto the list of macro-callable actions in the config doc-comment. The dispatch_action whitelist in events.rs already allowed it from macros, but a user reading the config file would never have known.

The review of the Readability extraction PR turned up a small pile of issues. None individually catastrophic, but together they made the feature unsafe to enable on any real-world feed list. Let's fix them in one go. The big ones: * `String::from_utf8_lossy` on the response body meant every non-UTF8 page (Windows-1252, Shift_JIS, GBK, ...) decoded into U+FFFD soup, which Readability then dutifully rejected as "page appears empty." Route through `encoding_rs` with Content-Type charset > <meta charset> sniff > UTF-8. * `response.bytes()` read the entire body before checking the 5 MB cap, which is a great way to allocate 200 MB on a hostile server. Switch to `Response::take(MAX+1).read_to_end(...)` and pre-check Content-Length while we're at it. * `spawn_pending_extractions` spawned one OS thread per queued request, unconditionally, with zero per-domain throttling. A refresh across several `fulltext = true` feeds could fire 20+ concurrent requests at the same host. Cap concurrent workers at 4 via an Arc<AtomicUsize>, and reuse `last_domain_fetch` so article fetches respect the same politeness budget feed fetches already do. * Auto-fulltext blindly fetched whatever the feed XML put in <link>. A hostile feed could point that at 169.254.169.254 or 10.0.0.x and use the reader to probe the user's network. Add a small allowlist (`is_safe_auto_url`): http/https only, no RFC1918/loopback/link-local/CGNAT, no `.local` or `localhost` names. Manual Shift+F bypasses it -- that's the user's explicit click, not the user's problem. * The LRU cap was soft. When the head was Pending, eviction bailed out and inserted anyway, so a wave of stuck workers could grow the cache past EXTRACTED_CACHE_CAP. Always evict the head; teach `record_extraction_result` to drop late results for non-Pending slots so we don't resurrect evicted or removed entries. * `dom_smoothie` panicking on hostile HTML stranded the slot on Pending forever with no path to retry. Wrap the worker in `catch_unwind` and surface panics as Failed. While at it: detail-view body-source lookup now uses `current_article_indices()` so it matches the action handler; `toggle_or_request_fulltext` shows a success-message hint when fired from a non-detail view (the macro path was silent before); `remove_extraction` helper replaces the two open-coded map+deque removals. Tests cover the new behaviors. The Readability dep didn't need to come with this many footguns on day one. This is what reviews are for.

A staff-level review of the previous hardening commit turned up four real gaps that the upfront `is_safe_auto_url` check didn't actually close. The big one: the auto-fulltext client used reqwest's stock `redirect::Policy::limited(10)`, which silently follows redirects to *anywhere*. A hostile feed publishes `<link>http://attacker.com</link>`, the attacker returns `302 Location: http://169.254.169.254/...`, and the entire SSRF gate is sidestepped. The whole point of the allowlist was keeping feed-supplied URLs out of the user's internal network, and we were waving them through on the redirect path. Not great. Add `Feed::build_safe_redirect_client`, whose custom policy re-runs `is_safe_auto_url` on every hop. Wire an `ExtractionRequest::safe_redirects` flag through so auto-path requests pick that client. Manual `Shift+F` keeps the permissive one — the user already chose the article, same trust model as opening it in a browser. While at it, three smaller things: - `spawn_pending_extractions` happily popped requests whose `extracted` slot had been evicted by LRU or pruned by feed removal, did a real HTTP fetch, then dropped the result. Gate each pop on the slot still being `Pending` before spawning. - `std::thread::spawn` panics on OS thread-creation failure with the inflight counter already bumped. That's a TUI crash with a counter we can't decrement. Switch to `thread::Builder::new().spawn()`, release the slot, re-queue the request. - The IPv6 SSRF list was missing 6to4 (`2002::/16`), NAT64 (`64:ff9b::/96`), deprecated site-local (`fec0::/10`), and multicast (`ff00::/8`). 6to4 is the practical concern: `2002:a9fe:a9fe::` reaches 169.254.169.254 wherever a relay exists. Also dedup the summary-formatting in the detail view so the Failed branch isn't carrying a second copy of it. Four new tests; 148/148 passing.

Another pass over the Readability code turned up six more things that wanted dealing with. None individually awful, but together they made the feature feel half-thought-through. The biggest one is the *Pending*-state trap. When the user fires `Shift+F`, the detail view replaced the entire body with "Fetching full text…" — and `Shift+F` is documented as a no-op while Pending, so the user is now stuck staring at a placeholder until the worker finishes. On a slow site behind the per-domain rate limit, that's many seconds with nothing to read. Not great. Render the summary as the body while extraction is in flight; the scroll-indicator title already says "Extracting…" so the user knows what's happening. Same trick we already use for Failed. Second one is a priority inversion: `request_extraction` always `push_back`'d, so an explicit `Shift+F` could land behind a whole refresh's worth of auto-extractions to the same host. Add a `priority_front: bool`; manual passes true, auto passes false. User-explicit actions should not wait behind background work. While at it: - `last_domain_fetch` was stamped *before* `thread::Builder::spawn`, so a rare OS spawn failure re-queued the request with a fresh stamp it hadn't earned. Stamp after the spawn succeeds. - `sniff_meta_charset` matched the first `charset` substring anywhere in the leading 1KB, so `<div data-charset="bogus">` or a `charset=evil` string inside `<script>` could pick the wrong encoding. Restrict to `charset=` *inside* a `<meta>` tag, the way the spec actually defines it. - `manual_client` and `safe_client` were locals inside `spawn_pending_extractions`, so a busy refresh built fresh `reqwest::blocking::Client`s every tick. Hoist to `run_app` scope, build once per session per path. - IPv6 SSRF comment said "carrier-grade /15". CGNAT is /10 and is already covered above by the v4 check. The comment was just wrong; fix it. Five new tests covering the priority jump and the four sniffer hardening cases (data-attribute, script literal, `<metadata>` tag, http-equiv form). 153/153 green.

The PR review found two real bugs hiding in the extraction worker pool, plus a handful of smaller things worth cleaning up. Time to actually fix them instead of waving hands at them. The first one: if a Pending slot gets LRU-evicted and then re-requested for the same id, the *original* worker is still out there, blissfully fetching. When it eventually delivers, the staleness gate sees Pending and writes through — so the "manual retry" the user just performed gets silently satisfied by the stale fetch. The id maps 1:1 to URL so the content is at least the right URL, but it's not the fresh data the user asked for. Fix: tag every ExtractionRequest with a monotonic generation. Pending carries it too. record_extraction_result requires both the state AND generation to match. The worker echoes its generation back through the channel; stale results get dropped. The second one: catch_unwind covers Rust panics, but if the OS decides to kill the worker (OOM, SIGSEGV in a native dep), the slot stays Pending forever. toggle_or_request_fulltext's contains_key guard then blocks the user from ever retrying. Fix: add a watchdog. Pending now carries started_at; every loop tick we flip anything older than http_timeout * 3 to Failed("timed out"). The inflight counter is intentionally *not* decremented — if the worker is still alive and exits normally, its own fetch_sub runs; double-decrementing would underflow. After enough OS kills the budget gets pinned and the user has to restart, but at least retries work. While at it: - The meta-charset sniffer was matching the first \`charset\` substring it found in a meta tag, which meant \`<meta name="charset" charset="utf-8">\` extracted the empty value out of name= and gave up before finding the real one. Now requires a word boundary before \`charset\` and \`=\` after. - The detail-view "Press F to retry" hint was hardcoded. If the user rebinds fetch_full_text, the hint lies. Use key_display(). - The doc claim that body allocation was "hard-capped at ~5 MB" wasn't quite true — Vec::read_to_end's doubling growth from a 64 KiB hint peaks at ~8 MiB for a 5 MiB body. Preallocate at the cap. Modern allocators don't commit physical pages until written, so small pages cost nothing. - The inflight counter was using SeqCst, which is overkill for a plain bound. Relaxed everywhere; the mpsc channel carries the result-handoff ordering. - New regression test for the "mark_feed_seen is called exactly once per feed arrival" contract — the promise was in the docs but not in tests.

…lticast Code review against the full-text extraction path turned up three things worth fixing before merge. The biggest one: the stale-Pending watchdog flipped slots to Failed but *intentionally* didn't decrement `extract_inflight`. The reasoning was sound — if the worker is actually still alive, it will decrement on its own and we'd otherwise underflow the counter. The unstated consequence is that a truly-dead worker (OS kill, SIGSEGV in a native dep) leaks one budget slot for the rest of the session. After four such events the pool is permanently saturated and Shift+F just sits there doing nothing. Let's fix that. `prune_stale_pending_extractions` now returns the count of slots it timed out, and the caller releases that many inflight budget slots using a saturating decrement. The worker exit path uses the same saturating decrement, so a slow worker that completes *after* the watchdog already released its slot clamps at zero instead of wrapping to `usize::MAX` — which would turn the bound into "spawn as many workers as you want until the heap runs out." Worst case is a transient one-slot overcommit, which is a much better failure mode than either the leak or the underflow. While at it: `show_extracted` was a single global bool. If you toggled to summary view on article A and walked to article B, B silently rendered as summary even though its title bar advertised "Full-text" as the body. Anchor the toggle to the focused indices and reset it when focus changes. And the IPv4 branch of `is_global_ip` wasn't rejecting `224.0.0.0/4` even though the IPv6 branch rejected `ff00::/8`. Multicast HTTP servers are not really a thing, but asymmetric allowlists *are* a thing reviewers reasonably complain about. Added `v4.is_multicast()` and called it a day. Tests added for all three. cargo fmt and cargo clippy -D warnings clean; 165 tests pass.

Code review caught a slow leak in the extraction worker pool. The 4-worker cap turned out to be advisory at best. It turns out `ExtractionState::Pending.started_at` was stamped at `request_extraction` time, but the watchdog uses the same field to identify stuck workers — and releases one inflight permit per pruned slot. A request that aged out while still *queued* (rate-limited behind a long backlog) never claimed a permit in the first place. The watchdog "released" it anyway, silently widening the effective pool by one per spurious release. Saturating sub kept it from underflowing, so this wasn't a crash — just drift in the wrong direction. Not great. Fix is structural: add `spawned: bool` to Pending, default false, flip to true only after `thread::Builder::spawn` returns Ok. The watchdog now counts only spawned-pruned slots toward the inflight release. Queue-stranded slots still flip to Failed so manual retry works — only the permit accounting is gated. Adding a field forces every construction site to declare its spawned-state, which beats comments. While at it: switch the `dom_smoothie` spec from `0.17` to `~0.17` — same constraint in cargo (pre-1.0 caret == tilde), but the intent is explicit at the manifest for a security-critical dep. Reword the misleading "peak allocation ~ FULLTEXT_MAX_BYTES" comment (DOM allocation inside dom_smoothie is separate working-set, not bounded here). Swap `format!("{}", e)` for `format!("{:#}", e)` in `record_extraction_result` so anyhow's cause chain survives. Also added five hermetic stub tests for `extract_article`: oversized Content-Length, wrong Content-Type, and the safe-redirect client refusing to follow into RFC1918. The pure-Readability path was well-tested; the wrapping fetch path's security gates were not.

Another pass over the extraction worker pool turned up two real correctness issues, plus a couple of load-bearing invariants the code was relying on without anything actually locking them. The big one: `insert_extraction` unconditionally evicted the oldest deque entry when the cache hit its cap. If that entry happened to be a `Pending` slot whose worker was still alive — and the cache only fills under sustained extraction load, so by definition the workers ARE alive — the slot disappears from `extracted`. The stale-Pending watchdog only walks `extracted`, so once the slot is gone, the watchdog can't find it. If that worker is silently dead (OS kill, SIGSEGV in a native dep), its inflight permit leaks for the rest of the session. Eventually the pool saturates and Shift+F sits there doing nothing. Not great. Fix is to scan the deque from the head for the first entry that's NOT a `Pending { spawned: true }` and evict THAT one. Soft cap instead of hard — the effective bound becomes `EXTRACTED_CACHE_CAP + EXTRACTION_MAX_INFLIGHT`, i.e. 504 entries instead of 500. The worker pool size caps how many we can ever be protecting, so this can't grow without bound. If literally every entry is a spawned Pending (would mean the whole pool has been stuck for a long time), we let the cache transiently grow; the watchdog gets to those slots on the next tick and restores evictability on the next insert. The second one: `prune_stale_pending_extractions` flipped a stale Pending to Failed by calling `insert_extraction`, which moves the entry to MRU. So a watchdog-timed-out slot was suddenly *fresher* than every Ready entry that came in after it. Inverted eviction order, for absolutely no reason. Mutate the state in place instead. Preserves the original LRU position so timed-out entries get evicted ahead of newer Ready results, which is what you'd actually want. While at it, the comment on the watchdog's inflight release said "transiently overcommit by one slot." That's wrong. If all four pool slots stall, the watchdog releases four, four new workers spawn, and now we have eight actually running. The real bound is `EXTRACTION_MAX_INFLIGHT` slots of overcommit, not one. Fix the comment to say what the code actually does. Two contracts that were load-bearing but had no test coverage: * `extract_article` does not send Authorization headers. The function signature deliberately omits a headers parameter, but nothing failed loudly if a future refactor added one to the static header list. Added a capturing stub server and a test that asserts on the raw bytes on the wire. If anyone ever wires per-feed auth into article extraction, this fails immediately. * `enqueue_fulltext_for_new` short-circuits on items that already have any extraction state (Ready / Failed / Pending). The guard lives in `request_extraction`'s `contains_key` check, but a refactor could quietly remove it and a refresh would suddenly start re-extracting every cached article on every cycle. Locked that too. cargo fmt clean, clippy -D warnings clean, 177/177 tests passing.

* feat(detail): add in-article find with smart-case matching Full-text extraction landed in v0.7.0 (#43), which is great right up until you open a 5000-word article and want to find a specific phrase. Then it's just you, j/k, and your patience. Before this commit `/` in the detail view opened the *cross-feed* search modal, which made approximately zero sense — you're staring at one article and pressing slash drops a full-screen overlay that filters your entire bookmark list. Let's fix that. Add a real find-in-page: `/` enters a new `ArticleSearch` input mode, live-highlights matches as you type, `n`/`N` cycle through them with wrap-around, and `ESC` clears. The query persists across `Shift+F` toggling between summary and extracted body, so the same search works whether you're reading the feed-provided HTML or the Readability-extracted plain text. Focus change to a different article drops the query — a stale `/foo` following you to the next article is the kind of UX that makes people uninstall things. Matching is ripgrep-style smart-case: case-sensitive when the query has any uppercase character, case-insensitive (ASCII fold) otherwise. Highlights render as styled spans inside a `Vec<Line>` fed to the existing `Paragraph` widget, so soft-wrap continues to work for free. The current match (the one `n` lands on) gets the filled-block treatment; non-current matches stay subtle so the page doesn't look like a Christmas tree. Jump-to-match uses a sibling helper to `count_wrapped_lines` that returns the cumulative wrapped-row index for a given source line, so the viewport scrolls exactly where the wrapped output puts the match — no drift, no math at three different abstraction layers. Body and content-width are cached on `App` at render time so the event handler can resolve match → scroll position without re-deriving the body. State lives only in `App`; nothing persisted. 22 new tests cover the helpers, the state machine, and the event dispatch. * refactor(detail): tighten in-article find after self-review Self-review of the in-article-find diff turned up four things worth fixing before someone else trips on them. None are user-visible today; they're all latent. *exit_detail_view* cleared the search query but left *input_mode* alone. Today no code path triggers the bad case — but the day someone wires a macro or remote-trigger that swaps views, the user gets stranded in ArticleSearch mode in a view that has no footer. One line, fixed. The detail renderer was cloning the entire rendered body into *article_body_cache* every frame regardless of whether search was active — about 500KB/s of churn on a 50KB full-text article for a feature nobody was using. Gate it on "search footer visible" and the steady state goes to zero allocations. The doc on *find_article_matches* called the ASCII-only fold a "v1 limitation." It is not a limitation, it is a load-bearing *invariant*: the highlight render slices the original line at the byte offsets the search reported, which only works because *to_ascii_lowercase* is byte-length-preserving. *str::to_lowercase* — which someone will reach for when they want proper Unicode — is *not* length-preserving (\`"İ".to_lowercase()\` is two code points) and would silently invalidate the offset model. Reframe the comment so a future maintainer can't miss it. While at it: *build_styled_body* had a dead *'a* lifetime and was *to_string()*-ing every unmatched line. Borrow from *body* via *Line::from(&str)* and *Span::raw(&str)*, replace the HashMap bucketing with a linear walk over already-sorted matches. Tests added for the invariant, multi-byte offset correctness, *input_mode* reset, and *build_styled_body* coverage.

bahdotsh added 9 commits May 15, 2026 13:58

bahdotsh merged commit 3715058 into main May 15, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: full-text article extraction via Mozilla Readability#43

feat: full-text article extraction via Mozilla Readability#43
bahdotsh merged 9 commits into
mainfrom
feat/2-full-text-article-extraction-readability

bahdotsh commented May 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

bahdotsh commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design choices

What changed

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bahdotsh commented May 15, 2026 •

edited

Loading