feat: full-text article extraction via Mozilla Readability#43
Merged
Conversation
Most RSS feeds ship truncated summaries and call it a day. The user then has to leave the terminal, open a browser, and re-read the metadata they just read in feedr. This is the single biggest paper cut in the whole experience. Add a full-text extraction path that fetches the article URL and runs Mozilla Readability over the HTML to pull out the actual content. Bind it to Shift+F in the detail view (toggle on success, retry on failure, no-op while pending), or set fulltext = true on a feed to auto-extract newly-seen items on refresh. The crate of choice is dom_smoothie — pure sync, MSRV matches ours exactly, actively maintained, has a parse budget. The obvious alternative (article_scraper) is async-only and would drag tokio plus a full reqwest async stack into a codebase that has been deliberately single-threaded synchronous from day one. Not happening. Extractions run on per-request std::thread::spawn workers that post results back over an mpsc::channel — same shape as spawn_feed_refresh. State lives in App::extracted, keyed by the existing item-id scheme, with an LRU cap of 500 and explicit in-memory-only semantics. No disk persistence in v1 — cache invalidation is a problem I would rather not invent for a feature whose entire point is "fetch fresh content on demand." Per-feed Authorization headers are deliberately NOT forwarded to article URLs. They are third-party hosts. Leaking bearer tokens to random blog domains is the kind of thing that ends up in a CVE. While at it, mark_feed_seen is hoisted out of collect_exec_on_new so exec_on_new and the new auto-fulltext path share one mark per feed arrival. Calling it twice would double-mark and the second consumer would see an empty newly_seen list and silently do nothing. The gate also covers both consumers now, so users with neither hook still pay zero disk for seen-set tracking — same behavior as before, just with a second consumer wired in.
While wiring Shift+F in, I only updated the help *overlay* (the scrollable `?` thing) and the README. The bottom help bar that's always on screen, the compact-mode help bar, and the macro-whitelist documentation in the auto-generated config.toml comment block all still pretended the action didn't exist. Add the FetchFullText keybinding to both render_help_bar and render_compact_help_bar for the FeedItemDetail view, and tack fetch-full-text onto the list of macro-callable actions in the config doc-comment. The dispatch_action whitelist in events.rs already allowed it from macros, but a user reading the config file would never have known.
The review of the Readability extraction PR turned up a small
pile of issues. None individually catastrophic, but together they
made the feature unsafe to enable on any real-world feed list.
Let's fix them in one go.
The big ones:
* `String::from_utf8_lossy` on the response body meant every
non-UTF8 page (Windows-1252, Shift_JIS, GBK, ...) decoded into
U+FFFD soup, which Readability then dutifully rejected as
"page appears empty." Route through `encoding_rs` with
Content-Type charset > <meta charset> sniff > UTF-8.
* `response.bytes()` read the entire body before checking the
5 MB cap, which is a great way to allocate 200 MB on a hostile
server. Switch to `Response::take(MAX+1).read_to_end(...)` and
pre-check Content-Length while we're at it.
* `spawn_pending_extractions` spawned one OS thread per queued
request, unconditionally, with zero per-domain throttling.
A refresh across several `fulltext = true` feeds could fire
20+ concurrent requests at the same host. Cap concurrent
workers at 4 via an Arc<AtomicUsize>, and reuse
`last_domain_fetch` so article fetches respect the same
politeness budget feed fetches already do.
* Auto-fulltext blindly fetched whatever the feed XML put in
<link>. A hostile feed could point that at
169.254.169.254 or 10.0.0.x and use the reader to probe the
user's network. Add a small allowlist (`is_safe_auto_url`):
http/https only, no RFC1918/loopback/link-local/CGNAT, no
`.local` or `localhost` names. Manual Shift+F bypasses it --
that's the user's explicit click, not the user's problem.
* The LRU cap was soft. When the head was Pending, eviction
bailed out and inserted anyway, so a wave of stuck workers
could grow the cache past EXTRACTED_CACHE_CAP. Always evict
the head; teach `record_extraction_result` to drop late
results for non-Pending slots so we don't resurrect evicted
or removed entries.
* `dom_smoothie` panicking on hostile HTML stranded the slot
on Pending forever with no path to retry. Wrap the worker in
`catch_unwind` and surface panics as Failed.
While at it: detail-view body-source lookup now uses
`current_article_indices()` so it matches the action handler;
`toggle_or_request_fulltext` shows a success-message hint when
fired from a non-detail view (the macro path was silent before);
`remove_extraction` helper replaces the two open-coded map+deque
removals. Tests cover the new behaviors.
The Readability dep didn't need to come with this many footguns
on day one. This is what reviews are for.
A staff-level review of the previous hardening commit turned up four real gaps that the upfront `is_safe_auto_url` check didn't actually close. The big one: the auto-fulltext client used reqwest's stock `redirect::Policy::limited(10)`, which silently follows redirects to *anywhere*. A hostile feed publishes `<link>http://attacker.com</link>`, the attacker returns `302 Location: http://169.254.169.254/...`, and the entire SSRF gate is sidestepped. The whole point of the allowlist was keeping feed-supplied URLs out of the user's internal network, and we were waving them through on the redirect path. Not great. Add `Feed::build_safe_redirect_client`, whose custom policy re-runs `is_safe_auto_url` on every hop. Wire an `ExtractionRequest::safe_redirects` flag through so auto-path requests pick that client. Manual `Shift+F` keeps the permissive one — the user already chose the article, same trust model as opening it in a browser. While at it, three smaller things: - `spawn_pending_extractions` happily popped requests whose `extracted` slot had been evicted by LRU or pruned by feed removal, did a real HTTP fetch, then dropped the result. Gate each pop on the slot still being `Pending` before spawning. - `std::thread::spawn` panics on OS thread-creation failure with the inflight counter already bumped. That's a TUI crash with a counter we can't decrement. Switch to `thread::Builder::new().spawn()`, release the slot, re-queue the request. - The IPv6 SSRF list was missing 6to4 (`2002::/16`), NAT64 (`64:ff9b::/96`), deprecated site-local (`fec0::/10`), and multicast (`ff00::/8`). 6to4 is the practical concern: `2002:a9fe:a9fe::` reaches 169.254.169.254 wherever a relay exists. Also dedup the summary-formatting in the detail view so the Failed branch isn't carrying a second copy of it. Four new tests; 148/148 passing.
Another pass over the Readability code turned up six more things that wanted dealing with. None individually awful, but together they made the feature feel half-thought-through. The biggest one is the *Pending*-state trap. When the user fires `Shift+F`, the detail view replaced the entire body with "Fetching full text…" — and `Shift+F` is documented as a no-op while Pending, so the user is now stuck staring at a placeholder until the worker finishes. On a slow site behind the per-domain rate limit, that's many seconds with nothing to read. Not great. Render the summary as the body while extraction is in flight; the scroll-indicator title already says "Extracting…" so the user knows what's happening. Same trick we already use for Failed. Second one is a priority inversion: `request_extraction` always `push_back`'d, so an explicit `Shift+F` could land behind a whole refresh's worth of auto-extractions to the same host. Add a `priority_front: bool`; manual passes true, auto passes false. User-explicit actions should not wait behind background work. While at it: - `last_domain_fetch` was stamped *before* `thread::Builder::spawn`, so a rare OS spawn failure re-queued the request with a fresh stamp it hadn't earned. Stamp after the spawn succeeds. - `sniff_meta_charset` matched the first `charset` substring anywhere in the leading 1KB, so `<div data-charset="bogus">` or a `charset=evil` string inside `<script>` could pick the wrong encoding. Restrict to `charset=` *inside* a `<meta>` tag, the way the spec actually defines it. - `manual_client` and `safe_client` were locals inside `spawn_pending_extractions`, so a busy refresh built fresh `reqwest::blocking::Client`s every tick. Hoist to `run_app` scope, build once per session per path. - IPv6 SSRF comment said "carrier-grade /15". CGNAT is /10 and is already covered above by the v4 check. The comment was just wrong; fix it. Five new tests covering the priority jump and the four sniffer hardening cases (data-attribute, script literal, `<metadata>` tag, http-equiv form). 153/153 green.
The PR review found two real bugs hiding in the extraction worker
pool, plus a handful of smaller things worth cleaning up. Time to
actually fix them instead of waving hands at them.
The first one: if a Pending slot gets LRU-evicted and then
re-requested for the same id, the *original* worker is still out
there, blissfully fetching. When it eventually delivers, the
staleness gate sees Pending and writes through — so the "manual
retry" the user just performed gets silently satisfied by the
stale fetch. The id maps 1:1 to URL so the content is at least
the right URL, but it's not the fresh data the user asked for.
Fix: tag every ExtractionRequest with a monotonic generation.
Pending carries it too. record_extraction_result requires both
the state AND generation to match. The worker echoes its
generation back through the channel; stale results get dropped.
The second one: catch_unwind covers Rust panics, but if the OS
decides to kill the worker (OOM, SIGSEGV in a native dep), the
slot stays Pending forever. toggle_or_request_fulltext's
contains_key guard then blocks the user from ever retrying.
Fix: add a watchdog. Pending now carries started_at; every loop
tick we flip anything older than http_timeout * 3 to
Failed("timed out"). The inflight counter is intentionally *not*
decremented — if the worker is still alive and exits normally,
its own fetch_sub runs; double-decrementing would underflow.
After enough OS kills the budget gets pinned and the user has to
restart, but at least retries work.
While at it:
- The meta-charset sniffer was matching the first \`charset\`
substring it found in a meta tag, which meant
\`<meta name="charset" charset="utf-8">\` extracted the empty
value out of name= and gave up before finding the real one.
Now requires a word boundary before \`charset\` and \`=\` after.
- The detail-view "Press F to retry" hint was hardcoded. If the
user rebinds fetch_full_text, the hint lies. Use key_display().
- The doc claim that body allocation was "hard-capped at ~5 MB"
wasn't quite true — Vec::read_to_end's doubling growth from a
64 KiB hint peaks at ~8 MiB for a 5 MiB body. Preallocate at
the cap. Modern allocators don't commit physical pages until
written, so small pages cost nothing.
- The inflight counter was using SeqCst, which is overkill for a
plain bound. Relaxed everywhere; the mpsc channel carries the
result-handoff ordering.
- New regression test for the "mark_feed_seen is called exactly
once per feed arrival" contract — the promise was in the docs
but not in tests.
…lticast Code review against the full-text extraction path turned up three things worth fixing before merge. The biggest one: the stale-Pending watchdog flipped slots to Failed but *intentionally* didn't decrement `extract_inflight`. The reasoning was sound — if the worker is actually still alive, it will decrement on its own and we'd otherwise underflow the counter. The unstated consequence is that a truly-dead worker (OS kill, SIGSEGV in a native dep) leaks one budget slot for the rest of the session. After four such events the pool is permanently saturated and Shift+F just sits there doing nothing. Let's fix that. `prune_stale_pending_extractions` now returns the count of slots it timed out, and the caller releases that many inflight budget slots using a saturating decrement. The worker exit path uses the same saturating decrement, so a slow worker that completes *after* the watchdog already released its slot clamps at zero instead of wrapping to `usize::MAX` — which would turn the bound into "spawn as many workers as you want until the heap runs out." Worst case is a transient one-slot overcommit, which is a much better failure mode than either the leak or the underflow. While at it: `show_extracted` was a single global bool. If you toggled to summary view on article A and walked to article B, B silently rendered as summary even though its title bar advertised "Full-text" as the body. Anchor the toggle to the focused indices and reset it when focus changes. And the IPv4 branch of `is_global_ip` wasn't rejecting `224.0.0.0/4` even though the IPv6 branch rejected `ff00::/8`. Multicast HTTP servers are not really a thing, but asymmetric allowlists *are* a thing reviewers reasonably complain about. Added `v4.is_multicast()` and called it a day. Tests added for all three. cargo fmt and cargo clippy -D warnings clean; 165 tests pass.
Code review caught a slow leak in the extraction worker pool. The
4-worker cap turned out to be advisory at best.
It turns out `ExtractionState::Pending.started_at` was stamped at
`request_extraction` time, but the watchdog uses the same field to
identify stuck workers — and releases one inflight permit per pruned
slot. A request that aged out while still *queued* (rate-limited
behind a long backlog) never claimed a permit in the first place. The
watchdog "released" it anyway, silently widening the effective pool
by one per spurious release. Saturating sub kept it from
underflowing, so this wasn't a crash — just drift in the wrong
direction. Not great.
Fix is structural: add `spawned: bool` to Pending, default false,
flip to true only after `thread::Builder::spawn` returns Ok. The
watchdog now counts only spawned-pruned slots toward the inflight
release. Queue-stranded slots still flip to Failed so manual retry
works — only the permit accounting is gated. Adding a field forces
every construction site to declare its spawned-state, which beats
comments.
While at it: switch the `dom_smoothie` spec from `0.17` to `~0.17` —
same constraint in cargo (pre-1.0 caret == tilde), but the intent is
explicit at the manifest for a security-critical dep. Reword the
misleading "peak allocation ~ FULLTEXT_MAX_BYTES" comment (DOM
allocation inside dom_smoothie is separate working-set, not bounded
here). Swap `format!("{}", e)` for `format!("{:#}", e)` in
`record_extraction_result` so anyhow's cause chain survives.
Also added five hermetic stub tests for `extract_article`: oversized
Content-Length, wrong Content-Type, and the safe-redirect client
refusing to follow into RFC1918. The pure-Readability path was
well-tested; the wrapping fetch path's security gates were not.
Another pass over the extraction worker pool turned up two real
correctness issues, plus a couple of load-bearing invariants the
code was relying on without anything actually locking them.
The big one: `insert_extraction` unconditionally evicted the oldest
deque entry when the cache hit its cap. If that entry happened to
be a `Pending` slot whose worker was still alive — and the cache
only fills under sustained extraction load, so by definition the
workers ARE alive — the slot disappears from `extracted`. The
stale-Pending watchdog only walks `extracted`, so once the slot is
gone, the watchdog can't find it. If that worker is silently dead
(OS kill, SIGSEGV in a native dep), its inflight permit leaks for
the rest of the session. Eventually the pool saturates and Shift+F
sits there doing nothing.
Not great.
Fix is to scan the deque from the head for the first entry that's
NOT a `Pending { spawned: true }` and evict THAT one. Soft cap
instead of hard — the effective bound becomes
`EXTRACTED_CACHE_CAP + EXTRACTION_MAX_INFLIGHT`, i.e. 504 entries
instead of 500. The worker pool size caps how many we can ever be
protecting, so this can't grow without bound. If literally every
entry is a spawned Pending (would mean the whole pool has been
stuck for a long time), we let the cache transiently grow; the
watchdog gets to those slots on the next tick and restores
evictability on the next insert.
The second one: `prune_stale_pending_extractions` flipped a stale
Pending to Failed by calling `insert_extraction`, which moves the
entry to MRU. So a watchdog-timed-out slot was suddenly *fresher*
than every Ready entry that came in after it. Inverted eviction
order, for absolutely no reason.
Mutate the state in place instead. Preserves the original LRU
position so timed-out entries get evicted ahead of newer Ready
results, which is what you'd actually want.
While at it, the comment on the watchdog's inflight release said
"transiently overcommit by one slot." That's wrong. If all four
pool slots stall, the watchdog releases four, four new workers
spawn, and now we have eight actually running. The real bound is
`EXTRACTION_MAX_INFLIGHT` slots of overcommit, not one. Fix the
comment to say what the code actually does.
Two contracts that were load-bearing but had no test coverage:
* `extract_article` does not send Authorization headers. The
function signature deliberately omits a headers parameter, but
nothing failed loudly if a future refactor added one to the
static header list. Added a capturing stub server and a test
that asserts on the raw bytes on the wire. If anyone ever
wires per-feed auth into article extraction, this fails
immediately.
* `enqueue_fulltext_for_new` short-circuits on items that already
have any extraction state (Ready / Failed / Pending). The
guard lives in `request_extraction`'s `contains_key` check,
but a refactor could quietly remove it and a refresh would
suddenly start re-extracting every cached article on every
cycle. Locked that too.
cargo fmt clean, clippy -D warnings clean, 177/177 tests passing.
bahdotsh
added a commit
that referenced
this pull request
May 15, 2026
* feat(detail): add in-article find with smart-case matching Full-text extraction landed in v0.7.0 (#43), which is great right up until you open a 5000-word article and want to find a specific phrase. Then it's just you, j/k, and your patience. Before this commit `/` in the detail view opened the *cross-feed* search modal, which made approximately zero sense — you're staring at one article and pressing slash drops a full-screen overlay that filters your entire bookmark list. Let's fix that. Add a real find-in-page: `/` enters a new `ArticleSearch` input mode, live-highlights matches as you type, `n`/`N` cycle through them with wrap-around, and `ESC` clears. The query persists across `Shift+F` toggling between summary and extracted body, so the same search works whether you're reading the feed-provided HTML or the Readability-extracted plain text. Focus change to a different article drops the query — a stale `/foo` following you to the next article is the kind of UX that makes people uninstall things. Matching is ripgrep-style smart-case: case-sensitive when the query has any uppercase character, case-insensitive (ASCII fold) otherwise. Highlights render as styled spans inside a `Vec<Line>` fed to the existing `Paragraph` widget, so soft-wrap continues to work for free. The current match (the one `n` lands on) gets the filled-block treatment; non-current matches stay subtle so the page doesn't look like a Christmas tree. Jump-to-match uses a sibling helper to `count_wrapped_lines` that returns the cumulative wrapped-row index for a given source line, so the viewport scrolls exactly where the wrapped output puts the match — no drift, no math at three different abstraction layers. Body and content-width are cached on `App` at render time so the event handler can resolve match → scroll position without re-deriving the body. State lives only in `App`; nothing persisted. 22 new tests cover the helpers, the state machine, and the event dispatch. * refactor(detail): tighten in-article find after self-review Self-review of the in-article-find diff turned up four things worth fixing before someone else trips on them. None are user-visible today; they're all latent. *exit_detail_view* cleared the search query but left *input_mode* alone. Today no code path triggers the bad case — but the day someone wires a macro or remote-trigger that swaps views, the user gets stranded in ArticleSearch mode in a view that has no footer. One line, fixed. The detail renderer was cloning the entire rendered body into *article_body_cache* every frame regardless of whether search was active — about 500KB/s of churn on a 50KB full-text article for a feature nobody was using. Gate it on "search footer visible" and the steady state goes to zero allocations. The doc on *find_article_matches* called the ASCII-only fold a "v1 limitation." It is not a limitation, it is a load-bearing *invariant*: the highlight render slices the original line at the byte offsets the search reported, which only works because *to_ascii_lowercase* is byte-length-preserving. *str::to_lowercase* — which someone will reach for when they want proper Unicode — is *not* length-preserving (\`"İ".to_lowercase()\` is two code points) and would silently invalidate the offset model. Reframe the comment so a future maintainer can't miss it. While at it: *build_styled_body* had a dead *'a* lifetime and was *to_string()*-ing every unmatched line. Borrow from *body* via *Line::from(&str)* and *Span::raw(&str)*, replace the HashMap bucketing with a linear walk over already-sorted matches. Tests added for the invariant, multi-byte offset correctness, *input_mode* reset, and *build_styled_body* coverage.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds Mozilla-Readability full-text article extraction to feedr. Press
Shift+Fin the article detail view to fetch the linked URL and render the actual article inline, replacing the typically-truncated feed summary. Press again to toggle back. Per-feedfulltext = trueauto-extracts newly-seen items on refresh.Closes the daily paper-cut of summary-only feeds without leaving the terminal.
Design choices
dom_smoothie— pure sync, MSRV matches feedr's exactly (1.75), actively maintained (v0.17.0, March 2026), MIT, configurable parse budget. Picked overarticle_scraperbecause that one is async-only and would drag tokio + a full async reqwest stack into a codebase that has been deliberately single-threaded synchronous from day one.std::thread::spawnworker that posts the result back over a long-livedmpsc::channel. Same pattern feedr already uses forspawn_feed_refresh. The UI never blocks on extraction.App::extracted: HashMap<item_id, ExtractionState>with LRU cap of 500. In-memory only — no disk persistence in v1. A restart re-fetches on demand.Authorizationheaders are deliberately NOT forwarded to article URLs. Article URLs are third-party hosts; propagating bearer tokens would be a credential leak.mark_feed_seensemantics asexec_on_new— first observation of a feed seeds silently to avoid a firehose).<200chars of extracted body, non-HTML content-type, or>5 MBresponse are rejected with a clear reason. The detail view falls back to the summary and lets the user retry withShift+F.What changed
Cargo.toml: adddom_smoothie = "0.17"src/config.rs: optionalfulltextfield onDefaultFeed(additive, backward-compatible)src/feed.rs: newExtractedArticle+extract_article(HTTP) +extract_from_html(pure test seam)src/app.rs:ExtractionStateenum,App::extracted/extracted_order/fulltext_feeds/show_extracted/pending_extraction_requests, LRU cache helpers, prune-on-feed-removesrc/keybindings.rs:KeyAction::FetchFullText+ default bindShift+Fsrc/events.rs: dispatch inView::FeedItemDetail+ macro whitelistsrc/tui.rs: long-livedmpscchannel,spawn_pending_extractionsworker pool, hoistedmark_feed_seensoexec_on_newand fulltext share one mark per feed arrivalsrc/ui/detail.rs: render extracted vs summary, four-state indicator (Summary/Extracting…/Full-text/Full-text failed), inline failure reasonsrc/ui/modals.rs: help overlay entryREADME.md+CLAUDE.md: docsTest plan
cargo test --all-features— 130 lib + 4 integration tests pass (9 new)cargo clippy --all-targets --all-features -- -D warnings— cleancargo fmt --all -- --check— cleanShift+F, verify the four states (Summary→Extracting…→Full-text/Full-text failed) and toggle behaviorfulltext = trueon a feed, refresh, confirm newly-seen items auto-extract in the background without blocking the UI?help overlay showsShift+F Toggle/fetch full-text (Readability)in the Item Detail section