feat(enrichers): make domain_to_website extractions opt-out by rachit367 · Pull Request #186 · reconurge/flowsint

rachit367 · 2026-06-09T06:04:50Z

What

Adds optional toggles to domain_to_website so users can skip the heavier extractions (page content, technologies, headers) for faster, lighter scans.

Closes #90

Why

From the issue thread: domain_to_website "extracts a lot and can be slow." Today it unconditionally fetches the page and runs full text extraction (up to 5000 chars), technology detection, and header capture on every run — even when a user only wants liveness + title. This follows the configurable-transformer convention from #60 (safe defaults, opt-in depth).

Changes

New get_params_schema() with three select params — extract_content, extract_technologies, extract_headers — all defaulting to "true", so existing behavior is unchanged. Setting any to "false" skips that work.
title, description, status_code, and active are always captured (they're cheap), so disabling the heavy options still gives a useful result.
Refactor: the two near-identical HTTPS/HTTP blocks in scan() are collapsed into a for scheme in ("https", "http") loop plus a _build_website_data() helper (so the param-gating lives in one place, not duplicated). Header selection moved to _extract_headers(). Behavior is preserved: HTTPS is tried first, HTTP is the fallback, and a failed/≥400 fetch yields an inactive Website. Also drops two unused imports.

Testing

uv run --package flowsint-enrichers pytest flowsint-enrichers/tests/enrichers/ -q
10 passed

New tests (requests mocked, no network): params schema shape + safe defaults, default full extraction (regression), each toggle off independently, all-heavy-off still keeping core fields, and the request-failure → inactive path.

The issue also mentioned iocsearcher for richer extraction — the maintainer noted that as a "good suggestion for the future." This PR addresses the concrete follow-up in the thread (an opt-out to control cost); richer extractors can layer on top later.

domain_to_website always fetched the page and extracted headers, the full page text (up to 5000 chars), and technologies on every run, which is slow/noisy when a user only wants liveness + title (reconurge#90). Adds three optional select params (extract_content, extract_technologies, extract_headers), all defaulting to "true" so existing behavior is unchanged. Setting any to "false" skips that work. Title, description, status_code and active are always captured (cheap). Also de-duplicates the near-identical HTTPS/HTTP branches in scan() into a scheme loop + a _build_website_data helper, and drops two unused imports. This follows the params_schema convention from reconurge#60. Tests (requests mocked, no network): params schema, default full extraction, each toggle off, all-heavy-off keeping core fields, and the request-failure -> inactive path. Closes reconurge#90

dextmorgn · 2026-06-18T06:50:58Z

Hey @rachit367,

thanks for your work ! For this to work we have to implement params settings in the UI. I'll work on this next week.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(enrichers): make domain_to_website extractions opt-out#186

feat(enrichers): make domain_to_website extractions opt-out#186
rachit367 wants to merge 1 commit into
reconurge:mainfrom
rachit367:feat/website-extraction-params

rachit367 commented Jun 9, 2026

Uh oh!

dextmorgn commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rachit367 commented Jun 9, 2026

What

Why

Changes

Testing

Uh oh!

dextmorgn commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants