Skip to content

feat(enrichers): make domain_to_website extractions opt-out#186

Open
rachit367 wants to merge 1 commit into
reconurge:mainfrom
rachit367:feat/website-extraction-params
Open

feat(enrichers): make domain_to_website extractions opt-out#186
rachit367 wants to merge 1 commit into
reconurge:mainfrom
rachit367:feat/website-extraction-params

Conversation

@rachit367

Copy link
Copy Markdown

What

Adds optional toggles to domain_to_website so users can skip the heavier extractions (page content, technologies, headers) for faster, lighter scans.

Closes #90

Why

From the issue thread: domain_to_website "extracts a lot and can be slow." Today it unconditionally fetches the page and runs full text extraction (up to 5000 chars), technology detection, and header capture on every run — even when a user only wants liveness + title. This follows the configurable-transformer convention from #60 (safe defaults, opt-in depth).

Changes

  • New get_params_schema() with three select params — extract_content, extract_technologies, extract_headersall defaulting to "true", so existing behavior is unchanged. Setting any to "false" skips that work.
  • title, description, status_code, and active are always captured (they're cheap), so disabling the heavy options still gives a useful result.
  • Refactor: the two near-identical HTTPS/HTTP blocks in scan() are collapsed into a for scheme in ("https", "http") loop plus a _build_website_data() helper (so the param-gating lives in one place, not duplicated). Header selection moved to _extract_headers(). Behavior is preserved: HTTPS is tried first, HTTP is the fallback, and a failed/≥400 fetch yields an inactive Website. Also drops two unused imports.

Testing

uv run --package flowsint-enrichers pytest flowsint-enrichers/tests/enrichers/ -q
10 passed

New tests (requests mocked, no network): params schema shape + safe defaults, default full extraction (regression), each toggle off independently, all-heavy-off still keeping core fields, and the request-failure → inactive path.

The issue also mentioned iocsearcher for richer extraction — the maintainer noted that as a "good suggestion for the future." This PR addresses the concrete follow-up in the thread (an opt-out to control cost); richer extractors can layer on top later.

domain_to_website always fetched the page and extracted headers, the
full page text (up to 5000 chars), and technologies on every run, which
is slow/noisy when a user only wants liveness + title (reconurge#90).

Adds three optional select params (extract_content, extract_technologies,
extract_headers), all defaulting to "true" so existing behavior is
unchanged. Setting any to "false" skips that work. Title, description,
status_code and active are always captured (cheap).

Also de-duplicates the near-identical HTTPS/HTTP branches in scan() into
a scheme loop + a _build_website_data helper, and drops two unused
imports. This follows the params_schema convention from reconurge#60.

Tests (requests mocked, no network): params schema, default full
extraction, each toggle off, all-heavy-off keeping core fields, and the
request-failure -> inactive path.

Closes reconurge#90
@dextmorgn

Copy link
Copy Markdown
Collaborator

Hey @rachit367,

thanks for your work ! For this to work we have to implement params settings in the UI. I'll work on this next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Request] Extract More From A Webpage

2 participants