Skip to content

feat(enrichers): Arabic media enrichers (Sabq, Argaam, Al Arabiya, Nitter)#151

Open
SocialMDev wants to merge 1 commit into
reconurge:mainfrom
SocialMDev:feat/arabic-osint-enrichers
Open

feat(enrichers): Arabic media enrichers (Sabq, Argaam, Al Arabiya, Nitter)#151
SocialMDev wants to merge 1 commit into
reconurge:mainfrom
SocialMDev:feat/arabic-osint-enrichers

Conversation

@SocialMDev

Copy link
Copy Markdown

Summary

Adds 8 enrichers that surface Arabic-language mentions of an Individual or a Phrase (topic) and link them into the graph as Website nodes with source-specific relationship labels.

Enricher Input → Output Relationship Source
individual_to_sabq / phrase_to_sabq Individual / Phrase → Website MENTIONED_IN_SABQ sabq.org HTML search
individual_to_argaam / phrase_to_argaam Individual / Phrase → Website MENTIONED_IN_ARGAAM argaam.com HTML search
individual_to_alarabiya / phrase_to_alarabiya Individual / Phrase → Website MENTIONED_IN_ALARABIYA Google News RSS (site:alarabiya.net)
individual_to_arabic_tweets / phrase_to_arabic_tweets Individual / Phrase → Website MENTIONED_ON_TWITTER_AR Nitter mirrors → Google dork fallback

What changed

flowsint-enrichers/
├── pyproject.toml                    +1 dep: defusedxml
├── src/
│   ├── flowsint_enrichers/
│   │   ├── individual/
│   │   │   ├── to_sabq.py            NEW
│   │   │   ├── to_argaam.py          NEW
│   │   │   ├── to_alarabiya.py       NEW
│   │   │   └── to_arabic_tweets.py   NEW
│   │   └── phrase/                   NEW dir (Phrase-input variants)
│   │       ├── __init__.py
│   │       ├── to_sabq.py
│   │       ├── to_argaam.py
│   │       ├── to_alarabiya.py
│   │       └── to_arabic_tweets.py
│   └── tools/arabic_media/           NEW dir
│       ├── __init__.py
│       ├── sabq.py                   SabqTool
│       ├── argaam.py                 ArgaamTool
│       ├── alarabiya.py              AlArabiyaTool (uses defusedxml for RSS)
│       └── nitter.py                 NitterArabicTool
└── tests/enrichers/
    ├── test_arabic_sabq.py
    ├── test_arabic_argaam.py
    ├── test_arabic_alarabiya.py
    └── test_arabic_tweets.py

Why a new phrase/ category

Sabq / Argaam / Al Arabiya all support searching for topics, not just people. Phrase was already in flowsint-types but had no enrichers — this PR adds the first set. Topic search is useful for journalists / OSINT investigators tracking issues rather than individuals.

Security notes

  • defusedxml is used in AlArabiyaTool for parsing Google News RSS, to avoid XXE / billion-laughs attacks on untrusted XML. Added as a dependency (>=0.7,<0.8).
  • All four scrapers respect a 10s timeout and degrade gracefully on non-200 responses.
  • Nitter tool tries each mirror in NITTER_INSTANCES then falls back to a Google dork; tests cover both paths via mocking.

Demo

Brought up docker-compose.dev.yml infra (postgres + redis + neo4j) and ran individual_to_sabq against real Neo4j with HTTP mocked to return 3 fixture article hits for "Faisal Aldeghaither":

MATCH (i:individual)-[r:MENTIONED_IN_SABQ]->(w:website)
RETURN i.`nodeProperties.full_name` AS person, type(r) AS rel, w.`nodeProperties.url` AS url, w.`nodeProperties.title` AS title;
person, rel, url, title
"Faisal Aldeghaither", "MENTIONED_IN_SABQ", "https://sabq.org/news/202605/saudi-vision-update-1", "تحديثات رؤية المملكة 2030"
"Faisal Aldeghaither", "MENTIONED_IN_SABQ", "https://sabq.org/news/202605/saudi-vision-update-2", "مقابلة حصرية"
"Faisal Aldeghaither", "MENTIONED_IN_SABQ", "https://sabq.org/news/202605/saudi-vision-update-3", "تقرير اقتصادي"

The Neo4j Browser visualisation showing the Individual → 3 Website subgraph with MENTIONED_IN_SABQ edges and Arabic article titles is attached in the first PR comment.

Test plan

  • pytest tests/enrichers/test_arabic_*.py — 19 new tests, all green
  • Existing tests/enrichers/test_registry.py still passes (21/21 with new suite)
  • All 8 enrichers register via @flowsint_enricher and appear in ENRICHER_REGISTRY
  • Postprocess writes to real Neo4j (verified with cypher-shell on demo stack)
  • Dedup logic: re-running with duplicate URLs creates each Website + relationship only once
  • No live network calls in tests — SabqTool, ArgaamTool, AlArabiyaTool, NitterArabicTool are mocked

Notes for maintainer

  • The existing to_domains.py reference enricher served as the architectural template (preprocess / scan / postprocess split, @flowsint_enricher decorator, module-level InputType / OutputType re-export). I tried to match style and structure exactly; happy to adjust if you want different conventions for the new phrase/ category.
  • HTML selectors for sabq.org and argaam.com are defensive (multiple fallbacks via comma-separated CSS selectors) but will need maintenance if those sites change their markup.
  • Discord-friendly: I can open follow-ups to add similar enrichers for other Arabic media (Asharq, Okaz, Riyadh Daily) if the pattern lands well.

…, Nitter)

Adds 8 new enrichers that surface Arabic-language mentions of an Individual
or a Phrase (topic) and link them into the graph as Website nodes:

- individual_to_sabq        / phrase_to_sabq          MENTIONED_IN_SABQ
- individual_to_argaam      / phrase_to_argaam        MENTIONED_IN_ARGAAM
- individual_to_alarabiya   / phrase_to_alarabiya     MENTIONED_IN_ALARABIYA
- individual_to_arabic_tweets / phrase_to_arabic_tweets MENTIONED_ON_TWITTER_AR

Shared scrapers live under src/tools/arabic_media/ and follow the existing
Tool base class. Al Arabiya uses Google News RSS filtered to alarabiya.net.
Arabic Twitter uses Nitter mirrors with a Google dork fallback.

XML parsing uses defusedxml to avoid XXE / billion-laughs on the RSS feed.

Tests mock the underlying Tool layer; no real network calls.

19 new tests, all green alongside the existing test_registry suite.
@SocialMDev

Copy link
Copy Markdown
Author

Demo evidence — individual_to_sabq executed against the dev docker-compose.dev.yml infra (postgres + redis + neo4j), HTTP mocked with 3 fixture article hits. Neo4j Browser shows the resulting subgraph with the Faisal Individual node connected via MENTIONED_IN_SABQ to 3 Arabic-titled Website nodes (screenshot attached separately by branch owner).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants