feat(enrichers): Arabic media enrichers (Sabq, Argaam, Al Arabiya, Nitter)#151
Open
SocialMDev wants to merge 1 commit into
Open
feat(enrichers): Arabic media enrichers (Sabq, Argaam, Al Arabiya, Nitter)#151SocialMDev wants to merge 1 commit into
SocialMDev wants to merge 1 commit into
Conversation
…, Nitter) Adds 8 new enrichers that surface Arabic-language mentions of an Individual or a Phrase (topic) and link them into the graph as Website nodes: - individual_to_sabq / phrase_to_sabq MENTIONED_IN_SABQ - individual_to_argaam / phrase_to_argaam MENTIONED_IN_ARGAAM - individual_to_alarabiya / phrase_to_alarabiya MENTIONED_IN_ALARABIYA - individual_to_arabic_tweets / phrase_to_arabic_tweets MENTIONED_ON_TWITTER_AR Shared scrapers live under src/tools/arabic_media/ and follow the existing Tool base class. Al Arabiya uses Google News RSS filtered to alarabiya.net. Arabic Twitter uses Nitter mirrors with a Google dork fallback. XML parsing uses defusedxml to avoid XXE / billion-laughs on the RSS feed. Tests mock the underlying Tool layer; no real network calls. 19 new tests, all green alongside the existing test_registry suite.
Author
|
Demo evidence — |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds 8 enrichers that surface Arabic-language mentions of an
Individualor aPhrase(topic) and link them into the graph asWebsitenodes with source-specific relationship labels.individual_to_sabq/phrase_to_sabqMENTIONED_IN_SABQindividual_to_argaam/phrase_to_argaamMENTIONED_IN_ARGAAMindividual_to_alarabiya/phrase_to_alarabiyaMENTIONED_IN_ALARABIYAsite:alarabiya.net)individual_to_arabic_tweets/phrase_to_arabic_tweetsMENTIONED_ON_TWITTER_ARWhat changed
Why a new
phrase/categorySabq / Argaam / Al Arabiya all support searching for topics, not just people.
Phrasewas already inflowsint-typesbut had no enrichers — this PR adds the first set. Topic search is useful for journalists / OSINT investigators tracking issues rather than individuals.Security notes
AlArabiyaToolfor parsing Google News RSS, to avoid XXE / billion-laughs attacks on untrusted XML. Added as a dependency (>=0.7,<0.8).NITTER_INSTANCESthen falls back to a Google dork; tests cover both paths via mocking.Demo
Brought up
docker-compose.dev.ymlinfra (postgres + redis + neo4j) and ranindividual_to_sabqagainst real Neo4j with HTTP mocked to return 3 fixture article hits for "Faisal Aldeghaither":The Neo4j Browser visualisation showing the Individual → 3 Website subgraph with
MENTIONED_IN_SABQedges and Arabic article titles is attached in the first PR comment.Test plan
pytest tests/enrichers/test_arabic_*.py— 19 new tests, all greentests/enrichers/test_registry.pystill passes (21/21 with new suite)@flowsint_enricherand appear inENRICHER_REGISTRYSabqTool,ArgaamTool,AlArabiyaTool,NitterArabicToolare mockedNotes for maintainer
to_domains.pyreference enricher served as the architectural template (preprocess / scan / postprocess split,@flowsint_enricherdecorator, module-levelInputType/OutputTypere-export). I tried to match style and structure exactly; happy to adjust if you want different conventions for the newphrase/category.sabq.organdargaam.comare defensive (multiple fallbacks via comma-separated CSS selectors) but will need maintenance if those sites change their markup.