-
Notifications
You must be signed in to change notification settings - Fork 29
feat(heuristics): add Fake Email analyzer to validate maintainer email domain #1106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
c6f35f7
to
8d29103
Compare
f945882
to
a7103e4
Compare
src/macaron/malware_analyzer/pypi_heuristics/metadata/fake_email.py
Outdated
Show resolved
Hide resolved
…l domains Signed-off-by: Amine <[email protected]>
Signed-off-by: Amine <[email protected]>
Signed-off-by: Amine <[email protected]>
759ab97
to
d99495c
Compare
…mail domain validation Signed-off-by: Amine <[email protected]>
d99495c
to
a8d373b
Compare
""" | ||
emailinfo = None | ||
try: | ||
emailinfo = validate_email(email, check_deliverability=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please make check_deliverability
configurable via defaults.ini
so that you can turn it off in unit tests? We try to avoid network calls in our unit tests.
analyzer: FakeEmailAnalyzer, pypi_package_json_asset_mock: MagicMock, mock_validate_email: MagicMock | ||
) -> None: | ||
"""Test analyzer passes when only maintainer_email is present and valid.""" | ||
email = "[email protected]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is example.net
supposed to be a valid domain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and no. The domain is technically valid, but it’s reserved by IANA and not intended for real-world use. So to ensure the email is actually usable by a real user, Should I add a list of reserved domains and TLDs and check against them before proceeding with validation or acceptance, or should I keep it as it is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the email validator packages has a test environment flag?
test_environment=False
: IfTrue
, DNS-based deliverability checks are disabled andtest
and**.test
domain names are permitted (see below). You can also setemail_validator.TEST_ENVIRONMENT
toTrue
to turn it on for all calls by default.
Would that maybe be better in a unit test? If a user does use those IANA reserved domains shouldn't it fail if we use check_deliverability
as they aren't set up to receive emails?
validated_emails = info.get("validated_emails") | ||
assert isinstance(validated_emails, list) | ||
assert len(validated_emails) == 2 | ||
assert {"normalized": "[email protected]", "local_part": "author", "domain": "example.com"} in validated_emails |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question, if example.com
is supposed to be a valid domain.
Signed-off-by: Amine <[email protected]>
@@ -38,6 +38,7 @@ dependencies = [ | |||
"problog >= 2.2.6,<3.0.0", | |||
"cryptography >=44.0.0,<45.0.0", | |||
"semgrep == 1.113.0", | |||
"email_validator >=2.2.0,<3.0.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the package email-validator? I can find a package spelled with a hiphen, email-validator
, but I can't find one spelled with an underscore like you have here.
""" | ||
package_json = pypi_package_json.package_json | ||
if not package_json.get("info", {}): | ||
return HeuristicResult.SKIP, {"message": "No package info available."} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be better off raising a HeuristicAnalyzerValueError
, as the info
field should always be a part of the PyPI JSON data, so if it isn't there then we were given malformed JSON data.
if not package_json.get("info", {}): | ||
return HeuristicResult.SKIP, {"message": "No package info available."} | ||
|
||
author_email = json_extract(package_json, ["info", "author_email"], str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see here we're extracting the email fields and then running the is_valid_email
function directly on what is present in that field. Have you tested if this works on fields where the email is not the only text present? I've got some PyPI JSON data from Django where author_email
looks like this:
"author_email": "Django Software Foundation <[email protected]>"
So the text includes more than just the email. Here's another example from the black package:
"author_email": "Łukasz Langa <[email protected]>"
It may also be a string of multiple emails as well, like the ultralytics package:
"author_email": "Glenn Jocher <[email protected]>, Jing Qiu <[email protected]>"
Summary
This PR adds a new heuristic analyzer called
FakeEmailAnalyzer
. It verifies the validity of maintainer email addresses listed in a PyPI package by checking both the format and the existence of MX records for their domains. This helps detect packages with fake or throwaway emails, which are often indicative of malicious intent.Description of changes
FakeEmailAnalyzer
that:detect_malicious_metadata_check.py
to include and invoke this new analyzer.Related issues
None
Checklist
verified
label should appear next to all of your commits on GitHub.