feat(scanner): coverage filter to cut vendored mis-attribution noise#174
Merged
Conversation
A real C/C++ upload exposed the free-OSSKB mis-attribution at full force: an openssl 0.9.7c tree came back as ~10 scattered downstream forks (globus-toolkit, halite, …) with wrong versions, because the matcher took the first per-file match and deduped by purl. identify-vendored.sh now groups file matches by library NAME and promotes a component only when at least SCANOSS_MIN_FILES files (default 2) agree on it, so one-off single-file fork matches are dropped. The version and PURL are resolved from the consensus across the supporting files, which collapses a library split across forks into one component with the canonical identity (e.g. openssl 3.0.0, not a minority 0.9.1c fork). Each component records its file-support count (bomlens:scanoss:files). SCANOSS_MIN_FILES=1 disables the filter. - fixtures updated to 2 files/component (the realistic, default-threshold case); the orthogonal adversarial cases pin SCANOSS_MIN_FILES=1. - new adversarial section verifies noise-drop + consensus version/purl + the disable escape hatch. - guide + CLI reference document SCANOSS_MIN_FILES and the mitigation (and that it helps but does not fully fix KB mis-attribution).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part (b) of the follow-up: reduce mis-attribution noise on real uploads.
The problem (found by a real upload)
Scanning an actual openssl 0.9.7c source tree returned ~10 scattered downstream forks (globus-toolkit, halite, megaglest, …) with wrong versions/names — because the matcher took the first per-file match and deduped by PURL, so widely-copied files each pointed at whichever fork the free OSSKB ranked first. My earlier "real OSSKB" test used a hand-picked clean 5-file sample, which hid this.
The fix — coverage / consensus filter
identify-vendored.shnow:SCANOSS_MIN_FILESfiles (default 2) agree on it → one-off single-file fork matches are dropped;openssl 3.0.0, not a minority0.9.1cfork);bomlens:scanoss:files);SCANOSS_MIN_FILES=1disables the filter (back-compat escape hatch).On the synthetic model of the real failure (openssl across 4 forks + liblfds ×2 + two single-file fork hits), the result drops from 6 noisy names to
openssl 3.0.0+liblfds— the noise gone, openssl consolidated to the consensus.Tests / docs
SCANOSS_MIN_FILES=1.SCANOSS_MIN_FILESdocumented in the guide + CLI reference, with honest framing that it helps but does not fully fix KB mis-attribution.bash tests/test-vendored-adversarial.sh→ 32 passed.bash tests/test-postprocess.sh→ 41 passed.