Skip to content

feat(scanner): coverage filter to cut vendored mis-attribution noise#174

Merged
haksungjang merged 1 commit into
mainfrom
feat/vendored-coverage-filter
Jun 23, 2026
Merged

feat(scanner): coverage filter to cut vendored mis-attribution noise#174
haksungjang merged 1 commit into
mainfrom
feat/vendored-coverage-filter

Conversation

@haksungjang

Copy link
Copy Markdown
Member

Part (b) of the follow-up: reduce mis-attribution noise on real uploads.

The problem (found by a real upload)

Scanning an actual openssl 0.9.7c source tree returned ~10 scattered downstream forks (globus-toolkit, halite, megaglest, …) with wrong versions/names — because the matcher took the first per-file match and deduped by PURL, so widely-copied files each pointed at whichever fork the free OSSKB ranked first. My earlier "real OSSKB" test used a hand-picked clean 5-file sample, which hid this.

The fix — coverage / consensus filter

identify-vendored.sh now:

  • groups file matches by library name and promotes a component only when SCANOSS_MIN_FILES files (default 2) agree on it → one-off single-file fork matches are dropped;
  • resolves the version and PURL from the consensus across the supporting files → a library split across forks collapses to one component with the canonical identity (e.g. openssl 3.0.0, not a minority 0.9.1c fork);
  • records the file-support count (bomlens:scanoss:files);
  • SCANOSS_MIN_FILES=1 disables the filter (back-compat escape hatch).

On the synthetic model of the real failure (openssl across 4 forks + liblfds ×2 + two single-file fork hits), the result drops from 6 noisy names to openssl 3.0.0 + liblfds — the noise gone, openssl consolidated to the consensus.

Tests / docs

  • Fixtures updated to 2 files/component (realistic default-threshold case); orthogonal adversarial cases pin SCANOSS_MIN_FILES=1.
  • New adversarial section asserts noise-drop, consensus version/PURL, file count, and the disable switch.
  • SCANOSS_MIN_FILES documented in the guide + CLI reference, with honest framing that it helps but does not fully fix KB mis-attribution.

bash tests/test-vendored-adversarial.sh → 32 passed. bash tests/test-postprocess.sh → 41 passed.

Note: a full real-OSSKB re-scan to confirm on the actual openssl tree is pending free-OSSKB rate-limit reset; verified here against the synthetic model of the observed pattern.

A real C/C++ upload exposed the free-OSSKB mis-attribution at full force: an
openssl 0.9.7c tree came back as ~10 scattered downstream forks (globus-toolkit,
halite, …) with wrong versions, because the matcher took the first per-file
match and deduped by purl.

identify-vendored.sh now groups file matches by library NAME and promotes a
component only when at least SCANOSS_MIN_FILES files (default 2) agree on it, so
one-off single-file fork matches are dropped. The version and PURL are resolved
from the consensus across the supporting files, which collapses a library split
across forks into one component with the canonical identity (e.g. openssl 3.0.0,
not a minority 0.9.1c fork). Each component records its file-support count
(bomlens:scanoss:files). SCANOSS_MIN_FILES=1 disables the filter.

- fixtures updated to 2 files/component (the realistic, default-threshold case);
  the orthogonal adversarial cases pin SCANOSS_MIN_FILES=1.
- new adversarial section verifies noise-drop + consensus version/purl + the
  disable escape hatch.
- guide + CLI reference document SCANOSS_MIN_FILES and the mitigation (and that
  it helps but does not fully fix KB mis-attribution).
@haksungjang haksungjang merged commit 50b2f43 into main Jun 23, 2026
25 checks passed
@haksungjang haksungjang deleted the feat/vendored-coverage-filter branch June 23, 2026 05:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant