Skip to content

Conversation

@JonoYang
Copy link
Member

This PR adds a gibberish detector to textcode to avoid processing nonsense copyright strings detected from binaries.

@pombredanne pombredanne changed the title 2402 detect gibberish copyright Detect gibberish copyright #2402 Nov 20, 2025
@pombredanne pombredanne changed the title Detect gibberish copyright #2402 Detect gibberish copyright #2402 Nov 20, 2025
    * Remove unnecessary tests

Signed-off-by: Jono Yang <[email protected]>
@JonoYang JonoYang requested a review from pombredanne November 20, 2025 21:11
@JonoYang
Copy link
Member Author

JonoYang commented Nov 20, 2025

@pombredanne removing the tests that I linked above, we only fail these data driven tests:

https://github.com/aboutcode-org/scancode-toolkit/blob/2402-detect-gibberish-copyright/tests/cluecode/data/copyrights/scilab-Scilab#L67

  • an instance of Scilab (c) INRIA-ENPC. was not detected
  • c) INRIA-ENPC. is identified as gibberish

https://github.com/aboutcode-org/scancode-toolkit/blob/2402-detect-gibberish-copyright/tests/cluecode/data/copyrights/misco4/linux-copyrights/Documentation/networking/arcnet-hardware.txt#L32

  • this did not detect Copyright Waterloo Microsystems Inc. 1985
  • @Copyright is identified as gibberish

https://github.com/aboutcode-org/scancode-toolkit/blob/2402-detect-gibberish-copyright/tests/cluecode/data/authors/trailing_date#L3C19-L3C59

  • Alexander Kanavin <[email protected]> was not detected
  • * : commit 3debe362faa62e5b381b880e3ba23aee07c85f6e Author: is detected as gibberish

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants