Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patentee duplicates #5

Open
cverluise opened this issue Mar 4, 2021 · 1 comment
Open

Patentee duplicates #5

cverluise opened this issue Mar 4, 2021 · 1 comment
Assignees
Labels
beta beta testing related issues
Milestone

Comments

@cverluise
Copy link
Owner

Issue description

US patents before 1920 (format 1) listed each patentee twice, typically both as part of the header and the body of the text. We annotated both, we also collect both.

The side effect is that, for these formats, some metrics (e.g. size of the team) are deeply affected and could be misleading.

More

This can happen in 2 cases:

  • US format 1 ~ 100%
  • GB format 1 ~10% because of the provisional and final publication being both on the same document

Details

Version: 1.0.0rc1

@cverluise
Copy link
Owner Author

Deduplicate using relative Levenshtein distance on the name_text (iterat over all patentee couples).

  • US format 1: 97%+ accuracy with threshold .43 (see doc)
  • GB format 1: nothing yet

We created a new field is_duplicate which is True if we found a duplicate. Note that only 1 of the 2, the one bearing the less information, is marked duplicate.

Will be available as of v1.0.0rc2.

Leaving open in case we want to do something similar for GB

@cverluise cverluise added the beta beta testing related issues label Jul 15, 2021
@cverluise cverluise added this to the v1 milestone Jul 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beta beta testing related issues
Projects
None yet
Development

No branches or pull requests

2 participants