Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent duplicate PIs on ingest #1564

Open
naglepuff opened this issue Mar 6, 2025 · 0 comments
Open

Prevent duplicate PIs on ingest #1564

naglepuff opened this issue Mar 6, 2025 · 0 comments

Comments

@naglepuff
Copy link
Collaborator

Background

In the postgres database, there is a table principal_investigator, which stores PIs as rows. These are populated during study ingest. This information is extracted from the principal_investigator slot. Note that in mongo, PI information is stored inline. This means that if multiple studies share a PI, that information is duplicated. On ingest, we do our best to represent each PI only once:

def get_or_create_pi(db: Session, name: str, url: Optional[str], orcid: Optional[str]) -> str:
pi = db.query(PrincipalInvestigator).filter_by(name=name).first()
if pi:
return pi.id

Problem

name is not a good equality check for PIs. If a PI is expressed differently in two different projects (e.g. "Mike" vs "Michael"), they will be represented twice in our database. The biggest consequence is that there may be two entries in the PI facet for the same person.

Potential Solution

On our end, we could prevent duplicate PIs by checking ORCid ID instead of name. Each PI should really only have one ORCid ID.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant