- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New source: SILVA taxonomy #348
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #348 +/- ##
=======================================
Coverage ? 51.96%
=======================================
Files ? 187
Lines ? 12161
Branches ? 1857
=======================================
Hits ? 6319
Misses ? 5603
Partials ? 239 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
src/pyobo/sources/silva.py
Outdated
reference=Reference(prefix="ena.embl", identifier=accession, name=organism) | ||
) | ||
# Do NOT annotate the new term with a rank (leave it unranked). | ||
new_term.append_parent(Reference(prefix=PREFIX, identifier=species_taxon_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't ENA terms represent nucleotide sequences derived from experiments? Can they also represent projects?
From what I understand, they aren't actually themselves representing taxa. Therefore this parent/child relationship doesn't make sense.
The hard work of making a PyOBO source is really understanding what is the relationship SILVA means when it mentions its internal taxonomy and ENA sequences. I can't do this hard work for you in detail, but from a high level it seems like the sequence was derived from an individual of the taxonomy.
Then, there's two options:
- Find an existing RO relationship that is appropriate for this. Maybe http://purl.obolibrary.org/obo/RO_0001001, even though it's not a perfect ontological fit. Maybe OBI is a better place to look
- mint an ad-hoc one yourself within the scope of this file, e.g., like in
pyobo/src/pyobo/sources/clinicaltrials.py
Line 23 in ada760b
HAS_INTERVENTION = TypeDef(
If you go the second route, make sure that you do a good job describing what the relationship means (in a concise way)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your detailed feedback. I completely understand where the hard work lies, and I truly appreciate the guidance you provided. Your suggestions—either reusing an existing RO relationship (like RO_0001001) or minting an ad-hoc one (as in clinicaltrials.py)—are exactly the direction I was hoping for.
I’ll explore those options further. Alternatively, I might start by representing only down to the genus level (as shown in the taxonomy files) until I fully understand the nuances of the lower levels.
Thanks again for steering this work in the right direction!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's good to surface decision like this higher up, and ideally all of pyobo would be biolink compliant. biolink:has_biological_sequence is the right KG relationship to use.
If you are going to use RO you need to use it consistently with how it's intended and not just pick a label that sounds right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an example ad-hoc typedef that you can fill in (or duplicate) for your purposes
src/pyobo/sources/silva.py
Outdated
logger.setLevel(logging.WARNING) | ||
|
||
TYPEDEF = TypeDef( | ||
reference=default_reference(PREFIX, "fixme", name="fixme"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixme here
Summary of Changes
|
@cthoyt I apologize, I just noticed I did something wrong, I see the warning: "Merging is blocked 1 review requesting changes by reviewers with write access. I am trying to figure out how to push to the proper branch. |
@jplfaria are you unable to push to your own branch? The pull request dialog is to inform you that you can't merge the pull request, since it's my repository. That's not an issue for you to worry about |
@cthoyt thank you for the clarification. I thought that was the case, but my inexperience with git made me think I was doing something wrong. |
The organism is associated with the sequence/project identified by this accession, but it's not the name of the project
@jplfaria see FIXME comments in recent commits I made |
RELATION_NEEDS_NEW_NAME = TypeDef( | ||
reference=default_reference(PREFIX, "has_related_sequence", name="has related sequence"), | ||
# FIXME! | ||
definition="This relation represents a connection between a species and ENA records that are " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still need to understand the context in what it means for a SILVA taxon to be mapped to an ENA record for a sequence. I am starting to think that this is just as simple as "the sequence was derived from a sample taken from an individual of this species" but please do a deep dive to clarify further
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fixes look good to me.
Would strongly consider reusing a relation from an existing ontology or standard (or working with an existing standard like RO to get a new term added) rather than minting a new one. I can help get a new term added for you, or you can use (as @cmungall states above) a term from Biolink, in particular |
thanks @sierra-moxon ! On second thought, I am not sure this is what we need, because the description is so vague |
@cthoyt, I apologize for the delay in responding. I have been buried in deadlines, and things only look to clear in mid-April. At that point, I can deep-dive into this, as I haven't looked into Biolink at all. |
@jplfaria please don't spend your time looking into biolink at the moment. There's no documentation for the suggested predicates for our use case, meaning that reusing them is not a good course of action. The best way forward is for us to mint our own predicates, but do a 100% job giving detailed context as to what the predicates are for |
In fact @jplfaria will likely be looking at Biolink anyway since the database he is building will be strongly aligned with it! |
Summary
This pull request implements the SILVA taxonomy as an ontology converter. The module converts SILVA small subunit (SSU) taxonomy data into OBO (and OWL) format and is organized in a style similar to the GTDB module.
Key Decisions and Implementation Details
Internal SILVA Taxonomy IDs:
SSU vs. LSU Taxonomy:
"""Convert SILVA small subunit (ssu) taxonomy to OBO format."""
Handling ENA Accession Numbers:
TAXRANK:0000006 (species)
to all ENA entries.Version Introduced Information:
Code Organization
Module-Level Structure:
PREFIX
is defined as"silva.taxon"
.Rank Mapping:
SILVA_RANK_TO_TAXRANK
covers all SILVA taxonomic ranks.Main Processing Steps:
Main Taxonomy File:
";"
(ignoring empty strings)."Bacteria"
or"Actinomycetota"
), and the parent is determined by joining all but the last element.Taxmap File:
ena.embl:
.I welcome any feedback or suggestions on URL handling, inclusion of the version introduced field, or any other aspect of the implementation.
Please let me know if further details are needed.