New source: SILVA taxonomy #348

jplfaria · 2025-02-12T06:35:47Z

Summary

This pull request implements the SILVA taxonomy as an ontology converter. The module converts SILVA small subunit (SSU) taxonomy data into OBO (and OWL) format and is organized in a style similar to the GTDB module.

Key Decisions and Implementation Details

Internal SILVA Taxonomy IDs:
- As discussed in #1306, the internal SILVA taxonomy ID URLs do not resolve externally.
- For example, I used URLs such as:
```
https://www.arb-silva.de/no_cache/download/archive/current/Exports/taxonomy/10081
```
  with the idea that navigating to:
```
https://www.arb-silva.de/no_cache/download/archive/current/Exports/taxonomy/
```
  will allow users to locate the files by internal IDs.
- Note: I am open to alternative suggestions (e.g. using URLs that resolve directly to bioregistry.io).
SSU vs. LSU Taxonomy:
- This module is specific for SILVA small subunit (SSU) taxonomy.
- The large subunit (LSU) taxonomy is distinct and should be handled separately.
- The module’s docstring clearly states:
```
"""Convert SILVA small subunit (ssu) taxonomy to OBO format."""
```
Handling ENA Accession Numbers:
- At the lowest taxonomy level, SILVA uses accession numbers that resolve to ENA entries.
- Initial Approach:
  - I initially implemented these as cross-references (xrefs) at the genus level (e.g., each ENA accession was added as an xref within the genus term).
  - Example (OBO):
```
[Term]
id: silva.taxon:58060
name: Angustibacter
xref: ena.embl:AB234237 ! uncultured bacterium
xref: ena.embl:AB512285 ! Angustibacter luteus
is_a: silva.taxon:58059
property_value: TAXRANK:1000000 TAXRANK:0000005 ! has rank genus
```
- Revised Approach:
  - I then decided to represent each ENA accession as its own term, with a parent association determined using the taxmap file (see below).
  - Example (OWL snippet):
```
<owl:Class rdf:about="https://www.ebi.ac.uk/ena/browser/view/KP324679">
    <rdfs:subClassOf rdf:resource="https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/taxonomy/47493"/>
    <oboInOwl:id>ena.embl:KP324679</oboInOwl:id>
    <rdfs:label>Fanellia korema</rdfs:label>
</owl:Class>
```
- Rank Annotation for ENA Terms:
  - Originally, I was assigning TAXRANK:0000006 (species) to all ENA entries.
  - However, because these ENA entries may represent either species or strain, I decided not to assign any rank to the ENA-derived terms.
Version Introduced Information:
- The main SILVA taxonomy file includes a column with the version in which a given taxon ID was introduced.
- Example in OBO Format:
```
[Term]
id: silva.taxon:58060
name: Angustibacter
property_value: version_introduced "138.2"
...
```
- I am open to discussion on whether to include this field by default as I can see it being useful but confusing.

Code Organization

Module-Level Structure:
- A global constant PREFIX is defined as "silva.taxon".
Rank Mapping:
- The dictionary SILVA_RANK_TO_TAXRANK covers all SILVA taxonomic ranks.
Main Processing Steps:
1. Main Taxonomy File:
  - Each row is split on ";" (ignoring empty strings).
  - The term’s name is set to the last element (e.g., "Bacteria" or "Actinomycetota"), and the parent is determined by joining all but the last element.
2. Taxmap File:
  - A new term is created for each ENA accession with prefix ena.embl:.
  - These terms are linked as children of the corresponding main taxonomy term, but no rank is assigned to them.

I welcome any feedback or suggestions on URL handling, inclusion of the version introduced field, or any other aspect of the implementation.

Please let me know if further details are needed.

codecov · 2025-02-12T06:38:48Z

Codecov Report

Attention: Patch coverage is 35.48387% with 40 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@d75bbc6). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/pyobo/sources/silva.py	34.42%	40 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #348   +/-   ##
=======================================
  Coverage        ?   51.96%           
=======================================
  Files           ?      187           
  Lines           ?    12161           
  Branches        ?     1857           
=======================================
  Hits            ?     6319           
  Misses          ?     5603           
  Partials        ?      239

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cthoyt · 2025-02-13T11:59:08Z

src/pyobo/sources/silva.py

+                reference=Reference(prefix="ena.embl", identifier=accession, name=organism)
+            )
+            # Do NOT annotate the new term with a rank (leave it unranked).
+            new_term.append_parent(Reference(prefix=PREFIX, identifier=species_taxon_id))


Don't ENA terms represent nucleotide sequences derived from experiments? Can they also represent projects?

From what I understand, they aren't actually themselves representing taxa. Therefore this parent/child relationship doesn't make sense.

The hard work of making a PyOBO source is really understanding what is the relationship SILVA means when it mentions its internal taxonomy and ENA sequences. I can't do this hard work for you in detail, but from a high level it seems like the sequence was derived from an individual of the taxonomy.

Then, there's two options:

Find an existing RO relationship that is appropriate for this. Maybe http://purl.obolibrary.org/obo/RO_0001001, even though it's not a perfect ontological fit. Maybe OBI is a better place to look

mint an ad-hoc one yourself within the scope of this file, e.g., like in

pyobo/src/pyobo/sources/clinicaltrials.py

Line 23 in ada760b

HAS_INTERVENTION = TypeDef(

If you go the second route, make sure that you do a good job describing what the relationship means (in a concise way)

Thank you for your detailed feedback. I completely understand where the hard work lies, and I truly appreciate the guidance you provided. Your suggestions—either reusing an existing RO relationship (like RO_0001001) or minting an ad-hoc one (as in clinicaltrials.py)—are exactly the direction I was hoping for.

I’ll explore those options further. Alternatively, I might start by representing only down to the genus level (as shown in the taxonomy files) until I fully understand the nuances of the lower levels.

Thanks again for steering this work in the right direction!

I think it's good to surface decision like this higher up, and ideally all of pyobo would be biolink compliant. biolink:has_biological_sequence is the right KG relationship to use.

If you are going to use RO you need to use it consistently with how it's intended and not just pick a label that sounds right.

cthoyt

I added an example ad-hoc typedef that you can fill in (or duplicate) for your purposes

cthoyt · 2025-02-16T11:12:47Z

src/pyobo/sources/silva.py

+logger.setLevel(logging.WARNING)
+
+TYPEDEF = TypeDef(
+    reference=default_reference(PREFIX, "fixme", name="fixme"),


jplfaria · 2025-02-20T04:46:51Z

Summary of Changes

New TypeDef Added:
Introduced HAS_TAXONOMIC_CLASSIFICATION to capture the relationship between an ENA accession (representing a genome sequence) and the taxonomic classification assigned by SILVA.
Rationale:
Instead of using a parent/child relationship—which implies a fixed hierarchical level—this new typedef reflects that SILVA can classify sequences to varying levels (often down to genus, but sometimes only to higher ranks). This approach better represents the flexible nature of the taxonomic assignments provided by SILVA.

Implementation Details:
The new typedef is defined as follows:

HAS_TAXONOMIC_CLASSIFICATION = TypeDef(
    reference=default_reference(PREFIX, "has_taxonomic_classification", name="has taxonomic classification"),
    definition="Indicates that the genome sequence represented by an ENA accession is classified under this taxon by SILVA.",
    is_metadata_tag=True,
)

jplfaria · 2025-03-03T19:04:51Z

@cthoyt I apologize, I just noticed I did something wrong, I see the warning:

"Merging is blocked 1 review requesting changes by reviewers with write access.
You're not authorized to push to this branch."

I am trying to figure out how to push to the proper branch.

cthoyt · 2025-03-04T10:32:59Z

@jplfaria are you unable to push to your own branch? The pull request dialog is to inform you that you can't merge the pull request, since it's my repository. That's not an issue for you to worry about

jplfaria · 2025-03-05T17:11:18Z

@cthoyt thank you for the clarification. I thought that was the case, but my inexperience with git made me think I was doing something wrong.

The organism is associated with the sequence/project identified by this accession, but it's not the name of the project

cthoyt · 2025-03-05T17:55:50Z

@jplfaria see FIXME comments in recent commits I made

cthoyt · 2025-03-05T18:21:28Z

src/pyobo/sources/silva.py

+RELATION_NEEDS_NEW_NAME = TypeDef(
+    reference=default_reference(PREFIX, "has_related_sequence", name="has related sequence"),
+    # FIXME!
+    definition="This relation represents a connection between a species and ENA records that are "


we still need to understand the context in what it means for a SILVA taxon to be mapped to an ENA record for a sequence. I am starting to think that this is just as simple as "the sequence was derived from a sample taken from an individual of this species" but please do a deep dive to clarify further

The fixes look good to me.

sierra-moxon · 2025-03-06T18:36:09Z

Would strongly consider reusing a relation from an existing ontology or standard (or working with an existing standard like RO to get a new term added) rather than minting a new one. I can help get a new term added for you, or you can use (as @cmungall states above) a term from Biolink, in particular biolink:has_biological_sequence (https://biolink.github.io/biolink-model/has_biological_sequence/)

cthoyt · 2025-03-06T19:26:18Z

thanks @sierra-moxon ! ~~This is just what we need.~~ I wish it were easier to find what I needed in Biolink. I've sent biolink/biolink-model#1555 as a plea for helping me and other people who are potential users

On second thought, I am not sure this is what we need, because the description is so vague

jplfaria · 2025-03-20T16:16:43Z

@cthoyt, I apologize for the delay in responding. I have been buried in deadlines, and things only look to clear in mid-April. At that point, I can deep-dive into this, as I haven't looked into Biolink at all.

cthoyt · 2025-03-20T16:23:51Z

@jplfaria please don't spend your time looking into biolink at the moment. There's no documentation for the suggested predicates for our use case, meaning that reusing them is not a good course of action. The best way forward is for us to mint our own predicates, but do a 100% job giving detailed context as to what the predicates are for

cmungall · 2025-03-20T18:12:25Z

In fact @jplfaria will likely be looking at Biolink anyway since the database he is building will be strongly aligned with it!

adding module for silva taxonomy

d267ea4

cthoyt added 3 commits February 13, 2025 12:45

Run ruff

205fad3

Update silva.py

b99644a

Merge branch 'main' into pr/348

9ff0766

cthoyt reviewed Feb 13, 2025

View reviewed changes

cthoyt changed the title ~~adding module for silva taxonomy~~ New source: SILVA taxonomy Feb 13, 2025

cthoyt added the Nomenclature Data label Feb 13, 2025

Add typedef example

a159f4a

cthoyt requested changes Feb 16, 2025

View reviewed changes

cthoyt added 2 commits February 16, 2025 10:22

Update silva.py

09a07e9

Update __init__.py

68ca239

cthoyt reviewed Feb 16, 2025

View reviewed changes

Replace parent relationship with HAS_TAXONOMIC_CLASSIFICATION typedef

b4fa6e3

cthoyt added 6 commits March 5, 2025 18:44

Update silva.py

8967240

Merge branch 'main' into pr/348

7926155

ID spaces get automatically constructed now

f4b2619

Fix nomenclature mistake

20acf22

The organism is associated with the sequence/project identified by this accession, but it's not the name of the project

Add fixmes

f46273a

Update silva.py

1fffd60

cthoyt added 3 commits March 5, 2025 19:04

Update silva.py

25e44d6

Remove duplicates

ae99167

Minor cleanup

151d147

cthoyt reviewed Mar 5, 2025

View reviewed changes

Update silva.py

5381ec3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New source: SILVA taxonomy #348

New source: SILVA taxonomy #348

jplfaria commented Feb 12, 2025

codecov bot commented Feb 12, 2025 •

edited

Loading

cthoyt Feb 13, 2025 •

edited

Loading

jplfaria Feb 13, 2025

cmungall Mar 6, 2025

cthoyt left a comment

cthoyt Feb 16, 2025

jplfaria commented Feb 20, 2025

jplfaria commented Mar 3, 2025 •

edited

Loading

cthoyt commented Mar 4, 2025

jplfaria commented Mar 5, 2025

cthoyt commented Mar 5, 2025

cthoyt Mar 5, 2025

jplfaria Mar 20, 2025 •

edited

Loading

sierra-moxon commented Mar 6, 2025

cthoyt commented Mar 6, 2025 •

edited

Loading

jplfaria commented Mar 20, 2025

cthoyt commented Mar 20, 2025

cmungall commented Mar 20, 2025

New source: SILVA taxonomy #348

Are you sure you want to change the base?

New source: SILVA taxonomy #348

Conversation

jplfaria commented Feb 12, 2025

Summary

Key Decisions and Implementation Details

Code Organization

codecov bot commented Feb 12, 2025 • edited Loading

Codecov Report

cthoyt Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

jplfaria Feb 13, 2025

Choose a reason for hiding this comment

cmungall Mar 6, 2025

Choose a reason for hiding this comment

cthoyt left a comment

Choose a reason for hiding this comment

cthoyt Feb 16, 2025

Choose a reason for hiding this comment

jplfaria commented Feb 20, 2025

Summary of Changes

jplfaria commented Mar 3, 2025 • edited Loading

cthoyt commented Mar 4, 2025

jplfaria commented Mar 5, 2025

cthoyt commented Mar 5, 2025

cthoyt Mar 5, 2025

Choose a reason for hiding this comment

jplfaria Mar 20, 2025 • edited Loading

Choose a reason for hiding this comment

sierra-moxon commented Mar 6, 2025

cthoyt commented Mar 6, 2025 • edited Loading

jplfaria commented Mar 20, 2025

cthoyt commented Mar 20, 2025

cmungall commented Mar 20, 2025

codecov bot commented Feb 12, 2025 •

edited

Loading

cthoyt Feb 13, 2025 •

edited

Loading

jplfaria commented Mar 3, 2025 •

edited

Loading

jplfaria Mar 20, 2025 •

edited

Loading

cthoyt commented Mar 6, 2025 •

edited

Loading