bug in get_proteins_by_id affecting pfam annotator

First of all thanks for developing DeepBGC and making it available to the community.

I came across a bug in `HmmscanPfamRecordAnnotator`  when generating the `proteins_by_id` dictionary. The  `util` function `get_proteins_by_id` is currently looping through all the potential protein ids of a feature (e.g.  `unique_protein_id`,  `protein_id` and `locus_tag`) and this can cause features with id based on `protein_id` qualifier to be overwritten by another feature that shares the same `protein_id` but it was deduplicated using the `unique_protein_id`. This is causing `PFAM_domain` features to be incorrectly placed in the genomic sequence because `protein_id` used in `hmmscan` output file will match a different feature and pick the incorrect feature location.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug in get_proteins_by_id affecting pfam annotator #78

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug in get_proteins_by_id affecting pfam annotator #78

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions