Releases: InscriptaLabs/BioCantor
Releases · InscriptaLabs/BioCantor
0.19.0
0.10.0
[0.10.0] 2021-09-17
Changed
CDSIntervalobject now has methods to access the number of codons and codon locations in both chunk-relative and chromosome coordinates. Chromosome accessors will always return the full original CDS.- If duplicate sequence identifiers are found when parsing GenBank/FASTA files, an exception is raised.
- The
scan_codon_locationsmethods onCDSIntervalnow operate on two algorithms, one simpler algorithm for canonical transcripts (no programmed frameshifts and no offset frames) and the original more robust algorithm otherwise.
0.9.0
0.8.0
[0.8.0]
Fixed
- Do not trust feature type annotations to define coding vs. non-coding when parsing GenBank files; only rely on the presence/absence of CDS intervals associated with the transcript.
- Setup requirements, build tests, and sphinx config updated to allow building of documentation without installing the package.
- GenBank parser was not handling
exonfeatures as direct descendants ofgenecorrectly.
Changed
Sortedparser now sorts features by position, then gene/mRNA/CDS/other. This helps deal with genbank files that are oddly ordered.- Introduced new
HybridGenBank parser mode that does bothLocusTagandSortedparsing at the same time. - All
.chromosome_locationaccessor ofIntervalobjects always return the full lengthLocation, even if theIntervalitself is chunk relative such that the underlyingLocationobject cannot represent the full length. As a result of this, the.chromosome_locationof a chunk-relative location cannot have associated sequence information.
Added
TranscriptIntervalandFeatureIntervalnow have accessor methods to getLocationobjects for their introns/gaps and full span.
0.7.0
[0.7.0]
Changed
- GenBank position-sorted parser can now handle CDS records that are not directly following a gene record.
- Refactor
Location,ParentandSequenceto have base classesAbstractLocation,AbstractParentandAbstractSequencethat are in the base of theinscripta.biocantor.locationmodule. This greatly helps with resolving circular imports. - Optimized checking
sequenceandlocationmembers to explicitly check forNone. This avoids a call to__len__. CompoundInterval._single_intervalsis now lazily evaluated, because it is expensive to generate manySingleIntervalobjects.CompoundIntervalnow stores the positions as two sorted integer lists.CompoundIntervalconstructor accepts tuples in addition to lists of integer values to avoid list construction overhead.CompoundInterval.is_overlappingandCompoundInterval.is_contiguousare lazily evaluated.CompoundInterval._combine_blocksnow always removes empty blocks. The new implementation also avoids producing a new interval if the result is identical to the start.unique_value_or_nonewas pulled out ofParentinto its own separate function with an associated cache. This function was optimized to use sets.- Added
__slots__to all child classes ofAbstractLocation,AbstractSequenceandAbstractParent. - Removed unnecessary call to
strip_location_info()inSequenceconstructor. - Removed all unnecessary instances of constructing lists, replacing them with iterators and tuples.
- GenBank export now defaults to not updating
/translationtag in order to save execution time. The original behavior can be restored by settingupdate_translations=True.
Fixed
- GenBank parser was not properly handling 0bp intervals, which can be sometimes seen as a way to represent insertions.
- GenBank parser was not capturing CDS qualifiers when parsing eukaryotic style GenBank files that have mRNA level features
0.6.0
[0.6.0]
Changed
- Added
raise_on_reserved_attributesflag to GFF3 export that controls whether reserved attributes lead to warnings or exceptions. - Added more top-level imports to simplify imports
- Try more common identifiers when parsing gene symbols from GFF3 files
- Attempt to infer frame from GFF3 files with null Phase columns on CDS records
- Update Tox tests to have a separate formatting case
0.5.0
[0.5.0]
Changed
- Added ability to parse non-transcribed features from GenBank records without a parent /gene record in the position-sorted parser.
- Added ability to export
SeqRecordannotations when writing to GenBank. - Added methods to
FeatureIntervalthat mirrorTranscriptInterval. - Added support for translating with non-standard codon tables.
0.4.5
[0.4.5]
Changed
- Remove contributor license agreement, which is superseded by the MIT license.
[0.4.4]
Fixed
- Handle genbank files with broken intervals gracefully.
- Fix interval parsing for negative strand features.
Changed
- The tag
Namecan now be used to identify a feature interval in a GFF3/GenBank file.
[0.4.3]
Fixed
AnnotationCollection._subset_parent()now usesseq_chunk_to_parentand pulls out the chromosome ID from the chromosome record.CDSInterval.from_dict()now passes along the parent provided.
Added
strict_parent_compareparameter for binary set theory operations.AnnotationCollection.query_by_position()has a new boolean flagexpand_location_to_childrenthat defaults to False, but if set to True
will expand the interval to contain the transcripts. When False, it may be the case that transcripts will have their underlying location objects
sliced down from their original coordinates. The original coordinates are still retained as integer members.
If the query position is entirely intronic for an isoform, this isoform will have aEmptyLocationchunk relative location,
but will still retain achromosome_location.
Changed
- Added a parent-level sequence identifier to the output of
biocantor.io.parser.seq_chunk_to_parent(). - Added a
strandargument tobiocantor.io.parser.seq_chunk_to_parent()that allows for the sequence chunk to be strand-referenced. Location.parent_to_relative_locationandLocation.location_relative_tonow has aoptimize_blocksflag that defaults to True.
If this flag is False, then these operations will not collapse adjacent or overlapping blocks.
[0.4.2]
Added
Location.union_preserve_overlaps()function added. This function produces the union of intervals, while preserving all overlaps.
[0.4.1]
Added
- Improved docstrings on interval objects.
- Location objects now have a
full_spanoptional flag on allintersction,overlapsandcontainsfunctions. This flag has compound intervals be treated as their full span, i.e. from start to end, regardless of compound structure. This flag defaults toFalsein all cases. When twoCompoundIntervalare compared, they are both always compared in their full spans when this flag isTrue. IntervalandIntervalCollectionobjects now are capable of being lifted to arbitrary coordinate systems, returning a new copy. These operations rely on first lifting to a shared chromosomal coordinate system.
Changed
- New
SequenceTypeenum stores whether interval sequences arechromosomeorchunk_relative. - All objects that accept
SequenceTypeinformation accepts either theSequenceTypeenum OR raw strings. AnnotationCollectionwill look at the providedparent_or_seq_chunk_parentto see if the bounds of the object can be inferred from the parent object. This is only performed if nostart/endare explicitly provided. If neither are provided, the bounds of the collection are the bounds of its children.- Refactored
CDSIntervalto be based onAbstractFeatureInterval. MovedCDSPhaseandCDSFrameto accomodate the circular import this introduced. - All
Intervalobjects are allowed to have chromosome parents without sequence information. - Removed versioneer in favor of hard coded versions.
Fixed
- Some functions on Interval objects were not operating in chromosome coordinates
AnnotationCollection.query_by_position()was not returning valid results if the parent was a sequence chunk.- GFF3 parser was not inferring transcripts for a gene feature with no children.
- Fixed a bug with missing gene biotypes in GFF3 parsing.
[0.4.0]
Added
- All Interval objects now have the ability to be built from subsets of genome sequence (called
sequence_chunk). - Querying
AnnotationCollectionobjects by coordinates produce new objects with sliced sequences with chunk-relative coordinates. - Interval objects built from sequence subsets can be exported in chunk-relative coordinates to GFF3/GenBank.
- Interval objects have new coordinate translation methods that operate in chunk-relative space. Coordinate methods that operate in genomic coordinate space were retained.
- Non-transcribed feature identifier parsing looks in the
notespecial field for identifiers.
Changed
- All Interval objects now must be built directly from coordinates, and do not accept Location objects.
- All Interval objects now hide their Location member. This is to avoid confusion about what coordinate system the Location may be on.
- All interval collections have
__iter__functions that call__iter_children()functions. - All Interval objects have their core
._locationobject hidden, and offer two accessors --.chromosome_locationand.chunk_relative_location. Note that.chromosome_locationwill not have sequence information attached to it if a sequence chunk was used. Generally, it is advised to not access.locationobjects directly.
[0.3.1]
Fixed
- Feature interval identifier regex should exactly match qualifier keys
Added
- Unified API for identifiers on all interval objects with new property methods
.idand .name.
[0.3.0]
Fixed
Biotypeenum improperly mappedprotein_codingandprotein-codingto different values. AddedmRNAas another synonym for this type.- GFF3, BED and GenBank export from Interval objects now raise an exception when the sequence name field is null.
Added
- Parse
FeatureIntervalandFeatureIntervalCollectionfrom GFF3 or GenBank, and write back as well.
Changed
FeatureIntervalnow has multiple types, stored as sets.FeatureIntervalCollectionstores the union of these types, in addition to optionally having its own type.
[0.2.0] 2021-01-06
Fixed
CompoundInterval.relative_interval_to_parent_location()in the case of overlapping blocks. Had previously been double counting overlap region.CompoundInterval.gap_list()in the case of overlapping blocks. Had been raising an error in that case.
Added
CDSInterval.scan_codon_locations()method. Returns an iterator over codon locations.- Implement
__hash__()forCompoundIntervalandCDSInterval - Implemented data structures for
TranscriptIntervalandFeatureIntervalthat model transcribed and non-transcribed genomic features. - Implemented data structures for
GeneIntervalandFeatureIntervalCollection, that model groups of intervals as genes or generic feature groups. - Implemented a wrapper data structure
AnnotationCollectionthat contains groups of genes and non-transcribed feature groups. - Implemented the ability to build BioCantor gene models from GFF3 and GenBank files.
- Implemented the ability to export the above data structures as GFF3, GenBank, BED and NCBI TBL formats.
- Implemented Marshmallow dataclasses that allow for serialization and deserialization of the above data structures.
- Copied the bins implementation from gffutils to avoid needing the full dependency set in a minimal install.
- Added a Biotype enumeration that tracks known biotypes.
- Added caching of sequence retrieval to
Intervalobjects.
Changed
- Migrated sphinx documentation from
automodapitoautoapi. - Performance upgrades to interval arithmetic operations.
Removed
CDSInterval.intersect()method. Frame math was incorrect for complex CDSs and was deemed too difficult to implement correctly.
v0.1.1
Merged in hotfix/0.1.1 (pull request #1) Hotfix/0.1.1 Approved-by: Michael DePalatis Approved-by: Joshua Shorenstein Approved-by: Ian Fiddes
v0.1.0
Initial release