Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype for storing single-cell data #1020

Draft
wants to merge 479 commits into
base: development
Choose a base branch
from
Draft

Conversation

arteymix
Copy link
Member

@arteymix arteymix commented Feb 5, 2024

TODO

  • add a CLI for loading/reloading single cell data and cell type assignments
  • extend the GEO loader to detect MEX and other supported formats and apply the appropriate strategy for loading vectors
  • Add support for HDF5-based single-cell data formats #1039
  • support saving single cell data to disk (I already added a SingleCellExpresionDataMatrix, we need to finish the work and write it to file). I think MEX is a pretty decent output format for this.

REST API

  • review which fields should be exposed on the REST API for filtering purposes
  • add aliases in the REST API to refer to the preferred single cell dimension and cell assignment

@arteymix arteymix force-pushed the feature-single-cell branch 6 times, most recently from 9222a95 to 788a61b Compare February 7, 2024 20:13
@arteymix arteymix force-pushed the feature-single-cell branch 2 times, most recently from f04ae2f to d791771 Compare February 7, 2024 20:23
@arteymix arteymix force-pushed the feature-single-cell branch from b60a8de to 0ce142a Compare February 8, 2024 00:23
@arteymix arteymix force-pushed the feature-single-cell branch from 80c6409 to e804d92 Compare February 13, 2024 03:40
@arteymix arteymix force-pushed the feature-single-cell branch 4 times, most recently from b2c8a8b to 6a993b7 Compare February 19, 2024 20:47
@arteymix arteymix added the single cell Issues related to single-cell data support label Feb 20, 2024
@arteymix arteymix self-assigned this Feb 21, 2024
@arteymix arteymix force-pushed the feature-single-cell branch from b7d4810 to 7c29995 Compare February 21, 2024 23:26
@arteymix arteymix linked an issue Feb 25, 2024 that may be closed by this pull request
3 tasks
@arteymix
Copy link
Member Author

I'm in the process of merging the dev branch to get this work up-to-date.

@@ -130,6 +130,13 @@
<!-- cannot be non-null because subsets and generic experiments don't have curation details -->
<column name="CURATION_DETAILS_FK" not-null="false" sql-type="BIGINT" unique="true"/>
</many-to-one>
<set name="singleCellExpressionDataVectors" lazy="true" fetch="select" inverse="true"
cascade="all-delete-orphan">
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the -delete-orphan and manage vectors the same way we do for raw and processed ones.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That include bulk insertion, removal, etc.

Add basic support in SingleCellDescriptive and DataVectorDescriptive for
floats, ints and longs.

Add missing conversion logic for scale types.

Reuse those to implement aggregation of floats, ints and longs. No
matter the input type, we always aggregate into doubles, so we don't
have to support those types in raw or processed vectors.

Support writing MEX and tabular format from those vectors.

Support loading integer data from MEX and all the supported types from
AnnData.

Add an option to prefer single-precision when loading data vectors,
which might imply losing some precision.

Add an option to use double precision for MEX. The default is
single-precision now.
We only support one matrix format for single cell data, so vectors that
are not stored in double require conversion.
@arteymix arteymix force-pushed the feature-single-cell branch from 5840b71 to fc6cbe5 Compare February 26, 2025 21:05
…and data

The COUNT_FAST aggregation method does not even need the data to be
populated, so we can nearly double the throughput by omitting it.
Data in GEO is always retrieved as string arrays of known size, so
replace all the List<Object> with String[].

Move the logic for parsing arrays of strings to QuantitationTypeConversionUtils.
Those values should be handled gracefully and without producing a
warning.

Move conversion logic back in GeoConverterImpl since this is meant to be
tailored to data encountered in GEO.
Enforce dependency convergence now that it has been achieved.

Remove unused jboss-3jb3x dependency
…g elements()

A DAL can have more elements than actual values, so its length must
always be taken from size(), not elements().length.

This is only problematic if the array was created by calling add().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment