Prototype for storing single-cell data #1020

arteymix · 2024-02-05T19:01:42Z

TODO

add a CLI for loading/reloading single cell data and cell type assignments
extend the GEO loader to detect MEX and other supported formats and apply the appropriate strategy for loading vectors
Add support for HDF5-based single-cell data formats #1039
support saving single cell data to disk (I already added a SingleCellExpresionDataMatrix, we need to finish the work and write it to file). I think MEX is a pretty decent output format for this.

REST API

review which fields should be exposed on the REST API for filtering purposes
add aliases in the REST API to refer to the preferred single cell dimension and cell assignment

gemma-core/src/test/java/ubic/gemma/persistence/util/ListUtilsTest.java

...va/ubic/gemma/persistence/service/expression/experiment/ExpressionExperimentServiceImpl.java

.../main/java/ubic/gemma/persistence/service/expression/experiment/ExpressionExperimentDao.java

gemma-core/src/main/java/ubic/gemma/model/expression/bioAssayData/DataVector.java

...-core/src/main/resources/ubic/gemma/model/expression/designElement/CompositeSequence.hbm.xml

arteymix · 2024-06-11T23:36:41Z

I'm in the process of merging the dev branch to get this work up-to-date.

arteymix · 2024-06-12T21:02:38Z

gemma-core/src/main/resources/ubic/gemma/model/analysis/Investigation.hbm.xml

@@ -130,6 +130,13 @@
 					<!-- cannot be non-null because subsets and generic experiments don't have curation details -->
 					<column name="CURATION_DETAILS_FK" not-null="false" sql-type="BIGINT" unique="true"/>
 				</many-to-one>
+				<set name="singleCellExpressionDataVectors" lazy="true" fetch="select" inverse="true"
+					 cascade="all-delete-orphan">


We should remove the -delete-orphan and manage vectors the same way we do for raw and processed ones.

That include bulk insertion, removal, etc.

Add basic support in SingleCellDescriptive and DataVectorDescriptive for floats, ints and longs. Add missing conversion logic for scale types. Reuse those to implement aggregation of floats, ints and longs. No matter the input type, we always aggregate into doubles, so we don't have to support those types in raw or processed vectors. Support writing MEX and tabular format from those vectors. Support loading integer data from MEX and all the supported types from AnnData. Add an option to prefer single-precision when loading data vectors, which might imply losing some precision. Add an option to use double precision for MEX. The default is single-precision now.

We only support one matrix format for single cell data, so vectors that are not stored in double require conversion.

…#1332)

…and data The COUNT_FAST aggregation method does not even need the data to be populated, so we can nearly double the throughput by omitting it.

Data in GEO is always retrieved as string arrays of known size, so replace all the List<Object> with String[]. Move the logic for parsing arrays of strings to QuantitationTypeConversionUtils.

Those values should be handled gracefully and without producing a warning. Move conversion logic back in GeoConverterImpl since this is meant to be tailored to data encountered in GEO.

Enforce dependency convergence now that it has been achieved. Remove unused jboss-3jb3x dependency

…g elements() A DAL can have more elements than actual values, so its length must always be taken from size(), not elements().length. This is only problematic if the array was created by calling add().

…ompressedStringListType

…loop

arteymix force-pushed the feature-single-cell branch 6 times, most recently from 9222a95 to 788a61b Compare February 7, 2024 20:13