Convert Human Cell Atlas Tier 1 metadata extracted out of an anndata object of a published CELLxGENE dataset,into HCA DCP metadata schema ingestible spreadsheet. The vice versa conversion is done with https://github.com/ebi-ait/hca-dcp-to-tier1.
flowchart TD
%% Nodes: files (parallelograms), scripts (rectangles), outputs (hexagons)
%% Data sources
A[/CxG public database/]
B[/Tier 1 tracker - xlsx/]
C[/Tier 1 metadata - csv/]
D[/DCP spreadsheet/]
E[/Previously wrangled spreadsheet/]
F[/Tier 2 metadata - xlsx/]
G[/File manifest/]
H{{Compare report - json}}
%% Scripts
S1[collect_cellxgene_metadata.py]
S2[collect_spreadsheet_metadata.py]
S3[convert_to_dcp.py]
S4[compare_with_dcp.py]
S5[merge_tier2_metadata.py]
S6[merge_file_manifest.py]
%% Flows
A --> S1 --> C
B --> S2 --> C
C --> S3 --> D
F -->|direct conversion| S3
G -->|direct conversion| S3
E --> S4 --> H
D --> S4
F -->|merge to dcp| S5
G -->|merge to dcp| S6
D --> S5
D --> S6
%% Grouping
subgraph Collection
A
B
S1
S2
C
end
subgraph Conversion
S3
D
end
subgraph Comparison
E
S4
H
end
subgraph Merge
S5
S6
end
This process is done in the following steps (user only call the wrapper or individual scripts, detailed steps are there to explain the under the hood processes).
- Pull data from CxG collect_cellxgene_metadata.py or spreadsheet collect_spreadsheet_metadata.py
- from CxG
- Given a collection_id, select dataset and download h5ad
- Pull obs and uns layer into csv files in
metadatadir with<collection_id>_<dataset_id>or<dataset_label>prefix in_metadata.csv,_study_metadata.csvand_cell_obs.csvfilenames - Test if DOI exists in ingest (ingest-token required)
- from spreadsheet
- Given a Tier 1 spreadsheet, pull label from filename
- Flatten the tier 1 metadata into a csv in
metadatadir with<label>_metadata.csv
- Convert to DCP spreadsheet convert_to_dcp.py
- Given a spreadsheet path, pull metadata & extract filename label used.
- Based on hca_template.xlsx, using the mapping convert to dcp flat metadata file with dcp programmatic fields
- Based on the field programmatic name, the dcp spreadsheet is populated
- If tier 2 and/ or file manifest is given, produce result via dcp_flat
- Exported into an xlsx file in
metadatadir to<label>_dcp.csvfilename
- Compare previously wrangled spreadsheet vs tier 1 compare_with_dcp.py
- Open converted and previously wrangled DCP spreadsheet
- Compare number of tabs, use intersection
- On each common tab
- Compare number of entites per tab
- Compare ids per tab, for intersection
- Compare values of entities with same IDs (except protocols)
- Export all comparison in a report json file in
report_comparedir to<label>_compare.jsonfilename
- Merge Tier 2 metadata into pre-filled DCP spreadsheet merge_tier2_metadata.py
- Open Tier 2 spreadsheet and wrangled DCP spreadsheet
- Flatten Tier 2 spreadsheet into a single denormalised tab
- Rename columns using tier 2 mapping
- Merge tier 2 metadata in corresponding tabs/entities of dcp spreadsheet.
- Export into an xlsx file in
metadatadir to<label>_tier2.xlsx
- Merge File metadata into pre-filled DCP spreadsheet merge_file_manifest.py
- Open File metadata tab, Tier 1 metadata and wrangled DCP spreadsheet
- Merge File metadata tab into wrangled spreadsheet
Sequence tab(remove existing & use FILE_MANIFEST_MAPPING) - Add standard FASTQ fields FASTQ_STANDARD_FIELDS
- Use Tier 1 metadata to assign sequqnce and library prep protocols, and other TIER_1_MAPPING fields
- Export into an xlsx file in
metadatadir to<label>_fastqed.xlsx
Tested in python3.9. To run scripts you can run:
python3 -m pip install -r requirements.txt
python3 collect_cellxgene_metadata.py -c <collection_id> -t <ingest-token>
python3 collect_spreadsheet_metadata.py -t1 <tier1_spreadsheet>
python3 convert_to_dcp.py -ft <flat_tier1_spreadsheet> (-t2 <tier2_metadata>) (-fm <file_manifest>)
python3 compare_with_dcp.py -dt <dcp_tier1_spreadsheet> -w <wrangled_spreadsheet>
python3 merge_tier2_metadata.py -t2 <tier2_metadata> -dt <dt_spreadsheet>
python3 merge_file_manifest.py -fm <file_manifest> -dt <dt_spreadsheet> -t1 <tier1_spreadsheet>Alternatively, you can use the hca-tier1-to-dcp.py script to run all scripts at once (collect, convert, compare, merge tier 2, merge file manifest). There is also the functionality to run for multiple collections, using a separate csv file for the IDs & wrangled spreadsheets path.
python3 hca-tier1-to-dcp.py -l test -t1 tier1.xlsx
or
python3 hca-tier1-to-dcp.py -l test -t1 tier1.xlsx -fm file_manifest.xlsx -t2 tier2.xlsx -w pre-wrangled.xlsx--collection_idor-c: Collection id (uuid) of the collection to download file from--dataset_idor-d: Dataset id (uuid) of the file to download--dataset-labelor-l: Label to use instead of collection/ dataset ids--output_diror-o: Directory for the output files--ingest_tokenor-t: Ingest token to query for existing projects with same DOI--tier1_spreadsheetor-t1: Submitted tier 1 spreadsheet file path--flat_tier1_pathor-ft: Flattened tier 1 spreadsheet path--local_templateor-lt: Local path of the hca_template.xlsx--dcp-tier1-spreadsheetor-dt: DCP formated tier 1 spreadsheet path--wrangled_spreadsheetor-w: Previously wrangled project spreadsheet path--unequal_comparissonor-u: Automaticly continue comparing even if biomaterials are not equal--file_manifestor-fm: File manifest path--tier2_metadataor-t2: Tier 2 spreadsheet file path
R: Required o: optional
| args | collect CxG | collect excel | convert | compare | merge T2 | merge file manifest |
|---|---|---|---|---|---|---|
--collection_id, -c |
R | |||||
--dataset_id, -d |
o | |||||
--dataset-label, -l |
o | |||||
--output_dir, -o |
o | o | o | o | o | |
--ingest_token, -t |
o | |||||
--tier1_spreadsheet, -t1 |
R | R | ||||
--flat_tier1_path, -ft |
R | |||||
--local_template, -lt |
o | |||||
--dcp-tier1-spreadsheet, -dt |
R | R | R | |||
--wrangled_spreadsheet, -w |
R | |||||
--unequal_comparisson, -u |
o | |||||
--file_manifest, -fm |
o | R | ||||
--tier2_metadata, -t2 |
o | R |
When more tier 2 values are added, be sure to update the mapping dictionary with the tier2 programmatic name and dcp programmatic value.
Programmatic values should match the values that we find in the hca_full_template.xlsx file from geo_to_hca repository which stands as a reference for the hca templates.