To do #6

jackbibby1 · 2022-02-23T20:36:45Z

jgarces02 · 2024-01-26T15:10:59Z

Perform all calculations in sparse matrices (or hdf5) to avoid the computer collapsing because of RAM requirements.

jackbibby1 · 2024-02-08T20:42:06Z

Perform all calculations in sparse matrices (or hdf5) to avoid the computer collapsing because of RAM requirements.

@jgarces02 I'll look at doing this when I get a minute. Can you let me know what you're running that means you don't have enough system RAM though? I assume this is mainly when you're using the parallel = x argument? FYI -- in case you're using a version of SCPA that's <1.3, upgrading should improve things a lot

jgarces02 · 2024-02-15T21:20:00Z

I'm using version 1.5.4 and yes, with parallel = T, cores = 5... under an HPC environment, so I guess I shouldn't have problems with the RAM (but when monitoring the process with htop, all nodes are almost 100%).

aux_BL <- seurat_extract(all_merge_azimuth, meta1 = "patient", value_meta1 = "1001", meta2 = "timepoint", value_meta2 = "BL")
aux_ES <- seurat_extract(all_merge_azimuth, meta1 = "patient", value_meta1 = "1001", meta2 = "timepoint", value_meta2 = "ES")

scpa_out <- compare_pathways(samples = list(aux_BL, aux_ES), pathways = genesets_hallmark, parallel = T, cores = 5, downsample = 5000)

tuhulab · 2024-04-09T12:37:42Z

⚠️ Be aware of reproducibility issues when cell number are larger than 500 in each group.

I encountered a reproducibility issue, when I compared group A vs. group B, each with around 5000 cells. I used the default parameter to run 10 times, which downsamples 500 cells each time. The rank of the gene set out of each run is inconsistent. See below:

Ideally, it should be flat.

# By default SCPA takes 500 cell.
repeat_10_times <- 
  lapply(1:10, function(i){
    IVneg_vacc_vs_ctrl <- compare_pathways(samples = list(vacc_IVneg, ctrl_IVneg),
                                           pathways = pathways,
                                           parallel = T,
                                           cores = 12)
    IVneg_vacc_vs_ctrl <- IVneg_vacc_vs_ctrl %>% as_tibble()
    return(IVneg_vacc_vs_ctrl)
  })

rank <- 
  repeat_10_times %>% 
  purrr::map(~ .x %>% mutate(rank = 1:nrow(.x))) %>% 
  purrr::reduce(bind_rows) %>% 
  select(Pathway, rank) %>% 
  group_by(Pathway) %>% 
  tidyr::nest() %>% 
  mutate(data = purrr::map(data, ~ .x %>% mutate(run = as.character(1:nrow(.x)) %>% factor(levels = as.character(1:10))))) %>% 
  tidyr::unnest()

rank %>% 
  ggplot(aes(x = run, y = rank, group = Pathway)) +
  geom_point() +
  geom_line(color = "grey", alpha = .1)

It could be solved either by (1) permutation (computationally intensive but robust), or (2) use the same random seed when downsampling to 500 (3) or simply include all cells even though computationally intensive

jackbibby1 · 2024-07-25T13:36:44Z

Hi @tuhulab

Sorry for the late reply.

Appreciate the comments. This is a tough one because ideally, if the clustering is granular enough, downsampling the cells shouldn't have a large meaningful difference on the most differentially regulated pathways. And I think this is what you see, where the further down the ranked list you go, the more variable the rank is. It may be more insightful to show qval or pval here rather than rank as well, to show how stable the pvals are.

Plus, I'm not sure what these populations are that you're comparing -- are these a particular cell type (i.e. naive T cells), or a bunch of different finer populations (i.e. T cells)?

Some rationale of why I'm hesitant to change things:

Permutation: I don't think the large increase in computational time will be worth the biological insight
Seed: I would hope the identified signatures would be robust enough to be prominent when comparing the same populations multiple times on a downsampled value. If this isn't the case, then a seed implementation would actually add error, and then I think the issue here is with cluster granularity or cell contamination (e.g. doublets) rather than downsampling.
There's an option to change the downsampling in the function, so people can already do this.

If people are very concerned with reproducibility, my recommendation would be to just run SCPA a few times and take the average qval output. Secondly, if you're getting wildly different results between SCPA runs, I'd argue that this is likely due to a lack of clustering resolution, or having contaminating cells in your cluster that are contributing to the variable output.

Happy to hear more thoughts on this

Jack

jackbibby1 added the documentation Improvements or additions to documentation label Jun 30, 2022

jackbibby1 added enhancement New feature or request and removed documentation Improvements or additions to documentation labels Jul 13, 2022

jiwen90 mentioned this issue Jan 28, 2023

parallel implementation #31

Merged

jackbibby1 closed this as completed in #31 Jan 30, 2023

jackbibby1 reopened this Jan 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

To do #6

To do #6

jackbibby1 commented Feb 23, 2022 •

edited

Loading

jgarces02 commented Jan 26, 2024

jackbibby1 commented Feb 8, 2024 •

edited

Loading

jgarces02 commented Feb 15, 2024 •

edited

Loading

tuhulab commented Apr 9, 2024 •

edited

Loading

jackbibby1 commented Jul 25, 2024 •

edited

Loading

To do #6

To do #6

Comments

jackbibby1 commented Feb 23, 2022 • edited Loading

jgarces02 commented Jan 26, 2024

jackbibby1 commented Feb 8, 2024 • edited Loading

jgarces02 commented Feb 15, 2024 • edited Loading

tuhulab commented Apr 9, 2024 • edited Loading

jackbibby1 commented Jul 25, 2024 • edited Loading

jackbibby1 commented Feb 23, 2022 •

edited

Loading

jackbibby1 commented Feb 8, 2024 •

edited

Loading

jgarces02 commented Feb 15, 2024 •

edited

Loading

tuhulab commented Apr 9, 2024 •

edited

Loading

jackbibby1 commented Jul 25, 2024 •

edited

Loading