Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To do #6

Open
11 of 12 tasks
jackbibby1 opened this issue Feb 23, 2022 · 5 comments · Fixed by #31
Open
11 of 12 tasks

To do #6

jackbibby1 opened this issue Feb 23, 2022 · 5 comments · Fixed by #31
Labels
enhancement New feature or request

Comments

@jackbibby1
Copy link
Owner

jackbibby1 commented Feb 23, 2022

@jackbibby1 jackbibby1 added the documentation Improvements or additions to documentation label Jun 30, 2022
@jackbibby1 jackbibby1 added enhancement New feature or request and removed documentation Improvements or additions to documentation labels Jul 13, 2022
@jackbibby1 jackbibby1 reopened this Jan 30, 2023
@jgarces02
Copy link

  • Perform all calculations in sparse matrices (or hdf5) to avoid the computer collapsing because of RAM requirements.

@jackbibby1
Copy link
Owner Author

jackbibby1 commented Feb 8, 2024

  • Perform all calculations in sparse matrices (or hdf5) to avoid the computer collapsing because of RAM requirements.

@jgarces02 I'll look at doing this when I get a minute. Can you let me know what you're running that means you don't have enough system RAM though? I assume this is mainly when you're using the parallel = x argument? FYI -- in case you're using a version of SCPA that's <1.3, upgrading should improve things a lot

@jgarces02
Copy link

jgarces02 commented Feb 15, 2024

I'm using version 1.5.4 and yes, with parallel = T, cores = 5... under an HPC environment, so I guess I shouldn't have problems with the RAM (but when monitoring the process with htop, all nodes are almost 100%).

aux_BL <- seurat_extract(all_merge_azimuth, meta1 = "patient", value_meta1 = "1001", meta2 = "timepoint", value_meta2 = "BL")
aux_ES <- seurat_extract(all_merge_azimuth, meta1 = "patient", value_meta1 = "1001", meta2 = "timepoint", value_meta2 = "ES")

scpa_out <- compare_pathways(samples = list(aux_BL, aux_ES), pathways = genesets_hallmark, parallel = T, cores = 5, downsample = 5000)

@tuhulab
Copy link

tuhulab commented Apr 9, 2024

⚠️ Be aware of reproducibility issues when cell number are larger than 500 in each group.

I encountered a reproducibility issue, when I compared group A vs. group B, each with around 5000 cells. I used the default parameter to run 10 times, which downsamples 500 cells each time. The rank of the gene set out of each run is inconsistent. See below:
Screenshot 2024-04-10 at 10 28 52

Ideally, it should be flat.

# By default SCPA takes 500 cell.
repeat_10_times <- 
  lapply(1:10, function(i){
    IVneg_vacc_vs_ctrl <- compare_pathways(samples = list(vacc_IVneg, ctrl_IVneg),
                                           pathways = pathways,
                                           parallel = T,
                                           cores = 12)
    IVneg_vacc_vs_ctrl <- IVneg_vacc_vs_ctrl %>% as_tibble()
    return(IVneg_vacc_vs_ctrl)
  })

rank <- 
  repeat_10_times %>% 
  purrr::map(~ .x %>% mutate(rank = 1:nrow(.x))) %>% 
  purrr::reduce(bind_rows) %>% 
  select(Pathway, rank) %>% 
  group_by(Pathway) %>% 
  tidyr::nest() %>% 
  mutate(data = purrr::map(data, ~ .x %>% mutate(run = as.character(1:nrow(.x)) %>% factor(levels = as.character(1:10))))) %>% 
  tidyr::unnest()

rank %>% 
  ggplot(aes(x = run, y = rank, group = Pathway)) +
  geom_point() +
  geom_line(color = "grey", alpha = .1)

It could be solved either by (1) permutation (computationally intensive but robust), or (2) use the same random seed when downsampling to 500 (3) or simply include all cells even though computationally intensive

@jackbibby1
Copy link
Owner Author

jackbibby1 commented Jul 25, 2024

Hi @tuhulab

Sorry for the late reply.

Appreciate the comments. This is a tough one because ideally, if the clustering is granular enough, downsampling the cells shouldn't have a large meaningful difference on the most differentially regulated pathways. And I think this is what you see, where the further down the ranked list you go, the more variable the rank is. It may be more insightful to show qval or pval here rather than rank as well, to show how stable the pvals are.

Plus, I'm not sure what these populations are that you're comparing -- are these a particular cell type (i.e. naive T cells), or a bunch of different finer populations (i.e. T cells)?

Some rationale of why I'm hesitant to change things:

  1. Permutation: I don't think the large increase in computational time will be worth the biological insight
  2. Seed: I would hope the identified signatures would be robust enough to be prominent when comparing the same populations multiple times on a downsampled value. If this isn't the case, then a seed implementation would actually add error, and then I think the issue here is with cluster granularity or cell contamination (e.g. doublets) rather than downsampling.
  3. There's an option to change the downsampling in the function, so people can already do this.

If people are very concerned with reproducibility, my recommendation would be to just run SCPA a few times and take the average qval output. Secondly, if you're getting wildly different results between SCPA runs, I'd argue that this is likely due to a lack of clustering resolution, or having contaminating cells in your cluster that are contributing to the variable output.

Happy to hear more thoughts on this

Jack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants