Outline of work needed for sweeps analysis #104

nspope · 2023-05-17T21:45:22Z

Meeting with @mufernando and @andrewkern to outline what needs doing for sweeps analysis

What we want to produce:

Multi-panel plot showing FPR and TPR as function of coordinate with rugs exon density and recombination (like ProbGen poster), split by population (e.g. small vs large) and sweep calling method (sweepfinder, diploshic, D)
One null model (BGS) in main figure; in supp figure could show comparison between BGS and completely neutral null model
Scatterplots showing FPR/TPR as a function of recombination rate/exon density, with some boxplot/quantile lines to show how distribution of test statistic changes with rec rate
Scatterplots showing joint distribution of FPR/TPR (e.g. pair plots) for supplement
Use global critical value for simplicity of explanation

What has to be done:

diploshic training PR is reviewed, needs some minor cleanup to be merged [Andy] -- done
diploshic prediction workflow needs to be put together
a. dump VCF per simulated window (the 5 Mb focal region, without simulated buffer) [Murillo] -- done
b. apply diploshic, sliding across focal regions -- this'll output a score per window for soft-linked/hard-linked/neutral/soft/hard classification [Andy]
c. pool soft+hard scores to get a binary "sweep vs not" score [Murillo/Nate]
d. take max score across entire focal window to get test statistic for the window [Murillo/Nate]
e. get critical value by calculating score for neutral/BGS simulations (as for CLR) [Murillo/Nate]
f. keep training and prediction in separate workflows (e.g. the prediction step should go in the same workflow where CLR is calculated) [Murillo/Nate]
write rule to generate figures based off Murillo's probgen draft [Murillo]

Provide feedback