-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
To do list for the v1.1b panel #56
Comments
TRGT merge: 0aa95bb1-29e6-491b-a06f-530ee922b6bc |
Thanks, @hangsuUNC, can we perhaps populate some more details here? For example, 1) pointers to common inputs (joint SNP callset, integrated SV callsets, their HPRC chr1 subsets, etc.), 2) a table of the experiments we are running giving relevant submission IDs and pointers to generated results, and 3) eventually populate that table with precision/recall metrics and the corresponding submission IDs for evaluation runs (I can help with this part)? Basically, for each experiment (e.g., HPRC 1.1 w/ TRGT and HiPhase 1.3.0), can you give me the inputs I need to run ChromosomePhasedPanelCreationFromHiPhase (https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/6751528b-ac58-4e25-87c0-bae436efb83f for that experiment), followed by ConcatAndEvaluate (https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/97d231ad-c40d-4efb-b2eb-06ee763de8c8)? For example, see the Job History comments for those runs, which refer to the submission IDs that generated the joint SNP and integrated 1.1 HPRC chr1 callsets. It would be nice if the Job History comments were also self-contained, but there's only so much you can fit in there---better to keep full track of things here, and include actual pointers to inputs in buckets, etc. Finally, after checking off bullet points above, it might also be nice to put a quick blurb here about any findings, including any pointers if they might be helpful. |
The output files are here:
All the information is listed in the Table: HPRC-extra-callers-hg38_set in the AoU Paper copy workspace. |
We may also want to consider the effect of the catalog, see e.g. PacificBiosciences/HiPhase#58 (although at this stage, it seems like the decision would be to exclude TRGT if we feel our catalog exhibits similar behavior, not try to rerun TRGT with a different catalog). |
For some reason, a VCF appears to be unsorted in the HPRC 1.0 w/o TRGT, HiPhase 1.4.0 run by the time it gets to the PanGenie panel creation script. We haven't run into this before, but it might be due to a newly inserted normalization step (which was introduced to resolve some intermittent issues with adjacent unnormalized variants that KAGE was bugging on); perhaps we also need to sort after. For now, let's not worry about that run. Copying over some Slack discussion: Interestingly, it looks like HPRC 1.1 w/ TRGT and HiPhase 1.3.0 compares unfavorably against HPRC 1.1 w/o TRGT and HiPhase 1.3.0. See e.g. SV recall after HiPhase in TR regions of 38% vs. 75%, at comparable precisions ~80%. So it's possible something is going wrong right off the bat when including TRGT---perhaps bcftools merge is to blame somewhere? Interestingly, after Shapeit4 imputation, recall is ~75% for both.
EDIT: Turns out the runs in the table above above were accidentally done with 1.3.0, not 1.4.0; we've updated the text there but not here. We will do proceed to do runs with 1.4.5 instead. |
The TRGT merged joint callset before Hiphase is here: gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/0aa95bb1-29e6-491b-a06f-530ee922b6bc/TRGTMerge/47e34134-97e7-4b48-8c63-aad2ff057ab7/call-TRGTMerge/MergeTRGTbeforeHiphase.trgtmerged.vcf.gz Do we still want to explore TRGT merge before hiphase as callset integration, split, hiphase then bcftools merge? |
@hangsuUNC let me know where your thinking is at on how to proceed. I think you and @SHuang-Broad can make the call. IMO some quick HPRC testing of HiPhase 1.4.5 + 1.1 seems light enough that we should do it before proceeding to run AoU + HPRC, but whether we want to ramp down on TRGT or do a bit more digging while we're here (and maybe while I proceed with Shapeit4, etc.) is up to you both. In any case, let's try to continue to record findings here---thanks again! |
HPRC testing of Hiphase 1.4.5 + v1.1 is here: |
Thanks, @hangsuUNC, ran the downstream evaluations and added them to the top of the table above. Looks like there's marginal difference between 1.1 + 1.4.5 w/o TRGT and 1.1 + 1.3.0 w/o TRGT. So if we are OK to drop TRGT, I would be OK with moving forward. Unfortunately, 1.0 + 1.3.0 w/o TRGT failed---which integrated SV callset did you use here, kanpig or Sniffles? Regardless, we have some old numbers that are in the ballpark of those for the 1.1 runs, so I'm still OK with moving forward. That said, again, I think the decision about how to proceed with TRGT lies with you and @SHuang-Broad. I might suggest conferring with @fabio-cunial since I think he is looking at it for the Hapestry preprint. But given that TRGT can be directly consumed by HiPhase, I would suggest that for the purposes of the AoU workflows, we think of incorporating TRGT as part of the physical-phasing problem rather than as part of the integration problem. |
Note job_history links above are currently broken because of https://broadinstitute.slack.com/archives/C07RK8KNWMV/p1739888818003019 |
Note that I finally added subsetting to confident regions to our Vcfdist implementation (the Vcfdist command line only allows one BED file, which we allocated to account for context). See the top row in the table above. This appears to boost precision a bit for non-TR/HP, and somewhat more substantially for TR/HP. Will go back and regenerate some figures for the paper accordingly. |
After clearing up some confusion (i.e., turns out the actual TRGT calls in the w/ TRGT run never made it to the downstream pipeline---I incorrectly assumed they were incorporated in the merged HiPhase SV outputs---but we see they affect the accuracy of the integrated SVs in TR regions anyway, presumably by negatively affecting their phasing, see also https://github.com/PacificBiosciences/HiPhase/blob/main/docs/user_guide.md#why-are-some-smallstructural-variants-unphased-when-i-added-tandem-repeats), here's the plan going forward (copied from Slack): Hmm, let's be really clear about what needs to get run next:
All of the above on a single sample vs. dipcall in confident regions and perhaps in/out TRs. Then, with the completed 1.3.0 + 1.1 + short + SV + TRGT run:
Then, with the just completed 1.4.5 + 1.1 + short + TRGT run:
Also recall that these two HiPhase runs were done w/ Matt's catalog, so depending on what these experiments show we may need to repeats some with Ben's catalog. Please feel free to check and/or edit to link submissions. Thanks @hangsuUNC! |
@hangsuUNC A step using a GATK tool in the ChromosomePhasedPanelCreationFromHiPhase run on 1.4.5 + 1.1 + short + TRGT failed due to:
Maybe a bad allele in the merged TRGT file? I've got to run, but let me know if you get a chance to dig! |
Yes, I believe there are a bunch of bad alleles in the TRGT file, for example, just found the raw call evaluation wdl failed because there is a "M" in the reference allele in the vcf file... Maybe we should do another round of filtering of TRGT calls first? Please let me know what do you think @samuelklee |
Yes, I’d also see if the other catalog has this issue. Thanks! |
Yes, have checked with the other catelog, seems like a common issue, they replace reference Ns with some other letters. |
1. Can we run the same HPRC 1.1 w/o TRGT as a baseline? Less priority would be HPRC 1.0 w/o TRGT (if we don't already have that run somewhere, but I'd do a fresh run anyway).
2. Can one run HiPhase with short+TRGT only? Is there some distinction between how HiPhase handles the SV and TRGT inputs?
3. What is the overlap between the TRGT catalog and 1.1? What AFs does the TRGT catalog cover?
4. If the overlap is substantial, can we do some quick Vcfdist evals of the appropriate HiPhase SV outputs, stratified by TR/non-TR regions to get an idea of whether 1.1 or TRGT is more reliable in each region type? For example, if HPRC 1.1 w/o TRGT metrics in TR regions look not so great, maybe we should just prefer the TRGT genotypes in those regions. One might imagine that if we end up preferring TRGT genotypes in TR regions and end up with disjoint sets of 1.1 non-TR and TRGT TR, HiPhase and bcftools concat will have an easier time.
5. Have we made any progress looking at TRGT merge (on pure TRGT outputs, not HiPhase outputs)? Is this just a trivial operation on squared genotypes?
6. I think we are using HiPhase 1.3.0. Have we tried more recent versions? In particular, it seems like 1.4.0 might introduce some noteworthy changes.
7. See where the raw per-sample TRGT calls lie on Fabio's latest ROC plots. This will of course just be a point rather than a curve for each sample.
The text was updated successfully, but these errors were encountered: