Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To do list for the v1.1b panel #56

Open
3 of 7 tasks
hangsuUNC opened this issue Jan 24, 2025 · 17 comments
Open
3 of 7 tasks

To do list for the v1.1b panel #56

hangsuUNC opened this issue Jan 24, 2025 · 17 comments

Comments

@hangsuUNC
Copy link
Collaborator

hangsuUNC commented Jan 24, 2025

  • 1. Can we run the same HPRC 1.1 w/o TRGT as a baseline? Less priority would be HPRC 1.0 w/o TRGT (if we don't already have that run somewhere, but I'd do a fresh run anyway).

  • 2. Can one run HiPhase with short+TRGT only? Is there some distinction between how HiPhase handles the SV and TRGT inputs?

  • 3. What is the overlap between the TRGT catalog and 1.1? What AFs does the TRGT catalog cover?

  • 4. If the overlap is substantial, can we do some quick Vcfdist evals of the appropriate HiPhase SV outputs, stratified by TR/non-TR regions to get an idea of whether 1.1 or TRGT is more reliable in each region type? For example, if HPRC 1.1 w/o TRGT metrics in TR regions look not so great, maybe we should just prefer the TRGT genotypes in those regions. One might imagine that if we end up preferring TRGT genotypes in TR regions and end up with disjoint sets of 1.1 non-TR and TRGT TR, HiPhase and bcftools concat will have an easier time.

  • 5. Have we made any progress looking at TRGT merge (on pure TRGT outputs, not HiPhase outputs)? Is this just a trivial operation on squared genotypes?

  • 6. I think we are using HiPhase 1.3.0. Have we tried more recent versions? In particular, it seems like 1.4.0 might introduce some noteworthy changes.

  • 7. See where the raw per-sample TRGT calls lie on Fabio's latest ROC plots. This will of course just be a point rather than a curve for each sample.

@hangsuUNC
Copy link
Collaborator Author

TRGT merge: 0aa95bb1-29e6-491b-a06f-530ee922b6bc

@samuelklee
Copy link
Collaborator

Thanks, @hangsuUNC, can we perhaps populate some more details here? For example, 1) pointers to common inputs (joint SNP callset, integrated SV callsets, their HPRC chr1 subsets, etc.), 2) a table of the experiments we are running giving relevant submission IDs and pointers to generated results, and 3) eventually populate that table with precision/recall metrics and the corresponding submission IDs for evaluation runs (I can help with this part)?

Basically, for each experiment (e.g., HPRC 1.1 w/ TRGT and HiPhase 1.3.0), can you give me the inputs I need to run ChromosomePhasedPanelCreationFromHiPhase (https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/6751528b-ac58-4e25-87c0-bae436efb83f for that experiment), followed by ConcatAndEvaluate (https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/97d231ad-c40d-4efb-b2eb-06ee763de8c8)?

For example, see the Job History comments for those runs, which refer to the submission IDs that generated the joint SNP and integrated 1.1 HPRC chr1 callsets. It would be nice if the Job History comments were also self-contained, but there's only so much you can fit in there---better to keep full track of things here, and include actual pointers to inputs in buckets, etc.

Finally, after checking off bullet points above, it might also be nice to put a quick blurb here about any findings, including any pointers if they might be helpful.

@hangsuUNC
Copy link
Collaborator Author

hangsuUNC commented Feb 4, 2025

The output files are here:

  1. HPRC_V10_woTRGT_SNP: gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/0c6101a5-42a5-493c-a5b8-9baf876c34d9/HierarchicallyMergeVcfs/e19d90c0-ed0c-41cd-acad-306642bed072/call-ConcatVcfs/HPRC_V10_woTRGT_SNP.vcf.gz
  2. HPRC_V10_woTRGT_SVs: gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/bd97d689-abf2-4f0a-bbe4-b28aae1ed794/HierarchicallyMergeVcfs/b9e64f4b-1508-45e9-bc2e-e3ee09f6ced2/call-ConcatVcfs/HPRC_V10_woTRGT_SVs.vcf.gz
  3. HPRC_V11_woTRGT_SNP: gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/fd9edac0-99e1-4907-89d0-2cb494797428/HierarchicallyMergeVcfs/1d1d5bb7-1ade-45b6-9337-d663408cf170/call-ConcatVcfs/attempt-2/HPRC_V11_woTRGT_SNP.vcf.gz
  4. HPRC_V11_woTRGT_SVs: gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/e72f3b25-a317-48c9-8b96-ed44142311f8/HierarchicallyMergeVcfs/cd9bacd3-1196-474d-bc8c-6ea2279486ac/call-ConcatVcfs/HPRC_V11_woTRGT_SVs.vcf.gz

All the information is listed in the Table: HPRC-extra-callers-hg38_set in the AoU Paper copy workspace.

@samuelklee
Copy link
Collaborator

samuelklee commented Feb 4, 2025

Experiment ConcatAndEvaluate summary (not in TR) ConcatAndEvaluate summary (in TR) Merged HiPhase SNP output Merged HiPhase SV output ChromosomePhasedPanelCreationFromHiPhase submission ConcatAndEvaluate submission (not in TR) ConcatAndEvaluate submission (in TR)
HPRC short w/ TRGT, HiPhase 1.4.5, w/ confident regions x x gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/b9a18ef7-d764-451d-a13f-964caad117be/HierarchicallyMergeVcfs/d8da568a-097c-4498-87a5-13bc4e698d9f/call-ConcatVcfs/HPRC_V11_woSV_hiphase145_SNP.vcf.gz gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/97511d9a-8906-402d-bce4-46270fbc4934/HierarchicallyMergeVcfs/0520d86a-8ffb-4493-9390-c0c7458d4d14/call-ConcatVcfs/HPRC_V11_woSV_hiphase145_TRGT.vcf.gz https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/submission_history/9989a299-a4b5-484d-a093-432992184ba1 x x
HPRC 1.1 w/o TRGT, HiPhase 1.4.5, w/ confident regions https://storage.cloud.google.com/fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/0277d9be-de58-4d39-babb-1cfd834011a6/PhasedPanelEvaluation/cffe41f7-e800-4fa3-91da-b618716dafef/call-SummarizeEvaluations/attempt-2/evaluation_summary.tsv?authuser=0 https://storage.cloud.google.com/fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/6fdbf976-16b5-468b-b456-df743068a179/PhasedPanelEvaluation/f4e0877b-050f-4da8-bd7a-258b1c24b883/call-SummarizeEvaluations/evaluation_summary.tsv?authuser=0 gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/d57fbaba-9d9b-4bc4-8d32-bdc8e579b2a1/HierarchicallyMergeVcfs/40ba741a-4d81-4dc7-92b1-3177dc809b22/call-ConcatVcfs/HPRC_V11_woTRGT_hiphase145_SNP.vcf.gz gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/db3a53a8-77db-4dbb-95c9-16d43d527f62/HierarchicallyMergeVcfs/44c49851-6829-4e73-ac15-0e516a2c034d/call-ConcatVcfs/HPRC_V11_woTRGT_hiphase145_SV.vcf.gz https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/6c84f37f-76f5-40ab-a6ec-81980fc2b3f7 https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/submission_history/0277d9be-de58-4d39-babb-1cfd834011a6 https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/submission_history/6fdbf976-16b5-468b-b456-df743068a179
HPRC 1.1 w/o TRGT, HiPhase 1.4.5 https://storage.cloud.google.com/fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/656d0740-4c9d-4b85-bb64-3b5670f1dd04/PhasedPanelEvaluation/cd2b9543-4889-4801-b321-11176e3ecb1a/call-SummarizeEvaluations/evaluation_summary.tsv?authuser=0 https://storage.cloud.google.com/fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/a7fb3cc6-0ce6-4b45-9939-926e1094f2a4/PhasedPanelEvaluation/9f8f19aa-5f33-4a60-826a-277a80b6c67c/call-SummarizeEvaluations/evaluation_summary.tsv?authuser=0 gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/d57fbaba-9d9b-4bc4-8d32-bdc8e579b2a1/HierarchicallyMergeVcfs/40ba741a-4d81-4dc7-92b1-3177dc809b22/call-ConcatVcfs/HPRC_V11_woTRGT_hiphase145_SNP.vcf.gz gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/db3a53a8-77db-4dbb-95c9-16d43d527f62/HierarchicallyMergeVcfs/44c49851-6829-4e73-ac15-0e516a2c034d/call-ConcatVcfs/HPRC_V11_woTRGT_hiphase145_SV.vcf.gz https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/6c84f37f-76f5-40ab-a6ec-81980fc2b3f7 https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/656d0740-4c9d-4b85-bb64-3b5670f1dd04 https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/a7fb3cc6-0ce6-4b45-9939-926e1094f2a4
HPRC 1.1 w/ TRGT, HiPhase 1.3.0 https://storage.cloud.google.com/fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/97d231ad-c40d-4efb-b2eb-06ee763de8c8/PhasedPanelEvaluation/602316f9-b3e3-4c03-b564-32085f63dc00/call-SummarizeEvaluations/evaluation_summary.tsv?authuser=0 https://storage.cloud.google.com/fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/6519095a-ad62-48f6-bd4f-695942d2af12/PhasedPanelEvaluation/f52d5a5b-d6fc-436a-b22a-9639dcb03e91/call-SummarizeEvaluations/evaluation_summary.tsv?authuser=0 gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/494642a6-04d8-4afd-9a19-911daac9e6d6/HierarchicallyMergeVcfs/799d50c6-276f-4c90-8b40-24ce23e865ac/call-ConcatVcfs/HPRC.hiphase.short.merged.vcf.gz gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/722a1d9f-2827-4b7d-a9e4-a1205671ea27/HierarchicallyMergeVcfs/bb6cc434-a96e-45da-81f3-e01b7a2bbf25/call-ConcatVcfs/HPRC.hiphase.SV.merged.vcf.gz https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/287c72b8-08c3-4ee4-a762-ed40c9dead54 https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/97d231ad-c40d-4efb-b2eb-06ee763de8c8 https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/6519095a-ad62-48f6-bd4f-695942d2af12
HPRC 1.1 w/o TRGT, HiPhase 1.3.0 https://storage.cloud.google.com/fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/bc82695b-2e34-47cc-a3f6-a3c51a297a7b/PhasedPanelEvaluation/2038bd0e-e630-4892-9e0e-deecac5167d8/call-SummarizeEvaluations/evaluation_summary.tsv?authuser=0 https://storage.cloud.google.com/fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/d3f9ba00-21c0-4e31-b864-a6c1f7c1c5a3/PhasedPanelEvaluation/f8e2337e-74fd-4790-b996-4b2a71b92375/call-SummarizeEvaluations/evaluation_summary.tsv?authuser=0 gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/fd9edac0-99e1-4907-89d0-2cb494797428/HierarchicallyMergeVcfs/1d1d5bb7-1ade-45b6-9337-d663408cf170/call-ConcatVcfs/attempt-2/HPRC_V11_woTRGT_SNP.vcf.gz gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/e72f3b25-a317-48c9-8b96-ed44142311f8/HierarchicallyMergeVcfs/cd9bacd3-1196-474d-bc8c-6ea2279486ac/call-ConcatVcfs/HPRC_V11_woTRGT_SVs.vcf.gz https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/1a476796-969f-4bb4-8ca7-2f3bccc9b494 https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/bc82695b-2e34-47cc-a3f6-a3c51a297a7b https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/d3f9ba00-21c0-4e31-b864-a6c1f7c1c5a3
HPRC 1.0 w/o TRGT, HiPhase 1.3.0 x x gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/0c6101a5-42a5-493c-a5b8-9baf876c34d9/HierarchicallyMergeVcfs/e19d90c0-ed0c-41cd-acad-306642bed072/call-ConcatVcfs/HPRC_V10_woTRGT_SNP.vcf.gz gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/bd97d689-abf2-4f0a-bbe4-b28aae1ed794/HierarchicallyMergeVcfs/b9e64f4b-1508-45e9-bc2e-e3ee09f6ced2/call-ConcatVcfs/HPRC_V10_woTRGT_SVs.vcf.gz https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/4d55d2fa-6c6a-4445-80f5-9ba16d10786c (FAILED, UNSORTED?) x x

@samuelklee
Copy link
Collaborator

samuelklee commented Feb 5, 2025

We may also want to consider the effect of the catalog, see e.g. PacificBiosciences/HiPhase#58 (although at this stage, it seems like the decision would be to exclude TRGT if we feel our catalog exhibits similar behavior, not try to rerun TRGT with a different catalog).

@samuelklee
Copy link
Collaborator

samuelklee commented Feb 5, 2025

For some reason, a VCF appears to be unsorted in the HPRC 1.0 w/o TRGT, HiPhase 1.4.0 run by the time it gets to the PanGenie panel creation script. We haven't run into this before, but it might be due to a newly inserted normalization step (which was introduced to resolve some intermittent issues with adjacent unnormalized variants that KAGE was bugging on); perhaps we also need to sort after. For now, let's not worry about that run.

Copying over some Slack discussion:

Interestingly, it looks like HPRC 1.1 w/ TRGT and HiPhase 1.3.0 compares unfavorably against HPRC 1.1 w/o TRGT and HiPhase 1.3.0. See e.g. SV recall after HiPhase in TR regions of 38% vs. 75%, at comparable precisions ~80%. So it's possible something is going wrong right off the bat when including TRGT---perhaps bcftools merge is to blame somewhere? Interestingly, after Shapeit4 imputation, recall is ~75% for both.

  1. Have we come to any conclusions about what TRGT merge does?
  2. Probably it's not the main culprit, but we should still investigate HiPhase 1.3.0 vs. HiPhase 1.4.0. At this point I'd suggest running HPRC 1.1 w/o TRGT and HiPhase 1.3.0 (previously I suggested w/ TRGT).
  3. Some understanding of the raw recall of the TRGT genotypes alone would be nice. Do we have this anywhere vs. HPRC dipcall?
  4. I suspect that TRGT can be included in an appropriate way, but perhaps not as naively as we are doing so far. For example, choice of catalog might be having an effect: The impact of repeat variants on phasing PacificBiosciences/HiPhase#58 And more generally, with our naive approach, I think more care needs to be taken in understanding and resolving any overlaps between the short/integrated callsets and the TRGT genotypes. Probably this is as simple as a) using an appropriately sparse catalog, b) preferring TRGT in appropriate regions, and c) using TRGT merge and bcftools merge in the appropriate ways. But perhaps this is more effort than we want to expend at this point---what does everyone think? At least we can tell Matt Danzi we tried!

EDIT: Turns out the runs in the table above above were accidentally done with 1.3.0, not 1.4.0; we've updated the text there but not here. We will do proceed to do runs with 1.4.5 instead.

@hangsuUNC
Copy link
Collaborator Author

The TRGT merged joint callset before Hiphase is here: gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/0aa95bb1-29e6-491b-a06f-530ee922b6bc/TRGTMerge/47e34134-97e7-4b48-8c63-aad2ff057ab7/call-TRGTMerge/MergeTRGTbeforeHiphase.trgtmerged.vcf.gz

Do we still want to explore TRGT merge before hiphase as callset integration, split, hiphase then bcftools merge?

@samuelklee
Copy link
Collaborator

samuelklee commented Feb 6, 2025

@hangsuUNC let me know where your thinking is at on how to proceed. I think you and @SHuang-Broad can make the call. IMO some quick HPRC testing of HiPhase 1.4.5 + 1.1 seems light enough that we should do it before proceeding to run AoU + HPRC, but whether we want to ramp down on TRGT or do a bit more digging while we're here (and maybe while I proceed with Shapeit4, etc.) is up to you both.

In any case, let's try to continue to record findings here---thanks again!

@hangsuUNC
Copy link
Collaborator Author

HPRC testing of Hiphase 1.4.5 + v1.1 is here:
Short: gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/d57fbaba-9d9b-4bc4-8d32-bdc8e579b2a1/HierarchicallyMergeVcfs/40ba741a-4d81-4dc7-92b1-3177dc809b22/call-ConcatVcfs/HPRC_V11_woTRGT_hiphase145_SNP.vcf.gz
SV: gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/db3a53a8-77db-4dbb-95c9-16d43d527f62/HierarchicallyMergeVcfs/44c49851-6829-4e73-ac15-0e516a2c034d/call-ConcatVcfs/HPRC_V11_woTRGT_hiphase145_SV.vcf.gz

@samuelklee
Copy link
Collaborator

samuelklee commented Feb 19, 2025

Thanks, @hangsuUNC, ran the downstream evaluations and added them to the top of the table above.

Looks like there's marginal difference between 1.1 + 1.4.5 w/o TRGT and 1.1 + 1.3.0 w/o TRGT. So if we are OK to drop TRGT, I would be OK with moving forward.

Unfortunately, 1.0 + 1.3.0 w/o TRGT failed---which integrated SV callset did you use here, kanpig or Sniffles? Regardless, we have some old numbers that are in the ballpark of those for the 1.1 runs, so I'm still OK with moving forward.

That said, again, I think the decision about how to proceed with TRGT lies with you and @SHuang-Broad. I might suggest conferring with @fabio-cunial since I think he is looking at it for the Hapestry preprint. But given that TRGT can be directly consumed by HiPhase, I would suggest that for the purposes of the AoU workflows, we think of incorporating TRGT as part of the physical-phasing problem rather than as part of the integration problem.

@samuelklee
Copy link
Collaborator

Note job_history links above are currently broken because of https://broadinstitute.slack.com/archives/C07RK8KNWMV/p1739888818003019

@samuelklee
Copy link
Collaborator

samuelklee commented Feb 20, 2025

Note that I finally added subsetting to confident regions to our Vcfdist implementation (the Vcfdist command line only allows one BED file, which we allocated to account for context). See the top row in the table above. This appears to boost precision a bit for non-TR/HP, and somewhat more substantially for TR/HP.

Will go back and regenerate some figures for the paper accordingly.

@samuelklee
Copy link
Collaborator

samuelklee commented Feb 21, 2025

After clearing up some confusion (i.e., turns out the actual TRGT calls in the w/ TRGT run never made it to the downstream pipeline---I incorrectly assumed they were incorporated in the merged HiPhase SV outputs---but we see they affect the accuracy of the integrated SVs in TR regions anyway, presumably by negatively affecting their phasing, see also https://github.com/PacificBiosciences/HiPhase/blob/main/docs/user_guide.md#why-are-some-smallstructural-variants-unphased-when-i-added-tandem-repeats), here's the plan going forward (copied from Slack):

Hmm, let's be really clear about what needs to get run next:

  • truvari on raw TRGT w/ Matt's catalog
  • truvari on raw TRGT w/ Ben's catalog
  • truvari on integrated SV

All of the above on a single sample vs. dipcall in confident regions and perhaps in/out TRs.

Then, with the completed 1.3.0 + 1.1 + short + SV + TRGT run:

  • bcftools merge the TRGT HiPhase outputs
  • bcftools merge the merged SV and TRGT HiPhase outputs
  • ConcatAndEvaluate with short + (SV+TRGT)

Then, with the just completed 1.4.5 + 1.1 + short + TRGT run:

  • bcftools merge the TRGT HiPhase outputs (EDIT: added to table above)
  • ConcatAndEvaluate with short + TRGT (this can be directly compared to already evaluated 1.4.5 + 1.1 + short + SV runs) (EDIT: ChromosomePhasedPanelCreationFromHiPhase submitted and added to table above)

Also recall that these two HiPhase runs were done w/ Matt's catalog, so depending on what these experiments show we may need to repeats some with Ben's catalog. Please feel free to check and/or edit to link submissions. Thanks @hangsuUNC!

@samuelklee
Copy link
Collaborator

@hangsuUNC A step using a GATK tool in the ChromosomePhasedPanelCreationFromHiPhase run on 1.4.5 + 1.1 + short + TRGT failed due to:

htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 1194060: unparsable vcf record with allele TCATCTGGAAGCCCCTACTCCCACCTCACCACACACATGCACATCACCCCCCACACACACCAAACAMCCCACACAACACACACACACCACACCACACAAACACAAACACACCACATCATGAACACACACATCACACACACACCACACACCCCACACACCCCACACACATCACACACACACACCACACACCCCACACAACACACAACACACACCACACACACCACACCACACACACACCACACCCCACAACACACACACACCACACTCCCCACTGAACACACACACATCACACACACCACACACCCCACACACACCACACACCCCACACAACACACACCACACACACACCACACCCCACAACACACACACATCACATCACACACACCACACCCCACACACACACACCACACCACACACACACCACACACCCCACACAACACATACACACATCACACACACCCCACACACCAACACACCACATCACACACACACCACATCACACACACACCAAACACCCCACACAACACACACACAACACAACACACACACAAACACACCACATCATACACACACCACACACCCCACACACCACACACACATCACACACACACCCCACACACCCCACACACCCCACACCACACTACACACACACCACACACCACACACAACACACACAACACAACACACACACCCCACACACAAACACCCCACACACCACACACACCACATCACACACACACACCACATCACACCCCACACAACACACACACCACACCACACACACACACCACACACCCCACACAACACACACCGCATCACACACACACCACACAACACACACACACCACACACACCACACACCCCACACAACACACAACACATCACACACACCACACACACCACACAACACACACATCACATACACACCACACACCACACACAACACATACACATCACACACACACCACACACCACACACAACACACACAACACAACACACACAACACACACCCCACACAACACACCACATCACACACACATCACACAACACACAACACAACACACACATCACACACCCCACACAACACACCACATCACACAAAACACATCACACACACAACACACACCCCACACAACACACACAACACAACACACACACCACACACCCCACACAACACACACAACACACCACACATATACATCACACACCACACACCCCACGCAACACACACACCACATCACACACACACCACACACCCCACACACACACGCATCACACCACACACACCACAGCCCCCACACAACACATACACACCACATCACACACACACCACACATCACATGTCATACACAGCACATACACCACACACACCACATAACATCACATGTCACACACACATCACATGACACATACACCACACACCCCATGCATCACACACACACCACACATCACATGTCATACATACTACATACACACAACACACACAACACATAACATCACATGTCACACACATCACATGACACACACCACACACTCCACACATCACATACACACTGCACATACACTACATCACACACCACACACCACACATCACATGTCACAAACACCACACAGCACACCCCACACCACACACATACACCACATACACAAATACCACACCACACACCACACATACACCACACCACACACACTTCACACACACCACACATCACATGTCACACACATTAGGTACACACCGAACACACACAACACACATTAAATGCCACATACAACACACCACACACATTAAACACACACCCCACACATAAGTCACACACATCACACACACAACACACCATGCGCTAAATACATCACACATACTACACACATGCAAATCACATCACAGACACCACATACGCACCACACCACATACCACACACACGACACATCACATGCCACACATCACCTGTCACACACATCAAACACACACACAATACACCACACACCTCACACACGTCAAACAAACCCCACACACACCATATCACACACACATCACACACACCACACATGCACCATATGCCTCCACACACAGAGACACATACACATCACACACCCTCACACACACACACCCCACATGCCATTTATACCACATGCCACAAACATTACATGCA,

Maybe a bad allele in the merged TRGT file? I've got to run, but let me know if you get a chance to dig!

@hangsuUNC
Copy link
Collaborator Author

hangsuUNC commented Feb 21, 2025

@hangsuUNC A step using a GATK tool in the ChromosomePhasedPanelCreationFromHiPhase run on 1.4.5 + 1.1 + short + TRGT failed due to:

htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 1194060: unparsable vcf record with allele TCATCTGGAAGCCCCTACTCCCACCTCACCACACACATGCACATCACCCCCCACACACACCAAACAMCCCACACAACACACACACACCACACCACACAAACACAAACACACCACATCATGAACACACACATCACACACACACCACACACCCCACACACCCCACACACATCACACACACACACCACACACCCCACACAACACACAACACACACCACACACACCACACCACACACACACCACACCCCACAACACACACACACCACACTCCCCACTGAACACACACACATCACACACACCACACACCCCACACACACCACACACCCCACACAACACACACCACACACACACCACACCCCACAACACACACACATCACATCACACACACCACACCCCACACACACACACCACACCACACACACACCACACACCCCACACAACACATACACACATCACACACACCCCACACACCAACACACCACATCACACACACACCACATCACACACACACCAAACACCCCACACAACACACACACAACACAACACACACACAAACACACCACATCATACACACACCACACACCCCACACACCACACACACATCACACACACACCCCACACACCCCACACACCCCACACCACACTACACACACACCACACACCACACACAACACACACAACACAACACACACACCCCACACACAAACACCCCACACACCACACACACCACATCACACACACACACCACATCACACCCCACACAACACACACACCACACCACACACACACACCACACACCCCACACAACACACACCGCATCACACACACACCACACAACACACACACACCACACACACCACACACCCCACACAACACACAACACATCACACACACCACACACACCACACAACACACACATCACATACACACCACACACCACACACAACACATACACATCACACACACACCACACACCACACACAACACACACAACACAACACACACAACACACACCCCACACAACACACCACATCACACACACATCACACAACACACAACACAACACACACATCACACACCCCACACAACACACCACATCACACAAAACACATCACACACACAACACACACCCCACACAACACACACAACACAACACACACACCACACACCCCACACAACACACACAACACACCACACATATACATCACACACCACACACCCCACGCAACACACACACCACATCACACACACACCACACACCCCACACACACACGCATCACACCACACACACCACAGCCCCCACACAACACATACACACCACATCACACACACACCACACATCACATGTCATACACAGCACATACACCACACACACCACATAACATCACATGTCACACACACATCACATGACACATACACCACACACCCCATGCATCACACACACACCACACATCACATGTCATACATACTACATACACACAACACACACAACACATAACATCACATGTCACACACATCACATGACACACACCACACACTCCACACATCACATACACACTGCACATACACTACATCACACACCACACACCACACATCACATGTCACAAACACCACACAGCACACCCCACACCACACACATACACCACATACACAAATACCACACCACACACCACACATACACCACACCACACACACTTCACACACACCACACATCACATGTCACACACATTAGGTACACACCGAACACACACAACACACATTAAATGCCACATACAACACACCACACACATTAAACACACACCCCACACATAAGTCACACACATCACACACACAACACACCATGCGCTAAATACATCACACATACTACACACATGCAAATCACATCACAGACACCACATACGCACCACACCACATACCACACACACGACACATCACATGCCACACATCACCTGTCACACACATCAAACACACACACAATACACCACACACCTCACACACGTCAAACAAACCCCACACACACCATATCACACACACATCACACACACCACACATGCACCATATGCCTCCACACACAGAGACACATACACATCACACACCCTCACACACACACACCCCACATGCCATTTATACCACATGCCACAAACATTACATGCA,

Maybe a bad allele in the merged TRGT file? I've got to run, but let me know if you get a chance to dig!

Yes, I believe there are a bunch of bad alleles in the TRGT file, for example, just found the raw call evaluation wdl failed because there is a "M" in the reference allele in the vcf file... Maybe we should do another round of filtering of TRGT calls first? Please let me know what do you think @samuelklee

@samuelklee
Copy link
Collaborator

Yes, I’d also see if the other catalog has this issue. Thanks!

@hangsuUNC
Copy link
Collaborator Author

Yes, I’d also see if the other catalog has this issue. Thanks!

Yes, have checked with the other catelog, seems like a common issue, they replace reference Ns with some other letters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants