Hi bcftools,
I'm part of a project building a pangenome and we noticed some strange output by bcftools norm in terms of how it handles structural variants. I thought you may have some good recommendations, or would want to be aware of the strange behavior.
Here is the command I ran:
bcftools norm --fasta-ref $REF_FASTA input.vcf -o output.vcf
We notice two main problems. The POS of larger structural variants is shifted many more base pairs away in some cases then we'd anticipate. For example, a 8887bp insertion at CHR1:41577 is shifted 140bp away to CHR1:41437.
We also noticed a particularly difficult case. In the original vcf (prior to normalization/left alignment), there are 5 structural variants distributed across two sites. At CHR1:671683 the individuals Moly and Tany have a >200bp insertion. The third individual, Pach, has reference genotype. At site CHR1:671691, all three individuals have >150bp insertion. After normalization and left alignment, all of the variants at the second site (CHR1:671691) are reassigned to CHR1:671683. This makes it appear as if there are conflicting alleles at the same site in our vcf.
I'm aware that there are options to rm-dups for example, or collapse variants. However, that's removing information we know is there based on the other outputs from the pangenome. For example, I would like to avoid representing the inserted sequence in Moly as one 200bp insertion when we know that at least 350bp are inserted in this region. Any feedback is appreciated. Thank you!
Hi bcftools,
I'm part of a project building a pangenome and we noticed some strange output by bcftools norm in terms of how it handles structural variants. I thought you may have some good recommendations, or would want to be aware of the strange behavior.
Here is the command I ran:
bcftools norm --fasta-ref $REF_FASTA input.vcf -o output.vcf
We notice two main problems. The POS of larger structural variants is shifted many more base pairs away in some cases then we'd anticipate. For example, a 8887bp insertion at CHR1:41577 is shifted 140bp away to CHR1:41437.
We also noticed a particularly difficult case. In the original vcf (prior to normalization/left alignment), there are 5 structural variants distributed across two sites. At CHR1:671683 the individuals Moly and Tany have a >200bp insertion. The third individual, Pach, has reference genotype. At site CHR1:671691, all three individuals have >150bp insertion. After normalization and left alignment, all of the variants at the second site (CHR1:671691) are reassigned to CHR1:671683. This makes it appear as if there are conflicting alleles at the same site in our vcf.
I'm aware that there are options to rm-dups for example, or collapse variants. However, that's removing information we know is there based on the other outputs from the pangenome. For example, I would like to avoid representing the inserted sequence in Moly as one 200bp insertion when we know that at least 350bp are inserted in this region. Any feedback is appreciated. Thank you!