GitHub - arvkevi/clinvar-kaggle: Scripts used to generate the ClinVar conflicting classifications dataset on Kaggle

Scripts and data used to prepare a Kaggle dataset.

Generate dataset using ClinVar .vcf w/ VEP annotations:
python process_clinvar.py will generate a version of the file clinvar_conflicting.csv with vep annotations.

Check out the notebook to see some exploratory data analysis.

Problem Statement

The objective is to predict whether a ClinVar variant will have conflicting classifications.

Conflicting classifications are when two of any of the following three classification categories are present for one variant, two submissions of one category is not considered conflicting.

Likely Benign or Benign
VUS
Likely Pathogenic or Pathogenic

The CLASS feature in clinvar_conflicting.csv is a binary representation of whether or not a variant has conflicting classifications where 0 represents consistent classifications and 1 represents conflicting classifications.

Since this problem only relates to variants with multiple classifications, I removed all variants from the original ClinVar vcf which were only had one submission.

Background

ClinVar is a public resource containing annotations about human genetic variants. These variants are classified on a spectrum between benign, likely benign, uncertain significance, likely pathogenic, and pathogenic. Variants that have conflicting classifications (defined above) can cause confusion when clinicians or researchers try to interpret whether the variant has an impact on the disease of a given patient.

I'm exploring ideas for applying machine learning to genomics. I'm hoping this project will encourage others to think about the additional feature engineering that's probably necessary to confidently assess the objective. There could be benefit to identifying single submission variants that may yet to have assigned a conflicting classification.

VEP annotations

Ensembl's Variant Effect Predictor (VEP) was used to annotate the original ClinVar .vcf. It provides additional information about variants that can serve as features for the dataset.

Step 1:

Download and rename the annotated .vcf as clinvar.annotated.vcf

Step 2:

Create the new dataset with vep annotations. python process_clinvar.py

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
vep		vep
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
clinvar-class-fig.png		clinvar-class-fig.png
clinvar-conflicting-eda.ipynb		clinvar-conflicting-eda.ipynb
clinvar.vcf.gz		clinvar.vcf.gz
clinvar_conflicting.csv		clinvar_conflicting.csv
process_clinvar.py		process_clinvar.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Problem Statement

Background

VEP annotations

Step 1:

Step 2:

About

Releases

Packages

Languages

License

arvkevi/clinvar-kaggle

Folders and files

Latest commit

History

Repository files navigation

Problem Statement

Background

VEP annotations

Step 1:

Step 2:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages