Skip to content

Scripts used to generate the ClinVar conflicting classifications dataset on Kaggle

License

Notifications You must be signed in to change notification settings

arvkevi/clinvar-kaggle

Repository files navigation

Scripts and data used to prepare a Kaggle dataset.

Generate dataset using ClinVar .vcf w/ VEP annotations:
python process_clinvar.py will generate a version of the file clinvar_conflicting.csv with vep annotations.

Check out the notebook to see some exploratory data analysis.

Problem Statement

The objective is to predict whether a ClinVar variant will have conflicting classifications.

Conflicting classifications are when two of any of the following three classification categories are present for one variant, two submissions of one category is not considered conflicting.

  1. Likely Benign or Benign
  2. VUS
  3. Likely Pathogenic or Pathogenic

The CLASS feature in clinvar_conflicting.csv is a binary representation of whether or not a variant has conflicting classifications where 0 represents consistent classifications and 1 represents conflicting classifications.

Since this problem only relates to variants with multiple classifications, I removed all variants from the original ClinVar vcf which were only had one submission.

Background

ClinVar is a public resource containing annotations about human genetic variants. These variants are classified on a spectrum between benign, likely benign, uncertain significance, likely pathogenic, and pathogenic. Variants that have conflicting classifications (defined above) can cause confusion when clinicians or researchers try to interpret whether the variant has an impact on the disease of a given patient.

I'm exploring ideas for applying machine learning to genomics. I'm hoping this project will encourage others to think about the additional feature engineering that's probably necessary to confidently assess the objective. There could be benefit to identifying single submission variants that may yet to have assigned a conflicting classification.

VEP annotations

Ensembl's Variant Effect Predictor (VEP) was used to annotate the original ClinVar .vcf. It provides additional information about variants that can serve as features for the dataset.

Step 1:

Download and rename the annotated .vcf as clinvar.annotated.vcf

Step 2:

Create the new dataset with vep annotations. python process_clinvar.py

Releases

No releases published

Packages

No packages published

Languages