A machine learning project for classifying genetic variants based on ClinVar data.
This project analyzes conflicting genetic variants from the ClinVar database and classifies them as Benign or Pathogenic/Likely Pathogenic using machine learning algorithms.
- Loading and preprocessing ClinVar data
- Classification of genetic variants (Benign vs Pathogenic)
- Feature importance analysis for predictions
- Results visualization
The project uses the clinvar_conflicting.csv dataset, which contains information about genetic variants with conflicting interpretations of clinical significance.
- Genomic coordinates: CHROM, POS, REF, ALT
- Allele frequencies: AF_ESP, AF_EXAC, AF_TGP
- Clinical information: CLNSIGINCL, CLNDN, CLNHGVS
- Functional predictions: SIFT, PolyPhen, CADD_PHRED, BLOSUM62
- Annotations: Consequence, IMPACT, SYMBOL, BIOTYPE
-
Data Preprocessing:
- Simplifying classification labels from CLNSIGINCL
- Filtering ambiguous variants
- Binary classification: 0 (Benign) and 1 (Pathogenic/Likely Pathogenic)
-
Machine Learning:
- Random Forest Classifier implementation
- Feature importance analysis for model interpretation
The chart shows the importance of various features for genetic variant classification.
- Python 3.12
- pandas - data processing
- scikit-learn - machine learning
- matplotlib - visualization
- numpy - numerical computations
- Clone the repository:
git clone https://github.com/3x6dll9ff/Genetic_Variant__Classifications.git
cd Genetic_Variant__Classifications- Create a virtual environment:
python3.12 -m venv .venv
source .venv/bin/activate- Install dependencies:
pip install pandas scikit-learn matplotlib numpy seaborn jupyter- Launch Jupyter Notebook:
jupyter notebook "Genetic_Variant_ Classifications.ipynb".
├── Genetic_Variant_ Classifications.ipynb # Main analysis notebook
├── clinvar_conflicting.csv # ClinVar dataset
├── image.png # Feature importance visualization
├── .gitignore # Git ignore file
└── README.md # This file
- ClinVar Database - database of genetic variants and their clinical significance
Danila Kardashevskii
MIT License
