Breast Cancer Risk Prediction using Machine Learning

This project demonstrates the implementation of machine learning models to predict breast cancer risk using a publicly available dataset: the Breast Cancer Coimbra dataset. The models used in this study include Logistic Regression, Random Forest, and Support Vector Machines (SVM), all implemented using the scikit-learn library.

Introduction

The goal of this project is to evaluate machine learning models' effectiveness in predicting breast cancer risk. The models are trained on breast cancer datasets to classify the cancer diagnosis as either benign or malignant. This repository contains the implementation of the models and a detailed comparison of their performance.

Datasets

One dataset is used in this project:

Breast Cancer Coimbra dataset: A dataset containing clinical features for breast cancer prediction. (included in repository) You can download the updated dataset from:

Breast Cancer Coimbra dataset

Installation

To set up the project locally, follow these steps:

Clone the repository:

git clone (https://github.com/PortiaKwanele/Breast-Cancer-Risk-Prediction-using-Machine-Learning.git)
cd breast-cancer-prediction

Create a virtual environment:

python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`

Install the required dependencies:
```
pip install -r requirements.txt
```

Usage

Run the models by executing the main.py file:

python main.py

The results will be printed in the terminal, displaying the performance metrics (accuracy, precision, recall, F1-score) for each model.

Machine Learning Models

The following machine learning models were implemented using the scikit-learn library:

Logistic Regression
Random Forest
Support Vector Machines (SVM)

Each model was trained on the datasets and evaluated using cross-validation techniques. The hyperparameters were adjusted using grid search to improve performance.

Evaluation

The models are evaluated using the following performance metrics:

Accuracy: The overall correctness of the model.
Precision: The proportion of true positive results among the positive predictions.
Recall: The proportion of true positive results out of all actual positives.
F1-score: The harmonic mean of precision and recall.
AUC-ROC: The area under the ROC curve, which measures the trade-off between true positive rate and false positive rate.

Results

The following results were obtained for each model (with default parameters):

Model	Accuracy	Precision	Recall	F1-score	AUC-ROC
Logistic Regression	87.5%	0.88	0.88	0.88	0.88
Random Forest	83.33%	0.84	0.83	0.83	0.83
SVM	79.17%	0.81	0.79	0.79	0.79

Future Work

Fine-tuning of models: Further parameter tuning can be done to enhance model performance.
Exploring more algorithms: Models such as deep learning networks can be explored.
Feature Engineering: Additional feature selection techniques can be implemented to improve prediction accuracy.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
archive.zip		archive.zip
combria.csv		combria.csv
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Breast Cancer Risk Prediction using Machine Learning

Table of Contents

Introduction

Datasets

Installation

Usage

Machine Learning Models

Evaluation

Results

Future Work

License

About

Releases

Packages

Languages

PortiaKwanele/Breast-Cancer-Risk-Prediction-using-Machine-Learning

Folders and files

Latest commit

History

Repository files navigation

Breast Cancer Risk Prediction using Machine Learning

Table of Contents

Introduction

Datasets

Installation

Usage

Machine Learning Models

Evaluation

Results

Future Work

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages