GitHub - zeyzul/HL_Prediction: Half-life prediction in mammals utilizing BERT embeddings.

Half-life predictions in mammals utilizing BERT embeddings

The goal of this research was to see if the performance of Random Forest and XGBoost algorithms in predicting half-lives of mammals can be improved through the usage of BERT embeddings.

Packages

Packages used for this project are included in the requirements.txt file. They can be downloaded by the following code:

   !pip install -r requirements.txt

Data ¹

The datasets were taken from this research and can be found in this repository under the names, all_HLs_human_featTable.txt and all_HLs_mouse_featTable.txt.
ORF and Half-life values were extracted and saved in Human_HL_ORF.csv and Mouse_HL_ORF.csv by running the file creating_tables.py.
Each nucleotite in a sequence is separated by a whitespace to mimic the word and sentence relation with nucleotites and sequences. This is done for the BERT model.
One hot encoding of sequences is done in the file one_hot_conversion.py by assigning a vector to each nucleotite in the following way:
- A : [1, 0, 0, 0]
- G : [0, 1, 0, 0]
- C : [0, 0, 1, 0]
- T : [0, 0, 0, 1]

The encoded sequences are saved as Human_X.npy and the corresponding Half-life values are saved as Human_Y.npy. The file names in one_hot_conversion.py must be manually changed to create encodings for mouse data.

Embeddings are created from sequences by the BERT model in embedding.py. Embeddings are saved as Human_EMBEDDINGSX.npy and their corresponding Half-life values are saved as Human_EMBEDDINGSY.npy.

Algorithms

Random Forest and XGBoost are used to predict Half-lives with both encoded data and BERT embeddings as input. The data source must be manually changed in the files random_forest.py and xgboost_model.py.

Results

The performance of the algorithms were calculated based on R^2 and mean absolute error (MAE). There was no significant improvement in predictions.

Due to Github's size limit, the .csv files in this repository are random samples. Therefore, the encodings and embeddings are also created based on the samples. The project was done both on the original data and samples which gave similar results. The original data can be found here. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
HUMAN_EMBEDDINGSX.npy.zip		HUMAN_EMBEDDINGSX.npy.zip
HUMAN_EMBEDDINGSY.npy		HUMAN_EMBEDDINGSY.npy
HUMAN_X.npy		HUMAN_X.npy
HUMAN_Y.npy		HUMAN_Y.npy
Human_HL_ORF.csv		Human_HL_ORF.csv
MOUSE_EMBEDDINGSX.npy.zip		MOUSE_EMBEDDINGSX.npy.zip
MOUSE_EMBEDDINGSY.npy		MOUSE_EMBEDDINGSY.npy
MOUSE_X.npy		MOUSE_X.npy
MOUSE_Y.npy		MOUSE_Y.npy
Mouse_HL_ORF.csv		Mouse_HL_ORF.csv
README.md		README.md
all_HLs_human_featTable.txt		all_HLs_human_featTable.txt
all_HLs_mouse_featTable.txt		all_HLs_mouse_featTable.txt
creating_tables.py		creating_tables.py
embedding.py		embedding.py
one_hot_conversion.py		one_hot_conversion.py
random_forest.py		random_forest.py
requirements.txt		requirements.txt
xgboost_model.py		xgboost_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Half-life predictions in mammals utilizing BERT embeddings

Packages

Data ¹

Algorithms

Results

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Half-life predictions in mammals utilizing BERT embeddings

Packages

Data 1

Algorithms

Results

Footnotes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Data ¹

Packages