MyLLM/README_draftLLM.md at main · Bobpick/MyLLM

MyLLM: A Customizable Language Model Trainer

Overview

MyLLM is a Python-based tool for training transformer-based language models on custom text data. It supports various transformer architectures (GPT-2, GPT-Neo, T5) and allows for experimentation with different hyperparameters. The model is trained using a multi-threaded approach for efficient data loading and preprocessing.

Features

Data Extraction: Supports .txt, .pdf, and .parquet files.
Parquet Storage: for smaller and faster accessibility to trained data.
Data Cleaning: Normalizes text, removes extraneous characters, and filters out overly short or long sentences.
Tokenization: Utilizes the Hugging Face AutoTokenizer for subword tokenization (BPE).
Transformer Models: Choose from GPT-2, GPT-Neo, or T5 architectures.
Hyperparameter Tuning: Easily adjust embedding size, attention heads, layers, etc.
Mixed Precision: Leverages torch.bfloat16 on supported GPUs for faster training.
Performance Evaluation: Tracks validation loss and perplexity.
Text Generation: Generates sample text periodically during training to assess progress.
PyTorch Profiler: Integrated for performance optimization.

Installation

Clone the Repository:
```
git clone <repository_url>
cd MyLLM
```
Install Dependencies:
```
pip install -r requirements.txt
```

Usage

Prepare Your Data:
- Place your .txt, .pdf, or .parquet data files in the same directory as the script or provide the full file path.
Train the Model:
```
python train_language_model.py <data_file> --model_type <gpt2|gpt_neo|t5>
```
- Replace <data_file> with the path to your training data.
- Optionally specify --model_type to choose the desired transformer architecture (default is GPT-2).
Output:
- The model will be trained for the specified number of iterations.
- Validation loss and perplexity will be printed periodically.
- Sample generated text will be shown every 100 iterations.
- The trained model will be saved in the "saved_models" directory.
- TensorBoard logs will be written to the "./logdir" directory, use tensorboard --logdir=./logdir to visualize the results in a web browser.

Customization

Modify the hyperparameters at the beginning of the train_language_model.py file to experiment with different model configurations.
Adjust data cleaning and preprocessing steps in the extract_text and clean_text functions to suit your specific data format.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MyLLM: A Customizable Language Model Trainer

Overview

Features

Installation

Usage

Customization

FilesExpand file tree

README_draftLLM.md

Latest commit

History

README_draftLLM.md

File metadata and controls

MyLLM: A Customizable Language Model Trainer

Overview

Features

Installation

Usage

Customization