This repository shows a minimal yet well-structured example of using BERT for two tasks at once:
- Domain classification (history / geography / health / technology)
- Sentiment analysis (negative / neutral / positive)
It also exposes a tiny sentence-embedding utility so you can obtain the CLS-token embeddings for any sentence.
.
├── app # Python package with all source code
│ ├── __init__.py
│ ├── data
│ │ ├── __init__.py
│ │ └── dataset.py # Helpers for loading csv → Torch datasets
│ └── models
│ ├── __init__.py
│ ├── embedder.py # Thin wrapper around HuggingFace BERT
│ └── heads.py # Generic classification head(s)
│
├── data # Tiny example datasets used in the demo
│ ├── domain.csv
│ └── sentiment.csv
│
├── embed.py # Example: produce embeddings for two sentences
├── train.py # Multi-task training script
├── requirements.txt # Python dependencies
├── Dockerfile # Optional container recipe
├── write_up
│ └── write_up.docx # Detailed design write-up & assignment answers
└── README.md # You are here 🙂
- Create and activate a virtual environment (recommended).
python -m venv .venv
source .venv/bin/activate- Install dependencies:
pip install -r requirements.txt- Run the sentence embedder:
python embed.pyYou should see output in the following format (vector shortened for readability):
sentence: There is a stack of papers on the table.
embedding: [0.03, -0.12, ..., 0.45]
sentence: The largest mountain in the world is Mount Everest.
embedding: [-0.22, 0.07, ..., 0.11]
- Train the multi-task model (uses the tiny demo datasets under
data/):
python train.py --num_epochs 3 --output_dir outputsBuild the image:
docker build -t bert-mtl .Run an embedding example (default):
# prints embeddings
docker run --rm bert-mtlRun training instead (override the default command):
# trains for 3 epochs and stores outputs inside the container
docker run --rm bert-mtl train.py --num_epochs 3
# to persist checkpoints to host machine:
docker run --rm -v $(pwd)/outputs:/app/outputs bert-mtl train.py --num_epochs 3 --output_dir outputs- The datasets provided are toy examples – they are only meant to demonstrate code execution. Replace them with real data for meaningful results.
- The HuggingFace model weights are downloaded at first run and cached under
~/.cache/huggingface/. - The training script now stores the fine-tuned BERT encoder and the task-specific heads in one checkpoint file for easy reuse.
- A detailed write-up of design choices and assignment Q&A lives at
write_up/write_up.docx.