Classifies protein sequences as soluble or membrane-bound.
Designed to showcase the Domino flow using a NetApp Volume Snapshot for data:
Workspace → NetApp Volume Snapshot → Training Job → App (and later Endpoint).
- Two predictors
- Rule-based (no training): thresholds on hydrophobicity.
- ML (logistic regression): trained on a tiny toy CSV.
- NetApp Volume–based I/O
Reads training data directly from/mnt/netapp-volumes/snapshots/ppp-volume/2/train.csv
and saves model artifacts to/mnt/artifacts. - Simple UI: Streamlit app publishable as a Domino App.
property-property-model/ ├─ app.sh # Domino App entry (MUST be at repo root) ├─ README.md └─ protein-property-predictor/ ├─ app.py # Streamlit UI ├─ model.py # Prediction (rule + ML loader) ├─ train.py # Training (logistic regression) ├─ data/ │ └─ train.csv # just for backup ├─ env/ │ └─ requirements.txt # numpy, pandas, scikit-learn, joblib, streamlit, requests └─ models/ └─ latest/ # Trained model is written here (or in Dataset)
📁 Training data is located in the mounted NetApp volume:
/mnt/netapp-volumes/snapshots/ppp-volume/2/train.csv
- Proteins are chains of 20 amino acids (letters A, C, D, …, Y).
- Water is hydrophilic; cell membranes are hydrophobic (oily).
- Hydrophobic letters A I L M F W V Y tend to favor membranes.
- Many membrane/secreted proteins begin with a hydrophobic N-terminus (~first 20 aa).
- We compute features:
- Overall hydrophobic fraction, 2) N-terminal hydrophobic fraction, 3) Length
→ Predict membrane-bound vs soluble.
- Overall hydrophobic fraction, 2) N-terminal hydrophobic fraction, 3) Length
Why care? Membrane proteins (receptors, channels, pumps) are major drug targets and behave differently in experiments than soluble proteins.
- Domino project connected to this GitHub repo.
- A Domino Compute Environment with Python 3 &
pip. - NetApp Volume Snapshot mounted at:
/mnt/netapp-volumes/snapshots/ppp-volume/2/train.csv - Python packages (already in
env/requirements.txt):
- Project → Code → add this Git repo (prefer “Use latest from external repo”).
- NetApp Volume Snapshot mounted at:
/mnt/netapp-volumes/snapshots/ppp-volume/2/train.csv
Open a terminal at the repo root:
# Install packages to user site
pip3 install --user -r protein-property-predictor/env/requirements.txt
export PATH="$HOME/.local/bin:$PATH"
# Protein Property Predictor — Quick Ops Runbook
## 3) **NetApp Volume Snapshot** mounted at: `/mnt/netapp-volumes/snapshots/ppp-volume/2/train.csv`
## 4) Train the model (writes `model.joblib`)
```bash
cd protein-property-predictor
python3 train.py
# → prints JSON with model_path, samples, etc.Force ML (uses model saved in /mnt/netapp-volumes/snapshots/ppp-volume/2/train.csv)
python3 model.py --seq "MKKLLLLLLLLLALALALAAAGAGA" --mode mlForce rule (no training needed)
python3 model.py --seq "MSTNPKPQRKTKRNTNRRPQDVK" --mode ruleAuto: try ML, fallback to rule
python3 model.py --seq ">p\nMAALALLLGVVVVALAAA" --mode autoapp.sh(repo root, Domino App entry script)protein-property-predictor/app.py(Streamlit UI)
- Deploy → Apps → New App
- Name:
ppp-app - Entry script:
app.sh(must be at repo root) - Compute environment: any Python env with pip
- Datasets: mount the same Dataset snapshot used for training
- Environment variables:
DATASET_DIR=/domino/datasets/local/protein-property-predictor # adjust to your path - Publish / Launch → open the App URL.
- Loads CSV from NetApp Volume Snapshot mounted at:
/mnt/netapp-volumes/snapshots/ppp-volume/2/train.csv - Featurizes sequences (overall & N-terminal hydrophobicity, length).
- Trains a
LogisticRegression; saves to${DATASET_DIR or ./}/models/latest/model.joblib.
- Rule mode: simple thresholds on hydrophobicity features.
- ML mode: loads
model.jobliband returns probability + label. - Auto mode: try ML; if no model available, fallback to rule.
- Text area for sequence / tiny FASTA.
- Mode selector (
auto|ml|rule). - Calls
predict()and renders JSON.
- Ensure
app.shexists at repo root (top level in Project → Code). - In Apps → Edit, set Entry script exactly to
app.sh.
- Check App logs in Domino.
- Ensure
streamlitis installed & on PATH (handled inapp.sh). - Workspace health check:
export DOMINO_APP_PORT=8501 streamlit run protein-property-predictor/app.py --server.port $DOMINO_APP_PORT --server.address 0.0.0.0 & sleep 3 curl -s http://127.0.0.1:$DOMINO_APP_PORT/_stcore/health
- Run
python3 train.pyfirst. - Confirm printed
model_pathexists under${DATASET_DIR}/models/latest/(or./models/latest/).
- Configure identity:
git config --global user.name "YOUR_GH_USERNAME" git config --global user.email "[email protected]"
- Use a GitHub PAT; complete SAML SSO if prompted for org repos.
- Domino Endpoint that wraps
predict()with request/response logging.(completed✅) - Jobs to schedule
train.pyand snapshot the Dataset.(completed✅) - MLflow logging of params/metrics/artifacts.(completed✅)
- Tests for parsing/featurization & golden predictions.
- Expand dataset; add metrics (confusion matrix, ROC/PR).