🧬 Protein Property Predictor (Toy Model) — Domino Demo by Luis Chan

Classifies protein sequences as soluble or membrane-bound.
Designed to showcase the Domino flow using a NetApp Volume Snapshot for data:
Workspace → NetApp Volume Snapshot → Training Job → App (and later Endpoint).

🚀 What’s inside

Two predictors
- Rule-based (no training): thresholds on hydrophobicity.
- ML (logistic regression): trained on a tiny toy CSV.
NetApp Volume–based I/O
Reads training data directly from /mnt/netapp-volumes/snapshots/ppp-volume/2/train.csv
and saves model artifacts to /mnt/artifacts.
Simple UI: Streamlit app publishable as a Domino App.

🗂️ Repo structure

property-property-model/ ├─ app.sh # Domino App entry (MUST be at repo root) ├─ README.md └─ protein-property-predictor/ ├─ app.py # Streamlit UI ├─ model.py # Prediction (rule + ML loader) ├─ train.py # Training (logistic regression) ├─ data/ │ └─ train.csv # just for backup ├─ env/ │ └─ requirements.txt # numpy, pandas, scikit-learn, joblib, streamlit, requests └─ models/ └─ latest/ # Trained model is written here (or in Dataset)

📁 Training data is located in the mounted NetApp volume:
/mnt/netapp-volumes/snapshots/ppp-volume/2/train.csv

🧪 Biology (60-second version)

Proteins are chains of 20 amino acids (letters A, C, D, …, Y).
Water is hydrophilic; cell membranes are hydrophobic (oily).
Hydrophobic letters A I L M F W V Y tend to favor membranes.
Many membrane/secreted proteins begin with a hydrophobic N-terminus (~first 20 aa).
We compute features:
1. Overall hydrophobic fraction, 2) N-terminal hydrophobic fraction, 3) Length
  → Predict membrane-bound vs soluble.

Why care? Membrane proteins (receptors, channels, pumps) are major drug targets and behave differently in experiments than soluble proteins.

📦 Requirements

Domino project connected to this GitHub repo.
A Domino Compute Environment with Python 3 & pip.
NetApp Volume Snapshot mounted at: /mnt/netapp-volumes/snapshots/ppp-volume/2/train.csv
Python packages (already in env/requirements.txt):

🧑‍💻 Quickstart on Domino

1) Connect code & mount a Dataset

Project → Code → add this Git repo (prefer “Use latest from external repo”).
NetApp Volume Snapshot mounted at: /mnt/netapp-volumes/snapshots/ppp-volume/2/train.csv

2) Workspace: set env & install deps

Open a terminal at the repo root:

# Install packages to user site
pip3 install --user -r protein-property-predictor/env/requirements.txt
export PATH="$HOME/.local/bin:$PATH"


# Protein Property Predictor — Quick Ops Runbook

## 3) **NetApp Volume Snapshot** mounted at: `/mnt/netapp-volumes/snapshots/ppp-volume/2/train.csv`  

## 4) Train the model (writes `model.joblib`)
```bash
cd protein-property-predictor
python3 train.py
# → prints JSON with model_path, samples, etc.

5) Predict from CLI

Force ML (uses model saved in /mnt/netapp-volumes/snapshots/ppp-volume/2/train.csv)

python3 model.py --seq "MKKLLLLLLLLLALALALAAAGAGA" --mode ml

Force rule (no training needed)

python3 model.py --seq "MSTNPKPQRKTKRNTNRRPQDVK" --mode rule

Auto: try ML, fallback to rule

python3 model.py --seq ">p\nMAALALLLGVVVVALAAA" --mode auto

🖥️ Publish the Streamlit App (Domino Apps)

Files used

app.sh (repo root, Domino App entry script)
protein-property-predictor/app.py (Streamlit UI)

Steps

Deploy → Apps → New App
Name: ppp-app
Entry script: app.sh (must be at repo root)
Compute environment: any Python env with pip
Datasets: mount the same Dataset snapshot used for training

Environment variables:

DATASET_DIR=/domino/datasets/local/protein-property-predictor   # adjust to your path

Publish / Launch → open the App URL.

🧠 How it works

`train.py`

Loads CSV from NetApp Volume Snapshot mounted at: /mnt/netapp-volumes/snapshots/ppp-volume/2/train.csv
Featurizes sequences (overall & N-terminal hydrophobicity, length).
Trains a LogisticRegression; saves to ${DATASET_DIR or ./}/models/latest/model.joblib.

`model.py`

Rule mode: simple thresholds on hydrophobicity features.
ML mode: loads model.joblib and returns probability + label.
Auto mode: try ML; if no model available, fallback to rule.

`app.py` (Streamlit)

Text area for sequence / tiny FASTA.
Mode selector (auto | ml | rule).
Calls predict() and renders JSON.

🧭 Troubleshooting

App error: “entry script './app.sh' not found”

Ensure app.sh exists at repo root (top level in Project → Code).
In Apps → Edit, set Entry script exactly to app.sh.

Blank Streamlit page

Check App logs in Domino.
Ensure streamlit is installed & on PATH (handled in app.sh).

Workspace health check:

export DOMINO_APP_PORT=8501
streamlit run protein-property-predictor/app.py --server.port $DOMINO_APP_PORT --server.address 0.0.0.0 &
sleep 3
curl -s http://127.0.0.1:$DOMINO_APP_PORT/_stcore/health

ML mode: “model not found”

Run python3 train.py first.
Confirm printed model_path exists under ${DATASET_DIR}/models/latest/ (or ./models/latest/).

Git push from Workspace fails

Configure identity:

git config --global user.name "YOUR_GH_USERNAME"
git config --global user.email "[email protected]"

Use a GitHub PAT; complete SAML SSO if prompted for org repos.

🗺️ Roadmap (next steps)

Domino Endpoint that wraps predict() with request/response logging.(completed✅)
Jobs to schedule train.py and snapshot the Dataset.(completed✅)
MLflow logging of params/metrics/artifacts.(completed✅)
Tests for parsing/featurization & golden predictions.
Expand dataset; add metrics (confusion matrix, ROC/PR).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧬 Protein Property Predictor (Toy Model) — Domino Demo by Luis Chan

Classifies protein sequences as soluble or membrane-bound.
Designed to showcase the Domino flow using a NetApp Volume Snapshot for data:
Workspace → NetApp Volume Snapshot → Training Job → App (and later Endpoint).

🚀 What’s inside

🗂️ Repo structure

🧪 Biology (60-second version)

📦 Requirements

🧑‍💻 Quickstart on Domino

1) Connect code & mount a Dataset

2) Workspace: set env & install deps

5) Predict from CLI

🖥️ Publish the Streamlit App (Domino Apps)

Files used

Steps

🧠 How it works

`train.py`

`model.py`

`app.py` (Streamlit)

🧭 Troubleshooting

App error: “entry script './app.sh' not found”

Blank Streamlit page

ML mode: “model not found”

Git push from Workspace fails

🗺️ Roadmap (next steps)

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
protein-property-predictor		protein-property-predictor
README.md		README.md
app.sh		app.sh

dominodatalab/property-property-model

Folders and files

Latest commit

History

Repository files navigation

🧬 Protein Property Predictor (Toy Model) — Domino Demo by Luis Chan

Classifies protein sequences as soluble or membrane-bound. Designed to showcase the Domino flow using a NetApp Volume Snapshot for data: Workspace → NetApp Volume Snapshot → Training Job → App (and later Endpoint).

🚀 What’s inside

🗂️ Repo structure

🧪 Biology (60-second version)

📦 Requirements

🧑‍💻 Quickstart on Domino

1) Connect code & mount a Dataset

2) Workspace: set env & install deps

5) Predict from CLI

🖥️ Publish the Streamlit App (Domino Apps)

Files used

Steps

🧠 How it works

train.py

model.py

app.py (Streamlit)

🧭 Troubleshooting

App error: “entry script './app.sh' not found”

Blank Streamlit page

ML mode: “model not found”

Git push from Workspace fails

🗺️ Roadmap (next steps)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Classifies protein sequences as soluble or membrane-bound.
Designed to showcase the Domino flow using a NetApp Volume Snapshot for data:
Workspace → NetApp Volume Snapshot → Training Job → App (and later Endpoint).

`train.py`

`model.py`

`app.py` (Streamlit)

Packages