SPQR is a Python library designed for large-scale clustering of molecular data using Streaming Product Quantization (PQ). It allows you to process billions of molecules in a streaming fashion, transforming SMILES strings into compact PQ-codes for efficient clustering.
- Overview
- Features
- Installation
- Usage
- Project Structure
- Documentation
- Testing
- Contributing
- License
- Acknowledgments
SPQR leverages the concept of Streaming Product Quantization to cluster high-dimensional molecular data without requiring the entire dataset to be in memory. Starting from SMILES strings, SPQR calculates molecular fingerprints, applies PQ encoding, and ultimately clusters the data efficiently.
- Scalability: Process data in chunks for datasets that don't fit in memory.
- Efficiency: Drastically reduce memory usage by converting high-dimensional fingerprints to compact PQ-codes.
- Modular Design: Separate modules for encoding, clustering, and data streaming.
- Ease of Use: Simple API with well-documented functions and usage examples.
-
Clone the repository:
git clone https://github.com/afloresep/spqr.git cd spqr
- Create a virtual environment and install dependencies:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate` pip install -r requirements.txt
- Optional: Install the package in editable mode:
pip install -e .
The entire pipeline from SMILES strings to clustering can be run via the main script:
python scripts/main.py
This script integrates all modules (from data streaming to fingerprint calculation and PQ encoding) for clustering molecular data.
For an interactive tutorial, check out examples/tutorial.ipynb
.
├── data
│ ├── data_lite.txt # Data for the tutorial example
├── docs # Documentation (Sphinx configuration, guides, etc.)
├── examples
│ ├── tutorial.ipynb # Notebook demonstrating the API usage
├── pyproject.toml
├── README.md
├── requirements.txt
├── scripts
│ └── main.py # Main pipeline: SMILES -> cluster
├── setup.py
├── spiq
│ ├── clustering # Clustering modules and implementations
│ ├── encoder
│ │ ├── encoder_base.py
│ │ ├── encoder.py
│ │ └── __init__.py
│ ├── __init__.py
│ ├── streamer
│ │ ├── data_streamer.py
│ │ └── __init__.py
│ └── utils
│ ├── fingerprints.py
│ ├── helper_functions.py
│ └── __init__.py
└── tests # Unit tests for all modules
├── __init__.py
├── test_clustering.py
├── test_data_streamer.py
├── test_encoder.py
├── test_fingerprints.py
├── test_trainer.py
└── test_utils.py
- API Reference: Documentation is auto-generated from the code’s docstrings using Sphinx. See the docs folder for more details.
- Tutorials & Guides: Refer to the Jupyter Notebook in examples/tutorial.ipynb for a hands-on introduction.
During the development I've been writing different tests to ensure the key functionalities remain working as expected after some changes. To run all the tests use:
pytest tests/
If instead you want to run a single group of test -for the data_streamer module for instance-, you can do:
pytest test/test_data_streamer.py
Contributions are welcome! Please follow these guidelines:
- Fork the repository and create your branch:
git checkout -b feature/my-feature
- Ensure your proper docstrings.
- (Ideally) Write tests for new features.
- Open a pull request describing your changes.
This project is licensed under the MIT License. See the LICENSE file for details.