A production-grade research implementation investigating privacy-preserving mechanisms for semantic vector retrieval systems. The project implements and evaluates three Local Differential Privacy (LDP) mechanisms to protect document collections from Membership Inference Attacks while maintaining retrieval quality.
- System Overview
- Installation
- Repository Architecture
- Data Specifications
- Privacy Mechanisms
- Evaluation Framework
- Output Specifications
- Experiment Execution
- Performance Benchmarks
- Development Guidelines
The system implements three distinct privacy mechanisms for vector retrieval:
- Document-side Local Differential Privacy (Doc-LDP): Injects calibrated Gaussian noise into document embeddings before index construction
- Query-side Local Differential Privacy (Query-LDP): Applies noise to query vectors during inference
- Score-side Local Differential Privacy (Score-LDP): Perturbs similarity scores before top-k selection
- Automated privacy budget calibration based on attack success thresholds
- Comprehensive privacy-utility tradeoff analysis
- Reproducible experimental pipeline with fixed random seeds
- Membership Inference Attack auditing with feature-based classifiers
- Python 3.11 or higher
- RAM: Minimum 16GB (32GB recommended for full experiments)
- Storage: 10GB free space for data and results
- Optional: CUDA 11.7+ compatible GPU for accelerated embedding generation
The project uses Conda for environment management with pinned dependencies to ensure reproducibility.
# Repository setup
git clone https://github.com/yourusername/mute-vector.git
cd mute-vector
# Environment creation
conda env create -f environment.yml
conda activate mute-vectors
# Development dependencies (optional)
pip install -r requirements-dev.txt
# Package installation
pip install -e .
# Data initialization
python scripts/download_data.py
python scripts/generate_queries.pyThe source directory contains the core implementation organized into specialized modules:
Data Module (/src/data/): Handles data ingestion and preprocessing
- Loads 20 Newsgroups dataset with configurable category selection
- Implements stratified train/validation/test splitting (70/15/15)
- Performs text normalization and filtering
Embeddings Module (/src/embeddings/): Manages embedding generation
- Supports multiple sentence-transformer models
- Implements caching for computational efficiency
- Produces L2-normalized vectors for retrieval
Privacy Module (/src/privacy/): Core privacy mechanism implementations
- Calibrates noise parameters from privacy budgets (epsilon values)
- Implements Gaussian mechanism with sensitivity analysis
- Provides automated budget tuning for target attack thresholds
Retrieval Module (/src/retrieval/): Search infrastructure
- Builds and manages FAISS indices with inner product similarity
- Generates keyphrase queries using RAKE algorithm
- Implements top-k retrieval with configurable k values
Attacks Module (/src/attacks/): Privacy evaluation
- Implements feature-based Membership Inference Attacks
- Extracts statistical features from retrieval patterns
- Trains shadow models for attack calibration
Evaluation Module (/src/evaluation/): Metrics and analysis
- Computes retrieval metrics (Recall@k)
- Calculates privacy metrics (AUC, TPR@FPR)
- Generates statistical significance tests
Orchestration scripts for running systematic experiments:
- Baseline evaluation: Establishes non-private performance bounds
- Grid search: Exhaustive evaluation across privacy parameters
- Ablation studies: Isolates impact of individual components
- Robustness testing: Validates results on held-out data
YAML-based configuration system with hierarchical overrides:
- Default parameters in base configuration
- Mechanism-specific parameter sets
- Model specifications for embedding generation
Models (/models): Persistent model storage
- Trained MIA classifiers
- Shadow models for attack calibration
- Embedding model checkpoints
Results (/results): Experimental outputs
- Individual run results with full metrics
- Aggregated summaries with statistical analysis
- Publication-ready figures and tables
Logs (/logs): Execution tracking
- Detailed experiment logs with timestamps
- Error logs for debugging
- Performance profiling data
mute-vector/
├── src/
│ ├── init.py
│ ├── data/
│ │ ├── init.py
│ │ ├── loader.py # 20newsgroups: 8 categories (comp.graphics, rec.autos, sci.med, talk.politics.guns,
│ │ │ # alt.atheism, sci.space, rec.sport.hockey, misc.forsale)
│ │ │ # Train: 70% (~3,500 docs), Val: 15% (~750 docs), Test: 15% (~750 docs)
│ │ ├── preprocessor.py # Lowercasing, stopword removal (NLTK), min_doc_length=50 tokens
│ │ └── splitter.py # Stratified splits maintaining category distributions
│ ├── embeddings/
│ │ ├── init.py
│ │ ├── encoder.py # all-MiniLM-L6-v2 (384 dims) as primary, all-distilroberta-v1 (768 dims) for robustness
│ │ ├── models.py # Model registry and lazy loading
│ │ └── cache.py # Embedding cache manager (HDF5 format)
│ ├── privacy/
│ │ ├── init.py
│ │ ├── mechanisms.py # Gaussian noise: N(0, σ²), σ = Δf/ε * sqrt(2ln(1.25/δ)), δ=1e-5
│ │ ├── doc_ldp.py # Document-side noise before indexing
│ │ ├── query_ldp.py # Query-side noise at inference
│ │ ├── score_ldp.py # Score-space noise before top-k selection
│ │ ├── calibration.py # ε ∈ {∞, 20, 10, 5, 2.5} → σ mapping
│ │ └── budget_tuner.py # Binary search for ε given TPR@FPR target
│ ├── retrieval/
│ │ ├── init.py
│ │ ├── indexer.py # FAISS IndexFlatIP (inner product), L2 normalized vectors
│ │ ├── searcher.py # Top-k ∈ {1, 5, 10} retrieval
│ │ ├── queries.py # RAKE: 3 keyphrases/doc, 3-5 tokens each
│ │ └── scorer.py # Cosine similarity scoring pipeline
│ ├── attacks/
│ │ ├── init.py
│ │ ├── mia.py # Logistic regression MIA, 80/20 train/test split
│ │ ├── features.py # 7 features: max_score, mean_score, std_score, hits_in_topk,
│ │ │ # mean_rank_when_hit, max_rank_hit, score_gini
│ │ └── shadow_models.py # Shadow model training for calibration
│ ├── evaluation/
│ │ ├── init.py
│ │ ├── metrics.py # Recall@{1,5,10}, MIA-AUC, TPR@{0.1%, 1%}FPR, latency_ms
│ │ ├── evaluator.py # Full evaluation pipeline orchestration
│ │ └── statistical_tests.py # Bootstrap confidence intervals (n=1000)
│ └── utils/
│ ├── init.py
│ ├── config.py # YAML parsing with schema validation
│ ├── logging.py # Structured logging (JSON format)
│ ├── reproducibility.py # Seed management (global seed: 2025)
│ └── io.py # Unified I/O for results
├── experiments/
│ ├── init.py
│ ├── run_baseline.py # No-DP baseline: full corpus, all queries
│ ├── run_grid_search.py # Full factorial: mechanism × ε × k
│ ├── run_ablations.py # Ablation A: k∈{1,5,10}, Ablation B: queries_per_doc∈{1,3,5}
│ ├── run_robustness.py # Test set evaluation, model swapping
│ └── run_combined_mechanisms.py # Two-layer combinations (Doc+Query LDP)
├── scripts/
│ ├── setup_environment.sh
│ ├── download_data.py # Fetches 20newsgroups via sklearn
│ ├── generate_queries.py # Pre-generates all query sets
│ ├── build_indices.py # Pre-builds FAISS indices for all ε values
│ └── visualize_results.py # Matplotlib/Seaborn plots
├── configs/
│ ├── default.yaml # Base configuration (overrideable)
│ ├── experiments/
│ │ ├── baseline.yaml # mechanism: none, ε: ∞
│ │ ├── doc_ldp.yaml # mechanism: doc, ε: [2.5, 5, 10, 20]
│ │ ├── query_ldp.yaml # mechanism: query, ε: [2.5, 5, 10, 20]
│ │ ├── score_ldp.yaml # mechanism: score, ε: [2.5, 5, 10, 20]
│ │ └── combined.yaml # Two-layer combinations
│ └── models/
│ ├── minilm.yaml # all-MiniLM-L6-v2 config
│ └── distilroberta.yaml # all-distilroberta-v1 config
├── tests/
│ ├── init.py
│ ├── unit/
│ │ ├── test_privacy_mechanisms.py # Noise distribution tests
│ │ ├── test_retrieval.py # Index consistency tests
│ │ └── test_metrics.py # Metric computation tests
│ └── integration/
│ ├── test_pipeline.py # End-to-end pipeline
│ └── test_reproducibility.py # Determinism checks
├── models/ # Trained models storage
│ ├── mia/ # MIA classifier checkpoints
│ ├── shadow/ # Shadow models for calibration
│ └── embedders/ # Fine-tuned embedders (if applicable)
├── results/
│ ├── runs/ # Individual run outputs (CSV)
│ │ └── archive/ # Historical runs
│ ├── aggregated/ # Aggregated results across runs
│ │ ├── summary.csv # Main results table
│ │ └── statistical_analysis.csv # With confidence intervals
│ ├── figures/
│ │ ├── privacy_utility/ # Frontier plots
│ │ ├── ablations/ # Ablation study plots
│ │ └── comparison/ # Mechanism comparison plots
│ ├── tables/
│ │ ├── latex/ # LaTeX-formatted tables
│ │ └── csv/ # CSV tables
│ └── checkpoints/ # Intermediate experiment states
├── data/
│ ├── raw/
│ │ └── 20newsgroups/ # Original sklearn fetch
│ ├── processed/
│ │ ├── train/ # 70% split (~3,500 docs)
│ │ ├── val/ # 15% split (~750 docs)
│ │ └── test/ # 15% split (~750 docs)
│ ├── queries/
│ │ ├── train_queries.csv # ~10,500 queries (3 per doc)
│ │ ├── val_queries.csv # ~2,250 queries
│ │ └── test_queries.csv # ~2,250 queries
│ └── indices/
│ ├── baseline/ # No-noise indices
│ └── private/ # Per-mechanism, per-ε indices
├── logs/
│ ├── experiments/ # Experiment execution logs
│ ├── errors/ # Error tracking
│ └── performance/ # Timing and resource usage
├── notebooks/
│ └── exploratory/ # Development notebooks (not for production)
├── docs/
│ ├── REPRODUCIBILITY.md
│ ├── EXPERIMENTS.md
│ ├── METRICS.md # Detailed metric definitions
│ └── OUTPUT_SCHEMA.md # Unified output structure specification
├── .github/
│ └── workflows/
│ └── tests.yml # CI/CD test runner
├── environment.yml
├── requirements.txt
├── requirements-dev.txt
├── pyproject.toml
├── setup.py
├── .gitignore
├── LICENSE
└── README.md
20 Newsgroups Corpus Selection:
- 8 diverse categories for balanced representation:
comp.graphics- Computer graphics discussionsrec.autos- Automotive topicssci.med- Medical sciencetalk.politics.guns- Political discussionsalt.atheism- Religious debatessci.space- Space explorationrec.sport.hockey- Sports contentmisc.forsale- Commerce/sales
Data Splits:
- Training set: 70% (~3,500 documents)
- Validation set: 15% (~750 documents)
- Test set: 15% (~750 documents)
- Stratified splitting maintains category distributions
Keyphrase Extraction Parameters:
- Algorithm: Rapid Automatic Keyword Extraction (RAKE)
- Queries per document: 3
- Keyphrase length: 3-5 tokens
- Total queries: ~15,000 across all splits
- HTML tag removal and text extraction
- Lowercasing and Unicode normalization
- Stopword removal using NLTK English stopwords
- Document filtering (minimum 50 tokens)
- Metadata preservation (category labels, document IDs)
Gaussian Mechanism Parameters:
- Noise distribution: N(0, σ²)
- Sensitivity calculation: Δf = 2 (for L2-normalized vectors)
- Standard deviation: σ = Δf/ε × √(2ln(1.25/δ))
- Fixed δ = 10⁻⁵ for all experiments
Privacy Budget Grid:
- ε ∈ {∞ (baseline), 20, 10, 5, 2.5}
- Lower ε values provide stronger privacy guarantees
Doc-LDP: Adds noise to document embeddings before index construction
- Applied once during preprocessing
- Affects all subsequent retrievals
- Most computationally efficient
Query-LDP: Adds noise to query embeddings at search time
- Applied per-query during inference
- Allows dynamic privacy adjustment
- No index reconstruction required
Score-LDP: Adds noise to similarity scores before ranking
- Applied post-retrieval
- Finest granularity of control
- Can be combined with other mechanisms
Automated epsilon selection based on privacy requirements:
- Target metric: TPR@0.1%FPR ≤ threshold
- Binary search over epsilon grid
- Returns minimal-utility-loss configuration
Membership Inference Attack (MIA):
- Attack model: Logistic regression classifier
- Features: Statistical patterns from retrieval scores
- Training: 50/50 member/non-member split
- Evaluation metrics:
- AUC (Area Under ROC Curve)
- TPR@0.1%FPR (True Positive Rate at 0.1% False Positive Rate)
- TPR@1%FPR (True Positive Rate at 1% False Positive Rate)
Retrieval Quality:
- Recall@1: Fraction of queries retrieving correct document at rank 1
- Recall@5: Fraction of queries retrieving correct document in top 5
- Recall@10: Fraction of queries retrieving correct document in top 10
Performance Metrics:
- Query latency (milliseconds)
- Index construction time (seconds)
- Memory footprint (GB)
- Bootstrap confidence intervals (n=1000 iterations)
- Paired significance tests for mechanism comparison
- Effect size calculation (Cohen's d)
All experiments produce standardized CSV outputs with the following columns:
Experiment Metadata:
run_id: Unique identifier (UUID)timestamp: ISO 8601 execution timegit_commit: Repository versionconfig_hash: Configuration fingerprint
Configuration Parameters:
mechanism: {none, doc_ldp, query_ldp, score_ldp, combined}epsilon_doc: Document-side privacy budgetepsilon_query: Query-side privacy budgetepsilon_score: Score-side privacy budgetembedding_model: Model identifierdataset_split: {train, val, test}num_documents: Corpus sizenum_queries: Total queries evaluated
Privacy Metrics:
mia_auc: Attack AUC score [0,1]tpr_0.001_fpr: TPR at 0.1% FPR [0,1]tpr_0.01_fpr: TPR at 1% FPR [0,1]avg_membership_score: Mean attack confidence
Utility Metrics:
recall_at_1: Recall@1 [0,1]recall_at_5: Recall@5 [0,1]recall_at_10: Recall@10 [0,1]mrr: Mean Reciprocal Rank [0,1]
Performance Metrics:
latency_mean_ms: Average query timelatency_std_ms: Query time standard deviationlatency_p99_ms: 99th percentile latencyindex_build_time_s: Index construction durationpeak_memory_gb: Maximum memory usage
Statistical Measures:
recall_at_1_ci_lower: 95% CI lower boundrecall_at_1_ci_upper: 95% CI upper boundmia_auc_ci_lower: 95% CI lower boundmia_auc_ci_upper: 95% CI upper bound
Summary tables combining multiple runs:
- Privacy-utility frontiers per mechanism
- Statistical comparisons across mechanisms
- Best configurations per privacy target
-
Environment Validation
- Verify dependencies and data availability
- Check reproducibility settings (seeds)
-
Baseline Establishment
- Run non-private configuration
- Establish upper bounds for utility metrics
-
Mechanism Evaluation
- Execute grid search across epsilon values
- Generate per-mechanism results
-
Comparative Analysis
- Produce privacy-utility frontiers
- Identify Pareto-optimal configurations
-
Robustness Validation
- Test on held-out data split
- Verify stability across random seeds
Experiments are executed via command-line scripts with YAML configurations:
- Single experiment: Specify configuration file
- Batch execution: Use experiment orchestrator
- Parallel runs: Configure worker processes
- Fixed global seed: 2025
- Deterministic operations enforced
- Configuration versioning via Git
- Complete environment specification
Computation Time (per configuration):
- Embedding generation: ~5 minutes (CPU), ~1 minute (GPU)
- Index construction: ~30 seconds
- MIA evaluation: ~2 minutes
- Full grid search: ~4 hours
Memory Requirements:
- Embedding matrix: ~5GB for 5000 documents
- FAISS index: ~2GB
- Peak usage during evaluation: ~12GB
- Document corpus: Tested up to 10,000 documents
- Query load: Evaluated with 30,000 queries
- Embedding dimensions: 384 (MiniLM), 768 (DistilRoBERTa)
- Type hints required for all function signatures
- Docstrings following NumPy style guide
- Maximum line length: 100 characters
- Import ordering: standard library, third-party, local
- Unit test coverage minimum: 80%
- Integration tests for complete pipelines
- Reproducibility tests with fixed seeds
- Performance regression tests
- Feature branches for development
- Semantic versioning for releases
- Comprehensive commit messages
- PR reviews required for main branch
- Module-level documentation required
- Inline comments for complex algorithms
- Update README for interface changes
- Maintain experiment logs
For technical questions, implementation details, or bug reports, please open an issue on the GitHub repository with appropriate labels.
MIT License - See LICENSE file for complete terms.
This research implementation follows privacy-preserving machine learning best practices and builds upon established differential privacy frameworks.