All notable changes to EasyMDM Advanced will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- PostgreSQL: Full support with SQLAlchemy integration
- SQL Server: Native support with pyodbc and connection optimization
- DuckDB: High-performance analytical database support
- Enhanced SQLite and CSV support from original package
- Cosine Similarity: TF-IDF based similarity for longer text fields
- Jaccard Similarity: Set-based similarity with n-gram support
- FuzzyWuzzy Integration: Multiple fuzzy matching variants (ratio, token_sort)
- Vectorized Operations: NumPy/Pandas optimized similarity computations
- Numba JIT Compilation: Performance optimization for large datasets
- Exact Blocking: Optimized exact key matching
- Sorted Neighborhood: Sliding window approach for ordered data
- RecordLinkage Integration: Standard blocking methods from recordlinkage library
- Performance Optimization: Multi-threaded blocking for large datasets
- Hierarchical Clustering: Distance-based hierarchical clustering with scikit-learn
- DBSCAN Clustering: Density-based clustering with noise detection
- Network-based Clustering: Enhanced graph connectivity clustering (original method)
- Quality Metrics: Silhouette scores and clustering quality analysis
- 8+ Survivorship Strategies:
- Most Recent (datetime-based)
- Source Priority (trust-based ordering)
- Longest String (completeness-based)
- Highest/Lowest Value (numeric-based)
- Threshold-based (conditional selection)
- Most Frequent (frequency-based)
- First Value (fallback strategy)
- Priority Rules: Multi-condition priority-based record selection
- Confidence Scoring: Quality metrics for survivorship decisions
- Parallel Processing: Multi-core similarity computation and processing
- Memory Optimization: Chunked processing for large datasets
- Batch Processing: Configurable batch sizes for memory management
- Streaming Support: Process datasets larger than memory
- Connection Pooling: Database connection optimization
- Rich UI Integration: Beautiful console output with progress bars
- Comprehensive Commands:
easymdm process: Execute MDM with advanced optionseasymdm create-config: Generate configuration templateseasymdm test-connection: Database connection testingeasymdm validate-config: Configuration validationeasymdm benchmark: Performance benchmarkingeasymdm info: Package information and diagnostics
- Configuration Templates: Pre-built templates for all database types
- Type Validation: Cerberus-based configuration validation
- JSON Schema Support: Schema-based configuration validation
- Configuration Templates: Database-specific YAML templates
- Environment Variable Support: Secure credential management
- Hierarchical Config: Complex nested configuration support
- Golden Records: Enhanced golden record generation with metadata
- Review Pairs: Detailed similarity scores for manual review
- Processing Statistics: Comprehensive performance and quality metrics
- Multiple Formats: CSV, JSON, Excel output support
- Detailed Logging: Structured logging with multiple levels
- Data Profiling: Comprehensive input data analysis
- Quality Metrics: Data quality scoring and assessment
- Performance Monitoring: Real-time processing metrics
- Error Handling: Robust error handling and recovery
- Progress Tracking: Real-time progress indicators
- Performance: 10x+ faster similarity computation with vectorization
- Memory Efficiency: Reduced memory usage with streaming operations
- Accuracy: Improved matching accuracy with advanced algorithms
- Scalability: Support for millions of records
- Speed: Faster candidate pair generation with optimized algorithms
- Reduction: Better blocking efficiency with multiple strategies
- Flexibility: Configurable blocking methods and parameters
- Strategies: 8+ survivorship strategies vs 3 in original
- Logic: Enhanced resolution logic with confidence scoring
- Flexibility: Configurable priority conditions and rules
- Transparency: Detailed resolution logic tracking
- Validation: Schema-based configuration validation
- Templates: Pre-built configuration templates
- Documentation: Comprehensive configuration documentation
- Flexibility: Support for complex multi-database scenarios
- Type Hints: Full Python type hint coverage
- Documentation: Comprehensive docstring coverage
- Testing: Extensive unit and integration test suite
- Linting: Black, flake8, mypy compliance
- Modularity: Clean separation of concerns and responsibilities
- Plugin System: Extensible similarity and survivorship strategies
- Factory Pattern: Clean object creation and management
- Abstract Base Classes: Consistent interface definitions
- Dependency Injection: Flexible component configuration
- Profiling: Built-in performance profiling and monitoring
- Optimization: Memory and CPU usage optimization
- Scalability: Linear scaling with dataset size
- Caching: Intelligent caching for repeated operations
- Python 3.8+
- Full compatibility with Python 3.9, 3.10, 3.11
- PostgreSQL: 9.6+ (tested with 12, 13, 14, 15)
- SQL Server: 2016+ (tested with 2017, 2019, 2022)
- SQLite: 3.7+ (any version supported by Python)
- DuckDB: 0.8+ (latest version recommended)
- CSV Files: Any valid CSV format
- Linux: Ubuntu 18.04+, CentOS 7+, RHEL 7+
- Windows: Windows 10+, Windows Server 2016+
- macOS: macOS 10.14+ (Mojave and later)
- Configuration Format: Enhanced YAML structure (migration guide provided)
- API Changes: New class-based API (backward compatibility layer available)
- Output Format: Enhanced output structure with additional metadata
- Configuration Migration: Use
easymdm migrate-configcommand - API Migration: Update import statements and configuration
- Output Migration: Update downstream processes for new output format
- Legacy configuration format supported with deprecation warnings
- Original CLI commands available with
easymdm legacyprefix - Automatic migration utilities provided
- Similarity Computation: 10-50x faster depending on method and dataset size
- Blocking: 5-20x faster with optimized algorithms
- Overall Processing: 5-15x faster end-to-end processing
- Memory Usage: 50-80% reduction in peak memory usage
- Dataset Size: Tested up to 10M records (vs 100K in original)
- Concurrent Processing: Support for multi-core processing
- Streaming: Support for datasets larger than available memory
- SQL Server: Requires ODBC Driver 17+ for optimal performance
- PostgreSQL: Connection pooling may require additional configuration for very large datasets
- Memory: Very large similarity matrices (>1B pairs) may require additional memory optimization
- Large Datasets: Use chunked processing and exact blocking for datasets >5M records
- Memory Constraints: Adjust batch_size and chunk_size parameters
- Connection Limits: Configure database connection limits appropriately
- Apache Spark Integration: Distributed processing support
- Real-time Processing: Streaming MDM capabilities
- ML Integration: Machine learning-based matching
- Web UI: Browser-based configuration and monitoring
- Cloud Integration: Native AWS, Azure, GCP support
- Enterprise Features: Role-based access, audit trails
- Advanced Analytics: Statistical analysis and reporting
- API Server: RESTful API for integration
- Basic CSV and SQLite support
- Fuzzy blocking with Jaro-Winkler similarity
- Network-based clustering
- Basic survivorship rules
- Simple CLI interface
- YAML configuration support
- Record linkage and deduplication
- Similarity-based matching
- Golden record generation
- Basic performance optimization
Note: This changelog covers the major evolution from the original EasyMDM to EasyMDM Advanced v2.0.0. For detailed API documentation and migration guides, see the Documentation.