Summary
Add check_array / validate_data input validation at the public API boundary so errors surface early with clear messages.
Details
Much of the validation already exists deeper in the stack (validate_mask_compatibility in _sparsity.py, _validate_shapes in _rms.py, shape checks in _full_update.py). This issue surfaces those checks to the estimator boundary so users get clear errors instead of cryptic BLAS crashes.
VBPCA
- Add
validate_data() or check_array() in fit(): enforce 2-D, numeric dtype, apply NaN policy via tags
- Validate
n_components >= 1 and n_components <= min(n_features, n_samples) in fit()
- Guard against all-NaN or empty (0-row / 0-column) input — currently crashes deep in matrix ops
- Add
n_features_in_ consistency check in transform() / inverse_transform()
Preprocessing classes
- Add 2-D enforcement in all
fit() / transform() methods
- Add dtype checks (numeric for scalers, any for OHE)
- Add
n_features_in_ consistency check in MissingAwareOneHotEncoder.transform()
What already exists (leverage, don't duplicate)
_sparsity.validate_mask_compatibility() — mask/data shape and sparsity checking
_rms._validate_shapes() — loadings/scores 2-D and dimension compatibility
_full_update._safe_cholesky() — numerical safety
_pca_full._validate_dense_mask_budget() — memory budget guard
_remove_empty.remove_empty_entries() — strips empty rows/cols (but no guard for all-empty)
Depends on
- Issue: Inherit from sklearn BaseEstimator + TransformerMixin
Acceptance criteria
VBPCA(n_components=5).fit(np.array([1,2,3])) raises clear ValueError (not 2-D)
VBPCA(n_components=5).fit(np.array([["a","b"],["c","d"]])) raises clear TypeError (not numeric)
VBPCA(n_components=0).fit(X) raises ValueError
- All-NaN input raises ValueError with message mentioning NaN
- Empty array raises ValueError
Summary
Add
check_array/validate_datainput validation at the public API boundary so errors surface early with clear messages.Details
Much of the validation already exists deeper in the stack (
validate_mask_compatibilityin_sparsity.py,_validate_shapesin_rms.py, shape checks in_full_update.py). This issue surfaces those checks to the estimator boundary so users get clear errors instead of cryptic BLAS crashes.VBPCA
validate_data()orcheck_array()infit(): enforce 2-D, numeric dtype, apply NaN policy via tagsn_components >= 1andn_components <= min(n_features, n_samples)infit()n_features_in_consistency check intransform()/inverse_transform()Preprocessing classes
fit()/transform()methodsn_features_in_consistency check inMissingAwareOneHotEncoder.transform()What already exists (leverage, don't duplicate)
_sparsity.validate_mask_compatibility()— mask/data shape and sparsity checking_rms._validate_shapes()— loadings/scores 2-D and dimension compatibility_full_update._safe_cholesky()— numerical safety_pca_full._validate_dense_mask_budget()— memory budget guard_remove_empty.remove_empty_entries()— strips empty rows/cols (but no guard for all-empty)Depends on
Acceptance criteria
VBPCA(n_components=5).fit(np.array([1,2,3]))raises clear ValueError (not 2-D)VBPCA(n_components=5).fit(np.array([["a","b"],["c","d"]]))raises clear TypeError (not numeric)VBPCA(n_components=0).fit(X)raises ValueError