Skip to content

Input validation at the public API boundary #38

@jc-macdonald

Description

@jc-macdonald

Summary

Add check_array / validate_data input validation at the public API boundary so errors surface early with clear messages.

Details

Much of the validation already exists deeper in the stack (validate_mask_compatibility in _sparsity.py, _validate_shapes in _rms.py, shape checks in _full_update.py). This issue surfaces those checks to the estimator boundary so users get clear errors instead of cryptic BLAS crashes.

VBPCA

  • Add validate_data() or check_array() in fit(): enforce 2-D, numeric dtype, apply NaN policy via tags
  • Validate n_components >= 1 and n_components <= min(n_features, n_samples) in fit()
  • Guard against all-NaN or empty (0-row / 0-column) input — currently crashes deep in matrix ops
  • Add n_features_in_ consistency check in transform() / inverse_transform()

Preprocessing classes

  • Add 2-D enforcement in all fit() / transform() methods
  • Add dtype checks (numeric for scalers, any for OHE)
  • Add n_features_in_ consistency check in MissingAwareOneHotEncoder.transform()

What already exists (leverage, don't duplicate)

  • _sparsity.validate_mask_compatibility() — mask/data shape and sparsity checking
  • _rms._validate_shapes() — loadings/scores 2-D and dimension compatibility
  • _full_update._safe_cholesky() — numerical safety
  • _pca_full._validate_dense_mask_budget() — memory budget guard
  • _remove_empty.remove_empty_entries() — strips empty rows/cols (but no guard for all-empty)

Depends on

  • Issue: Inherit from sklearn BaseEstimator + TransformerMixin

Acceptance criteria

  • VBPCA(n_components=5).fit(np.array([1,2,3])) raises clear ValueError (not 2-D)
  • VBPCA(n_components=5).fit(np.array([["a","b"],["c","d"]])) raises clear TypeError (not numeric)
  • VBPCA(n_components=0).fit(X) raises ValueError
  • All-NaN input raises ValueError with message mentioning NaN
  • Empty array raises ValueError

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureNew feature or capabilitytestingTest coverage and test infrastructure

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions