Skip to content

Bug in ColumnTransformer #962

@aparnakesarkar

Description

@aparnakesarkar

I have a straightforward usecase to label encode some columns, onehot encode some columns and passthrough some columns in a pandas df (drop remainder)

Code:

from dask_ml.compose import ColumnTransformer
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

df = pd.read_csv('path/to/csv')

ordinal_cols = [<list of ordinal columns>]
nominal_cols = [<list of nominal columns>]
passthrough_cols =  [<list of passthrough columns>]

transformers = [
    ("ordinal_encoding", OrdinalEncoder(), ordinal_cols),
    ("onehot_encoding", OneHotEncoder(), nominal_cols),
    ('select', 'passthrough', passthrough_cols)
]

preprocessor = ColumnTransformer(transformers=transformers)
df_t = preprocessor.fit_transform(df)

this failed with the Traceback

Traceback (most recent call last):
  File ".../helpers/pydev/pydevd.py", line 1496, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File ".../python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File ".../dask_testing.py", line 80, in <module>
    df_t = preprocessor.fit_transform(df)
  File ".../lib/python3.8/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File ".../lib/python3.8/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File ".../lib/python3.8/site-packages/sklearn/compose/_column_transformer.py", line 750, in fit_transform
    return self._hstack(list(Xs))
  File ".../lib/python3.8/site-packages/dask_ml/compose/_column_transformer.py", line 198, in _hstack
    return pd.concat(Xs, axis="columns")
  File ".../lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File ".../lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 368, in concat
    op = _Concatenator(
  File ".../lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 458, in __init__
    raise TypeError(msg)
TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid

On further debugging the output from the three steps in the transformer give 3 different types of outputs.

  1. OrdinalEncoder() gives a 2darray
  2. OneHotEncoder() gives a csr_matrix
  3. "passthrough" gives a dataframe

Point where it is failing in dask-ml package is .../python3.8/site-packages/dask_ml/compose/_column_transformer.py line 198 where it is trying to concat the three different types into a an output df

Code snippet:

elif self.preserve_dataframe and (pd.Series in types or pd.DataFrame in types):
            return pd.concat(Xs, axis="columns")

Anything else we need to know?:
Shape of my data is (1000, 1076)
label encoding 109 ccolumns
onehot encoding 1 column
passthrough the rest of the columns

I do not want to use remainder="passthrough" param, I want to pass it in the transformers list

Environment:

  • Dask version:
dask               2023.1.0
dask-glm           0.2.0
dask-ml            2022.5.27
  • Python version: 3.8
  • Operating System: MacOS
  • Install method (conda, pip, source): pip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions