Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Learning when drop_zeros=False #1595

Open
ColdTeapot273K opened this issue Aug 23, 2024 · 1 comment
Open

Learning when drop_zeros=False #1595

ColdTeapot273K opened this issue Aug 23, 2024 · 1 comment
Labels

Comments

@ColdTeapot273K
Copy link
Contributor

Versions

river version: recent main (d606d7b)
Python version: Python 3.11.8
Operating system: Fedora Linux

Describe the bug

These 4 interesting lines effectively stop OneHotEncoder from learning when drop_zeros=False:

Steps/code to reproduce

Setup:

# Sample code to reproduce the problem
>>> from pprint import pprint
>>> import random
>>> import string

>>> random.seed(42)
>>> alphabet = list(string.ascii_lowercase)
>>> X = [
...     {
...         'c1': random.choice(alphabet),
...         'c2': random.choice(alphabet),
...     }
...     for _ in range(4)
... ]

>>> from river import preprocessing
>>> oh = preprocessing.OneHotEncoder(drop_zeros=True)
>>> for x in X:
...:     oh.learn_one(x)
...:     pprint(oh.transform_one(x))
...: 
{'c1_u': 1, 'c2_d': 1}
{'c1_a': 1, 'c2_x': 1}
{'c1_i': 1, 'c2_h': 1}
{'c1_h': 1, 'c2_e': 1}

Actual result:

>>> oh.values
defaultdict(set, {})

Expected result:

>>> oh.values
defaultdict(set, {'c1': {'a', 'h', 'i', 'u'}, 'c2': {'d', 'e', 'h', 'x'}})
@MaxHalford MaxHalford added the Bug label Aug 25, 2024
@MaxHalford
Copy link
Member

Hey there @ColdTeapot273K!

Would you say this is a bug? The idea here is that there is no need to keep track of values when drop_zeros=True. Indeed, if drop_zeros=False, then we need to keep track of values in order to add zeros. But if not, there is no need to keep track, which is why we return directly in learn_one and learn_many.

What am I missing? Maybe it would be preferable if oh.values was a private property?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants