Skip to content

Support Categorical Values directly #45

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
schlichtanders opened this issue Feb 28, 2024 · 3 comments · May be fixed by #54
Open

Support Categorical Values directly #45

schlichtanders opened this issue Feb 28, 2024 · 3 comments · May be fixed by #54

Comments

@schlichtanders
Copy link

Motivation and description

In Data Science CategoricalArrays.CategoricalValue or CategoricalArrays.CategoricalVector and the like appear often. (RDatasets loads DataFrames with columns of that type by default).

It would be great if onehotbatch could simply be applied on this.

I just came to this package, still figuring out how to transform such a Categorical Value/Vector into onehot Vector/Matrix... It is very possible that I missed something obvious

Possible Implementation

No response

@mcabbott
Copy link
Member

Attempting to construct the minimal object:

julia> using CategoricalArrays, OneHotArrays

julia> cv = CategoricalArrays.CategoricalValue('b', CategoricalArray('a':'z'))
CategoricalValue{Char, UInt32} 'b'

julia> dump(cv)
CategoricalValue{Char, UInt32}
  pool: CategoricalPool{Char, UInt32, CategoricalValue{Char, UInt32}}
    levels: Array{Char}((26,))
      1: Char 'a'
      2: Char 'b'
      3: Char 'c'
      4: Char 'd'
      5: Char 'e'
      ...
      22: Char 'v'
      23: Char 'w'
      24: Char 'x'
      25: Char 'y'
      26: Char 'z'
    invindex: Dict{Char, UInt32}
      slots: Memory{UInt8}
        length: Int64 64
        ptr: Ptr{Nothing} @0x0000000160607020
    ...

julia> cv.pool.levels
26-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
 'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
...

julia> Int(cv.ref), length(cv.pool.levels)
(2, 26)

julia> OneHotArrays.onehot(cv::CategoricalValue) = OneHotVector(cv.ref, length(cv.pool.levels))

julia> onehot(cv)
26-element OneHotVector(::UInt32) with eltype Bool:
 
 1
 
 
 
 
...

julia> dump(onehot(cv))
OneHotVector{UInt32}
  indices: UInt32 0x00000002
  nlabels: Int64 26

Are these two integers all that's required, or are there more complicated examples?

@schlichtanders
Copy link
Author

I think this is all, but I am not an expert on CategoricalArrays

@mcabbott
Copy link
Member

mcabbott commented May 2, 2025

See #54 for a start. Probably need someone to come up with a list of CategoricalArrays examples worth testing.

@mcabbott mcabbott linked a pull request May 2, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants