Skip to content

Enhancement: Implement VectorizedScalarExtractor #139

@simonmandlik

Description

@simonmandlik

Implement structure VectorizedScalarExtractor, with which we could group the following extractor into a single node:

julia> schema(JSON.parse.([
  """ { "a": 1, "b": 2, "c": 3 } """,
  """ { "b": 3, "c": 4 } """
  ])) |> suggestextractor
DictExtractor
  ├── a: StableExtractor(CategoricalExtractor(n=2))
  ├── b: CategoricalExtractor(n=3)
  ╰── c: CategoricalExtractor(n=3)

This is mainly an optimization.

Corresponding suggestextractor rule is already written, but commented out.

Possible starting point is the old implementation, we just need to port this to the new version. This old version doesn't implement normalization like ScalarExtractor, which would also be desirable in the new version.

As the example shows, we also need to make sure that StableExtractor wrapping works in an expected way.

One important design decision is how to build the extractor tree, because if we replace several ScalarExtractor nodes with only one new VectorizedScalarExtractor, it will have fewer nodes than e.g. the corresponding Schema, making stuff like HierarchicalUtils.jl traversal codes break. On the other hand, such extractor would have the same structure as the resulting model. Test what exactly are the implications here. One possible solution is to add some "ghost nodes" as children of VectorizedScalarExtractor that wouldn't ever be used, but would complete the tree.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions