Enhancement: Implement `VectorizedScalarExtractor`

Implement structure `VectorizedScalarExtractor`, with which we could group the following extractor into a single node:

```julia
julia> schema(JSON.parse.([
  """ { "a": 1, "b": 2, "c": 3 } """,
  """ { "b": 3, "c": 4 } """
  ])) |> suggestextractor
DictExtractor
  ├── a: StableExtractor(CategoricalExtractor(n=2))
  ├── b: CategoricalExtractor(n=3)
  ╰── c: CategoricalExtractor(n=3)
```

This is mainly an optimization.

Corresponding [`suggestextractor` rule](https://github.com/CTUAvastLab/JsonGrinder.jl/blob/e8a42a4b8d3a0b2df320cc04c691cf3133e52a04/src/extractors/extractor.jl#L199-L201) is already written, but commented out.

Possible starting point is the [old implementation](https://github.com/CTUAvastLab/JsonGrinder.jl/blob/912147b56739bd5bfac0ed27f1530c91b7c9b276/src/extractors/extractvector.jl), we just need to port this to the new version. This old version doesn't implement normalization like `ScalarExtractor`, which would also be desirable in the new version.

As the example shows, we also need to make sure that `StableExtractor` wrapping works in an expected way.

One important design decision is how to build the extractor tree, because if we replace several `ScalarExtractor` nodes with only one new `VectorizedScalarExtractor`, it will have fewer nodes than e.g. the corresponding `Schema`, making stuff like `HierarchicalUtils.jl` traversal codes break. On the other hand, such extractor would have the same structure as the resulting model. Test what exactly are the implications here. One possible solution is to add some "ghost nodes" as children of `VectorizedScalarExtractor` that wouldn't ever be used, but would complete the tree.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhancement: Implement `VectorizedScalarExtractor` #139

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enhancement: Implement VectorizedScalarExtractor #139

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Enhancement: Implement `VectorizedScalarExtractor` #139