Make better use of Spark's primitives for featurization

I'm pretty sure we don't do a good job of using spark's primitives for training. E.g. we don't make the data distributed until after featurization.

Currently model training is as follows:
1. feature pipeline priming (not distributed) with raw flat file data
2. featurization (not distributed) on raw flat file data
3. dataframe creation (data becomes distributed)
4. model training 

We could instead:
1. create dataframes of raw data
2. prime feature pipeline (distributed)
3. featurize data (distributed)
4. model training 

This will allow us to handle larger datasets, but it will require some work to make the distributed fashion of aggregating feature counts, index creation, etc, work.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make better use of Spark's primitives for featurization #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Make better use of Spark's primitives for featurization #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions