-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
I'm pretty sure we don't do a good job of using spark's primitives for training. E.g. we don't make the data distributed until after featurization.
Currently model training is as follows:
- feature pipeline priming (not distributed) with raw flat file data
- featurization (not distributed) on raw flat file data
- dataframe creation (data becomes distributed)
- model training
We could instead:
- create dataframes of raw data
- prime feature pipeline (distributed)
- featurize data (distributed)
- model training
This will allow us to handle larger datasets, but it will require some work to make the distributed fashion of aggregating feature counts, index creation, etc, work.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels