Skip to content

Make better use of Spark's primitives for featurization #2

@skrawcz

Description

@skrawcz

I'm pretty sure we don't do a good job of using spark's primitives for training. E.g. we don't make the data distributed until after featurization.

Currently model training is as follows:

  1. feature pipeline priming (not distributed) with raw flat file data
  2. featurization (not distributed) on raw flat file data
  3. dataframe creation (data becomes distributed)
  4. model training

We could instead:

  1. create dataframes of raw data
  2. prime feature pipeline (distributed)
  3. featurize data (distributed)
  4. model training

This will allow us to handle larger datasets, but it will require some work to make the distributed fashion of aggregating feature counts, index creation, etc, work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions