Using `transform()` to get train and test data #529

ArijitSinghEDA · 2023-12-14T05:00:05Z

ArijitSinghEDA
Dec 14, 2023

Might be a basic/silly question, but reading the documents it is my understanding that the transform() function can only return a dataframe object(pandas, spark etc.). How can I use it to get the train and test data I get after using sklearn.model_selection.train_test_split() ?

Answered by kvnkho

Jan 2, 2024

Hi @ArijitSinghEDA ,

Sorry for the late response. Just got back from the holidays.

Your understanding is right with regards to the transform() function. It is intended to run on a single worker of the cluster. The operation of doing a train test split needs to see the whole data. There are a few ways to do this. But I think it helps to think of Fugue as a mindset, and all of the solutions presented below will work on both small and big data.

ML with Big Data

If you are doing a train test split, you must be doing machine learning afterwards. One way you can do the machine learning instead is by doing the train-test split on each partition of data, and running the machine learning for each …

View full answer

kvnkho · 2024-01-02T08:00:59Z

kvnkho
Jan 2, 2024
Maintainer

Hi @ArijitSinghEDA ,

Sorry for the late response. Just got back from the holidays.

Your understanding is right with regards to the transform() function. It is intended to run on a single worker of the cluster. The operation of doing a train test split needs to see the whole data. There are a few ways to do this. But I think it helps to think of Fugue as a mindset, and all of the solutions presented below will work on both small and big data.

ML with Big Data

If you are doing a train test split, you must be doing machine learning afterwards. One way you can do the machine learning instead is by doing the train-test split on each partition of data, and running the machine learning for each partition. This gives you N number of scikit-learn models where N is the number of partitions. Basically you treat the big data problem as several small data problems.

In the example below, the return and schema are left as ... because sometimes we just return the metrics. We can also return the model or the predictions. It's upto you and it's very flexible. If you want to return the model, you can pickle it (more details on this later).

def train_model(df: pd.DataFrame) -> pd.DataFrame:
    train, test = train_test_split(df)
    model = SomeModel() 
    model.fit(train)
    pred = model.predict(test.drop('y', axis=1))
    return ...

transform(big_df, train_model, schema = "...", partition={"by": "group"}, engine=spark_session)

You can find more on this setup here. In some ways, this can be a more scalable approach to machine learning with big data. It also lets you use libraries used for small data like scikit-learn. Otherwise, you have to use stuff like MLLib, which I have heard a lot of big data practitioners dislike because it's comparatively clunky. It's still an option though.

Extending the approach done in the link above, you can have each transform() partition return a train DataFrame and a test DataFrame. For example:

def get_train_test_splits(df: pd.DataFrame) -> Iterable[Dict[str,Any]]:
    train, test = train_test_split(...)
    yield {"train": pickle.dumps(train), "test": pickle.dumps(test)}

transform(big_df, get_train_test_splits, schema="train:binary,test:binary", partition={"by":"group"}, engine=spark_session)

This will give you a train and test split per partition that you can then unpack and use a scikit-learn model on. One column will be the train DataFrames and another will be the test DataFrames. Note that pickle is note scalable for bigger DataFrames. More on this will be in the link I mentioned.

But nonetheless, if you want to use big data tooling for the ML model, you need to perform the train test split on the full data.

Using FugueSQL
If you already have a train/test column, you can use FugueSQL to do the split. SQL is nice because it has the same syntax for big and small data. It's also very readable.

import fugue.api as fa
import pandas as pd

df = pd.DataFrame({"a": [1,2,3,4], "b": [1,2,3,4], "split": ["train", "train", "test", "test"]})

query =  """
train = SELECT * 
               FROM df 
             WHERE split = 'train'
               YIELD DATAFRAME
test = SELECT * 
              FROM df 
           WHERE split = 'test'
              YIELD DATAFRAME
"""
engine = None
res = fa.fugue_sql_flow(query).run(engine=engine)
# dataframes
res["train"]
res['test']

Fugue API

If you want Python, you can do:

from fugue.column import col, lit
import fugue.api as fa

with fa.engine_context(spark_session):
    train = fa.select(df, col("split"),where=(col("split")==lit("train")))

fa.show(train)

Without train/test column

If you don't already have a column and just want to do a split, do you can do something like:

import numpy as np

df = pd.DataFrame({"a": [1,2,3,4], "b": [1,2,3,4]})

# adds number between 0 and 1
def add_random(df: pd.DataFrame) -> pd.DataFrame:
    df['random'] = np.random.random(df.shape[0])
    return df

query =  """
train = SELECT * 
               FROM rand
             WHERE random > 0.8
               YIELD DATAFRAME
test = SELECT * 
              FROM rand
           WHERE random < 0.2
              YIELD DATAFRAME
"""

with fa.engine_context(spark_session):
    rand = fa.transform(df, add_random, schema="*, random:float")
    res = fa.fugue_sql_flow(query).run()
fa.show(res['train'])
fa.show(res['test'])

I hope that gives you a starting point. There are a lot of ways to do this. If you need more clarification, please feel free to ask! I also didn't fully code some snippets above because it could be too long, but I'm happy to make one for your specific use case.

2 replies

ArijitSinghEDA Jan 2, 2024
Author

Thank you! The detailed reply surely helps in the understanding.

My use case is very straightforward, I just wish to perform train test split on the large data and get the two sets.

kvnkho Jan 2, 2024
Maintainer

Of course! I'll be working on the tutorials tomorrow so I'll be sure to add this. Thanks for raising the discussion. Returning two DataFrame is quite a common question in our Slack channel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using `transform()` to get train and test data #529

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Using transform() to get train and test data #529

ArijitSinghEDA Dec 14, 2023

Replies: 1 comment · 2 replies

kvnkho Jan 2, 2024 Maintainer

ArijitSinghEDA Jan 2, 2024 Author

kvnkho Jan 2, 2024 Maintainer

Using `transform()` to get train and test data #529

ArijitSinghEDA
Dec 14, 2023

Replies: 1 comment 2 replies

kvnkho
Jan 2, 2024
Maintainer

ArijitSinghEDA Jan 2, 2024
Author

kvnkho Jan 2, 2024
Maintainer