Using transform()
to get train and test data
#529
-
Might be a basic/silly question, but reading the documents it is my understanding that the |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Hi @ArijitSinghEDA , Sorry for the late response. Just got back from the holidays. Your understanding is right with regards to the ML with Big Data If you are doing a train test split, you must be doing machine learning afterwards. One way you can do the machine learning instead is by doing the train-test split on each partition of data, and running the machine learning for each partition. This gives you N number of scikit-learn models where N is the number of partitions. Basically you treat the big data problem as several small data problems. In the example below, the return and schema are left as def train_model(df: pd.DataFrame) -> pd.DataFrame:
train, test = train_test_split(df)
model = SomeModel()
model.fit(train)
pred = model.predict(test.drop('y', axis=1))
return ...
transform(big_df, train_model, schema = "...", partition={"by": "group"}, engine=spark_session) You can find more on this setup here. In some ways, this can be a more scalable approach to machine learning with big data. It also lets you use libraries used for small data like scikit-learn. Otherwise, you have to use stuff like MLLib, which I have heard a lot of big data practitioners dislike because it's comparatively clunky. It's still an option though. Extending the approach done in the link above, you can have each def get_train_test_splits(df: pd.DataFrame) -> Iterable[Dict[str,Any]]:
train, test = train_test_split(...)
yield {"train": pickle.dumps(train), "test": pickle.dumps(test)}
transform(big_df, get_train_test_splits, schema="train:binary,test:binary", partition={"by":"group"}, engine=spark_session) This will give you a train and test split per partition that you can then unpack and use a scikit-learn model on. One column will be the train DataFrames and another will be the test DataFrames. Note that pickle is note scalable for bigger DataFrames. More on this will be in the link I mentioned. But nonetheless, if you want to use big data tooling for the ML model, you need to perform the train test split on the full data. Using FugueSQL import fugue.api as fa
import pandas as pd
df = pd.DataFrame({"a": [1,2,3,4], "b": [1,2,3,4], "split": ["train", "train", "test", "test"]})
query = """
train = SELECT *
FROM df
WHERE split = 'train'
YIELD DATAFRAME
test = SELECT *
FROM df
WHERE split = 'test'
YIELD DATAFRAME
"""
engine = None
res = fa.fugue_sql_flow(query).run(engine=engine)
# dataframes
res["train"]
res['test'] Fugue API If you want Python, you can do: from fugue.column import col, lit
import fugue.api as fa
with fa.engine_context(spark_session):
train = fa.select(df, col("split"),where=(col("split")==lit("train")))
fa.show(train) Without train/test column If you don't already have a column and just want to do a split, do you can do something like: import numpy as np
df = pd.DataFrame({"a": [1,2,3,4], "b": [1,2,3,4]})
# adds number between 0 and 1
def add_random(df: pd.DataFrame) -> pd.DataFrame:
df['random'] = np.random.random(df.shape[0])
return df
query = """
train = SELECT *
FROM rand
WHERE random > 0.8
YIELD DATAFRAME
test = SELECT *
FROM rand
WHERE random < 0.2
YIELD DATAFRAME
"""
with fa.engine_context(spark_session):
rand = fa.transform(df, add_random, schema="*, random:float")
res = fa.fugue_sql_flow(query).run()
fa.show(res['train'])
fa.show(res['test']) I hope that gives you a starting point. There are a lot of ways to do this. If you need more clarification, please feel free to ask! I also didn't fully code some snippets above because it could be too long, but I'm happy to make one for your specific use case. |
Beta Was this translation helpful? Give feedback.
Hi @ArijitSinghEDA ,
Sorry for the late response. Just got back from the holidays.
Your understanding is right with regards to the
transform()
function. It is intended to run on a single worker of the cluster. The operation of doing a train test split needs to see the whole data. There are a few ways to do this. But I think it helps to think of Fugue as a mindset, and all of the solutions presented below will work on both small and big data.ML with Big Data
If you are doing a train test split, you must be doing machine learning afterwards. One way you can do the machine learning instead is by doing the train-test split on each partition of data, and running the machine learning for each …