Skip to content

Commit f2913c8

Browse files
sarahyurickayushdg
andauthored
Remove all Dask-ML uses (#886)
* initial pass * remove all dask ml references * remove use_dask * fix failing test * wrap in try/except * add gpu test * separate gpu test * use gpu_client * remove imports Co-authored-by: Ayush Dattagupta <[email protected]>
1 parent 161e276 commit f2913c8

File tree

14 files changed

+60
-47
lines changed

14 files changed

+60
-47
lines changed

.github/workflows/test-upstream.yml

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -73,11 +73,10 @@ jobs:
7373
mamba install -c conda-forge "sasl>=0.3.1"
7474
docker pull bde2020/hive:2.3.2-postgresql-metastore
7575
docker pull bde2020/hive-metastore-postgresql:2.3.0
76-
- name: Install upstream dev Dask / dask-ml
76+
- name: Install upstream dev Dask
7777
if: env.which_upstream == 'Dask'
7878
run: |
7979
mamba update dask
80-
python -m pip install --no-deps git+https://github.com/dask/dask-ml
8180
- name: Test with pytest
8281
run: |
8382
pytest --junitxml=junit/test-results.xml --cov-report=xml -n auto tests --dist loadfile
@@ -112,11 +111,10 @@ jobs:
112111
which python
113112
pip list
114113
mamba list
115-
- name: Install upstream dev dask-ml
114+
- name: Install upstream dev Dask
116115
if: env.which_upstream == 'Dask'
117116
run: |
118117
mamba update dask
119-
python -m pip install --no-deps git+https://github.com/dask/dask-ml
120118
- name: run a dask cluster
121119
run: |
122120
if [[ $which_upstream == "Dask" ]]; then
@@ -161,12 +159,11 @@ jobs:
161159
which python
162160
pip list
163161
mamba list
164-
- name: Install upstream dev Dask / dask-ml
162+
- name: Install upstream dev Dask
165163
if: env.which_upstream == 'Dask'
166164
run: |
167165
python -m pip install --no-deps git+https://github.com/dask/dask
168166
python -m pip install --no-deps git+https://github.com/dask/distributed
169-
python -m pip install --no-deps git+https://github.com/dask/dask-ml
170167
- name: Try to import dask-sql
171168
run: |
172169
python -c "import dask_sql; print('ok')"

.github/workflows/test.yml

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -64,11 +64,10 @@ jobs:
6464
mamba install -c conda-forge "sasl>=0.3.1"
6565
docker pull bde2020/hive:2.3.2-postgresql-metastore
6666
docker pull bde2020/hive-metastore-postgresql:2.3.0
67-
- name: Optionally install upstream dev Dask / dask-ml
67+
- name: Optionally install upstream dev Dask
6868
if: needs.detect-ci-trigger.outputs.triggered == 'true'
6969
run: |
7070
mamba update dask
71-
python -m pip install --no-deps git+https://github.com/dask/dask-ml
7271
- name: Test with pytest
7372
run: |
7473
pytest --junitxml=junit/test-results.xml --cov-report=xml -n auto tests --dist loadfile
@@ -108,11 +107,10 @@ jobs:
108107
which python
109108
pip list
110109
mamba list
111-
- name: Optionally install upstream dev dask-ml
110+
- name: Optionally install upstream dev Dask
112111
if: needs.detect-ci-trigger.outputs.triggered == 'true'
113112
run: |
114113
mamba update dask
115-
python -m pip install --no-deps git+https://github.com/dask/dask-ml
116114
- name: run a dask cluster
117115
env:
118116
UPSTREAM: ${{ needs.detect-ci-trigger.outputs.triggered }}
@@ -153,12 +151,11 @@ jobs:
153151
which python
154152
pip list
155153
mamba list
156-
- name: Optionally install upstream dev Dask / dask-ml
154+
- name: Optionally install upstream dev Dask
157155
if: needs.detect-ci-trigger.outputs.triggered == 'true'
158156
run: |
159157
python -m pip install --no-deps git+https://github.com/dask/dask
160158
python -m pip install --no-deps git+https://github.com/dask/distributed
161-
python -m pip install --no-deps git+https://github.com/dask/dask-ml
162159
- name: Try to import dask-sql
163160
run: |
164161
python -c "import dask_sql; print('ok')"

continuous_integration/environment-3.10-dev.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,6 @@ channels:
33
- conda-forge
44
- nodefaults
55
dependencies:
6-
- dask-ml>=2022.1.22
76
- dask>=2022.3.0
87
- fastapi>=0.69.0
98
- fugue>=0.7.0

continuous_integration/environment-3.8-dev.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,6 @@ channels:
33
- conda-forge
44
- nodefaults
55
dependencies:
6-
- dask-ml=2022.1.22
76
- dask=2022.3.0
87
- fastapi=0.69.0
98
- fugue=0.7.0

continuous_integration/environment-3.9-dev.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,6 @@ channels:
33
- conda-forge
44
- nodefaults
55
dependencies:
6-
- dask-ml>=2022.1.22
76
- dask>=2022.3.0
87
- fastapi>=0.69.0
98
- fugue>=0.7.0

continuous_integration/gpuci/environment.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@ channels:
66
- conda-forge
77
- nodefaults
88
dependencies:
9-
- dask-ml>=2022.1.22
109
- dask>=2022.3.0
1110
- fastapi>=0.69.0
1211
- fugue>=0.7.0

dask_sql/physical/rel/custom/predict.py

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,9 @@
22
import uuid
33
from typing import TYPE_CHECKING
44

5+
import dask.dataframe as dd
6+
import pandas as pd
7+
58
from dask_sql.datacontainer import ColumnContainer, DataContainer
69
from dask_sql.physical.rel.base import BaseRelPlugin
710

@@ -30,8 +33,7 @@ class PredictModelPlugin(BaseRelPlugin):
3033
Please note however, that it will need to act on Dask dataframes. If you
3134
are using a model not optimized for this, it might be that you run out of memory if
3235
your data is larger than the RAM of a single machine.
33-
To prevent this, have a look into the dask-ml package,
34-
especially the [ParallelPostFit](https://ml.dask.org/meta-estimators.html)
36+
To prevent this, have a look into the dask_sql.physical.rel.custom.wrappers.ParallelPostFit
3537
meta-estimator. If you are using a model trained with `CREATE MODEL`
3638
and the `wrap_predict` flag, this is done automatically.
3739
@@ -59,8 +61,21 @@ def convert(self, rel: "LogicalPlan", context: "dask_sql.Context") -> DataContai
5961

6062
model, training_columns = context.schema[schema_name].models[model_name]
6163
df = context.sql(sql_select)
62-
prediction = model.predict(df[training_columns])
63-
predicted_df = df.assign(target=prediction)
64+
try:
65+
prediction = model.predict(df[training_columns])
66+
predicted_df = df.assign(target=prediction)
67+
except TypeError:
68+
df = df.set_index(df.columns[0], drop=False)
69+
prediction = model.predict(df[training_columns])
70+
# Convert numpy.ndarray to Dask Series
71+
prediction = dd.from_pandas(
72+
pd.Series(prediction, index=df.index),
73+
npartitions=df.npartitions,
74+
)
75+
predicted_df = df.assign(target=prediction)
76+
# Need to drop first column to reset index
77+
# because the first column is equal to the index
78+
predicted_df = predicted_df.drop(columns=[df.columns[0]]).reset_index()
6479

6580
# Create a temporary context, which includes the
6681
# new "table" so that we can use the normal

docker/conda.txt

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,6 @@ uvicorn>=0.13.4
1616
pyarrow>=6.0.1
1717
prompt_toolkit>=3.0.8
1818
pygments>=2.7.1
19-
dask-ml>=2022.1.22
2019
scikit-learn>=1.0.0
2120
intake>=0.6.0
2221
pre-commit>=2.11.1

docker/main.dockerfile

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,6 @@ RUN mamba install -y \
2727
nest-asyncio \
2828
# additional dependencies
2929
"pyarrow>=6.0.1" \
30-
"dask-ml>=2022.1.22" \
3130
"scikit-learn>=1.0.0" \
3231
"intake>=0.6.0" \
3332
&& conda clean -ay

docs/source/machine_learning.rst

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -125,8 +125,7 @@ following sql statements
125125
Want to increase the performance of your model by tuning the
126126
parameters? Use the hyperparameter tuning directly
127127
in SQL using below SQL syntax, choose different tuners
128-
from the dask_ml package based on memory and compute constraints and
129-
for more details refer to the `dask ml documentation <https://ml.dask.org/hyper-parameter-search.html#incremental-hyperparameter-optimization>`_
128+
based on memory and compute constraints.
130129

131130
..
132131
TODO - add a GPU section to these examples once we have working CREATE EXPERIMENT tests for GPU
@@ -135,7 +134,7 @@ for more details refer to the `dask ml documentation <https://ml.dask.org/hyper-
135134
136135
CREATE EXPERIMENT my_exp WITH (
137136
model_class = 'sklearn.ensemble.GradientBoostingClassifier',
138-
experiment_class = 'dask_ml.model_selection.GridSearchCV',
137+
experiment_class = 'sklearn.model_selection.GridSearchCV',
139138
tune_parameters = (n_estimators = ARRAY [16, 32, 2],
140139
learning_rate = ARRAY [0.1,0.01,0.001],
141140
max_depth = ARRAY [3,4,5,10]
@@ -258,7 +257,6 @@ and the boolean target ``label``.
258257
SELECT * FROM training_data
259258
260259
-- We can now train a model from the sklearn package.
261-
-- Make sure to install it together with dask-ml with conda or pip.
262260
CREATE OR REPLACE MODEL my_model WITH (
263261
model_class = 'sklearn.ensemble.GradientBoostingClassifier',
264262
wrap_predict = True,
@@ -282,7 +280,7 @@ and the boolean target ``label``.
282280
-- experiment to tune different hyperparameters
283281
CREATE EXPERIMENT my_exp WITH(
284282
model_class = 'sklearn.ensemble.GradientBoostingClassifier',
285-
experiment_class = 'dask_ml.model_selection.GridSearchCV',
283+
experiment_class = 'sklearn.model_selection.GridSearchCV',
286284
tune_parameters = (n_estimators = ARRAY [16, 32, 2],
287285
learning_rate = ARRAY [0.1,0.01,0.001],
288286
max_depth = ARRAY [3,4,5,10]

0 commit comments

Comments
 (0)