FEAT-modin-project#6908: Remove the warning regarding engine initiali…

…zation Signed-off-by: Igoshev, Iaroslav <[email protected]>
YarShev · Feb 6, 2024 · 62e248e · 62e248e
1 parent 807298d
commit 62e248e
Show file tree

Hide file tree

Showing 20 changed files with 372 additions and 4,841 deletions.
diff --git a/docs/getting_started/quickstart.rst b/docs/getting_started/quickstart.rst
@@ -61,6 +61,7 @@ For the purpose of demonstration, we will load in modin as ``pd`` and pandas as
   #############################################
   import time
   import ray
+  # Look at the Ray documentation with respect to the Ray configuration suited to you most.
   ray.init()
   #############################################
 

diff --git a/docs/getting_started/troubleshooting.rst b/docs/getting_started/troubleshooting.rst
@@ -215,6 +215,7 @@ once Python interpreter is started in them so that to avoid a race condition in
   import modin.pandas as pd
   import modin.config as cfg
 
+  # Look at the Ray documentation with respect to the Ray configuration suited to you most.
   ray.init(runtime_env={'env_vars': {'__MODIN_AUTOIMPORT_PANDAS__': '1'}})
 
   pandas_df = pandas.DataFrame(
@@ -357,7 +358,9 @@ or
   cfg.Engine.put("dask")
 
   if __name__ == "__main__":
-    client = Client() # Explicit Dask Client creation.
+    # Explicit Dask Client creation.
+    # Look at the Dask Distributed documentation with respect to the Client configuration suited to you most.
+    client = Client()
     df = pd.DataFrame([0, 1, 2, 3])
     print(df)
 

diff --git a/docs/getting_started/using_modin/using_modin_locally.rst b/docs/getting_started/using_modin/using_modin_locally.rst
@@ -23,55 +23,6 @@ just like you would pandas, since the API is identical to pandas.
 
 **That's it. You're ready to use Modin on your previous pandas workflows!**
 
-Optional Configurations
------------------------
-
-When using Modin locally on a single machine or laptop (without a cluster), Modin will
-automatically create and manage a local Dask or Ray cluster for the executing your
-code. So when you run an operation for the first time with Modin, you will see a
-message like this, indicating that a Modin has automatically initialized a local
-cluster for you:
-
-.. code-block:: python
-
-  df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
-
-.. code-block:: text
-
-   UserWarning: Ray execution environment not yet initialized. Initializing...
-   To remove this warning, run the following python code before doing dataframe operations:
-
-    import ray
-    ray.init()
-
- If you prefer to use Dask over Ray as your execution backend, you can use the
- following code to modify the default configuration:
-
-.. code-block:: python
-
-   import modin
-   modin.config.Engine.put("Dask")
-
-.. code-block:: python
-
-   df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
-
-
-.. code-block:: text
-
-   UserWarning: Dask execution environment not yet initialized. Initializing...
-   To remove this warning, run the following python code before doing dataframe operations:
-
-      from distributed import Client
-
-      client = Client()
-
-Finally, if you already have an Ray or Dask engine initialized, Modin will
-automatically attach to whichever engine is available. If you are interested in using
-Modin with HDK engine, please refer to :doc:`these instructions </development/using_hdk>`.
-For additional information on other settings you can configure, see
-:doc:`Modin's config </flow/modin/config>` page for more details.
-
 Advanced: Configuring the resources Modin uses
 ----------------------------------------------
 

diff --git a/docs/usage_guide/advanced_usage/index.rst b/docs/usage_guide/advanced_usage/index.rst
@@ -12,6 +12,7 @@ Advanced Usage
    modin_xgboost
    modin_logging
    batch
+   modin_engines
 
 .. meta::
     :description lang=en:
@@ -22,6 +23,16 @@ integrated toolkit for data scientists. We are actively developing data science
 such as DataFrame spreadsheet integration, DataFrame algebra, progress bars, SQL queries
 on DataFrames, and more. Join us on `Slack`_ and `Discourse`_ for the latest updates!
 
+Modin engines
+-------------
+
+Modin supports a series of execution engines such as Ray_, Dask_, `MPI through unidist`_, `HDK`_,
+each of which might be a more beneficial choice for a specific scenario. When doing the first operation
+with Modin it automatically initializes one of the engines to further perform distributed/parallel computation.
+If you are familiar with a concrete execution engine, it is possible to initialize the engine on your own and
+Modin will automatically attach to it. Refer to :doc:`Modin engines </usage_guide/advanced_usage/modin_engines>` page
+for more details.
+
 Experimental APIs
 -----------------
 
@@ -118,3 +129,7 @@ downloaded as an artifact from the GitHub Actions tab for further inspection. Se
 .. _`tqdm`: https://github.com/tqdm/tqdm
 .. _`distributed XGBoost`: https://medium.com/intel-analytics-software/distributed-xgboost-with-modin-on-ray-fc17edef7720
 .. _`fuzzydata`: https://github.com/suhailrehman/fuzzydata
+.. _Ray: https://github.com/ray-project/ray
+.. _Dask: https://github.com/dask/distributed
+.. _`MPI through unidist`: https://github.com/modin-project/unidist
+.. _HDK: https://github.com/intel-ai/hdk
diff --git a/docs/usage_guide/advanced_usage/modin_engines.rst b/docs/usage_guide/advanced_usage/modin_engines.rst
@@ -0,0 +1,76 @@
+Modin engines
+=============
+
+As a rule, you don't have to worry about initialization of an execution engine as
+Modin itself automatically initializes one when performing the first operation.
+Also, Modin has a broad range of :doc:`configuration settings </flow/modin/config>`, which
+you can use to configure an execution engine. If there is a reason to initialize an execution engine
+on your own and you are sure what to do, Modin will automatically attach to whichever engine is available.
+Below, you can find some examples on how to initialize a specific execution engine on your own.
+
+Ray
+---
+
+You can initialize Ray engine with a specific number of CPUs (worker processes) to perform computation.
+
+.. code-block:: python
+
+  import ray
+  import modin.config as modin_cfg
+
+  ray.init(num_cpus=<N>)
+  modin_cfg.Engine.put("ray") # Modin will use Ray engine
+  modin_cfg.CpuCount.put(<N>)
+
+To get more details on all possible parameters for initialization refer to `Ray documentation`_.
+
+Dask
+----
+
+You can initialize Dask engine with a specific number of worker processes and threads per worker to perform computation.
+
+.. code-block:: python
+
+  from distributed import Client
+  import modin.config as modin_cfg
+
+  client = Client(n_workers=<N1>, threads_per_worker=<N2>)
+  modin_cfg.Engine.put("dask") # # Modin will use Dask engine
+  modin_cfg.CpuCount.put(<N1>)
+
+To get more details on all possible parameters for initialization refer to `Dask Distributed documentation`_.
+
+MPI through unidist
+-------------------
+
+You can initialize MPI thought unidist engine with a specific number of CPUs (worker processes) to perform computation.
+
+.. code-block:: python
+
+  import unidist
+  import unidist.config as unidist_cfg
+  import modin.config as modin_cfg
+
+  unidist_cfg.Backend.put("mpi")
+  unidist_cfg.CpuCount.put(<N>)
+  unidist.init()
+
+  modin_cfg.Engine.put("unidist") # # Modin will use MPI through unidist engine
+  modin_cfg.CpuCount.put(<N>)
+
+To get more details on all possible parameters for initialization refer to `unidist documentation`_.
+
+HDK
+---
+
+For now it is not possible to initialize HDK beforehand. Modin itself initializes it with the required configuration.
+
+.. code-block:: python
+
+  import modin.config as modin_cfg
+
+  modin_cfg.StorageFormat.put("hdk") # # Modin will use HDK engine
+
+.. _`Ray documentation`: https://docs.ray.io/en/latest
+.. _Dask Distributed documentation: https://distributed.dask.org/en/latest
+.. _`unidist documentation`: https://unidist.readthedocs.io/en/latest
diff --git a/docs/usage_guide/advanced_usage/modin_xgboost.rst b/docs/usage_guide/advanced_usage/modin_xgboost.rst
@@ -55,6 +55,7 @@ To start the Ray runtime on a single node:
 .. code-block:: python
 
   import ray
+  # Look at the Ray documentation with respect to the Ray configuration suited to you most.
   ray.init()
 
 If you already had the Ray cluster you can connect to it by next way:
@@ -78,6 +79,7 @@ All processing will be in a `single node` mode.
   from sklearn import datasets
 
   import ray
+  # Look at the Ray documentation with respect to the Ray configuration suited to you most.
   ray.init() # Start the Ray runtime for single-node
 
   import modin.pandas as pd

diff --git a/docs/usage_guide/benchmarking.rst b/docs/usage_guide/benchmarking.rst
@@ -35,6 +35,7 @@ Consider the following ipython script:
     import time
     import ray
 
+    # Look at the Ray documentation with respect to the Ray configuration suited to you most.
     ray.init()
     df = pd.DataFrame(list(range(MinPartitionSize.get() * 2)))
     %time result = df.map(lambda x: time.sleep(0.1) or x)
@@ -146,6 +147,7 @@ That will typically block on any asynchronous computation:
         time.sleep(10)
       return x + 1
 
+    # Look at the Ray documentation with respect to the Ray configuration suited to you most.
     ray.init()
     df1 = pd.DataFrame(list(range(10_000)), columns=['A'])
     result = df1.map(slow_add_one)