Skip to content

Commit

Permalink
FEAT-modin-project#6735: Make Modin on MPI through unidist component …
Browse files Browse the repository at this point in the history
…more obvious

Signed-off-by: Igoshev, Iaroslav <[email protected]>
  • Loading branch information
YarShev committed Nov 13, 2023
1 parent 8a332c1 commit 89c0473
Show file tree
Hide file tree
Showing 18 changed files with 106 additions and 52 deletions.
24 changes: 15 additions & 9 deletions .github/workflows/ci-notebooks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ on:
- setup.cfg
- setup.py
- requirements/env_hdk.yml
- requirements/env_unidist.yml
concurrency:
# Cancel other jobs in the same branch. We don't care whether CI passes
# on old commits.
Expand All @@ -28,12 +29,17 @@ jobs:
steps:
- uses: actions/checkout@v3
- uses: ./.github/actions/python-only
if: matrix.execution != 'hdk_on_native'
if: matrix.execution != 'hdk_on_native' || matrix.execution != 'pandas_on_unidist'
- uses: ./.github/actions/mamba-env
with:
environment-file: requirements/env_hdk.yml
activate-environment: modin_on_hdk
if: matrix.execution == 'hdk_on_native'
- uses: ./.github/actions/mamba-env
with:
environment-file: requirements/env_unidist.yml
activate-environment: modin_on_unidist
if: matrix.execution == 'pandas_on_unidist'
- name: Cache datasets
uses: actions/cache@v2
with:
Expand All @@ -43,29 +49,29 @@ jobs:
# replace modin with . in the tutorial requirements file for `pandas_on_ray` and
# `pandas_on_dask` since we need Modin built from sources
- run: sed -i 's/modin/./g' examples/tutorial/jupyter/execution/${{ matrix.execution }}/requirements.txt
if: matrix.execution != 'hdk_on_native'
if: matrix.execution != 'hdk_on_native' || matrix.execution != 'pandas_on_unidist'
# install dependencies required for notebooks execution for `pandas_on_ray` and `pandas_on_dask`
# Override modin-spreadsheet install for now
- run: |
pip install -r examples/tutorial/jupyter/execution/${{ matrix.execution }}/requirements.txt
pip install git+https://github.com/modin-project/modin-spreadsheet.git@49ffd89f683f54c311867d602c55443fb11bf2a5
if: matrix.execution != 'hdk_on_native'
# Build Modin from sources for `hdk_on_native`
if: matrix.execution != 'hdk_on_native' || matrix.execution != 'pandas_on_unidist'
# Build Modin from sources for `hdk_on_native` and `pandas_on_unidist`
- run: pip install -e .
if: matrix.execution == 'hdk_on_native'
if: matrix.execution == 'hdk_on_native' || matrix.execution == 'pandas_on_unidist'
# install test dependencies
# NOTE: If you are changing the set of packages installed here, make sure that
# the dev requirements match them.
- run: pip install pytest pytest-cov black flake8 flake8-print flake8-no-implicit-concat
if: matrix.execution != 'hdk_on_native'
if: matrix.execution != 'hdk_on_native' || matrix.execution != 'pandas_on_unidist'
- run: pip install flake8-print jupyter nbformat nbconvert
if: matrix.execution == 'hdk_on_native'
if: matrix.execution == 'hdk_on_native' || matrix.execution == 'pandas_on_unidist'
- run: pip list
if: matrix.execution != 'hdk_on_native'
if: matrix.execution != 'hdk_on_native' || matrix.execution != 'pandas_on_unidist'
- run: |
conda info
conda list
if: matrix.execution == 'hdk_on_native'
if: matrix.execution == 'hdk_on_native' || matrix.execution == 'pandas_on_unidist'
# setup kernel configuration for `pandas_on_unidist` execution with mpi backend
- run: python examples/tutorial/jupyter/execution/${{ matrix.execution }}/setup_kernel.py
if: matrix.execution == 'pandas_on_unidist'
Expand Down
1 change: 0 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,6 @@ jobs:
run: |
MODIN_ENGINE=dask python -c "import modin.pandas as pd; print(pd.DataFrame([1,2,3]))"
MODIN_ENGINE=ray python -c "import modin.pandas as pd; print(pd.DataFrame([1,2,3]))"
MODIN_ENGINE=unidist UNIDIST_BACKEND=mpi mpiexec -n 1 python -c "import modin.pandas as pd; print(pd.DataFrame([1,2,3]))"
test-internals:
needs: [lint-flake8, lint-black-isort]
Expand Down
22 changes: 16 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,24 +57,30 @@ The charts below show the speedup you get by replacing pandas with Modin based o
Modin can be installed with `pip` on Linux, Windows and MacOS:

```bash
pip install "modin[all]" # (Recommended) Install Modin with all of Modin's currently supported engines.
pip install "modin[all]" # (Recommended) Install Modin with Ray and Dask engines.
```

If you want to install Modin with a specific engine, we recommend:

```bash
pip install "modin[ray]" # Install Modin dependencies and Ray.
pip install "modin[dask]" # Install Modin dependencies and Dask.
pip install "modin[unidist]" # Install Modin dependencies and Unidist.
pip install "modin[mpi]" # Install Modin dependencies and MPI through unidist.
```

To get Modin on MPI through unidist (as of unidist 0.5.0) fully working
it is required to have a working MPI implementation installed beforehand.
Otherwise, installation of `modin[mpi]` may fail. Refer to
[Installing with pip](https://unidist.readthedocs.io/en/latest/installation.html#installing-with-pip)
section of the unidist documentation for more details about installation.

Modin automatically detects which engine(s) you have installed and uses that for scheduling computation.

#### From conda-forge

Installing from [conda forge](https://github.com/conda-forge/modin-feedstock) using `modin-all`
will install Modin and four engines: [Ray](https://github.com/ray-project/ray), [Dask](https://github.com/dask/dask),
[Unidist](https://github.com/modin-project/unidist) and [HDK](https://github.com/intel-ai/hdk).
[MPI through unidist](https://github.com/modin-project/unidist) and [HDK](https://github.com/intel-ai/hdk).

```bash
conda install -c conda-forge modin-all
Expand All @@ -85,10 +91,14 @@ Each engine can also be installed individually (and also as a combination of sev
```bash
conda install -c conda-forge modin-ray # Install Modin dependencies and Ray.
conda install -c conda-forge modin-dask # Install Modin dependencies and Dask.
conda install -c conda-forge modin-unidist # Install Modin dependencies and Unidist.
conda install -c conda-forge modin-mpi # Install Modin dependencies and MPI through unidist.
conda install -c conda-forge modin-hdk # Install Modin dependencies and HDK.
```

Refer to
[Installing with conda](https://unidist.readthedocs.io/en/latest/installation.html#installing-with-conda)
section of the unidist documentation for more details on how to install a specific MPI implementation to run on.

To speed up conda installation we recommend using libmamba solver. To do this install it in a base environment:

```bash
Expand Down Expand Up @@ -119,7 +129,7 @@ export MODIN_ENGINE=unidist # Modin will use Unidist
```

If you want to choose the Unidist engine, you should set the additional environment
variable ``UNIDIST_BACKEND``, because currently Modin only supports Unidist on MPI:
variable ``UNIDIST_BACKEND``. Currently, Modin only supports MPI through unidist:

```bash
export UNIDIST_BACKEND=mpi # Unidist will use MPI backend
Expand All @@ -144,7 +154,7 @@ _Note: You should not change the engine after your first operation with Modin as

#### Which engine should I use?

On Linux, MacOS, and Windows you can install and use either Ray, Dask or Unidist. There is no knowledge required
On Linux, MacOS, and Windows you can install and use either Ray, Dask or MPI through unidist. There is no knowledge required
to use either of these engines as Modin abstracts away all of the complexity, so feel
free to pick either!

Expand Down
9 changes: 5 additions & 4 deletions docs/development/architecture.rst
Original file line number Diff line number Diff line change
Expand Up @@ -216,8 +216,8 @@ documentation page on :doc:`contributing </development/contributing>`.
- Uses the `Dask Futures`_ execution framework.
- The storage format is `pandas` and the in-memory partition type is a pandas DataFrame.
- For more information on the execution path, see the :doc:`pandas on Dask </flow/modin/core/execution/dask/implementations/pandas_on_dask/index>` page.
- :doc:`pandas on Unidist </development/using_pandas_on_unidist>`
- Uses the Unidist_ execution framework.
- :doc:`pandas on MPI </development/using_pandas_on_mpi>`
- Uses MPI_ through the Unidist_ execution framework.
- The storage format is `pandas` and the in-memory partition type is a pandas DataFrame.
- For more information on the execution path, see the :doc:`pandas on Unidist </flow/modin/core/execution/unidist/implementations/pandas_on_unidist/index>` page.
- :doc:`pandas on Python </development/using_pandas_on_python>`
Expand All @@ -228,8 +228,8 @@ documentation page on :doc:`contributing </development/contributing>`.
- Uses the Ray_ execution framework.
- The storage format is `pandas` and the in-memory partition type is a pandas DataFrame.
- For more information on the execution path, see the :doc:`experimental pandas on Ray </flow/modin/experimental/core/execution/ray/implementations/pandas_on_ray/index>` page.
- pandas on Unidist (experimental)
- Uses the Unidist_ execution framework.
- pandas on MPI (experimental)
- Uses MPI_ through the Unidist_ execution framework.
- The storage format is `pandas` and the in-memory partition type is a pandas DataFrame.
- For more information on the execution path, see the :doc:`experimental pandas on Unidist </flow/modin/experimental/core/execution/unidist/implementations/pandas_on_unidist/index>` page.
- pandas on Dask (experimental)
Expand Down Expand Up @@ -375,6 +375,7 @@ details. The documentation covers most modules, with more docs being added every
.. _Arrow tables: https://arrow.apache.org/docs/python/generated/pyarrow.Table.html
.. _Ray: https://github.com/ray-project/ray
.. _Unidist: https://github.com/modin-project/unidist
.. _MPI: https://www.mpi-forum.org/
.. _code: https://github.com/modin-project/modin/blob/master/modin/core/dataframe
.. _Dask: https://github.com/dask/dask
.. _Dask Futures: https://docs.dask.org/en/latest/futures.html
Expand Down
2 changes: 1 addition & 1 deletion docs/development/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Development
using_pandas_on_ray
using_pandas_on_dask
using_pandas_on_python
using_pandas_on_unidist
using_pandas_on_mpi
using_hdk
using_pyarrow_on_ray

Expand Down
2 changes: 1 addition & 1 deletion docs/development/partition_api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ in the worker process that processes a function (please, refer to `Dask document

Unidist engine
--------------
Currently, Modin only supports unidist on MPI backend. There is no mentioned above issue for
Currently, Modin only supports MPI through unidist. There is no mentioned above issue for
Modin on ``Unidist`` engine using ``MPI`` backend with ``pandas`` in-memory format
because ``Unidist`` saves any objects in the MPI worker process that processes a function
(please, refer to `Unidist documentation`_ for more information).
Expand Down
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
pandas on Unidist
=================
pandas on MPI through unidist
=============================

This section describes usage related documents for the pandas on Unidist component of Modin.
This section describes usage related documents for the pandas on MPI through unidist component of Modin.

Modin uses pandas as a primary memory format of the underlying partitions and optimizes queries
ingested from the API layer in a specific way to this format. Thus, there is no need to care of choosing it
but you can explicitly specify it anyway as shown below.

One of the execution engines that Modin uses is Unidist. Currently, Modin only supports Unidist on MPI backend.
To enable the pandas on Unidist execution using MPI backend you should set the following environment variables:
One of the execution engines that Modin uses is MPI through unidist.
To enable the pandas on MPI through unidist execution you should set the following environment variables:

.. code-block:: bash
Expand Down Expand Up @@ -36,4 +36,21 @@ To run a python application you should use ``mpiexec -n 1 python <script.py>`` c
For more information on how to run a python application with unidist on MPI backend
please refer to `Unidist on MPI`_ section of the unidist documentation.

As of unidist 0.5.0 there is support for a shared object store for MPI backend.
The feature allows to improve performance in the workloads,
where workers use same data multiple times by reducing data copies.
You can enable the feature by setting the following environment variable:

.. code-block:: bash
export UNIDIST_MPI_SHARED_OBJECT_STORE=True
or turn it on in source code:

.. code-block:: python
import unidist.config as unidist_cfg
unidist_cfg.MpiSharedObjectStore.put(True)
.. _`Unidist on MPI`: https://unidist.readthedocs.io/en/latest/using_unidist/unidist_on_mpi.html
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@ PandasOnUnidist Execution
Queries that perform data transformation, data ingress or data egress using the `pandas on Unidist` execution
pass through the Modin components detailed below.

To enable `pandas on Unidist` execution, please refer to the usage section in :doc:`pandas on Unidist </development/using_pandas_on_unidist>`.
To enable `pandas on MPI through unidist` execution,
please refer to the usage section in :doc:`pandas on MPI through unidist </development/using_pandas_on_mpi>`.

Data Transformation
'''''''''''''''''''
Expand Down
2 changes: 1 addition & 1 deletion docs/getting_started/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ and Modin will do computation with that engine:
pip install "modin[dask]" # Install Modin dependencies and Dask to run on Dask
export MODIN_ENGINE=dask # Modin will use Dask
pip install "modin[unidist]" # Install Modin dependencies and Unidist to run on Unidist.
pip install "modin[mpi]" # Install Modin dependencies and MPI to run on MPI through unidist.
export MODIN_ENGINE=unidist # Modin will use Unidist
export UNIDIST_BACKEND=mpi # Unidist will use MPI backend.
Expand Down
27 changes: 19 additions & 8 deletions docs/getting_started/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,13 @@ Modin can be used with :doc:`Ray</development/using_pandas_on_ray>`, :doc:`Dask<
pip install "modin[ray]" # Install Modin dependencies and Ray to run on Ray
pip install "modin[dask]" # Install Modin dependencies and Dask to run on Dask
pip install "modin[unidist]" # Install Modin dependencies and Unidist to run on Unidist
pip install "modin[all]" # Install all of the above
pip install "modin[mpi]" # Install Modin dependencies and MPI to run on MPI through unidist
pip install "modin[all]" # Install Ray and Dask
To get Modin on MPI through unidist (as of unidist 0.5.0) fully working
it is required to have a working MPI implementation installed beforehand.
Otherwise, installation of ``modin[mpi]`` may fail. Refer to
`Installing with pip`_ section of the unidist documentation for more details about installation.

Modin will automatically detect which engine you have installed and use that for
scheduling computation! See below for HDK engine installation.
Expand Down Expand Up @@ -65,7 +70,7 @@ storage formats or for different functionalities of Modin. Here is a list of dep
.. code-block:: bash
pip install "modin[unidist]" # If you want to use the Unidist execution engine
pip install "modin[mpi]" # If you want to use MPI through unidist execution engine
Installing on Google Colab
"""""""""""""""""""""""""""
Expand Down Expand Up @@ -106,18 +111,18 @@ it is possible to install modin with chosen engine(s) alongside. Current options
+---------------------------------+---------------------------+-----------------------------+
| modin-ray | Ray_ | Linux, Windows |
+---------------------------------+---------------------------+-----------------------------+
| modin-unidist | Unidist_ | Linux, Windows, MacOS |
| modin-mpi | MPI_ through unidist_ | Linux, Windows, MacOS |
+---------------------------------+---------------------------+-----------------------------+
| modin-hdk | HDK_ | Linux |
+---------------------------------+---------------------------+-----------------------------+
| modin-all | Dask, Ray, Unidist, HDK | Linux |
+---------------------------------+---------------------------+-----------------------------+

For installing Dask, Ray and Unidist engines into conda environment following command should be used:
For installing Dask, Ray and MPI through unidist engines into conda environment following command should be used:

.. code-block:: bash
conda install -c conda-forge modin-ray modin-dask modin-unidist
conda install -c conda-forge modin-ray modin-dask modin-mpi
All set of engines could be available in conda environment by specifying:

Expand All @@ -129,7 +134,10 @@ or explicitly:

.. code-block:: bash
conda install -c conda-forge modin-ray modin-dask modin-unidist modin-hdk
conda install -c conda-forge modin-ray modin-dask modin-mpi modin-hdk
Refer to `Installing with conda`_ section of the unidist documentation
for more details on how to install a specific MPI implementation to run on.

``conda`` may be slow installing ``modin-all`` or combitations of execution engines so we currently recommend using libmamba solver for the installation process.
To do this install it in a base environment:
Expand Down Expand Up @@ -171,7 +179,7 @@ also use ``pip``.
This will install directly from the repo without you having to manually clone it! Please be aware
that these changes have not made it into a release and may not be completely stable.

If you would like to install Modin with a specific engine, you can use ``modin[ray]`` or ``modin[dask]`` or ``modin[unidist]`` instead of ``modin[all]`` in the command above.
If you would like to install Modin with a specific engine, you can use ``modin[ray]`` or ``modin[dask]`` or ``modin[mpi]`` instead of ``modin[all]`` in the command above.

Windows
-------
Expand Down Expand Up @@ -214,7 +222,10 @@ Once cloned, ``cd`` into the ``modin`` directory and use ``pip`` to install:
.. _WSL: https://docs.microsoft.com/en-us/windows/wsl/install-win10
.. _Ray: http://ray.readthedocs.io
.. _Dask: https://github.com/dask/dask
.. _MPI: https://www.mpi-forum.org/
.. _Unidist: https://github.com/modin-project/unidist
.. _`Installing with pip`: https://unidist.readthedocs.io/en/latest/installation.html#installing-with-pip
.. _`Installing with conda`: https://unidist.readthedocs.io/en/latest/installation.html#installing-with-conda
.. _HDK: https://github.com/intel-ai/hdk
.. _`Intel Distribution of Modin`: https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/distribution-of-modin.html#gs.86stqv
.. _`Intel Distribution of Modin Getting Started`: https://www.intel.com/content/www/us/en/developer/articles/technical/intel-distribution-of-modin-getting-started-guide.html
Expand Down
4 changes: 2 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ of the targets:
pip install "modin[ray]" # Install Modin dependencies and Ray to run on Ray
pip install "modin[dask]" # Install Modin dependencies and Dask to run on Dask
pip install "modin[unidist]" # Install Modin dependencies and Unidist to run on Unidist
pip install "modin[mpi]" # Install Modin dependencies and MPI to run on MPI through unidist
pip install "modin[all]" # Install all of the above
Modin will automatically detect which engine you have installed and use that for
Expand All @@ -77,7 +77,7 @@ variable ``MODIN_ENGINE`` and Modin will do computation with that engine:
export MODIN_ENGINE=unidist # Modin will use Unidist
If you want to choose the Unidist engine, you should set the additional environment
variable ``UNIDIST_BACKEND``, because currently Modin only supports Unidist on MPI:
variable ``UNIDIST_BACKEND``, because currently Modin only supports MPI through unidist:

.. code-block:: bash
Expand Down
2 changes: 1 addition & 1 deletion examples/tutorial/jupyter/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Currently we provide tutorial notebooks for the following execution backends:

- [PandasOnRay](https://modin.readthedocs.io/en/latest/development/using_pandas_on_ray.html)
- [PandasOnDask](https://modin.readthedocs.io/en/latest/development/using_pandas_on_dask.html)
- [PandasOnUnidist](https://modin.readthedocs.io/en/latest/development/using_pandas_on_unidist.html)
- [PandasOnMPI through unidist](https://modin.readthedocs.io/en/latest/development/using_pandas_on_mpi.html)
- [HdkOnNative](https://modin.readthedocs.io/en/latest/development/using_hdk.html)

## Creating a development environment
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
FROM continuumio/miniconda3

RUN conda env create -f jupyter_unidist_env.yml
RUN conda install -c conda-forge psutil setproctitle
RUN pip install -r requirements-dev.txt

Loading

0 comments on commit 89c0473

Please sign in to comment.