Skip to content

Domain decomposition and halo construction#540

Open
halungge wants to merge 327 commits intomainfrom
halo_construction
Open

Domain decomposition and halo construction#540
halungge wants to merge 327 commits intomainfrom
halo_construction

Conversation

@halungge
Copy link
Contributor

@halungge halungge commented Sep 6, 2024

Decompose (global) grid file:

  • uses pymetis to decompose the global grid (cells) into n patches
  • after decomposition halos for all dimensions (cell, edge, vertex) are constructed. Halo construction is done in a ICON like fashion: They consist halos of 2 cell levels (one upward and one downward pointing line) and the corresponding vertices and edges on these lines.

Omissions:

  • LAM grids need to be investigated further:

    • tests comparing decomposed vs. single_node computation are only run on the global grid.
    • for the LAM grids ICON reorders arrays to arrange halo points on the first boundary layers together with the boundary layers, it should be investigated whether that is essential in the model.
    • This PR does only take this into account on the computation of the start_index and end_index not in the halo construction.
  • the number of halo lines (in terms of cells) is hardcoded to 2, that could be made a parameter.

  • Not sure it all runs on GPU correctly... most probably there are some numpy cupy issues to fix.

@halungge halungge force-pushed the halo_construction branch from df9c2ef to 72d4d4b Compare July 25, 2025 16:30
Magdalena Luz added 28 commits August 8, 2025 09:16
# Conflicts:
#	model/common/src/icon4py/model/common/grid/grid_manager.py
#	model/common/src/icon4py/model/common/grid/horizontal.py
# Conflicts:
#	model/common/src/icon4py/model/common/decomposition/definitions.py
#	model/common/src/icon4py/model/common/grid/base.py
#	model/common/src/icon4py/model/common/grid/grid_manager.py
#	model/common/src/icon4py/model/common/grid/gridfile.py
#	model/common/src/icon4py/model/common/grid/icon.py
#	model/common/tests/common/grid/fixtures.py
#	model/testing/src/icon4py/model/testing/grid_utils.py
#	model/testing/src/icon4py/model/testing/parallel_helpers.py
@msimberg
Copy link
Contributor

cscs-ci run default

@msimberg
Copy link
Contributor

cscs-ci run default

@msimberg
Copy link
Contributor

cscs-ci run distributed

self._edge_domain(h_grid.Zone.LOCAL),
self._edge_domain(
h_grid.Zone.HALO
), # TODO(msimberg): END too much, invalid neighbor access. LOCAL too little?
Copy link
Contributor

@msimberg msimberg Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird. Why does this need to compute all the way to the first halo line to get correct results in the halo when a halo exchange is done later? Bug in indices or expected?

Same further down for other fields that need coinnectivities for computation.

Comment on lines +81 to +87
def array_ns_from_array(array: NDArray) -> ModuleType:
if isinstance(array, np.ndarray):
import numpy as xp
else:
import cupy as xp # type: ignore[no-redef]

return xp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcanton jcanton mentioned this pull request Feb 26, 2026
return SingleNodeReductions()


class DecompositionFlag(int, Enum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move here the sketch from halo.py?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indifferent. I think it can be in either place equally well, but won't oppose moving it here if you prefer.


def __init__(
self,
run_properties: defs.ProcessProperties,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is called like so (mostly) everywhere else (it's actually mis-spelled processor_procs here and there)

Suggested change
run_properties: defs.ProcessProperties,
processor_props: defs.ProcessProperties,

allocator: GT4Py buffer allocator
"""
self._xp = data_alloc.import_array_ns(allocator)
self._props = run_properties
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._props = run_properties
self._processor_props = processor_props

"""
self._xp = data_alloc.import_array_ns(allocator)
self._props = run_properties
self._connectivities = {self._value(k): v for k, v in connectivities.items()}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we re-define this here instead of using it as is? can we force it to be dict[gtx.FieldOffset, data_alloc.NDArray] and avoid the str conversion? (it's not passed in from many places as far as I can see)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I think it'd be nice if we can avoid the str. I can try removing that.

f"The distribution assumes more nodes than the current run is scheduled on {self._props} ",
)

def _assert_all_neighbor_tables(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we chose either neighbor_table or connectivity but not both?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we use connectivity in most places, so I would be in favour of that.

@github-actions
Copy link

Mandatory Tests

Please make sure you run these tests via comment before you merge!

  • cscs-ci run default
  • cscs-ci run distributed

Optional Tests

To run benchmarks you can use:

  • cscs-ci run benchmark-bencher

To run tests and benchmarks with the DaCe backend you can use:

  • cscs-ci run dace

To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:

  • cscs-ci run extra

For more detailed information please look at CI in the EXCLAIM universe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants