Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BlockIdManagerSelector #3560

Merged
merged 20 commits into from
Oct 15, 2024

Conversation

matthewc2003
Copy link
Contributor

@matthewc2003 matthewc2003 commented Jul 31, 2024

Description

Adds the BlockIdManagerSelector to the list of available manager selectors. This selector returns a sorted list of managers by their block id, from greatest (newest) to least (oldest). Will be used to more effectively distribute tasks. Eventually, this selector will be used in conjunction with some method of sorting the task queue based on runtime to a) ensure tasks won't get assigned to managers that have a shorter remaining wall time than that of the task's execution time and b) put tasks with similar execution times on the same block, thereby resulting in a higher likelihood that the block completely empties of tasks, allowing the scale down logic to kick in.

Changed Behavior

Users can now choose the BlockIdManagerSelector() in the HTEX config.

Fixes

Work as part of #3323

Type of change

  • New feature

@benclifford
Copy link
Collaborator

you might elaborate on "Will be used to more effectively distribute tasks."

@benclifford benclifford marked this pull request as draft August 6, 2024 10:10
@benclifford
Copy link
Collaborator

switching this to draft status because @matthewc2003 is working on gathering evidence on when this does and does not help.

@matthewc2003
Copy link
Contributor Author

As a result of some experimentation I've found the following:

  1. BlockID manager selector helps with workloads that see a varying amount of tasks over time. Example: Cholesky factorization workload run through the TaPS benchmarking suite
    Random Manager Selector:
    random-lg
    Block Id Manager Selector:
    seesaw-lg
    We see new blocks are prioritized with the blockID manager selector, when used with 'htex_auto_scaling', results in compute cost savings.
  2. Doesn't really work with bag-of-tasks workloads. When all the tasks are put into the queue upfront, all blocks operate at near full utilization for the majority of the workload, which task goes where doesn't really matter.

@matthewc2003 matthewc2003 marked this pull request as ready for review August 21, 2024 20:07
@benclifford
Copy link
Collaborator

some documentation ideas:

the same summary you posted above, put into a docstring for the block id manager selector, and then add to docs/reference.rst (perhaps in a new section)

add a description of manager_selector to the HighThroughputExecutor selector, with a link to the docstring on block id manager selector.

@@ -19,6 +19,14 @@ def sort_managers(self, ready_managers: Dict[bytes, ManagerRecord], manager_list

class RandomManagerSelector(ManagerSelector):

"""Returns a shuffled list of interesting_managers

Maintains the behavior of the original interchange. Works well
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the docs should primarily be about the version now, rather than about changes since some other version the user isn't using, so you could get rid of this first sentence.

a related thing to express would be that this is (I think) the default - see how some other docstrings on HighThroughputExecutor talk about defaults

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... but the commit message that this gets merged under (which hopefully will come from the description text of this PR) is the right place to describe how the new code is different/similar to the old code - so describing how the default behaviour now is the same or different to the previous behaviour before this is merged is relevant there: the audience for that is people who want to know what has changed in parsl, vs audience for the user guide is what is stuff like now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded to identify the random manager selector as the default

@@ -78,6 +78,16 @@ Executors
parsl.executors.FluxExecutor
parsl.executors.radical.RadicalPilotExecutor

Executors Miscellaneous
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could call this Manager selectors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@benclifford
Copy link
Collaborator

It would be good to get some testing in here for the newly introduced code.

There are a few things you could do:

make a test that invokces BlockIdManagerSelector.sort_managers with some example inputs and check that it sorts them the right way

launch a DFK with the BlockIdManagerSelector configured and check that a simple no-op task runs.

level 2 of that: run a configuration with 2 blocks, no scaling, run a few tasks and check that they all end up in the 2nd block and never in the first block: you should see that in the PARSL_WORKER_BLOCK_ID environment variable inside a task

@khk-globus might have some other ideas too around this space

@pytest.mark.local
def test_block_id_selection(try_assert):
blockids = []
try_assert(lambda: (time.sleep(1) or get_worker_pid().result() == "1"), timeout_ms=20000)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the sleep(1) for? in test cases that's usually something that should be waited for explicitly, or is something broken that the author hopes goes away after a while.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe you should be checking that both blocks are available when this test runs - eg lambda: len(htex.connected_managers()) == 2 - so that there is the opportunity for buggy behaviour to place blocks onto block 0, and then the opportunity for the test to detect that.

manager_selector: ManagerSelector
Determines what strategy the interchange uses to select managers during task distribution.
See API reference under "Manager Selectors" regarding the various manager selectors.
Default: RandomManagerSelector()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this and you'll get a hyperlink to the documentation for RandomManagerSelector:

$ git diff
diff --git a/parsl/executors/high_throughput/executor.py b/parsl/executors/high_throughput/executor.py
index 09e634f41d..92a256ae54 100644
--- a/parsl/executors/high_throughput/executor.py
+++ b/parsl/executors/high_throughput/executor.py
@@ -149,7 +149,7 @@ GENERAL_HTEX_PARAM_DOCS = """provider : :class:`~parsl.providers.base.ExecutionP
     manager_selector: ManagerSelector
         Determines what strategy the interchange uses to select managers during task distribution.
         See API reference under "Manager Selectors" regarding the various manager selectors.
-        Default: RandomManagerSelector()
+        Default: `RandomManagerSelector`
 """  # Documentation for params used by both HTEx and MPIEx
 
 

in distributing workloads equally across all availble compute
resources. Is not effective in conjunction with elastic scaling
behavior as it leads to wasted resource consumption.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe put subjects in these sentences such as it - this isn't charged by the word.


Observations:
1. BlockID manager selector helps with workloads that see a varying
amount of tasks over time. We see new blocks are prioritized with the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid academic-we in user facing documentation: just write "New blocks are prioritizied"

for i in range(10):
future = get_worker_pid()
blockids.append(future.result())
print(future.result())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't print in tests - the output is either passing or failing.


class BlockIdManagerSelector(ManagerSelector):

"""Returns a interesting_managers list sorted by block ID
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/a/an/

@benclifford benclifford merged commit ac62e04 into Parsl:master Oct 15, 2024
7 checks passed
@benclifford
Copy link
Collaborator

I removed most of your PR description from the commit message, rather than harrassing @matthewc2003 to go rewrite it - it was mostly about stuff the isn't implemented, rather than stuff that this PR implements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants