-
Notifications
You must be signed in to change notification settings - Fork 474
Closed
Description
When creating a placement group, I found that sorted_bundle_infos = sorted(bundle_infos, key=sort_key) works fine with 8 GPUs on a machine, but causes problems when I have 16 GPUs.
The issue is that gpu_id is returned as a string, which leads to lexicographic sorting instead of numeric sorting. As shown in the debug output below, the sorted order becomes 0, 1, 10, 11, 12, 13, 14, 15, 2, 3, 4, 5, 6, 7, 8, 9 instead of the expected 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15.
==============================gpu_ids: [('90.90.97.74', '0'), ('90.90.97.74', '1'), ('90.90.97.74', '2'), ('90.90.97.74', '3'), ('90.90.97.74', '4'), ('90.90.97.74', '5'), ('90.90.97.74', '6'), ('90.90.97.74', '7'), ('90.90.97.74', '8'), ('90.90.97.74', '9'), ('90.90.97.74', '10'), ('90.90.97.74', '11'), ('90.90.97.74', '12'), ('90.90.97.74', '13'), ('90.90.97.74', '14'), ('90.90.97.74', '15')]
==============================bundle_infos: [(0, '90.90.97.74', '0'), (1, '90.90.97.74', '1'), (2, '90.90.97.74', '2'), (3, '90.90.97.74', '3'), (4, '90.90.97.74', '4'), (5, '90.90.97.74', '5'), (6, '90.90.97.74', '6'), (7, '90.90.97.74', '7'), (8, '90.90.97.74', '8'), (9, '90.90.97.74', '9'), (10, '90.90.97.74', '10'), (11, '90.90.97.74', '11'), (12, '90.90.97.74', '12'), (13, '90.90.97.74', '13'), (14, '90.90.97.74', '14'), (15, '90.90.97.74', '15')]
==============================sorted_bundle_infos: [(0, '90.90.97.74', '0'), (1, '90.90.97.74', '1'), (10, '90.90.97.74', '10'), (11, '90.90.97.74', '11'), (12, '90.90.97.74', '12'), (13, '90.90.97.74', '13'), (14, '90.90.97.74', '14'), (15, '90.90.97.74', '15'), (2, '90.90.97.74', '2'), (3, '90.90.97.74', '3'), (4, '90.90.97.74', '4'), (5, '90.90.97.74', '5'), (6, '90.90.97.74', '6'), (7, '90.90.97.74', '7'), (8, '90.90.97.74', '8'), (9, '90.90.97.74', '9')]
==============================pg_reordered_bundle_indices: [0, 1, 10, 11, 12, 13, 14, 15, 2, 3, 4, 5, 6, 7, 8, 9]
==============================pg_reordered_gpu_ids: ['0', '1', '10', '11', '12', '13', '14', '15', '2', '3', '4', '5', '6', '7', '8', '9']
[2026-01-23 03:31:26] placement_group.py:81 - bundle 0, actual_bundle_index: 0, node: 90.90.97.74, gpu: 0
[2026-01-23 03:31:26] placement_group.py:81 - bundle 1, actual_bundle_index: 1, node: 90.90.97.74, gpu: 1
[2026-01-23 03:31:26] placement_group.py:81 - bundle 2, actual_bundle_index: 10, node: 90.90.97.74, gpu: 10
[2026-01-23 03:31:26] placement_group.py:81 - bundle 3, actual_bundle_index: 11, node: 90.90.97.74, gpu: 11
[2026-01-23 03:31:26] placement_group.py:81 - bundle 4, actual_bundle_index: 12, node: 90.90.97.74, gpu: 12
[2026-01-23 03:31:26] placement_group.py:81 - bundle 5, actual_bundle_index: 13, node: 90.90.97.74, gpu: 13
[2026-01-23 03:31:26] placement_group.py:81 - bundle 6, actual_bundle_index: 14, node: 90.90.97.74, gpu: 14
[2026-01-23 03:31:26] placement_group.py:81 - bundle 7, actual_bundle_index: 15, node: 90.90.97.74, gpu: 15
[2026-01-23 03:31:26] placement_group.py:81 - bundle 8, actual_bundle_index: 2, node: 90.90.97.74, gpu: 2
[2026-01-23 03:31:26] placement_group.py:81 - bundle 9, actual_bundle_index: 3, node: 90.90.97.74, gpu: 3
[2026-01-23 03:31:26] placement_group.py:81 - bundle 10, actual_bundle_index: 4, node: 90.90.97.74, gpu: 4
[2026-01-23 03:31:26] placement_group.py:81 - bundle 11, actual_bundle_index: 5, node: 90.90.97.74, gpu: 5
[2026-01-23 03:31:26] placement_group.py:81 - bundle 12, actual_bundle_index: 6, node: 90.90.97.74, gpu: 6
[2026-01-23 03:31:26] placement_group.py:81 - bundle 13, actual_bundle_index: 7, node: 90.90.97.74, gpu: 7
[2026-01-23 03:31:26] placement_group.py:81 - bundle 14, actual_bundle_index: 8, node: 90.90.97.74, gpu: 8
[2026-01-23 03:31:26] placement_group.py:81 - bundle 15, actual_bundle_index: 9, node: 90.90.97.74, gpu: 9
INFO 01-23 03:31:26 [importing.py:53] Triton module has been replaced with a placeholder.I solved this by converting gpu_id from string to integer before returning in the sort_key function:
def sort_key(x):
index, node_identifier, gpu_id = x
# Sort by node IP number and then by GPU ID
try:
# try to parse it as an IP address.
ip_address = node_identifier
node_ip_parts = list(map(int, ip_address.split(".")))
except ValueError:
# Try to resolve the hostname to an IP address.
try:
ip_address = socket.gethostbyname(node_identifier)
node_ip_parts = list(map(int, ip_address.split(".")))
except (socket.gaierror, TypeError):
# Instead, we convert each character of the original identifier string
# to its ASCII value. This provides a stable and consistent numerical
# representation that allows for sorting.
node_ip_parts = [ord(c) for c in node_identifier]
if isinstance(gpu_id, str):
gpu_id = int(gpu_id)
return (node_ip_parts, gpu_id)This fix ensures proper numeric sorting for systems with 10 or more GPUs. Could you please apply this fix?
lilei199908
Metadata
Metadata
Assignees
Labels
No labels