Skip to content

[bug] sort_key function causes incorrect GPU ordering with 10+ GPUs due to string comparison #1481

@chenjunyi-dev

Description

@chenjunyi-dev

When creating a placement group, I found that sorted_bundle_infos = sorted(bundle_infos, key=sort_key) works fine with 8 GPUs on a machine, but causes problems when I have 16 GPUs.

The issue is that gpu_id is returned as a string, which leads to lexicographic sorting instead of numeric sorting. As shown in the debug output below, the sorted order becomes 0, 1, 10, 11, 12, 13, 14, 15, 2, 3, 4, 5, 6, 7, 8, 9 instead of the expected 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15.

==============================gpu_ids:  [('90.90.97.74', '0'), ('90.90.97.74', '1'), ('90.90.97.74', '2'), ('90.90.97.74', '3'), ('90.90.97.74', '4'), ('90.90.97.74', '5'), ('90.90.97.74', '6'), ('90.90.97.74', '7'), ('90.90.97.74', '8'), ('90.90.97.74', '9'), ('90.90.97.74', '10'), ('90.90.97.74', '11'), ('90.90.97.74', '12'), ('90.90.97.74', '13'), ('90.90.97.74', '14'), ('90.90.97.74', '15')]
==============================bundle_infos:  [(0, '90.90.97.74', '0'), (1, '90.90.97.74', '1'), (2, '90.90.97.74', '2'), (3, '90.90.97.74', '3'), (4, '90.90.97.74', '4'), (5, '90.90.97.74', '5'), (6, '90.90.97.74', '6'), (7, '90.90.97.74', '7'), (8, '90.90.97.74', '8'), (9, '90.90.97.74', '9'), (10, '90.90.97.74', '10'), (11, '90.90.97.74', '11'), (12, '90.90.97.74', '12'), (13, '90.90.97.74', '13'), (14, '90.90.97.74', '14'), (15, '90.90.97.74', '15')]
==============================sorted_bundle_infos:  [(0, '90.90.97.74', '0'), (1, '90.90.97.74', '1'), (10, '90.90.97.74', '10'), (11, '90.90.97.74', '11'), (12, '90.90.97.74', '12'), (13, '90.90.97.74', '13'), (14, '90.90.97.74', '14'), (15, '90.90.97.74', '15'), (2, '90.90.97.74', '2'), (3, '90.90.97.74', '3'), (4, '90.90.97.74', '4'), (5, '90.90.97.74', '5'), (6, '90.90.97.74', '6'), (7, '90.90.97.74', '7'), (8, '90.90.97.74', '8'), (9, '90.90.97.74', '9')]
==============================pg_reordered_bundle_indices:  [0, 1, 10, 11, 12, 13, 14, 15, 2, 3, 4, 5, 6, 7, 8, 9]
==============================pg_reordered_gpu_ids:  ['0', '1', '10', '11', '12', '13', '14', '15', '2', '3', '4', '5', '6', '7', '8', '9']
[2026-01-23 03:31:26] placement_group.py:81 -   bundle    0, actual_bundle_index:    0, node: 90.90.97.74, gpu: 0
[2026-01-23 03:31:26] placement_group.py:81 -   bundle    1, actual_bundle_index:    1, node: 90.90.97.74, gpu: 1
[2026-01-23 03:31:26] placement_group.py:81 -   bundle    2, actual_bundle_index:   10, node: 90.90.97.74, gpu: 10
[2026-01-23 03:31:26] placement_group.py:81 -   bundle    3, actual_bundle_index:   11, node: 90.90.97.74, gpu: 11
[2026-01-23 03:31:26] placement_group.py:81 -   bundle    4, actual_bundle_index:   12, node: 90.90.97.74, gpu: 12
[2026-01-23 03:31:26] placement_group.py:81 -   bundle    5, actual_bundle_index:   13, node: 90.90.97.74, gpu: 13
[2026-01-23 03:31:26] placement_group.py:81 -   bundle    6, actual_bundle_index:   14, node: 90.90.97.74, gpu: 14
[2026-01-23 03:31:26] placement_group.py:81 -   bundle    7, actual_bundle_index:   15, node: 90.90.97.74, gpu: 15
[2026-01-23 03:31:26] placement_group.py:81 -   bundle    8, actual_bundle_index:    2, node: 90.90.97.74, gpu: 2
[2026-01-23 03:31:26] placement_group.py:81 -   bundle    9, actual_bundle_index:    3, node: 90.90.97.74, gpu: 3
[2026-01-23 03:31:26] placement_group.py:81 -   bundle   10, actual_bundle_index:    4, node: 90.90.97.74, gpu: 4
[2026-01-23 03:31:26] placement_group.py:81 -   bundle   11, actual_bundle_index:    5, node: 90.90.97.74, gpu: 5
[2026-01-23 03:31:26] placement_group.py:81 -   bundle   12, actual_bundle_index:    6, node: 90.90.97.74, gpu: 6
[2026-01-23 03:31:26] placement_group.py:81 -   bundle   13, actual_bundle_index:    7, node: 90.90.97.74, gpu: 7
[2026-01-23 03:31:26] placement_group.py:81 -   bundle   14, actual_bundle_index:    8, node: 90.90.97.74, gpu: 8
[2026-01-23 03:31:26] placement_group.py:81 -   bundle   15, actual_bundle_index:    9, node: 90.90.97.74, gpu: 9
INFO 01-23 03:31:26 [importing.py:53] Triton module has been replaced with a placeholder.

I solved this by converting gpu_id from string to integer before returning in the sort_key function:

def sort_key(x):
    index, node_identifier, gpu_id = x
    # Sort by node IP number and then by GPU ID
    try:
        # try to parse it as an IP address.
        ip_address = node_identifier
        node_ip_parts = list(map(int, ip_address.split(".")))
    except ValueError:
        # Try to resolve the hostname to an IP address.
        try:
            ip_address = socket.gethostbyname(node_identifier)
            node_ip_parts = list(map(int, ip_address.split(".")))
        except (socket.gaierror, TypeError):
            # Instead, we convert each character of the original identifier string
            # to its ASCII value. This provides a stable and consistent numerical
            # representation that allows for sorting.
            node_ip_parts = [ord(c) for c in node_identifier]
    
    if isinstance(gpu_id, str):
        gpu_id = int(gpu_id)
    
    return (node_ip_parts, gpu_id)

This fix ensures proper numeric sorting for systems with 10 or more GPUs. Could you please apply this fix?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions