Skip to content

Conversation

@TroyGarden
Copy link
Contributor

@TroyGarden TroyGarden commented Dec 10, 2025

Summary:

context

  • modify the kvzch benchmark configs to better represent the real use case
  • add config pass-in to the test models
  • fix small bugs and minior refactoring

changes

  • previous kv-zch embedding table is too small the prefetch process is too short, after this change (increased table size) the prefetch process is longer
image image

benchmark

short name GPU Runtime (P90) CPU Runtime (P90) GPU Peak Mem alloc (P90) GPU Peak Mem reserved (P90) GPU Mem used (P90) Malloc retries (P50/P90/P100) CPU Peak RSS (P90)
regular-base 9864.51 ms 9403.68 ms 33.77 GB 49.66 GB 50.71 GB 0.0 / 0.0 / 0.0 30.65 GB
kvzch-base 18804.26 ms 44245.82 ms 25.28 GB 36.33 GB 37.38 GB 0.0 / 0.0 / 0.0 31.18 GB
base-inplace 20141.71 ms 46805.58 ms 25.28 GB 34.39 GB 35.44 GB 0.0 / 0.0 / 0.0 31.19 GB
kvzch-sdd 20382.59 ms 45647.02 ms 33.42 GB 47.52 GB 48.56 GB 0.0 / 0.0 / 0.0 31.13 GB
kvzch-prefetch 17951.19 ms 38598.57 ms 33.45 GB 47.16 GB 48.21 GB 0.0 / 0.0 / 0.0 30.83 GB
  • planner stats
########################################################################################################################################################################################################################################################################################
INFO:torchrec.distributed.planner.stats:#                                                                                                                                                                     --- Planner Statistics ---                                                                                                                                                                      #
INFO:torchrec.distributed.planner.stats:#                                                                                                                                              --- Evaluated 1 proposal(s), found 1 possible plan(s), ran for 0.03s ---                                                                                                                                               #
INFO:torchrec.distributed.planner.stats:# ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- #
INFO:torchrec.distributed.planner.stats:#      Rank      HBM (GB)     DDR (GB)                    Perf (ms)     Input (MB)     Output (MB)               Shards                                                                                                                                                                                                                                               #
INFO:torchrec.distributed.planner.stats:#    ------    ----------   ----------                  -----------   ------------   -------------             --------                                                                                                                                                                                                                                               #
INFO:torchrec.distributed.planner.stats:#         0   8.846 (33%)     0.0 (0%)   0.807 (0.2,0.09,0.4,0.2,0)           0.46            55.0   CW: 8 RW: 1 TW: 50                                                                                                                                                                                                                                               #
INFO:torchrec.distributed.planner.stats:#         1   8.846 (33%)     0.0 (0%)   0.807 (0.2,0.09,0.4,0.2,0)           0.46            55.0   CW: 8 RW: 1 TW: 50                                                                                                                                                                                                                                               #
INFO:torchrec.distributed.planner.stats:#                                                                                                                                                                                                                                                                                                                                                                     #
INFO:torchrec.distributed.planner.stats:# Perf: Total perf (Forward compute, Forward comms, Backward compute, Backward comms, Prefetch compute)                                                                                                                                                                                                                                                               #
INFO:torchrec.distributed.planner.stats:# Input: MB/iteration, Output: MB/iteration, Shards: number of tables                                                                                                                                                                                                                                                                                                 #
INFO:torchrec.distributed.planner.stats:# HBM: estimated peak memory usage for shards, dense tensors, and features (KJT)                                                                                                                                                                                                                                                                                      #
INFO:torchrec.distributed.planner.stats:#                                                                                                                                                                                                                                                                                                                                                                     #
INFO:torchrec.distributed.planner.stats:# Parameter Info:                                                                                                                                                                                                                                                                                                                                                     #
INFO:torchrec.distributed.planner.stats:#                                      FQN     Sharding       Compute Kernel                           Perf (ms)     Storage (HBM, DDR)     Cache Load Factor     Sum Pooling Factor     Sum Num Poolings     Num Indices     Output     Weighted                         Sharder     Features     Emb Dim (CW Dim)     Hash Size                             Ranks   #
INFO:torchrec.distributed.planner.stats:#                                    -----   ----------     ----------------                         -----------   --------------------   -------------------   --------------------   ------------------   -------------   --------   ----------                       ---------   ----------   ------------------   -----------                           -------   #
INFO:torchrec.distributed.planner.stats:#     sparse.weighted_ebc.weighted_table_0           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#     sparse.weighted_ebc.weighted_table_1           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#     sparse.weighted_ebc.weighted_table_2           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#     sparse.weighted_ebc.weighted_table_3           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#     sparse.weighted_ebc.weighted_table_4           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#     sparse.weighted_ebc.weighted_table_5           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#     sparse.weighted_ebc.weighted_table_6           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#     sparse.weighted_ebc.weighted_table_7           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#     sparse.weighted_ebc.weighted_table_8           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#     sparse.weighted_ebc.weighted_table_9           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_10           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_11           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_12           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_13           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_14           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_15           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_16           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_17           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_18           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_19           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_20           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_21           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_22           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_23           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_24           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_25           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_26           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_27           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_28           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_29           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_30           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_31           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_32           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_33           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_34           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_35           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_36           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_37           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_38           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_39           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_40           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_41           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_42           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_43           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_44           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_45           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_46           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_47           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_48           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#    sparse.weighted_ebc.weighted_table_49           TW                fused   0.012 (0.002,0.002,0.007,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled     weighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                       sparse.ebc.table_0           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                       sparse.ebc.table_1           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                       sparse.ebc.table_2           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                       sparse.ebc.table_3           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                       sparse.ebc.table_4           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                       sparse.ebc.table_5           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                       sparse.ebc.table_6           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                       sparse.ebc.table_7           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                       sparse.ebc.table_8           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                       sparse.ebc.table_9           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_10           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_11           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_12           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_13           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_14           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_15           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_16           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_17           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_18           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_19           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_20           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_21           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_22           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_23           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_24           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_25           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_26           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_27           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_28           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_29           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_30           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_31           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_32           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_33           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_34           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_35           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_36           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_37           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_38           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_39           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_40           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_41           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_42           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_43           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_44           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_45           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_46           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_47           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_48           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 0   #
INFO:torchrec.distributed.planner.stats:#                      sparse.ebc.table_49           TW                fused    0.01 (0.002,0.002,0.004,0.002,0)     (0.096 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  256        100000                                 1   #
INFO:torchrec.distributed.planner.stats:#                    sparse.ebc.FP16_table           RW   dram_virtual_table        0.432 (0.09,0.003,0.2,0.2,0)     (0.383 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1                  512       1000000                               0-1   #
INFO:torchrec.distributed.planner.stats:#                   sparse.ebc.large_table           CW                fused       0.079 (0.02,0.01,0.04,0.01,0)     (7.637 GB, 0.0 GB)                  None                    1.0                  1.0             1.0     pooled   unweighted   EmbeddingBagCollectionSharder            1           2048 (128)       1000000   0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1   #
INFO:torchrec.distributed.planner.stats:#                                                                                                                                                                                                                                                                                                                                                                     #
INFO:torchrec.distributed.planner.stats:# Batch Size: 512                                                                                                                                                                                                                                                                                                                                                     #
INFO:torchrec.distributed.planner.stats:#                                                                                                                                                                                                                                                                                                                                                                     #
INFO:torchrec.distributed.planner.stats:# Compute Kernels Count:                                                                                                                                                                                                                                                                                                                                              #
INFO:torchrec.distributed.planner.stats:#    dram_virtual_table: 1                                                                                                                                                                                                                                                                                                                                            #
INFO:torchrec.distributed.planner.stats:#    fused: 101                                                                                                                                                                                                                                                                                                                                                       #
INFO:torchrec.distributed.planner.stats:#                                                                                                                                                                                                                                                                                                                                                                     #
INFO:torchrec.distributed.planner.stats:# Compute Kernels Storage:                                                                                                                                                                                                                                                                                                                                            #
INFO:torchrec.distributed.planner.stats:#    dram_virtual_table: HBM: 0.383 GB, DDR: 0.0 GB                                                                                                                                                                                                                                                                                                                   #
INFO:torchrec.distributed.planner.stats:#    fused: HBM: 17.272 GB, DDR: 0.0 GB                                                                                                                                                                                                                                                                                                                               #
INFO:torchrec.distributed.planner.stats:#                                                                                                                                                                                                                                                                                                                                                                     #
INFO:torchrec.distributed.planner.stats:# Total Perf Imbalance Statistics                                                                                                                                                                                                                                                                                                                                     #
INFO:torchrec.distributed.planner.stats:# Total Variation: 0.000                                                                                                                                                                                                                                                                                                                                              #
INFO:torchrec.distributed.planner.stats:# Total Distance: 0.000                                                                                                                                                                                                                                                                                                                                               #
INFO:torchrec.distributed.planner.stats:# Chi Divergence: 0.000                                                                                                                                                                                                                                                                                                                                               #
INFO:torchrec.distributed.planner.stats:# KL Divergence: 0.000                                                                                                                                                                                                                                                                                                                                                #
INFO:torchrec.distributed.planner.stats:#                                                                                                                                                                                                                                                                                                                                                                     #
INFO:torchrec.distributed.planner.stats:# HBM Imbalance Statistics                                                                                                                                                                                                                                                                                                                                            #
INFO:torchrec.distributed.planner.stats:# Total Variation: 0.000                                                                                                                                                                                                                                                                                                                                              #
INFO:torchrec.distributed.planner.stats:# Total Distance: 0.000                                                                                                                                                                                                                                                                                                                                               #
INFO:torchrec.distributed.planner.stats:# Chi Divergence: 0.000                                                                                                                                                                                                                                                                                                                                               #
INFO:torchrec.distributed.planner.stats:# KL Divergence: 0.000                                                                                                                                                                                                                                                                                                                                                #
INFO:torchrec.distributed.planner.stats:#                                                                                                                                                                                                                                                                                                                                                                     #
INFO:torchrec.distributed.planner.stats:# Imbalance stats range 0-1, higher means more imbalanced                                                                                                                                                                                                                                                                                                             #
INFO:torchrec.distributed.planner.stats:#                                                                                                                                                                                                                                                                                                                                                                     #
INFO:torchrec.distributed.planner.stats:# Maximum of Total Perf: 0.807 ms on ranks 0-1                                                                                                                                                                                                                                                                                                                        #
INFO:torchrec.distributed.planner.stats:# Mean Total Perf: 0.807 ms                                                                                                                                                                                                                                                                                                                                           #
INFO:torchrec.distributed.planner.stats:# Max Total Perf is 0% greater than the mean                                                                                                                                                                                                                                                                                                                          #
INFO:torchrec.distributed.planner.stats:# Maximum of Forward Compute: 0.164 ms on ranks 0-1                                                                                                                                                                                                                                                                                                                   #
INFO:torchrec.distributed.planner.stats:# Maximum of Forward Comms: 0.09 ms on ranks 0-1                                                                                                                                                                                                                                                                                                                      #
INFO:torchrec.distributed.planner.stats:# Maximum of Backward Compute: 0.389 ms on ranks 0-1                                                                                                                                                                                                                                                                                                                  #
INFO:torchrec.distributed.planner.stats:# Maximum of Backward Comms: 0.164 ms on ranks 0-1                                                                                                                                                                                                                                                                                                                    #
INFO:torchrec.distributed.planner.stats:# Maximum of Prefetch Compute: 0.0 ms on ranks 0-1                                                                                                                                                                                                                                                                                                                    #
INFO:torchrec.distributed.planner.stats:# Sum of Maxima: 0.807 ms                                                                                                                                                                                                                                                                                                                                             #
INFO:torchrec.distributed.planner.stats:#                                                                                                                                                                                                                                                                                                                                                                     #
INFO:torchrec.distributed.planner.stats:# Estimated Sharding Distribution                                                                                                                                                                                                                                                                                                                                      #
INFO:torchrec.distributed.planner.stats:# Sparse only Max HBM: 8.828 GB on ranks [0, 1]                                                                                                                                                                                                                                                                                                                       #
INFO:torchrec.distributed.planner.stats:# Sparse only Min HBM: 8.828 GB on ranks [0, 1]                                                                                                                                                                                                                                                                                                                       #
INFO:torchrec.distributed.planner.stats:# Max HBM: 8.846 GB on ranks [0, 1]                                                                                                                                                                                                                                                                                                                                   #
INFO:torchrec.distributed.planner.stats:# Min HBM: 8.846 GB on ranks [0, 1]                                                                                                                                                                                                                                                                                                                                   #
INFO:torchrec.distributed.planner.stats:# Mean HBM: 8.846 GB on ranks [0, 1]                                                                                                                                                                                                                                                                                                                                  #
INFO:torchrec.distributed.planner.stats:# Low Median HBM: 8.846 GB on ranks [0, 1]                                                                                                                                                                                                                                                                                                                            #
INFO:torchrec.distributed.planner.stats:# High Median HBM: 8.846 GB on ranks [0, 1]                                                                                                                                                                                                                                                                                                                           #
INFO:torchrec.distributed.planner.stats:# Critical Path (comms): 0.254                                                                                                                                                                                                                                                                                                                                        #
INFO:torchrec.distributed.planner.stats:# Critical Path (compute): 0.553                                                                                                                                                                                                                                                                                                                                      #
INFO:torchrec.distributed.planner.stats:# Critical Path (comms + compute): 0.807                                                                                                                                                                                                                                                                                                                              #
INFO:torchrec.distributed.planner.stats:# Max HBM is 0% greater than the mean                                                                                                                                                                                                                                                                                                                                 #
INFO:torchrec.distributed.planner.stats:#                                                                                                                                                                                                                                                                                                                                                                     #
INFO:torchrec.distributed.planner.stats:#                                                                                                                                                                                                                                                                                                                                                                     #
INFO:torchrec.distributed.planner.stats:# Top HBM Memory Usage Estimation: 8.846 GB                                                                                                                                                                                                                                                                                                                           #
INFO:torchrec.distributed.planner.stats:# Top Tier #1 Estimated Peak HBM Pressure: 8.846 GB on ranks 0-1                                                                                                                                                                                                                                                                                                      #
INFO:torchrec.distributed.planner.stats:#                                                                                                                                                                                                                                                                                                                                                                     #
INFO:torchrec.distributed.planner.stats:# Reserved Memory:                                                                                                                                                                                                                                                                                                                                                    #
INFO:torchrec.distributed.planner.stats:#    HBM: 4.8 GB                                                                                                                                                                                                                                                                                                                                                      #
INFO:torchrec.distributed.planner.stats:#    Percent of Total HBM: 15%                                                                                                                                                                                                                                                                                                                                        #
INFO:torchrec.distributed.planner.stats:# Planning Memory:                                                                                                                                                                                                                                                                                                                                                    #
INFO:torchrec.distributed.planner.stats:#    HBM: 27.2 GB, DDR: 128.0 GB                                                                                                                                                                                                                                                                                                                                      #
INFO:torchrec.distributed.planner.stats:#    Percent of Total HBM: 85%                                                                                                                                                                                                                                                                                                                                        #
INFO:torchrec.distributed.planner.stats:#                                                                                                                                                                                                                                                                                                                                                                     #
INFO:torchrec.distributed.planner.stats:# Dense Storage (per rank):                                                                                                                                                                                                                                                                                                                                           #
INFO:torchrec.distributed.planner.stats:#    HBM: 0.01 GB, DDR: 0.0 GB                                                                                                                                                                                                                                                                                                                                        #
INFO:torchrec.distributed.planner.stats:#                                                                                                                                                                                                                                                                                                                                                                     #
INFO:torchrec.distributed.planner.stats:# KJT Storage (per rank):                                                                                                                                                                                                                                                                                                                                             #
INFO:torchrec.distributed.planner.stats:#    HBM: 0.008 GB, DDR: 0.0 GB                                                                                                                                                                                                                                                                                                                                       #
INFO:torchrec.distributed.planner.stats:#                                                                                                                                                                                                                                                                                                                                                                     #
INFO:torchrec.distributed.planner.stats:# Top 5 Tables Causing Max Perf:                                                                                                                                                                                                                                                                                                                                      #
INFO:torchrec.distributed.planner.stats:#    large_table                                                                                                                                                                                                                                                                                                                                                      #
INFO:torchrec.distributed.planner.stats:# Top 5 Tables Causing Max HBM:                                                                                                                                                                                                                                                                                                                                       #
INFO:torchrec.distributed.planner.stats:#    large_table: 0.477 GB on ranks [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]                                                                                                                                                                                                                                                                                  #
INFO:torchrec.distributed.planner.stats:#######################################################################################################################################################################################################################################################################################################################################################################

Differential Revision: D84268361

@meta-codesync
Copy link
Contributor

meta-codesync bot commented Dec 10, 2025

@TroyGarden has exported this pull request. If you are a Meta employee, you can view the originating Diff in D84268361.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 10, 2025
Summary:

# context
* modify the kvzch benchmark configs to better represent the real use case
* add config pass-in to the test models
* fix small bugs and minior refactoring 


# changes
* previous kv-zch embedding table is too small the prefetch process is too short, after this change (increased table size) the prefetch process is longer
 {F1983784711}  {F1983784733}

# benchmark
|short name                         |GPU Runtime (P90)|CPU Runtime (P90)|GPU Peak Mem alloc (P90)|GPU Peak Mem reserved (P90)|GPU Mem used (P90)|Malloc retries (P50/P90/P100)|CPU Peak RSS (P90)|
|--|--|--|--|--|--|--|--|
|regular-base                       |9864.51 ms       |9403.68 ms       |33.77 GB                |49.66 GB                   |50.71 GB          |0.0 / 0.0 / 0.0              |30.65 GB          |
|kvzch-base                         |18804.26 ms      |44245.82 ms      |25.28 GB                |36.33 GB                   |37.38 GB          |0.0 / 0.0 / 0.0              |31.18 GB          |
|base-inplace                       |20141.71 ms      |46805.58 ms      |25.28 GB                |34.39 GB                   |35.44 GB          |0.0 / 0.0 / 0.0              |31.19 GB          |
|kvzch-sdd                          |20382.59 ms      |45647.02 ms      |33.42 GB                |47.52 GB                   |48.56 GB          |0.0 / 0.0 / 0.0              |31.13 GB          |
|kvzch-prefetch                     |17951.19 ms      |38598.57 ms      |33.45 GB                |47.16 GB                   |48.21 GB          |0.0 / 0.0 / 0.0              |30.83 GB          |
|regular-base                       |49710.51 ms      |74880.50 ms      |43.14 GB                |50.63 GB                   |51.68 GB          |0.0 / 0.0 / 0.0              |33.57 GB          |

Reviewed By: spmex

Differential Revision: D84268361
@meta-codesync meta-codesync bot closed this in 223db0d Dec 11, 2025
@TroyGarden TroyGarden deleted the export-D84268361 branch December 11, 2025 04:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant