[MLPerf] Add DLRM-DCNv2 #144

abheesht17 · 2025-08-18T08:48:56Z

TODOs:

Need to make it work for multi-host: Multi-host training does not work for distributed embeddings #143.
Currently uses dummy data, need to shift it to using the actual dataset. Already added code for loading actual data, but the path we are reading already has "batched" data. Can go ahead with this, or can load "unbatched" data and batch that up.
Eval dataset
Check if random seed is being correctly passed. We need to make sure that the initialiser returns different values for different layers (since we are passing SeedGenerator, I think we are good, but good to verify manually).
~~Instead of separate files for configs, maybe have separate functions for configs.~~
Is this a worry?

 08:44:12.046114: W jax_tpu_embedding/sparsecore/lib/core/input_preprocessing_util.cc:251] No Coo Buffer Size provided for table cat_14_table_cat_15_table_cat_23_table_cat_24_table_cat_25_table_cat_33_table_cat_34_table_cat_35_table_cat_36_table, the default value (6144) may be too large and can cause OOM. Utilize the stats returned from the sparse dense matmul preprocessing API.

Set up metric logging and checkpointing.

gemini-code-assist

Summary of Changes

Hello @abheesht17, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the DLRM-DCNv2 model as an MLPerf example, designed for efficient training on TPUs. It provides the model architecture, configuration files for dataset, model, and training hyperparameters, a dummy data loader, and a comprehensive shell script to set up the TPU environment and execute the training process. The current implementation uses dummy data and has known areas for future improvement, such as multi-host support and actual dataset integration.

Highlights

New Model Implementation: Adds the DLRM-DCNv2 model, a deep learning architecture combining deep neural networks with a cross-network for learning explicit feature interactions, commonly used in recommendation systems.
TPU Integration: Leverages Keras's distributed training capabilities with JAX backend and keras_rs.layers.DistributedEmbedding for efficient large-scale embedding table handling on TPUs, including SparseCore.
Modular Configuration: Introduces a structured configuration system using keras.utils.Config for defining dataset features, model parameters (embedding dimensions, MLP layers, DCN layers), and training hyperparameters.
Automated Setup Script: Provides a run.sh script to automate the entire setup process on Google Cloud TPU VMs, from VM creation and environment setup to dependency installation and model execution.
Dummy Data Support: Includes a dataloader.py to generate dummy data, enabling initial testing and development of the DLRM-DCNv2 model before integration with real datasets.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a DLRM-DCNv2 model implementation for MLPerf benchmarks, including configurations, a dummy dataloader, the model definition, and a run script. The overall structure is good, but there are several critical and high-severity issues that need to be addressed. Specifically, there are bugs related to weight initialization due to seed reuse in loops, which will lead to layers having identical weights. There's also a potential runtime error in the model's forward pass due to improper handling of an empty list during tensor concatenation. Additionally, the main training script has a hardcoded number of epochs, ignoring the value from the configuration. I've also included some medium-severity suggestions to improve code maintainability and script robustness. Please review the detailed comments.

examples/ml_perf/model.py

examples/ml_perf/main.py

examples/ml_perf/model.py

examples/ml_perf/main.py

examples/ml_perf/run.sh

examples/ml_perf/main.py

hertschuh · 2025-08-19T00:30:03Z

examples/ml_perf/main.py

+            ),
+            combiner="sum",
+            placement="sparsecore",
+            # TODO: These two args are not getting passed down to


You could try these XLA flags:

xla_sparse_core_max_ids_per_partition_per_sample

xla_sparse_core_max_unique_ids_per_partition_per_sample

hertschuh · 2025-08-19T00:31:35Z

examples/ml_perf/main.py

+            table=table_config,
+            # TODO: Verify whether it should be `(bsz, 1)` or
+            # `(bsz, multi_hot_size)`.
+            input_shape=(per_host_batch_size, multi_hot_size),


multi_hot as an input to embedding tables is not supported, it should be turned into a ragged or padded sequence of indices before being fed to the DistributedEmbedding.

Hmmm, I don't see happening here though: https://github.com/AI-Hypercomputer/RecML/blob/1821350b346b66479baaa0ab624aa67929305dea/examples/dlrm/dlrm_main.py#L245-L249. They use (bsz, 1). Not sure why, will check with them

multi_hot as an input to embedding tables is not supported, it should be turned into a ragged or padded sequence of indices before being fed to the DistributedEmbedding.

Oh, I think our understanding of multi-hot is different. I think what you mean by multi-hot is [[1, 0, 1, 0, 0], [0, 0, 1, 1, 1]]. I am passing indices here though. I should rename it to something else, it's confusing to keep multi_hot

np.random.randint( low=0, high=vocabulary_size, size=(self.batch_size, multi_hot_size), dtype=np.int64, )

hertschuh

Some comments about the bash script

hertschuh · 2025-09-23T19:35:08Z

examples/ml_perf/run.sh

+# ==============================================================================
+# Environment Variables
+# ==============================================================================
+export TPU_NAME="abheesht-mlperf-${ACCELERATOR_TYPE}"


export TPU_NAME="${USER}-mlperf-${ACCELERATOR_TYPE}"

hertschuh · 2025-09-23T19:35:27Z

examples/ml_perf/run.sh

+
+    if [ ! -d 'keras-rs' ]; then
+      echo '>>> Cloning keras-rs repository...'
+      git clone https://github.com/abheesht17/keras-rs.git


This needs to be changed before submitting.

hertschuh · 2025-09-23T19:36:08Z

examples/ml_perf/run.sh

@@ -0,0 +1,171 @@
+#!/bin/bash


I had to chmod this file to make it runnable. Can you fix that?

examples/ml_perf/run.sh

hertschuh · 2025-09-23T19:45:47Z

examples/ml_perf/run.sh

+    echo '>>> Installing/updating dependencies...'
+    pip install -e .
+    pip uninstall -y tensorflow keras
+    pip install git+https://github.com/keras-team/keras.git


Why did you need to install from source?

Oh, I think Antonio had some changes for distributing the dataset across hosts, which weren't a part of any release at that point in time (IIRC)

examples/ml_perf/run.sh

abheesht17 requested a review from hertschuh August 18, 2025 08:48

gemini-code-assist bot reviewed Aug 18, 2025

View reviewed changes

hertschuh reviewed Aug 19, 2025

View reviewed changes

abheesht17 requested review from cantonios, hertschuh and silkyarora August 20, 2025 13:51

hertschuh reviewed Sep 24, 2025

View reviewed changes

abheesht17 and others added 9 commits October 18, 2025 23:02

Address some comments + move to step-based trainer

5118325

Merge branch 'keras-team:main' into ml-perf

f097378

Fix

59c9d08

Refactor configs to one file

af9ba92

Follow the original example in specifying input/output shape

8b6e300

Add debugging statements

760639a

Add debugging statements

deaf35a

Add debugging statements

3d0b640

Add debugging statements

5782407

abheesht17 force-pushed the ml-perf branch from 6bb8ae0 to 5782407 Compare October 20, 2025 17:40

abheesht17 added 12 commits October 20, 2025 23:50

Temp comment out stats update code

94ffb7e

Change input size to dist emb

c578ff8

Bsz

b3425f9

Bsz

f2a849a

Bsz

d08479e

Restore stats stuff

604c07b

Comment out stats update for now

1d5e983

Try alternate way of sharding dataset

832605e

Try alternate way of sharding dataset

e621af7

Try alternate way of sharding dataset

0b11227

Debug

136a57f

Debug

2ae73ce

abheesht17 and others added 30 commits October 29, 2025 19:53

Some dataloader options

4e950f0

Debug

657e8d5

Debug

0ed121b

Debug

0de1f0b

Debug

6cfc3fc

Debug

a1c65c4

Debug

7b4b18d

Debug

4e5259f

Debug

18b97f3

Debug

5c56e1b

Debug

37345dd

Remove auto stack kwargs

06a0a79

Merge branch 'keras-team:main' into ml-perf

19c1dd8

Comment out profiling

4bfbf95

Comment out stat update for now

a0578cc

Debug

05f8905

Debug

ff0625f

Debug

36535df

Add model.summary()

27f9fd6

Add model.summary()

7c3c3e6

Workaround

e1187f0

Workaround

260e5b1

Change bsz to power of 2

df89a6c

Comment out predict

0d3b0ef

Small fix

cabac68

Add xprof stuff

4122ff3

Reduce num_steps to 10

35e6c6d

Enable Python tracer level

6061c87

Enable Python tracer level

fefd549

Enable Python tracer level

2bca758

[MLPerf] Add DLRM-DCNv2 #144

Are you sure you want to change the base?

[MLPerf] Add DLRM-DCNv2 #144

Uh oh!

Conversation

abheesht17 commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abheesht17 Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hertschuh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

abheesht17 commented Aug 18, 2025 •

edited

Loading

abheesht17 Aug 20, 2025 •

edited

Loading