Skip to content

DP w/ TinyLlama is broken #269

@nina-xu

Description

@nina-xu

Priority Level

Medium (Annoying but has workaround)

Describe the bug

I tried to run 3 datasets with TinyLlama + DP, all failed. Smollm3+DP is still working. I have not tested Mistral. It's a vllm error. Training completed but generation failed to start.

wandb project: https://wandb.ai/nemo-llm-service/nss_dp_epsilon_0317/table?nw=nwuserninaxunvidia

Steps/Code to reproduce bug

command:

bash submit_slurm_jobs.sh --runs 5 --partition polar4 --sleep-sec 30 --pipeline-mode end_to_end --exp-name nss_dp_epsilon_0317 --dataset-urls amazon_reviews_25k,car_accident,clinc_oos

config:

generation:
  num_records: 1000
  use_structured_generation: true
privacy:
  dp_enabled: true
training:
  batch_size: 8
  pretrained_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0

error:

2026-03-17 15:30:18
{"extra": {"error_type": "AssertionError"}, "level": "error", "qual_name": "SafeSynthesizer.run", "lineno": 830, "filename": "observability.py", "category": "runtime", "timestamp": "2026-03-17T19:30:18.574415Z", "message": "Error in SafeSynthesizer.generate: wrong number of dimensions2"}
2026-03-17 15:30:18
{"extra": {"error_type": "AssertionError"}, "level": "error", "qual_name": "run", "lineno": 858, "filename": "observability.py", "category": "user", "timestamp": "2026-03-17T19:30:18.574941Z", "message": "Error in SafeSynthesizer: wrong number of dimensions2"}
2026-03-17 15:30:18
{"level": "info", "lineno": 631, "filename": "observability.py", "_ray_timestamp_ns": 1773775818575636567, "category": "runtime", "timestamp": "2026-03-17T19:30:18.575664Z", "message": "Logged failure to wandb for end_to_end phase"}
2026-03-17 15:30:18

2026-03-17 15:30:18
[rank0]: Traceback (most recent call last):
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/bin/safe-synthesizer", line 10, in <module>
2026-03-17 15:30:18
[rank0]:     sys.exit(cli())
2026-03-17 15:30:18
[rank0]:              ^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/click/core.py", line 1485, in __call__
2026-03-17 15:30:18
[rank0]:     return self.main(*args, **kwargs)
2026-03-17 15:30:18
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/click/core.py", line 1406, in main
2026-03-17 15:30:18
[rank0]:     rv = self.invoke(ctx)
2026-03-17 15:30:18
[rank0]:          ^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/click/core.py", line 1873, in invoke
2026-03-17 15:30:18
[rank0]:     return _process_result(sub_ctx.command.invoke(sub_ctx))
2026-03-17 15:30:18
[rank0]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/click/core.py", line 1851, in invoke
2026-03-17 15:30:18
[rank0]:     rv = super().invoke(ctx)
2026-03-17 15:30:18
[rank0]:          ^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/click/core.py", line 1269, in invoke
2026-03-17 15:30:18
[rank0]:     return ctx.invoke(self.callback, **ctx.params)
2026-03-17 15:30:18
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/click/core.py", line 824, in invoke
2026-03-17 15:30:18
[rank0]:     return callback(*args, **kwargs)
2026-03-17 15:30:18
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/click/decorators.py", line 34, in new_func
2026-03-17 15:30:18
[rank0]:     return f(get_current_context(), *args, **kwargs)
2026-03-17 15:30:18
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/src/nemo_safe_synthesizer/cli/run.py", line 222, in run
2026-03-17 15:30:18
[rank0]:     ss.run()
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/src/nemo_safe_synthesizer/sdk/library_builder.py", line 434, in run
2026-03-17 15:30:18
[rank0]:     self.process_data().train().generate().evaluate()
2026-03-17 15:30:18
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/src/nemo_safe_synthesizer/observability.py", line 820, in wrapper
2026-03-17 15:30:18
[rank0]:     result = func(*args, **kwargs)
2026-03-17 15:30:18
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/src/nemo_safe_synthesizer/sdk/library_builder.py", line 361, in generate
2026-03-17 15:30:18
[rank0]:     self.generator.initialize()
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/src/nemo_safe_synthesizer/generation/vllm_backend.py", line 190, in initialize
2026-03-17 15:30:18
[rank0]:     self.llm = vLLM(
2026-03-17 15:30:18
[rank0]:                ^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 334, in __init__
2026-03-17 15:30:18
[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
2026-03-17 15:30:18
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/v1/engine/llm_engine.py", line 172, in from_engine_args
2026-03-17 15:30:18
[rank0]:     return cls(
2026-03-17 15:30:18
[rank0]:            ^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/v1/engine/llm_engine.py", line 106, in __init__
2026-03-17 15:30:18
[rank0]:     self.engine_core = EngineCoreClient.make_client(
2026-03-17 15:30:18
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 96, in make_client
2026-03-17 15:30:18
[rank0]:     return InprocClient(vllm_config, executor_class, log_stats)
2026-03-17 15:30:18
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 269, in __init__
2026-03-17 15:30:18
[rank0]:     self.engine_core = EngineCore(*args, **kwargs)
2026-03-17 15:30:18
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 112, in __init__
2026-03-17 15:30:18
[rank0]:     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
2026-03-17 15:30:18
[rank0]:                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 242, in _initialize_kv_caches
2026-03-17 15:30:18
[rank0]:     available_gpu_memory = self.model_executor.determine_available_memory()
2026-03-17 15:30:18
[rank0]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/v1/executor/abstract.py", line 126, in determine_available_memory
2026-03-17 15:30:18
[rank0]:     return self.collective_rpc("determine_available_memory")
2026-03-17 15:30:18
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
2026-03-17 15:30:18
[rank0]:     result = run_method(self.driver_worker, method, args, kwargs)
2026-03-17 15:30:18
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/v1/serial_utils.py", line 461, in run_method
2026-03-17 15:30:18
[rank0]:     return func(*args, **kwargs)
2026-03-17 15:30:18
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
2026-03-17 15:30:18
[rank0]:     return func(*args, **kwargs)
2026-03-17 15:30:18
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 322, in determine_available_memory
2026-03-17 15:30:18
[rank0]:     self.model_runner.profile_run()
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4974, in profile_run
2026-03-17 15:30:18
[rank0]:     hidden_states, last_hidden_states = self._dummy_run(
2026-03-17 15:30:18
[rank0]:                                         ^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
2026-03-17 15:30:18
[rank0]:     return func(*args, **kwargs)
2026-03-17 15:30:18
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4685, in _dummy_run
2026-03-17 15:30:18
[rank0]:     outputs = self.model(
2026-03-17 15:30:18
[rank0]:               ^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/compilation/cuda_graph.py", line 222, in __call__
2026-03-17 15:30:18
[rank0]:     return self.runnable(*args, **kwargs)
2026-03-17 15:30:18
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
2026-03-17 15:30:18
[rank0]:     return self._call_impl(*args, **kwargs)
2026-03-17 15:30:18
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
2026-03-17 15:30:18
[rank0]:     return forward_call(*args, **kwargs)
2026-03-17 15:30:18
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 589, in forward
2026-03-17 15:30:18
[rank0]:     model_output = self.model(
2026-03-17 15:30:18
[rank0]:                    ^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 561, in __call__
2026-03-17 15:30:18
[rank0]:     output = TorchCompileWithNoGuardsWrapper.__call__(self, *args, **kwargs)  # type: ignore[arg-type]
2026-03-17 15:30:18
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/compilation/wrapper.py", line 228, in __call__
2026-03-17 15:30:18
[rank0]:     return self._call_with_optional_nvtx_range(
2026-03-17 15:30:18
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/compilation/wrapper.py", line 119, in _call_with_optional_nvtx_range
2026-03-17 15:30:18
[rank0]:     return callable_fn(*args, **kwargs)
2026-03-17 15:30:18
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 832, in compile_wrapper
2026-03-17 15:30:18
[rank0]:     return fn(*args, **kwargs)
2026-03-17 15:30:18
[rank0]:            ^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 401, in forward
2026-03-17 15:30:18
[rank0]:     def forward(
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
2026-03-17 15:30:18
[rank0]:     return fn(*args, **kwargs)
2026-03-17 15:30:18
[rank0]:            ^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/compilation/caching.py", line 185, in __call__
2026-03-17 15:30:18
[rank0]:     return self.optimized_call(*args, **kwargs)
2026-03-17 15:30:18
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/torch/fx/graph_module.py", line 837, in call_wrapped
2026-03-17 15:30:18
[rank0]:     return self._wrapped_call(self, *args, **kwargs)
2026-03-17 15:30:18
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/torch/fx/graph_module.py", line 413, in __call__
2026-03-17 15:30:18
[rank0]:     raise e
2026-03-17 15:30:18
[rank0]:   File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/torch/fx/graph_module.py", line 400, in __call__

Expected behavior

should succeed

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions