I tried to run 3 datasets with TinyLlama + DP, all failed. Smollm3+DP is still working. I have not tested Mistral. It's a vllm error. Training completed but generation failed to start.
bash submit_slurm_jobs.sh --runs 5 --partition polar4 --sleep-sec 30 --pipeline-mode end_to_end --exp-name nss_dp_epsilon_0317 --dataset-urls amazon_reviews_25k,car_accident,clinc_oos
generation:
num_records: 1000
use_structured_generation: true
privacy:
dp_enabled: true
training:
batch_size: 8
pretrained_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
2026-03-17 15:30:18
{"extra": {"error_type": "AssertionError"}, "level": "error", "qual_name": "SafeSynthesizer.run", "lineno": 830, "filename": "observability.py", "category": "runtime", "timestamp": "2026-03-17T19:30:18.574415Z", "message": "Error in SafeSynthesizer.generate: wrong number of dimensions2"}
2026-03-17 15:30:18
{"extra": {"error_type": "AssertionError"}, "level": "error", "qual_name": "run", "lineno": 858, "filename": "observability.py", "category": "user", "timestamp": "2026-03-17T19:30:18.574941Z", "message": "Error in SafeSynthesizer: wrong number of dimensions2"}
2026-03-17 15:30:18
{"level": "info", "lineno": 631, "filename": "observability.py", "_ray_timestamp_ns": 1773775818575636567, "category": "runtime", "timestamp": "2026-03-17T19:30:18.575664Z", "message": "Logged failure to wandb for end_to_end phase"}
2026-03-17 15:30:18
2026-03-17 15:30:18
[rank0]: Traceback (most recent call last):
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/bin/safe-synthesizer", line 10, in <module>
2026-03-17 15:30:18
[rank0]: sys.exit(cli())
2026-03-17 15:30:18
[rank0]: ^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/click/core.py", line 1485, in __call__
2026-03-17 15:30:18
[rank0]: return self.main(*args, **kwargs)
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/click/core.py", line 1406, in main
2026-03-17 15:30:18
[rank0]: rv = self.invoke(ctx)
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/click/core.py", line 1873, in invoke
2026-03-17 15:30:18
[rank0]: return _process_result(sub_ctx.command.invoke(sub_ctx))
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/click/core.py", line 1851, in invoke
2026-03-17 15:30:18
[rank0]: rv = super().invoke(ctx)
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/click/core.py", line 1269, in invoke
2026-03-17 15:30:18
[rank0]: return ctx.invoke(self.callback, **ctx.params)
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/click/core.py", line 824, in invoke
2026-03-17 15:30:18
[rank0]: return callback(*args, **kwargs)
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/click/decorators.py", line 34, in new_func
2026-03-17 15:30:18
[rank0]: return f(get_current_context(), *args, **kwargs)
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/src/nemo_safe_synthesizer/cli/run.py", line 222, in run
2026-03-17 15:30:18
[rank0]: ss.run()
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/src/nemo_safe_synthesizer/sdk/library_builder.py", line 434, in run
2026-03-17 15:30:18
[rank0]: self.process_data().train().generate().evaluate()
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/src/nemo_safe_synthesizer/observability.py", line 820, in wrapper
2026-03-17 15:30:18
[rank0]: result = func(*args, **kwargs)
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/src/nemo_safe_synthesizer/sdk/library_builder.py", line 361, in generate
2026-03-17 15:30:18
[rank0]: self.generator.initialize()
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/src/nemo_safe_synthesizer/generation/vllm_backend.py", line 190, in initialize
2026-03-17 15:30:18
[rank0]: self.llm = vLLM(
2026-03-17 15:30:18
[rank0]: ^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 334, in __init__
2026-03-17 15:30:18
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/v1/engine/llm_engine.py", line 172, in from_engine_args
2026-03-17 15:30:18
[rank0]: return cls(
2026-03-17 15:30:18
[rank0]: ^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/v1/engine/llm_engine.py", line 106, in __init__
2026-03-17 15:30:18
[rank0]: self.engine_core = EngineCoreClient.make_client(
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 96, in make_client
2026-03-17 15:30:18
[rank0]: return InprocClient(vllm_config, executor_class, log_stats)
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 269, in __init__
2026-03-17 15:30:18
[rank0]: self.engine_core = EngineCore(*args, **kwargs)
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 112, in __init__
2026-03-17 15:30:18
[rank0]: num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 242, in _initialize_kv_caches
2026-03-17 15:30:18
[rank0]: available_gpu_memory = self.model_executor.determine_available_memory()
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/v1/executor/abstract.py", line 126, in determine_available_memory
2026-03-17 15:30:18
[rank0]: return self.collective_rpc("determine_available_memory")
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
2026-03-17 15:30:18
[rank0]: result = run_method(self.driver_worker, method, args, kwargs)
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/v1/serial_utils.py", line 461, in run_method
2026-03-17 15:30:18
[rank0]: return func(*args, **kwargs)
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
2026-03-17 15:30:18
[rank0]: return func(*args, **kwargs)
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 322, in determine_available_memory
2026-03-17 15:30:18
[rank0]: self.model_runner.profile_run()
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4974, in profile_run
2026-03-17 15:30:18
[rank0]: hidden_states, last_hidden_states = self._dummy_run(
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
2026-03-17 15:30:18
[rank0]: return func(*args, **kwargs)
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4685, in _dummy_run
2026-03-17 15:30:18
[rank0]: outputs = self.model(
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/compilation/cuda_graph.py", line 222, in __call__
2026-03-17 15:30:18
[rank0]: return self.runnable(*args, **kwargs)
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
2026-03-17 15:30:18
[rank0]: return self._call_impl(*args, **kwargs)
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
2026-03-17 15:30:18
[rank0]: return forward_call(*args, **kwargs)
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 589, in forward
2026-03-17 15:30:18
[rank0]: model_output = self.model(
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 561, in __call__
2026-03-17 15:30:18
[rank0]: output = TorchCompileWithNoGuardsWrapper.__call__(self, *args, **kwargs) # type: ignore[arg-type]
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/compilation/wrapper.py", line 228, in __call__
2026-03-17 15:30:18
[rank0]: return self._call_with_optional_nvtx_range(
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/compilation/wrapper.py", line 119, in _call_with_optional_nvtx_range
2026-03-17 15:30:18
[rank0]: return callable_fn(*args, **kwargs)
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 832, in compile_wrapper
2026-03-17 15:30:18
[rank0]: return fn(*args, **kwargs)
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 401, in forward
2026-03-17 15:30:18
[rank0]: def forward(
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
2026-03-17 15:30:18
[rank0]: return fn(*args, **kwargs)
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/vllm/compilation/caching.py", line 185, in __call__
2026-03-17 15:30:18
[rank0]: return self.optimized_call(*args, **kwargs)
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/torch/fx/graph_module.py", line 837, in call_wrapped
2026-03-17 15:30:18
[rank0]: return self._wrapped_call(self, *args, **kwargs)
2026-03-17 15:30:18
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/torch/fx/graph_module.py", line 413, in __call__
2026-03-17 15:30:18
[rank0]: raise e
2026-03-17 15:30:18
[rank0]: File "/lustre/fs11/portfolios/llmservice/projects/llmservice_sdg_research/users/ninaxu/Safe-Synthesizer/.venv/lib/python3.11/site-packages/torch/fx/graph_module.py", line 400, in __call__
Priority Level
Medium (Annoying but has workaround)
Describe the bug
I tried to run 3 datasets with TinyLlama + DP, all failed. Smollm3+DP is still working. I have not tested Mistral. It's a vllm error. Training completed but generation failed to start.
wandb project: https://wandb.ai/nemo-llm-service/nss_dp_epsilon_0317/table?nw=nwuserninaxunvidia
Steps/Code to reproduce bug
command:
config:
error:
Expected behavior
should succeed
Additional context
No response