Conversation
binaryaaron
left a comment
There was a problem hiding this comment.
claude was a liar about unsloth and aarch64. also, pull in this change and test again
59d13ec to
83fff62
Compare
binaryaaron
left a comment
There was a problem hiding this comment.
if we want to support dgx spark quickly, the different install path in the container is fine but it's not proper support. let's make a follow up for this - we'll need to do a proper spike on cross-platform support + cuda version support.
| structured_outputs_config=structured_outputs_config, | ||
| enforce_eager=enforce_eager, | ||
| ) | ||
| # attention_config was added in vLLM 0.12+ but removed in some builds. |
There was a problem hiding this comment.
NGC container nvcr.io/nvidia/vllm:26.02-py3 doesn't have attention_config. PyPI vLLM 0.15.0 does.
containers/Dockerfile.spark
Outdated
| # Torch-dependent packages — install with --no-deps to preserve container's torch/CUDA | ||
| RUN pip install --no-deps \ | ||
| peft accelerate bitsandbytes datasets==4.3.0 trl==0.26.1 \ | ||
| hf_transfer unsloth unsloth_zoo \ | ||
| opacus sentence-transformers gliner kernels | ||
|
|
||
| # Safe Synthesizer + remaining deps | ||
| RUN pip install --no-deps -e . && \ | ||
| pip install \ | ||
| faker 'pydantic[email]>=2.12.5' pydantic-settings pyyaml jsonschema rich structlog \ | ||
| colorama 'huggingface-hub>=0.34.4,<1' anyascii pycountry betterproto flashtext \ | ||
| cached-property category-encoders dython dateparser langchain-core json-repair \ | ||
| matplotlib 'outlines>=1.0.0' plotly prv-accountant 'smart-open==7.0.5' python-stdnum \ | ||
| 'pandas>=2.1.3' ratelimit 'sqlfluff==3.2.0' 'range_regex>=0.1.0' 'tenacity==9.1.2' \ | ||
| 'tiktoken>=0.7.0' tldextract 'wandb==0.23.1' python-dotenv patsy \ | ||
| pyarrow multiprocess onnxruntime opt_einsum dill==0.3.8 faiss-cpu |
There was a problem hiding this comment.
these should move to pyproject.toml and follow patterns from the cuda dockerfile. can call this one dockerfile.cuda-spark or cuda-aarch64.
There was a problem hiding this comment.
Makes sense. I can move the deps into a spark or aarch64 extra in pyproject.toml and have the Dockerfile just do pip install -e ".[cuda-aarch64]".
| "vllm==0.15.0", | ||
| "xformers==v0.0.33.post2; sys_platform == 'linux'", | ||
| "xformers==v0.0.33.post2; sys_platform == 'linux' and platform_machine == 'x86_64'", | ||
| ] |
There was a problem hiding this comment.
this is probably going to be a whole thing but we should make a new dep group or do better platform resolution for cross-platform deps.
There was a problem hiding this comment.
Agree this needs a proper spike. For now the platform markers are minimal and don't break x86_64 right?
There was a problem hiding this comment.
hmmm do we want to keep the sys_platform markers? locking is borked in CI right now and I've been fighting it by putting linux markers in
There was a problem hiding this comment.
We really will need a spark and/or station as part of our github runners so we make sure this continues to work.
binaryaaron
left a comment
There was a problem hiding this comment.
approving conditionally so i don't block while i'm out
|
We've also got to publish the container somewhere in the future, same with the cuda one. |
a113a39 to
655ca6c
Compare
8408cca to
819bdb8
Compare
- containers/Dockerfile.spark: container-based install using nvcr.io/nvidia/vllm:26.02-py3 - docs/DGX_SPARK.md: quick start guide (build + run in 2 steps) - pyproject.toml: platform markers for aarch64-incompatible packages (faiss-gpu-cu12, torchvision+cu128, torchao, xformers) - config/training.py: auto-fallback Flash Attention 3 to sdpa on aarch64 - vllm_backend.py: handle vllm versions without attention_config kwarg Signed-off-by: mvansegbroeck <mvansegbroeck@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: mvansegbroeck <mvansegbroeck@gmail.com>
onnxruntime is required by gliner but not always resolved transitively on aarch64. opt_einsum is directly imported in dp_transformers/linear.py but was never declared. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: mvansegbroeck <mvansegbroeck@gmail.com>
Signed-off-by: mvansegbroeck <mvansegbroeck@gmail.com>
f3b9988 to
c7f1cf7
Compare
…lockfile Signed-off-by: mvansegbroeck <mvansegbroeck@gmail.com>
|
|
||
| # 2. Torch-dependent packages — --no-deps preserves the container's torch/CUDA | ||
| RUN pip install --no-deps \ | ||
| peft accelerate bitsandbytes datasets==4.3.0 trl==0.26.1 \ |
There was a problem hiding this comment.
🤔 should we add an extras group for them? mostly to not have these versions be different than what's in the lockfile
|
|
||
| # NeMo Safe Synthesizer on DGX Spark | ||
|
|
||
| Generate synthetic tabular data with quality and privacy guarantees — train, generate, and evaluate in one command. |
There was a problem hiding this comment.
should this comment be specific to the spark setup?
| if platform.machine() == "aarch64": | ||
| if self.attn_implementation == "kernels-community/vllm-flash-attn3": |
There was a problem hiding this comment.
| if platform.machine() == "aarch64": | |
| if self.attn_implementation == "kernels-community/vllm-flash-attn3": | |
| if platform.machine() == "aarch64" and self.attn_implementation == "kernels-community/vllm-flash-attn3": |
| "vllm==0.15.0", | ||
| "xformers==v0.0.33.post2; sys_platform == 'linux'", | ||
| "xformers==v0.0.33.post2; sys_platform == 'linux' and platform_machine == 'x86_64'", | ||
| ] |
There was a problem hiding this comment.
hmmm do we want to keep the sys_platform markers? locking is borked in CI right now and I've been fighting it by putting linux markers in
There was a problem hiding this comment.
Is there a reason not to put this under the user or developer guide and be a full part of our docs? A top level file in docs/ that isn't rendered to github pages isn't the layout I expect.
If we don't want it in the github pages yet, then should move to top level of the repo.
| ``` | ||
|
|
||
| > If HTTPS clone fails with authentication errors, use SSH: | ||
| > `git clone git@github.com:NVIDIA-NeMo/Safe-Synthesizer.git` |
There was a problem hiding this comment.
suggestion: Now that the repo is public, this should not be an issue.
|
|
||
| **Why a container?** DGX Spark's CUDA 13 + aarch64 requires specific Triton, vLLM, and PyTorch versions. The container (`nvcr.io/nvidia/vllm:26.02-py3`) provides a tested stack where Unsloth training and vLLM generation work natively. | ||
|
|
||
| **Full documentation:** [Safe Synthesizer User Guide](https://github.com/NVIDIA-NeMo/Safe-Synthesizer/blob/main/docs/user-guide/getting-started.md) |
There was a problem hiding this comment.
suggestion: We should link to the github pages location at https://nvidia-nemo.github.io/Safe-Synthesizer/user-guide/getting-started/ (now that it's public).
Or if we move this fully into github pages, use a relative link so mkdocs will render it correctly.
| "vllm==0.15.0", | ||
| "xformers==v0.0.33.post2; sys_platform == 'linux'", | ||
| "xformers==v0.0.33.post2; sys_platform == 'linux' and platform_machine == 'x86_64'", | ||
| ] |
There was a problem hiding this comment.
We really will need a spark and/or station as part of our github runners so we make sure this continues to work.
Summary
Add DGX Spark (aarch64) support for Safe Synthesizer:
containers/Dockerfile.cuda-aarch64for building on aarch64/Sparkdocs/DGX_SPARK.md— quickstart guide for running NSS on DGX Sparkonnxruntimeandopt_einsumtocu128extrapragma: allowlist secretfor API key placeholder in docsPre-Review Checklist
make format && make checkor via prek validation.make testpasses locallymake test-e2epasses locallymake test-ci-containerpasses locally (recommended)/syncon this PR to trigger a run (auto-triggers on ready-for-review)Pre-Merge Checklist