neml2 submodule update#32871
Conversation
49fa72c to
7475d35
Compare
|
@dschwen @roystgnr @loganharbour @permcody Can you review my changes in framework/moose.mk and framework/app.mk? I have good reasons for making those changes (see pytorch/pytorch#182263). But let me know if they could lead to unintended consequences. |
|
Changing flags and flag order? Oh boy, I think this is still a black art, even in the era of AI. We'll just have to see how the testing does! |
7475d35 to
45c4fcc
Compare
|
Job Test, step Results summary on 2c316aa wanted to post the following: Framework test summaryCompared against 5b391df in job civet.inl.gov/job/3787211. No added testsRun time changes
Modules test summaryCompared against 5b391df in job civet.inl.gov/job/3787211. Removed tests
Added tests
Run time changes
|
If we're doing things correctly, they wouldn't lead to unintended consequences, because they wouldn't lead to any consequences. Get all your LDFLAGS before all your LIBS flags, and because they don't conflict with each other and there aren't any distinct versions of identical symbols it doesn't matter what order the LDFLAGS are in amongst themselves, likewise for the LIBS. If it does matter, then this isn't a fix for that problem, it's a workaround. Might it be the best possible workaround? Maybe. From your comment it sounds like we have multiple different BLAS getting linked to, and PyTorch needs to definitely resolve its own BLAS calls to its own BLAS rather than to what PETSc brings in? The fix would be to make sure we only link one BLAS into one final program, pointing PETSc (or its dependencies) to PyTorch's when building PETSc, but I'm not sure if there are other incompatibilities between them that might make that impossible. |
|
Yeah, what really bothers me is that libtorch doesn't directly link to blas as far as I can tell, but it does have symbols with jump tables. So I guess libtorch is expected to be linked to some final program who either define those symbols or is linked to blas somehow, AND that blas needs to have compatible ABI with whatever symbol libtorch used (for its own compilation). I honestly think that's a bad decision on the libtorch end. I agree with you on all accounts. This is a workaround. This is not the best fix. I am open to suggestions on what we could do on our end before the torch team fixes this problem (assuming they will eventually). |
Oh and to be fair, that is indeed what we do when we build containers. PyTorch is built from source using the same blas from petsc. Our containers will work just fine without this workaround. However, that's not our general instruction for users -- we suggest people to use conda, and ask users to download their own PyTorch (libtorch). That creates problems. |
|
Ah I was going to say I spent a lot of time on this very issue which I handled by making sure we're using the same blas everywhere! (For our container builds) |
|
Perhaps... perhaps We should discourage users from using their own PyTorch/libtorch and instead provide a |
|
Or, as a middle ground, we can provide a conftest script to check if the user's libtorch is compatible with our petsc? Then the user can use the script to decide whether it's necessary to build torch from source or download a compatible torch. |
|
Job Documentation, step Docs: sync website on 2c316aa wanted to post the following: View the site here This comment will be updated on new commits. |
Will it ever be? We create the blas library that PETSc uses in all cases: container, conda, user builds PETSc via
I would support this |
|
Job Coverage, step Generate coverage on 2c316aa wanted to post the following: Framework coverage
Modules coverageSolid mechanics
Full coverage reportsReports
This comment will be updated on new commits. |
||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'll wait for @loganharbour to come back and chime in. I can work on a build script for torch if he also approves. I suspect he already has something like that. |
|
If you guys are both confident and comfortable with a build script for libtorch, that's fine with me. @hugary1995, this was the minimal config I was able to get: Best to start with that and enable what you need for neml2? |
… update_and_rebuild_neml2.sh accordingly
|
Should be ready now. I added scripts to help build torch from source and added pytorch as a update=none submodule. Let me also start a discussion on slack to gather feedback. |
|
@loganharbour would you help me test out the configure_libtorch.sh script? I modified (mostly addition) some flags from yours. |
|
Right now we kind-of support linking to conda-based torch libraries on macs. Would this also mean that we should distribute the self-built torch within our conda environment? Or are those avenues still doable with the distribution they have through conda? |
|
I don't think it will be possible to be definitively compatible with a torch distribution that's not our own unless we stop building our own blas |
|
That could have been an alternative solution, i.e., download torch and let petsc use torch's blas and lapack. However, since torch does NOT distribute those libraries, that's not an option... |
|
Sounds painful from the conda side, but I think it would actually be cool to finally officially have pytorch in our conda environment. |
So you're proposing we conda package torch? I'm not going to auto signup our team for more work, but if someone wants to take it on, they may. In the short-term I think what we've moved towards here is better than what we had before. We don't conda package conduit/MFEM at the moment; instead a Mac user (or Linux user not using the containers) must build them using the scripts. I think this is a good parallel |
I see, yeah that sounds good to me too. |
|
I want to point out that the conda users can still download pytorch and hope that it is compatible with petsc. It's just that we don't/can't officially support that any more. |
| to be altered. | ||
|
|
||
| !alert-end! | ||
| A working compiler stack and CMake are also required. The [MOOSE conda environment](installation/conda.md) satisfies both. |
There was a problem hiding this comment.
I think it used to have python package dependencies too (even if you only wanted to compile the C++ parts) like numpy at some point, maybe they fixed it. But we should have those too in the moose conda package.
There was a problem hiding this comment.
They have an option now to disable numpy. Though I haven't tried to run the configure script in an environment without python, guess I should try that.
|
|
||
| The way one enables LibTorch [!cite](paszke2019pytorch) capabilities in MOOSE depends on | ||
| the operating system (Linux or Mac) and if we use HPC or just a local workstation. | ||
| MOOSE can be linked against [libtorch](https://pytorch.org/cppdocs/) [!cite](paszke2019pytorch) to enable hardware acceleration. The libtorch source is provided as a git submodule under `framework/contrib/pytorch`, and a setup script is provided to build and install it with the configuration MOOSE expects. |
There was a problem hiding this comment.
I think it is not just for hardware acceleration, but also for high level easy-to-use interfaces to parts of ML workflows.
|
Job Precheck, step Versioner verify on aba6984 wanted to post the following: Versioner templatesFound 11 templates, 0 failed Versioner influential filesFound 54 influential files, 5 changed, 2 added, 1 removed
Versioner versionsFound 10 packages, 1 changed, 0 failed
|
|
Job Precheck, step Python: black format on aba6984 wanted to post the following: Python black formattingYour code requires style changes. A patch was generated and copied here. You can directly apply the patch by running the following at the top level of your repository: Alternatively, you can run the following at the top level of your repository: |
loganharbour
left a comment
There was a problem hiding this comment.
Wow the input diffs are so beautiful.
| @@ -0,0 +1,125 @@ | |||
| #!/usr/bin/env bash | |||
There was a problem hiding this comment.
We need to use these scripts in apptainer/moose-dev.def as well
| sub_block_params.applyParameters(common_action.parameters()); | ||
|
|
||
| // Variables to skip | ||
| _skip_input_variables = getParam<std::vector<std::string>>("skip_input_variables"); |
There was a problem hiding this comment.
Is these a reason this isn't in the initializer list?
| paramError("moose_input_types", | ||
| "Unsupported type corresponding to the moose input ", | ||
| input.moose.name); | ||
| auto obj_name = obscureObjectName( |
There was a problem hiding this comment.
This isn't required, but I think there is a lot of common setting that you should be able to do in a lambda for adding all of these objects (type, from_moose, to_neml2, etc)
| bool has_scalar = _problem->hasScalarVariable(name); | ||
| bool has_func = _problem->hasFunction(name); | ||
| bool has_var = _problem->hasVariable(name); | ||
| if (int(is_time) + int(has_scalar) + int(has_func) + int(has_var) > 1) |
There was a problem hiding this comment.
This is obscure but can you verify that hitting this error provides the context of the top level block as an input file and position? Good sanity check on much of the error handling work I've done
| @@ -2,10 +2,19 @@ | |||
| # The original issue is idaholab/blackbear#333 | |||
| issues = '#26450 #26920' | |||
| design = 'NEML2/index.md NEML2Action.md' | |||
| [custom_model] | |||
| requirement = 'The framework shall be capable of running custom NEML2 model implemented in MOOSE (optionally with automatic differentiation).' | |||
There was a problem hiding this comment.
| requirement = 'The framework shall be capable of running custom NEML2 model implemented in MOOSE (optionally with automatic differentiation).' | |
| requirement = 'The framework shall be capable of running custom NEML2 models implemented in MOOSE (optionally with automatic differentiation).' |
@lindsayad @grmnptr: Extending the conda piece is pretty significant. Long term, I don't think growing the number of conda packages is particularly sustainable |
This pull request introduces significant improvements to the integration between MOOSE and NEML2, focusing on simplifying and generalizing the mapping of MOOSE data structures to NEML2 input variables and parameters. The changes unify the handling of different MOOSE data sources, update documentation to reflect the new, more flexible approach, and refactor related code for clarity and maintainability.
Also trying to move torch link flags to the front in the build system to work around a torch bug.