Skip to content

Conversation

@lindsayad
Copy link
Member

@lindsayad lindsayad commented Oct 8, 2025

No description provided.

Copy link
Member

@roystgnr roystgnr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I missed anything. In the future when part of a changeset is a huge diff but is a subset of something as simple as sed s/inline/METAPHYSICL_INLINE, put each such part in its own commit to avoid them being chaff obscuring other changes?

@lindsayad
Copy link
Member Author

Not sure if I missed anything. In the future when part of a changeset is a huge diff but is a subset of something as simple as sed s/inline/METAPHYSICL_INLINE, put each such part in its own commit to avoid them being chaff obscuring other changes?

That would require planning ahead, not playing whack-a-mole with "calling a host function from a host-device function" warnings from nvcc

@roystgnr
Copy link
Member

Whack-a-mole work is why I finally gave in and learned to love git's oxymoronic ability to rewrite history; while I'm playing whack-a-mole my commit messages look like "Checkpoint - DO NOT MERGE", but when I'm done I interactive-rebase and git add --patch one feature at a time.

(I sound more smug about this than I should - a proper history rewrite would be tested by a shell loop over the final git rev-list which tests that each step passes make check, and the next time I do that will be the first)

@lindsayad
Copy link
Member Author

Kokkos metaphysicl

(pprof) list sparsity_union
Total: 313.83s
ROUTINE ======================== MetaPhysicL::DynamicSparseNumberBase::sparsity_union in /data/lindad/projects/no-gpu/libmesh/installed/include/metaphysicl/dynamicsparsenumberbase.h
    12.54s     13.04s (flat, cum)  4.16% of Total
         .          .    311:  // increase it as needed to support e.g. operator+=
         .          .    312:template <typename Data, typename Indices, template <class...> class SubType, class... SubTypeArgs>
         .          .    313:template <typename Indices2>
         .          .    314:METAPHYSICL_INLINE
         .          .    315:void
     340ms      340ms    316:DynamicSparseNumberBase<Data, Indices, SubType, SubTypeArgs...>::sparsity_union (const Indices2 & new_indices)
         .          .    317:{
         .          .    318:#ifndef NDEBUG
         .          .    319:  // Assert sorted and no duplicates
         .          .    320:  if (_indices.size() > 1)
         .          .    321:    for (std::size_t i = 0; i < _indices.size() - 1; ++i)
         .          .    322:      metaphysicl_assert(_indices[i] < _indices[i + 1]);
         .          .    323:  if (new_indices.size() > 1)
         .          .    324:    for (std::size_t i = 0; i < new_indices.size() - 1; ++i)
         .          .    325:      metaphysicl_assert(new_indices[i] < new_indices[i + 1]);
         .          .    326:#endif
         .          .    327:
         .          .    328:  //
         .          .    329:  // First phase: count how many new indices will be inserted
         .          .    330:  //
         .          .    331:
         .          .    332:  std::size_t index_it = 0;
      10ms       10ms    333:  std::size_t index2_it = 0;
         .          .    334:
         .          .    335:  typedef typename Indices2::value_type I2;
         .          .    336:  typedef typename CompareTypes<I,I2>::supertype max_index_type;
         .          .    337:  // number of indices in new_indices which aren't in _indices
      30ms       30ms    338:  max_index_type unseen_indices = 0;
         .          .    339:
         .          .    340:  const I maxI = numeric_limits<I>::max();
         .          .    341:
     220ms      350ms    342:  while (index2_it != new_indices.size()) {
     120ms      350ms    343:    I idx1 = (index_it == _indices.size()) ? maxI : _indices[index_it];
         .          .    344:    I2 idx2 = new_indices[index2_it];
         .          .    345:
         .          .    346:    // Advance our index while we are behind new_indices index
     760ms      760ms    347:    while (idx1 < idx2) {
      20ms       20ms    348:      ++index_it;
      30ms       30ms    349:      idx1 = (index_it == _indices.size()) ? maxI : _indices[index_it];
         .          .    350:    }
         .          .    351:
         .          .    352:    // advance the two indexes in lock-step while they are equal and we aren't at the end
         .          .    353:    // of _indices
     1.22s      1.22s    354:    while ((idx1 == idx2) &&
     840ms      840ms    355:           (idx1 != maxI)) {
         .          .    356:      ++index_it;
     210ms      210ms    357:      idx1 = (index_it == _indices.size()) ? maxI : _indices[index_it];
      90ms       90ms    358:      ++index2_it;
     1.01s      1.01s    359:      idx2 = (index2_it == new_indices.size()) ? maxI : new_indices[index2_it];
         .          .    360:    }
         .          .    361:
         .          .    362:    // Advance the new_indices index while we it's behind our index, all the while adding to unseen_indices
     930ms      930ms    363:    while (idx2 < idx1) {
      80ms       80ms    364:      ++unseen_indices;
     160ms      160ms    365:      ++index2_it;
     400ms      400ms    366:      if (index2_it == new_indices.size())
         .          .    367:        // If we've hit the end of new_indices we break from the outer loop
         .          .    368:        break;
         .          .    369:      idx2 = new_indices[index2_it];
         .          .    370:    }
         .          .    371:  }
         .          .    372:
         .          .    373:  // The common case is cheap
      90ms       90ms    374:  if (!unseen_indices)
         .          .    375:    return;
         .          .    376:
         .          .    377:  //
         .          .    378:  // Phase 2: reverse in-place merge
         .          .    379:  //
         .          .    380:
         .          .    381:  const std::size_t old_size = this->size();
      20ms       20ms    382:  const std::size_t new_size = old_size + unseen_indices;
         .          .    383:
         .          .    384:  // Calls resize on both _data and _indices
         .      140ms    385:  this->resize(new_size);
         .          .    386:
         .          .    387:  // Write iterator starting at the tail of our resized containers
      40ms       40ms    388:  std::ptrdiff_t write_it = new_size - 1;
         .          .    389:
         .          .    390:  // Read iterator starting at our old tail
      20ms       20ms    391:  std::ptrdiff_t read_it = old_size - 1;
         .          .    392:  // Read iterator for new_indices starting at its tail
      30ms       30ms    393:  std::ptrdiff_t new_indices_read_it = new_indices.size() - 1;
         .          .    394:
     470ms      470ms    395:  for (; write_it != -1; --write_it) {
         .          .    396:    // First operand: we've reached the end of our original indices
         .          .    397:    // Second operand: we haven't reached the end of our original indices *but* the new_indices
         .          .    398:    //   read iterator is pointing to an index that is greater than our read iterator
         .          .    399:    //   is pointing to
         .          .    400:    // Either way, we should write the new_indices index. Since we are writing from the
         .          .    401:    // new_indices container, for which we have not yet imported the corresponding data,
         .          .    402:    // we write 0 to the corresponding _data location
     760ms      760ms    403:    if ((read_it == -1) ||
     1.40s      1.40s    404:        ((new_indices_read_it != -1) &&
     550ms      550ms    405:         (new_indices[new_indices_read_it] > _indices[read_it]))) {
     370ms      370ms    406:      _data[write_it] = 0;
     180ms      180ms    407:      _indices[write_it] = new_indices[new_indices_read_it];
         .          .    408:      // Finished reading from the new indices, thus decrement its iterator
     900ms      900ms    409:      --new_indices_read_it;
         .          .    410:    } else {
         .          .    411:      //
         .          .    412:      // If we didn't satisfy the if branch, that means we will be writing from our _indices/_data
         .          .    413:      // instead of from new_indices
         .          .    414:      //
         .          .    415:
         .          .    416:      // Handle the case where our _indices iterator and the new_indices iterator point to the same
         .          .    417:      // index value. E.g. we must make sure that we decrement both _indices and new_indices
         .          .    418:      // iterators
         .          .    419:      if ((new_indices_read_it != -1) &&
         .          .    420:          (new_indices[new_indices_read_it] == _indices[read_it]))
     460ms      460ms    421:        --new_indices_read_it;
         .          .    422:      metaphysicl_assert(read_it > -1);
         .          .    423:      metaphysicl_assert(write_it > -1);
     230ms      230ms    424:      _data[write_it] = _data[read_it];
      30ms       30ms    425:      _indices[write_it] = _indices[read_it];
         .          .    426:      // Finished reading from our _indices/_data, thus decrement their iterator
      80ms       80ms    427:      --read_it;
         .          .    428:    }
         .          .    429:  }
         .          .    430:
         .          .    431:  metaphysicl_assert(read_it  == -1);
         .          .    432:  metaphysicl_assert(write_it  == -1);
         .          .    433:  metaphysicl_assert(new_indices_read_it == -1);
     440ms      440ms    434:}
         .          .    435:
         .          .    436:
         .          .    437:  // Since this is a dynamically allocated sparsity pattern, we can
         .          .    438:  // decrease it when possible for efficiency
         .          .    439:template <typename Data, typename Indices, template <class...> class SubType, class... SubTypeArgs>

@lindsayad
Copy link
Member Author

Current MetaPhysicL

(pprof) list sparsity_union
Total: 309.45s
ROUTINE ======================== MetaPhysicL::DynamicSparseNumberBase::sparsity_union in /data/lindad/projects/no-gpu/libmesh/installed/include/metaphysicl/dynamicsparsenumberbase.h
    10.78s     13.65s (flat, cum)  4.41% of Total
         .          .    277:  // increase it as needed to support e.g. operator+=
         .          .    278:template <typename Data, typename Indices, template <class...> class SubType, class... SubTypeArgs>
         .          .    279:template <typename Indices2>
         .          .    280:inline
         .          .    281:void
     210ms      210ms    282:DynamicSparseNumberBase<Data, Indices, SubType, SubTypeArgs...>::sparsity_union (const Indices2 & new_indices)
         .          .    283:{
         .          .    284:  metaphysicl_assert
         .          .    285:    (std::adjacent_find(_indices.begin(), _indices.end()) ==
         .          .    286:     _indices.end());
         .          .    287:  metaphysicl_assert
         .          .    288:    (std::adjacent_find(new_indices.begin(), new_indices.end()) ==
         .          .    289:     new_indices.end());
         .          .    290:#ifdef METAPHYSICL_HAVE_CXX11
         .          .    291:  metaphysicl_assert(std::is_sorted(_indices.begin(), _indices.end()));
         .          .    292:  metaphysicl_assert(std::is_sorted(new_indices.begin(), new_indices.end()));
         .          .    293:#endif
         .          .    294:
      60ms       60ms    295:  auto index_it = _indices.begin();
     210ms      210ms    296:  auto index2_it = new_indices.begin();
         .          .    297:
         .          .    298:  typedef typename Indices2::value_type I2;
         .          .    299:  typedef typename CompareTypes<I,I2>::supertype max_index_type;
      30ms       30ms    300:  max_index_type unseen_indices = 0;
         .          .    301:
         .          .    302:  const I maxI = std::numeric_limits<I>::max();
         .          .    303:
      90ms      260ms    304:  while (index2_it != new_indices.end()) {
     120ms      240ms    305:    I idx1 = (index_it == _indices.end()) ? maxI : *index_it;
         .          .    306:    I2 idx2 = *index2_it;
         .          .    307:
     480ms      480ms    308:    while (idx1 < idx2) {
     130ms      130ms    309:      ++index_it;
      60ms       60ms    310:      idx1 = (index_it == _indices.end()) ? maxI : *index_it;
         .          .    311:    }
         .          .    312:
     1.53s      1.53s    313:    while ((idx1 == idx2) &&
     210ms      210ms    314:           (idx1 != maxI)) {
      90ms       90ms    315:      ++index_it;
     150ms      150ms    316:      idx1 = (index_it == _indices.end()) ? maxI : *index_it;
     110ms      110ms    317:      ++index2_it;
     1.14s      1.14s    318:      idx2 = (index2_it == new_indices.end()) ? maxI : *index2_it;
         .          .    319:    }
         .          .    320:
     950ms      950ms    321:    while (idx2 < idx1) {
     110ms      110ms    322:      ++unseen_indices;
     160ms      160ms    323:        ++index2_it;
     750ms      750ms    324:      if (index2_it == new_indices.end())
         .          .    325:        break;
         .          .    326:      idx2 = *index2_it;
         .          .    327:    }
         .          .    328:  }
         .          .    329:
         .          .    330:  // The common case is cheap
      80ms       80ms    331:  if (!unseen_indices)
         .          .    332:    return;
         .          .    333:
         .          .    334:  std::size_t old_size = this->size();
         .          .    335:
      20ms      110ms    336:  this->resize(old_size + unseen_indices);
         .          .    337:
         .       20ms    338:  auto md_it = _data.rbegin();
         .       10ms    339:  auto mi_it = _indices.rbegin();
         .          .    340:
         .      140ms    341:  auto d_it =
         .          .    342:    _data.rbegin() + unseen_indices;
         .          .    343:  auto i_it =
         .          .    344:    _indices.rbegin() + unseen_indices;
         .       40ms    345:  auto i2_it = new_indices.rbegin();
         .          .    346:
         .          .    347:  // Duplicate copies of rend() to work around
         .          .    348:  // http://www.open-std.org/jtc1/sc22/wg21/docs/lwg-defects.html#179
         .          .    349:  auto      mirend  = _indices.rend();
         .          .    350:  auto  rend  = mirend;
         .          .    351:  auto rend2 = new_indices.rend();
         .          .    352:#ifndef NDEBUG
         .          .    353:  auto      mdrend = _data.rend();
         .          .    354:  auto drend = mdrend;
         .          .    355:#endif
         .          .    356:
     1.06s      1.85s    357:  for (; mi_it != mirend; ++md_it, ++mi_it) {
     730ms      730ms    358:    if ((i_it == rend) ||
     960ms      960ms    359:        ((i2_it != rend2) &&
         .          .    360:         (*i2_it > *i_it))) {
     100ms      100ms    361:      *md_it = 0;
     590ms      590ms    362:      *mi_it = *i2_it;
         .      1.38s    363:      ++i2_it;
         .          .    364:    } else {
         .          .    365:      if ((i2_it != rend2) &&
         .          .    366:          (*i2_it == *i_it))
         .       20ms    367:        ++i2_it;
         .          .    368:      metaphysicl_assert(d_it < drend);
         .          .    369:      metaphysicl_assert(md_it < mdrend);
     270ms      270ms    370:      *md_it = *d_it;
         .          .    371:      *mi_it = *i_it;
         .       30ms    372:      ++d_it;
         .       60ms    373:      ++i_it;
         .          .    374:    }
         .          .    375:  }
         .          .    376:
         .          .    377:  metaphysicl_assert(i_it  == rend);
         .          .    378:  metaphysicl_assert(i2_it == rend2);
         .          .    379:  metaphysicl_assert(d_it  == drend);
         .          .    380:  metaphysicl_assert(md_it == mdrend);
     380ms      380ms    381:}
         .          .    382:
         .          .    383:
         .          .    384:  // Since this is a dynamically allocated sparsity pattern, we can
         .          .    385:  // decrease it when possible for efficiency
         .          .    386:template <typename Data, typename Indices, template <class...> class SubType, class... SubTypeArgs>

@roystgnr
Copy link
Member

Am I reading that right? 15% slower?

@lindsayad
Copy link
Member Author

You must be looking at the flat number? Cumulatively the timing is 4.5% faster which I rendered in my brain as noise

@NamjaeChoi
Copy link

How are you linking Kokkos with MetaPhysicL btw?

@NamjaeChoi
Copy link

I think we need to be careful about Kokkos symbols. We don't want to expose it to every source file in MOOSE.

@NamjaeChoi
Copy link

NamjaeChoi commented Oct 15, 2025

In MOOSE, they are protected by MOOSE_KOKKOS_SCOPE. I wonder if defining METAPHYSICL_HAVE_KOKKOS conditionally based on MOOSE_KOKKOS_SCOPE would be a simple solution. Since MetaPhysicL is a pure header library, it may not be that difficult.

@lindsayad
Copy link
Member Author

I don't know how we'd make an upstream library config depend on a downstream one

@NamjaeChoi
Copy link

Just let MOOSE define METAPHYSICL_HAVE_KOKKOS?

@lindsayad
Copy link
Member Author

lindsayad commented Oct 15, 2025

I think we should be able to split headers to only expose host code to "host" MetaPhysicL types. Like I should remove the using in semidynamicsparsenumberarray_decl.h that names the Kokkos-based array. I would hope that you can compile regular host sources with a host compiler including Kokkos_Macros.hpp at least if the only use is KOKKOS_INLINE_FUNCTION. I should hope that I don't have to link in any kokkos symbols for that

@NamjaeChoi
Copy link

Kokkos_Macros.hpp will have some backend symbols that will mandate compiling the whole stack with the backend compiler, for instance nvcc. I think we just define METAPHYSICL_HAVE_KOKKOS in KokkosHeader.h in MOOSE and just include MetaPhysicL after. KokkosHeader.h is guaranteed to be only included by Kokkos source files.

@NamjaeChoi
Copy link

NamjaeChoi commented Oct 21, 2025

What's the ETA of this change at MOOSE?

@lindsayad
Copy link
Member Author

Maybe a month would be my guess if I stop lollygagging

@lindsayad
Copy link
Member Author

wow beautiful Kokkos recipe pass. Such wow

Copy link
Member

@roystgnr roystgnr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to take a crack at restoring the std:: option this weekend, see how hard it is and whether we should just merge without it and reintroduce it in a later PR; in the meantime a few questions

test/Makefile.am Outdated
BUILT_SOURCES = .license.stamp

.license.stamp: $(top_srcdir)/LICENSE
.license.stamp:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why lose the dependency here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are warnings from autotools, something about a stamp not supporting a dependency

Copy link
Member

@roystgnr roystgnr Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not getting anything from the libMesh standard autotools (autoconf 2.72, automake 1.18.1, libtool 2.5.4, still the latest on ftp.gnu.org). Could you be more specific? And is nothing complaining about the same line in src/Makefile.am?

In the meantime I'm just going to restore this, as long as I'm pushing more commits anyway; that command does legitimately depend on that file, even though it's hardly a dependency we change often.

lindsayad and others added 10 commits October 24, 2025 14:56
This is necessary when we're adding to that namespace in the
backwards-compatibility build
If we're going to use a new ENABLE_FOO for backwards compatibility we
need to be sure nobody's missing having it defined.
It's just plain easier to ensure that all definitions come after all the
declarations they might need if the declarations are in a separate file.
I haven't gotten any warnings about this, and the one in src/Makefile.am
seems to have been fine too.
There's some weird poorly-documented AC_CONFIG_HEADERS magic that lets
us get away with omitting the path here, but it's unnecessarily
confusing to rely on it.
Copy link
Member

@roystgnr roystgnr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm ready to merge this (once CI is happy, knock on wood; 2 recipes have passed so far), but I'll wait for confirmation that I haven't broken anything in your eyes.

@roystgnr
Copy link
Member

And for the record, I've verified that https://github.com/manufactured-solutions/MASA still builds and passes tests against a --enable-std-violation build of this branch.

@roystgnr
Copy link
Member

And with a few changes, the MASA master branch now passes both with and without --enable-std-violation.

Copy link
Member Author

@lindsayad lindsayad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this LGTM now

@lindsayad lindsayad merged commit 57a9c79 into libMesh:master Oct 29, 2025
8 checks passed
@lindsayad lindsayad deleted the kokkos branch October 29, 2025 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants