v1.10.0 #1067
shi-eric
announced in
Announcements
v1.10.0
#1067
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Warp v1.10.0
Warp v1.10 expands JAX integration with automatic differentiation support and multi-device
jax.pmap()compatibility. The tile programming model has been enhanced with axis-specific reductions, component-level indexing, and convenience functions for creating tiles.Performance has been significantly improved in several areas: BVH operations now support in-place rebuilding for CUDA graphs and configurable leaf sizes, built-in function calls from Python are up to 70× faster, and additional sparse matrix and FEM operations can now be captured in CUDA graphs.
Additional usability improvements include negative indexing and slicing for arrays, atomic bitwise operations, and new built-in functions including error functions and type casting.
Important: This release removes the
warp.simmodule (deprecated since v1.8), which has been superseded by the Newton physics engine. See the Announcements section below for migration guidance and other upcoming changes.For a complete list of changes, see the full changelog.
New features
JAX automatic differentiation (experimental)
Warp now supports experimental automatic differentiation with JAX, allowing kernels to participate in JAX automatic differentiation workflows. This feature is contributed by @mehdiataei and builds on earlier work by @jaro-sevcik. It enables computing gradients through Warp kernels using
jax.grad()by passingenable_backward=Truetojax_kernel().Key capabilities include:
wp.vec2orwp.mat22are fully supportedjax.pmap()for distributed forward and backward passes across multiple GPUsThis feature is experimental and has some current limitations. See the JAX Automatic Differentiation documentation for complete examples, usage details, and limitations.
Multi-device JAX support with
jax.pmap()Warp now properly supports
jax.pmap()andjax.shard_map()for multi-device parallel execution, thanks to fixes contributed by @chaserileyroberts. Previously, device targeting issues prevented Warp callables from working correctly within these JAX primitives—JAX would invoke callbacks from multiple threads targeting different devices, but Warp would always execute on the default device. The fix ensures proper device coordination by extracting device ordinals from XLA FFI and adding thread synchronization for concurrent callbacks, enabling efficient data-parallel workflows across multiple GPUs.In-place BVH rebuilding with CUDA graph support
A new
wp.Bvh.rebuild()method enables rebuilding BVH hierarchies in-place without allocating new memory. This complements the existingrefit()method and is particularly useful when primitive distributions change significantly.CUDA graph capture: Unlike creating a new BVH,
rebuild()reuses existing buffers, making it safe to capture in CUDA graphs. Previously captured graphs that include queries on the BVH remain valid after rebuilding, enabling high-performance repeated updates without graph re-capture overhead.Construction algorithms: On CUDA devices, in-place rebuild supports
"lbvh"only. On CPU,"sah"and"median"are supported. Defaults are chosen automatically based on the device.Tile programming enhancements
The tile programming model has been enhanced with new capabilities to make tile-based computations more expressive and convenient:
Axis-specific reductions
The tile-reduction functions
wp.tile_reduce()andwp.tile_sum()now support an optionalaxisparameter, enabling reductions along a specific dimension of a tile rather than reducing the entire tile to a single value. This enhancement brings NumPy-like axis semantics to tile operations.Component-level indexing
Tiles of composite types (vectors, matrices, quaternions) now support component-level indexing and assignment. You can directly index into individual components using extended indexing syntax:
tile[i][1]extracts the second component of a vector at positionitile[i][1, 1]accesses the element at row 1, column 1 of a matrix at positioniThis provides more convenient and expressive syntax for working with structured data in tiles.
Creating tiles filled with a constant value
The new
wp.tile_full()function provides a convenient way to create tiles initialized with a constant value, similar to NumPy'snp.full():New example
The new
example_tile_mcgp.pyexample demonstrates tile-based Monte Carlo methods by implementing a walk-on-spheres algorithm for solving Laplace's equation on volumetric domains.Performance improvements
Built-in function calls from Python
Calling Warp built-in functions from Python scope (e.g.,
wp.normalize(),wp.transform_identity(), matrix arithmetic likemat * mat) is now significantly faster thanks to optimizations in overload resolution. Previously, each function call would iterate through all overloads, attempt argument binding, and pack parameters into C types until finding a match. Now, Warp caches the resolved overload and parameter packing strategy based on argument types using@functools.lru_cache, eliminating redundant resolution overhead on subsequent calls.In microbenchmarks, repeated
wp.mat44multiplication at Python scope is up to 70× faster (~570 μs → ~8 μs), while operations likewp.transform_identity()see 3-4× speedups (~100 μs → ~30 μs). The magnitude of improvement varies by operation complexity, with greater gains for operations requiring more expensive overload resolution.Breaking change: As part of this optimization, support for passing lists, tuples, and other non-Warp array arguments to built-in functions has been removed. Calls like
wp.normalize([1.0, 2.0, 3.0])must now be written aswp.normalize(wp.vec3(1.0, 2.0, 3.0)). This simplifies the function call path and removes expensive sequence-flattening logic that was incompatible with efficient caching.Configurable BVH leaf size
wp.Bvhandwp.Meshnow expose tunableleaf_sizeandbvh_leaf_sizeparameters, respectively, allowing users to control the number of primitives stored in each leaf node for performance optimization. The optimal leaf size depends on the query workload:Behavior change: The default
leaf_sizeforwp.Bvhhas changed from 4 (hardcoded) to 1, optimizing for intersection queries which are more common.wp.Meshretains a defaultbvh_leaf_sizeof 4 as a compromise between intersection and closest-point query performance. Users performing primarily closest-point queries may benefit from explicitly setting larger leaf sizes.Sparse matrix operations with CUDA graphs
Sparse matrix operations in
warp.sparsecan now be captured in CUDA graphs for allocation-free execution. Operations likebsr_axpy(),bsr_assign(), andbsr_set_transpose()preserve matrix topology when usingmasked=True, whilebsr_mm()adds a newmax_new_nnzparameter that allows specifying an upper bound on new non-zero blocks for flexible graph capture when sparsity patterns vary within known bounds.FEM operations with CUDA graphs
Building
warp.femgeometry and function space partitions can now be captured in CUDA graphs by specifying upper bounds on partition sizes:max_cell_countandmax_side_countforExplicitGeometryPartition, andmax_node_countformake_space_partition(). Additionally, building fields and restrictions is now synchronization-free by default.Language enhancements
Array indexing and slicing improvements
Warp arrays now support negative indexing and improved slicing behavior, making array manipulation more intuitive and consistent with NumPy conventions.
Negative indexing: Access elements from the end of an array using negative indices:
Enhanced array slicing: Arrays now support more flexible slicing operations within kernels, including stride-based access patterns. This works with both regular arrays and tile operations:
New built-in functions
wp.erf(),wp.erfc(),wp.erfinv(), andwp.erfcinv()for error function computationswp.cast()to reinterpret values as different types while preserving bit patterns (e.g., reinterpreting float bits as int)wp.atomic_and(),wp.atomic_or(), andwp.atomic_xor()for thread-safe bitwise operations on integers, contributed by @j3soonwp.sparse.bsr_row_index()andwp.sparse.bsr_block_index()as kernel-level functions to efficiently determine which row a given block belongs to without manually searching through the compressed offset arrayBug fixes
AArch64 CPU execution with tiles
Fixed segmentation faults when running tile-based kernels on AArch64 CPUs, affecting platforms including NVIDIA Jetson (Thor, Orin), DGX Spark, Grace Hopper, and Grace Blackwell systems. The fix uses stack memory allocation instead of static memory to work around limitations in LLVM's JIT compiler.
This change is enabled by default on all CPU architectures and can be disabled if needed via
wp.config.enable_tiles_in_stack_memory = False. If you encounter issues that are resolved by disabling this setting, please report them on our GitHub Issues page.Note: This primarily affects CPU execution of tile operations, which is less common in Warp workflows but useful for debugging or scenarios in which GPU memory transfer overhead outweighs compute benefits.
Native library version verification
Warp now performs runtime version checking to detect mismatches between the Python package and native libraries (e.g.,
warp.dll,warp.so). This helps diagnose issues in which multiple Warp installations on the same system may cause the wrong native libraries to be loaded. When a mismatch is detected, a warning is issued but execution continues. If you see such warnings, ensure you're loading Warp from the expected installation location and that your environment doesn't have conflicting Warp versions.Announcements
Removal of
warp.simmoduleThe
warp.simmodule has been removed in this release. This module was formally deprecated in Warp v1.8 (July 2025) and has been superseded by the Newton physics engine, an independent package managed as a Linux Foundation project with a redesigned API focused on robotics and robot learning.Migration: Users relying on
warp.simshould migrate to Newton. For guidance on transitioning fromwarp.simto Newton, please consult the Newton migration guide. The original deprecation announcement and community discussion can be found in GitHub Discussion #735.Questions and discussions about Newton should be directed to the Newton Discussions section. Existing issues in the Warp repository concerning
warp.simwill be closed.JAX FFI is now the default
The default implementation of
jax_kernel()is now based on JAX's Foreign Function Interface (FFI), which is required for JAX version 0.8 and newer. Most users should not need to change their code, as the FFI-based version has been available since Warp 1.7 and provides better performance through CUDA graph capture. The previous custom call implementation is still available aswp.jax_experimental.custom_call.jax_kernel()for users on older JAX versions, but it is deprecated and will not work with JAX version 0.8 or later.Internal code reorganization:
_srcfolderAs part of ongoing efforts to clarify Warp's public API surface, internal implementation code has been reorganized into a
warp._srcsubpackage. This change helps distinguish between public APIs that users should rely on versus internal implementation details that may change without notice.What this means for users:
warp.context,warp.types, andwarp.femremain accessible at their current paths through compatibility shims.warp._srcpaths in error messages and stack traces (e.g.,warp._src.contextinstead ofwarp.context).warp._src.*(acknowledging the use of private APIs).This reorganization is the first step in a multi-phase effort to establish a stable public API. If you encounter any issues introduced by this reorganization, please report them on our GitHub Issues page.
Upcoming removals
The following features will be removed in v1.11 (planned for January 2026):
wp.mat22(wp.vec2(1, 2), wp.vec2(3, 4))). Usewp.matrix_from_rows()orwp.matrix_from_cols()instead. This deprecation was originally announced in v1.9 with a planned removal in v1.10, but has been extended one release cycle. While kernel-scope usage had been emitting deprecation warnings since v1.9, it was discovered that Python-scope usage lacked proper warnings. Starting in v1.10, both contexts now emit deprecation warnings.graph_compatibleparameter injax_callable(): The booleangraph_compatibleparameter has been deprecated in favor of the newgraph_modeparameter which acceptsGraphModeenum values (GraphMode.JAX,GraphMode.WARP, orGraphMode.NONE).Platform support
Acknowledgments
We also thank the following contributors from outside the core Warp development team:
struct()andoverload()decoratorsThis discussion was created from the release v1.10.0.
Beta Was this translation helpful? Give feedback.
All reactions