From e3babf4f90b5f8e8274e5e4bb843f227ed6dfb36 Mon Sep 17 00:00:00 2001
From: Angel Li <liangel@meta.com>
Date: Fri, 26 Sep 2025 10:57:02 -0700
Subject: [PATCH 1/6] final.md

---
 2.9.0/final.md          | 841 ++++++++++++++++++++++++++++++++++++++++
 2.9.0/miscategorized.md |  13 -
 2 files changed, 841 insertions(+), 13 deletions(-)
 create mode 100644 2.9.0/final.md

diff --git a/2.9.0/final.md b/2.9.0/final.md
new file mode 100644
index 0000000..d713728
--- /dev/null
+++ b/2.9.0/final.md
@@ -0,0 +1,841 @@
+# PyTorch 2.9.0 Release Notes
+- [Highlights](#highlights)
+- [Backwards Incompatible Changes](#backwards-incompatible-changes)
+- [Deprecations](#deprecations)
+- [New Features](#new-features)
+- [Improvements](#improvements)
+- [Bug fixes](#bug-fixes)
+- [Performance](#performance)
+- [Documentation](#documentation)
+- [Developers](#developers)
+- [Security](#security)
+
+
+# Highlights
+TODO
+
+For more details about these highlighted features, you can look at the release blogpost.
+Below are the full release notes for this release.
+
+
+# Backwards Incompatible Changes
+
+## Min supported Python version is now 3.10 ([#162310](https://github.com/pytorch/pytorch/pull/162310))
+
+The minimum version of Python required for PyTorch 2.9.0 is 3.10.
+
+## Build Frontend
+
+### Remove `/d2implyavx512upperregs` flag that slows build ([#159431](https://github.com/pytorch/pytorch/pull/159431))
+
+### Add `ScalarType` to shim conversion and `stable::Tensor.scalar_type` ([#160557](https://github.com/pytorch/pytorch/pull/160557))
+
+Before, user extensions could only in abstract pass around obfuscated dtypes appearing as `int32_ts`. Now, users can confidently use `torch::headeronly::ScalarType` in their extensions for major scalar types. This PR enables ABI stability by adding a translation layer through the shim, so that even if the `ScalarType` enum values change in the future, user extensions need not fear.
+
+This is narrowly BC breaking for unpopular dtypes: `quint*`s, `qint*`s, `Bits*`, `dummy_uint*`s, `dummy_int*`s, `Float8_e8m0fnu`, and `Float4_e2m1fn_x2` in the use case where an extension retrieves a Tensor dtype of the above and passes it into `aoti_torch_call_dispatcher`.
+
+## Export
+### Switch off runtime asserts by default in favor of a shape guards function ([#160111](https://github.com/pytorch/pytorch/pull/160111), [#161178](https://github.com/pytorch/pytorch/pull/161178), [#161794](https://github.com/pytorch/pytorch/pull/161794))
+
+
+To enable runtime asserts, use `export(..., prefer_deferred_runtime_asserts_over_guards=True)`. Also kills the `allow_complex_guards_as_runtime_asserts` flag, merging it into the former option.
+
+
+Additionally, `exported_program.module()` will generate a call to a `_guards_fn` submodule that will run additional checks on inputs. Users who do not want this behavior can either remove this call in the graph, or do `exported_program.module(check_guards=False)` to avoid the generation.
+
+## MPS
+### Build metal kernels of MacOS-14+ and remove all pre-MacOS-14 specific logic, requires MacOS-14+ going forward ([\#159733](https://github.com/pytorch/pytorch/pull/159733), [\#159912](https://github.com/pytorch/pytorch/pull/159912))
+
+PyTorch MPS is only supported on MacOS-14 or later. If you need to use MPS on MacOS Ventura, please avoid updating to Python-3.9 or above
+
+## ONNX
+### Default to `dynamo=True` for ONNX exporter ([#159646](https://github.com/pytorch/pytorch/pull/159646), [#162726](https://github.com/pytorch/pytorch/pull/162726))
+
+Previously `torch.onnx.export(...)` used the legacy TorchScript exporter if no arguments were provied. The ONNX exporter now uses the newer `torch.export.export` pipeline by default (`dynamo=True`). This change improves graph fidelity and future-proofs exports, but may surface graph capture errors that were previously masked or handled differently.
+
+Previously in torch 2.8.0:
+
+```python
+# API calls the legacy exporter with dynamo=False
+torch.onnx.export(...)
+```
+
+Now in torch 2.9.0:
+
+```python
+# To preserve the original behavior
+torch.onnx.export(..., dynamo=False)
+
+# Export onnx model through torch.export.export
+torch.onnx.export(...)
+```
+
+Recommendation: first try the new default; only fall back if you hit blocking issues and report them upstream.
+Long term solution: fix the root cause instead of relying on fallback or TorchScript exporter.
+
+### Set default opset to 20 ([#158802](https://github.com/pytorch/pytorch/pull/158802))
+
+Opset 20 enables newer operator definitions. If your tooling or downstream runtime only supports opset 18, pin it explicitly. For the latest ONNX operators, you can experiment with opset 23.
+
+Previously in torch 2.8.0:
+
+```python
+# opset_version=18
+torch.onnx.export(...)
+```
+
+Now in torch 2.8.0:
+
+```python
+# To preserve the original behavior
+torch.onnx.export(..., opset_version=18)
+
+# New: opset_version=20
+torch.onnx.export(...)
+
+# Use the latest supported opset: opset_version=23
+torch.onnx.export(..., opset_version=23)
+```
+
+### Drop `draft_export` in exporter API ([#161454](https://github.com/pytorch/pytorch/pull/161454), [#162225](https://github.com/pytorch/pytorch/pull/162225))
+
+Remove implicit draft tracing from the default exporter path, achieving clearer behaviour and faster failures.
+The expensive `torch.export.draft_export` diagnostic path is no longer auto-invoked (which could take hours on large models). You can still opt in for deep diagnostics:
+
+Previously in torch 2.8.0:
+
+```bash
+# If both torch.export.export(..., strict=False) and
+# torch.export.export(..., strict=True) fail to capture
+# the model graph, torch.export.draft_export(...) will be triggered,
+# and uses real tensor to trace/export the model.
+#
+# Inside export_to_onnx.py:
+#  ... torch.onnx.export(..., dynamo=True)
+python export_to_onnx.py
+```
+
+Now in torch 2.9.0:
+
+```bash
+# To trigger torch.export.draft_export once
+# torch.export.export strict=False/True both
+# fail:
+
+TORCH_ONNX_ENABLE_DRAFT_EXPORT=True python export_to_onnx.py
+```
+
+### Remove `torch.onnx.dynamo_export` and the `onnxrt` torch compile backend ([#158130](https://github.com/pytorch/pytorch/pull/158130), [#158258](https://github.com/pytorch/pytorch/pull/158258))
+
+`torch.onnx.dynamo_export` is removed. Please use `torch.onnx.export` instead.
+The experimental ONNX Runtime compile backend (`torch.compile(backend="onnxrt")`) is no longer supported.
+
+### Remove `torch.onnx.enable_fake_mode` ([#161222](https://github.com/pytorch/pytorch/pull/161222))
+
+The `dynamo=True` mode uses `FakeTensor`s by default which is memory efficient.
+
+### Some public facing utility APIs for the TorchScript based exporter are now private ([#161323](https://github.com/pytorch/pytorch/pull/161323))
+### Remove `torch.onnx.symbolic_caffe2` ([#157102](https://github.com/pytorch/pytorch/pull/157102))
+
+## Python Frontend
+### Upgrade to DLPack 1.0. ([#145000](https://github.com/pytorch/pytorch/pull/145000))
+
+This upgrade is doing the same BC-breaking changes as the DLPack release.
+Objects in `torch.utils.dlpack` have been updated to reflect these changes, such as `DLDeviceType`.
+See the PR for details on the exact changes and how to update your code.
+
+### Raise appropriate errors in `torch.cat` ([#158249](https://github.com/pytorch/pytorch/pull/158249))
+
+Raising `ValueError`, `IndexError` or `TypeError` where appropriate instead of the generic `RuntimeError`.
+If you code was catching these error, you can update to catch the new error type.
+
+# Deprecations
+## Dataloader Frontend
+### Deprecate `pin_memory_device` param in `torch.utils.data.DataLoader` ([#158323](https://github.com/pytorch/pytorch/pull/158323))
+
+We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required for `StatefulDataloader` which leveraged `BaseDataLoaderIter` direclty rather than the `Dataloader` class init
+
+## Export
+### Deprecation for `export_for_training` API, in favor of equivalent `export` API ([#158203](https://github.com/pytorch/pytorch/pull/158203))
+
+`export_for_training` exists because we couldn't migrate internal usages of export to the final IR. Now that we have completed the migration, we deprecated and deleted this API.
+
+## Release Engineering
+### Remove Python 3.9 support in CD builds. Move CI to Python 3.10.([#161427](https://github.com/pytorch/pytorch/pull/161427)) ([#162265](https://github.com/pytorch/pytorch/pull/162265)) ([#162297](https://github.com/pytorch/pytorch/pull/162297)) ([#160852](https://github.com/pytorch/pytorch/pull/160852))
+
+### Remove CUDA 12.9 support in CD builds ([#161916](https://github.com/pytorch/pytorch/pull/161916))
+
+# New Features
+## AOTDispatcher
+- Add AOTDispatcher config to set backward autocast behavior ([#156356](https://github.com/pytorch/pytorch/pull/156356))
+
+## Build Frontend
+- Add transpose to `torch/csrc/stable` ([#158160](https://github.com/pytorch/pytorch/pull/158160))
+- Add `zero_()` and `empty_like(t)` to `torch/csrc/stable/ops.h` ([#158866](https://github.com/pytorch/pytorch/pull/158866))
+
+## C++ Extensions
+- Add pad and narrow to `torch/csrc/stable/ops.h` ([#159328](https://github.com/pytorch/pytorch/pull/159328))
+- Add `getCurrentDeviceIndex` to `torch::stable::accelerator` ([#160453](https://github.com/pytorch/pytorch/pull/160453))
+- Add `new_zeros` dtype variant to the shim and as a stable op ([#161597](https://github.com/pytorch/pytorch/pull/161597))
+- Update `torch::stable::Tensor()` default constructor ([#159507](https://github.com/pytorch/pytorch/pull/159507))
+- Add beginnings of `torch::stable::accelerator` ([#159679](https://github.com/pytorch/pytorch/pull/159679))
+- Port `amax` to stable ABI ([#160214](https://github.com/pytorch/pytorch/pull/160214))
+- Add `new_empty` (with dtype argument only) to `torch::stable` ([#159508](https://github.com/pytorch/pytorch/pull/159508))
+- Enable generating generic `c_shim` that doesn't bypass dispatcher ([#158974](https://github.com/pytorch/pytorch/pull/158974))
+- Cut a version of `TORCH_ERROR_CODE_CHECK` in `headeronly` from AOTI ([#159604](https://github.com/pytorch/pytorch/pull/159604))
+- Check F2C BLAS for OpenBLAS and other vendors ([#143846](https://github.com/pytorch/pytorch/pull/143846))
+- Add an ovrsource target for `torch/headeronly` ([#157912](https://github.com/pytorch/pytorch/pull/157912))
+- Migrate `c10/macros/cmake_macros.h.in` to `torch/headeronly` ([#158035](https://github.com/pytorch/pytorch/pull/158035))
+- Move `c10/macros/Macros.h` to `headeronly` ([#158365](https://github.com/pytorch/pytorch/pull/158365))
+- Add `STD_TORCH_CHECK` to `headeronly` ([#158377](https://github.com/pytorch/pytorch/pull/158377))
+- Migrate easy q(u)int/bits stuff to `torch/headeronly` ([#159302](https://github.com/pytorch/pytorch/pull/159302))
+- Move `Float4` to `headeronly` ([#159414](https://github.com/pytorch/pytorch/pull/159414))
+- Move `BFloat16.h` to `headeronly` ([#159412](https://github.com/pytorch/pytorch/pull/159412))
+- Move `Float8` variations to `headeronly` ([#159415](https://github.com/pytorch/pytorch/pull/159415))
+- Move complex to `headeronly` ([#159411](https://github.com/pytorch/pytorch/pull/159411))
+- Migrate `ScalarType` to `headeronly` ([#159911](https://github.com/pytorch/pytorch/pull/159911))
+- Add stable Tensor `get_device_index`, use more stable `DeviceIndex` ([#160143](https://github.com/pytorch/pytorch/pull/160143))
+- Add `is_cpu` method to stable tensor type ([#160212](https://github.com/pytorch/pytorch/pull/160212))
+- Remove cmake cache and reconfigure again if it is invalid ([#156958](https://github.com/pytorch/pytorch/pull/156958))
+- Remove `wheel` from build requirements ([#158027](https://github.com/pytorch/pytorch/pull/158027))
+- Error when `TORCH_STABLE_ONLY` is defined in `TensorBase.h` ([#161658](https://github.com/pytorch/pytorch/pull/161658))
+
+## CPU
+- Support GQA for flash attention ([#157893](https://github.com/pytorch/pytorch/pull/157893))
+
+## CUDA
+- MXFP8 grouped GEMM support for `torch._scaled_grouped_mm` + submodule bump ([#162209](https://github.com/pytorch/pytorch/pull/162209))
+- Add getter for CUDA graph exec to allow mutation of captured kernel params ([#161294](https://github.com/pytorch/pytorch/pull/161294))
+- Implement support for `cudnn_batch_norm_out` kernel to replace the autogen approach ([#123020](https://github.com/pytorch/pytorch/pull/123020))
+
+## Distributed
+### Symmetric Memory
+- NVSHMEM support for Triton 3.5 ([#163152](https://github.com/pytorch/pytorch/pull/163152))
+
+## Dynamo
+- Experimental API for ahead-of-time compiling models in fullgraph mode ([#161383](https://github.com/pytorch/pytorch/pull/161383))
+- Toggle erroring/resume on graph break with `torch._dynamo.error_on_graph_break` ([#161739](https://github.com/pytorch/pytorch/pull/161739), [#161747](https://github.com/pytorch/pytorch/pull/161747))
+- Add a hook for recompilations ([#157961](https://github.com/pytorch/pytorch/pull/157961))
+
+## Export
+- Add support for param mutation under inference mode ([#159661](https://github.com/pytorch/pytorch/pull/159661))
+
+## FX
+- Extend torch function support to ALL arguments instead of just scalar type (but not inside of list) ([#145089](https://github.com/pytorch/pytorch/pull/145089))
+- Add `is_fx_symbolic_tracing` flag ([#161385](https://github.com/pytorch/pytorch/pull/161385))
+
+## Inductor
+- Allow user to pass in custom partitioner function ([#157580](https://github.com/pytorch/pytorch/pull/157580))
+
+## JIT
+- Add `torch._check` compatibility support ([#159988](https://github.com/pytorch/pytorch/pull/159988))
+
+## MPS
+- Partial sparse support for MPS backend ([\#159729](https://github.com/pytorch/pytorch/pull/159729), [\#160254](https://github.com/pytorch/pytorch/pull/160254), [\#160223](https://github.com/pytorch/pytorch/pull/160223), [\#161846](https://github.com/pytorch/pytorch/pull/161846), [\#162007](https://github.com/pytorch/pytorch/pull/162007), [#157238](https://github.com/pytorch/pytorch/pull/157238))
+- Add `avg_pool3d`, `max_unpool1d/2d/3d`, `max_pool3d`, `max_pool3d` bwd pass, and `avg_pool3d` bwd pass for MPS ([#158877](https://github.com/pytorch/pytorch/pull/158877),[#159789](https://github.com/pytorch/pytorch/pull/159789), [#156467](https://github.com/pytorch/pytorch/pull/156467), [#157498](https://github.com/pytorch/pytorch/pull/157498), [#159089](https://github.com/pytorch/pytorch/pull/159089))
+
+## ONNX
+- RMS Norm support in opset 23 ([#159377](https://github.com/pytorch/pytorch/pull/159377))
+
+## Optimizer
+- Introduce Muon optimizer to PyTorch ([#160213](https://github.com/pytorch/pytorch/pull/160213))
+
+## Profiler
+- Add GC Events to Python Stack Tracer ([#161209](https://github.com/pytorch/pytorch/pull/161209))
+- Add a custom profiler configuration option ([#151656](https://github.com/pytorch/pytorch/pull/151656))
+
+## Python Frontend
+- Add utility to get the kernel currently registered on the dispatcher ([#158393](https://github.com/pytorch/pytorch/pull/158393))
+- Extend `__torch_function__` handler to be triggered by elements within a list ([#160256](https://github.com/pytorch/pytorch/pull/160256))
+- Add `torch.hash_tensor` reduction function ([#154149](https://github.com/pytorch/pytorch/pull/154149))
+
+## Quantization
+- Enable cpu fp8 qlinear ([#155678](https://github.com/pytorch/pytorch/pull/155678))
+- Enable cpu fp8 qconv ([#157076](https://github.com/pytorch/pytorch/pull/157076))
+
+## Release Engineering
+- Add support for CUDA 13.0 in CI/CD builds. Enable CUDA compression mode for binary size reduction for CUDA 13.0 builds ([#160956](https://github.com/pytorch/pytorch/pull/160956)) ([#161073](https://github.com/pytorch/pytorch/pull/161073)) ([#161257](https://github.com/pytorch/pytorch/pull/161257)) ([#161663](https://github.com/pytorch/pytorch/pull/161663)) ([#161316](https://github.com/pytorch/pytorch/pull/161316)) ([#160201](https://github.com/pytorch/pytorch/pull/160201)) ([#160770](https://github.com/pytorch/pytorch/pull/160770)) ([#161013](https://github.com/pytorch/pytorch/pull/161013)) ([#161916](https://github.com/pytorch/pytorch/pull/161916)) ([#162268](https://github.com/pytorch/pytorch/pull/162268)) ([#162322](https://github.com/pytorch/pytorch/pull/162322)) ([#162383](https://github.com/pytorch/pytorch/pull/162383)) ([#161833](https://github.com/pytorch/pytorch/pull/161833))
+
+- Enable CUDA 12.6, 12.8 and 13.0 support for Linux ARM64 CD builds ([#162364](https://github.com/pytorch/pytorch/pull/162364)) ([#160720](https://github.com/pytorch/pytorch/pull/160720)) ([#159481](https://github.com/pytorch/pytorch/pull/159481))
+
+- Add support for Python 3.14 in CI/CD builds ([#156889](https://github.com/pytorch/pytorch/pull/156889)) ([#157559](https://github.com/pytorch/pytorch/pull/157559)) ([#159261](https://github.com/pytorch/pytorch/pull/159261)) ([#159869](https://github.com/pytorch/pytorch/pull/159869)) ([#160593](https://github.com/pytorch/pytorch/pull/160593)) ([#160788](https://github.com/pytorch/pytorch/pull/160788)) ([#161255](https://github.com/pytorch/pytorch/pull/161255)) ([#159725](https://github.com/pytorch/pytorch/pull/159725))
+
+- Enable NVSHMEM integration ([#151261](https://github.com/pytorch/pytorch/pull/151261)) ([#153010](https://github.com/pytorch/pytorch/pull/153010)) ([#154538](https://github.com/pytorch/pytorch/pull/154538)) ([#155506](https://github.com/pytorch/pytorch/pull/155506)) ([#156685](https://github.com/pytorch/pytorch/pull/156685)) ([#158938](https://github.com/pytorch/pytorch/pull/158938)) ([#161321](https://github.com/pytorch/pytorch/pull/161321)) ([#160778](https://github.com/pytorch/pytorch/pull/160778)) ([#159907](https://github.com/pytorch/pytorch/pull/159907)) ([#160465](https://github.com/pytorch/pytorch/pull/160465))
+
+## ROCm
+- OCP Micro-scaling Format (mx-fp8/mx-fp4) Support ([#151360](https://github.com/pytorch/pytorch/pull/151360))
+- Support experimental CU carveout `torch._C._set_sm_carveout_experimental()` ([#149466](https://github.com/pytorch/pytorch/pull/149466))
+- Add FP8 rowwise support to `_scaled_grouped_mm` ([#159075](https://github.com/pytorch/pytorch/pull/159075))
+
+## XPU
+- Enable `FlexAttention` on Intel GPU ([#143553](https://github.com/pytorch/pytorch/pull/143553))
+- Enable `_int_mm` on Intel GPU ([#157769](https://github.com/pytorch/pytorch/pull/157769))
+
+# Improvements
+## AOTDispatcher
+- Skip logging in fp8 activation quantization if there are no nodes to be quantized ([#158129](https://github.com/pytorch/pytorch/pull/158129))
+- Add `aot_export_joint_with_descriptors` and `aot_compile_joint_with_descriptors` ([#158715](https://github.com/pytorch/pytorch/pull/158715))
+- Allow keeping input mutations in the graph for `_aot_export_function` ([#157730](https://github.com/pytorch/pytorch/pull/157730))
+- Extract out `prepare_aot_module_simplified` for use in next PR ([#158319](https://github.com/pytorch/pytorch/pull/158319))
+- Rename modules in AOTAutograd ([#158449](https://github.com/pytorch/pytorch/pull/158449))
+- Track descriptors for all inputs/outputs of AOTAutograd traced graph ([#158624](https://github.com/pytorch/pytorch/pull/158624))
+- Improve graph output alias with subclass error message ([#159619](https://github.com/pytorch/pytorch/pull/159619))
+- Pass fw/bw compilers to `aot_export_joint_with_descriptors` ([#159814](https://github.com/pytorch/pytorch/pull/159814))
+
+## Autograd
+- Support deterministic `torch.nn.Upsample` `mode="trilinear"` backward ([#154239](https://github.com/pytorch/pytorch/pull/154239))
+
+## Build Frontend
+- Fix dev warning in `Dependencies.cmake` ([#159702](https://github.com/pytorch/pytorch/pull/159702))
+- Fix building system gloo with CUDA/HIP ([#146637](https://github.com/pytorch/pytorch/pull/146637))
+- Build `libtorch` without NVSHMEM ([#160910](https://github.com/pytorch/pytorch/pull/160910))
+
+## Composability
+- Set `enable_gqa` for `aten._scaled_dot_product_attention_math decomp`([#158604](https://github.com/pytorch/pytorch/pull/158604))
+- Meta implementation for `aten._scaled_dot_product_attention_math_for_mps` ([#159695](https://github.com/pytorch/pytorch/pull/159695))
+- Meta implementation for `aten.add.Scalar` ([#161332](https://github.com/pytorch/pytorch/pull/161332))
+- `aten.expand_copy` decomp ([#161688](https://github.com/pytorch/pytorch/pull/161688))
+- Fix result dtype cast in decomp for `aten.linalg_vector_norm` ([#155111](https://github.com/pytorch/pytorch/pull/155111))
+- Add dtype checks in meta implementation for several ordering ops ([#159556](https://github.com/pytorch/pytorch/pull/159556))
+- Fix meta function for `aten.complex` ([#160894](https://github.com/pytorch/pytorch/pull/160894))
+- Improve shape checks for `aten._grouped_mm` ([#159666](https://github.com/pytorch/pytorch/pull/159666))
+- Improve unbacked symint (dynamic shape) support for several decompositions ([#148815](https://github.com/pytorch/pytorch/pull/148815), [#156902](https://github.com/pytorch/pytorch/pull/156902), [#157008](https://github.com/pytorch/pytorch/pull/157008), [#158894](https://github.com/pytorch/pytorch/pull/158894), [#159184](https://github.com/pytorch/pytorch/pull/159184), [#160683](https://github.com/pytorch/pytorch/pull/160683), [#160253](https://github.com/pytorch/pytorch/pull/160253), [#162084](https://github.com/pytorch/pytorch/pull/162084), [#162099](https://github.com/pytorch/pytorch/pull/162099), [#162109](https://github.com/pytorch/pytorch/pull/162109), [#160462](https://github.com/pytorch/pytorch/pull/160462))
+
+## C++ Frontend
+- Generalized `AllocatorConfig` to be device-agnostic via new `AcceleratorAllocatorConfig` ([#149601](https://github.com/pytorch/pytorch/pull/149601), [#150312](https://github.com/pytorch/pytorch/pull/150312))
+- Added `Scalar::isUnsigned()` method ([#159877](https://github.com/pytorch/pytorch/pull/159877))
+- Exposed `ModelRunner` from nativert as public ([#159989](https://github.com/pytorch/pytorch/pull/159989))
+- Improve error message for `torch.binomial` enforcing float inputs ([#157658](https://github.com/pytorch/pytorch/pull/157658))
+
+## CPU (AArch64)
+- Made PyTorch compilable with gcc-14 on ARM ([#157867](https://github.com/pytorch/pytorch/pull/157867))
+
+## CUDA
+- Make cublaslt/hipblaslt workspaces persistent ([#156495](https://github.com/pytorch/pytorch/pull/156495))
+- Remove unnecessary warnings during the ATen compilation process ([#157703](https://github.com/pytorch/pytorch/pull/157703))
+- Slightly improve error message from `repeat_interleave` kernel ([#157996](https://github.com/pytorch/pytorch/pull/157996))
+- Add framework for explanations for common CUDA errors ([#158395](https://github.com/pytorch/pytorch/pull/158395))
+- Upgrade KernelLauncher `kernelLaunchCheck` to print help string ([#158896](https://github.com/pytorch/pytorch/pull/158896))
+- Prep for cutlass upgrade by ignoring `Wunused-but-set-variable` ([#159276](https://github.com/pytorch/pytorch/pull/159276))
+- Workaround ATen SFINAE under `libc++` ([#161101](https://github.com/pytorch/pytorch/pull/161101))
+- Implement changes to CCCL (CUB/Thrust/LibCUDACXX) usage in ATen ([#153373](https://github.com/pytorch/pytorch/pull/153373))
+- Add maybe unused flag to remove warning ([#157655](https://github.com/pytorch/pytorch/pull/157655))
+- Use new CCCL API in v2.8 ([#160554](https://github.com/pytorch/pytorch/pull/160554))
+- Improve cupy device placement when device is provided with explicit index ([#158529](https://github.com/pytorch/pytorch/pull/158529))
+
+## Distributed
+### c10d
+  - Add improvements to eager init of `ProcessGroupNCCL` ([#156748](https://github.com/pytorch/pytorch/pull/156748))
+  - Simplify unique hash management of `ProcessGroupNCCL` ([#156790](https://github.com/pytorch/pytorch/pull/156790))
+  - Support per operation timeouts in `ProcessGroupGloo` ([#158128](https://github.com/pytorch/pytorch/pull/158128))
+  - Allow ping to be retried in `TCPStore` ([#159165](https://github.com/pytorch/pytorch/pull/159165))
+  - Support scalar tensor for functional `all_gather` ([#149913](https://github.com/pytorch/pytorch/pull/149913))
+  - Expos `unsafe_get_ptr` for dist.ProcessGroupNCCL.NCCLConfig ([#161136](https://github.com/pytorch/pytorch/pull/161136))
+  - Add batch option for `send/recv_object_list` ([#160342](https://github.com/pytorch/pytorch/pull/160342))
+  - Make FakeStore optional to be passed into fake backend ([#162164](https://github.com/pytorch/pytorch/pull/162164))
+  - Enable complex datatype support in `ProcessGroupGloo` ([#156633](https://github.com/pytorch/pytorch/pull/156633))
+  - Move thread-local capture mode guard to include `work.isStarted` ([#160398](https://github.com/pytorch/pytorch/pull/160398))
+### Device Mesh
+  - Enable the use of user set backend and pg option even for the global mesh ([#157501](https://github.com/pytorch/pytorch/pull/157501))
+  - Enable slicing a submesh with warnings ([#158899](https://github.com/pytorch/pytorch/pull/158899))
+  - Allow controlling PG backend and options via `init_device_mesh` ([#159371](https://github.com/pytorch/pytorch/pull/159371))
+### DistributedDataParallel (DDP)
+  - Support ddp zero hook XCCL path ([#159240](https://github.com/pytorch/pytorch/pull/159240))
+### DTensor
+  - Relax `device_mesh` argument constraint in `local_map` ([#157049](https://github.com/pytorch/pytorch/pull/157049))
+  - Support complex numbers in DTensor redistribute ([#157329](https://github.com/pytorch/pytorch/pull/157329))
+  - Rework partial propagation in point-wise op and support mul ([#157340](https://github.com/pytorch/pytorch/pull/157340))
+  - Allow dynamic shapes for `DTensor` slice ([#157953](https://github.com/pytorch/pytorch/pull/157953))
+  - Implement `histc` op ([#158298](https://github.com/pytorch/pytorch/pull/158298))
+  - Made dispatch to sharding prop over decomps ([#159324](https://github.com/pytorch/pytorch/pull/159324))
+  - Support user-supplied Generator for random ops ([#159933](https://github.com/pytorch/pytorch/pull/159933))
+  - Add `propagate_tensor_meta` function that skips cache if `_are_we_tracing` ([#161334](https://github.com/pytorch/pytorch/pull/161334))
+  - Support `local_map` as a decorator ([#161353](https://github.com/pytorch/pytorch/pull/161353))
+### FullyShardedDataParallel2 (FSDP2)
+  - Support custom `all_gather` and `reduce_scatter` comms ([#155189](https://github.com/pytorch/pytorch/pull/155189))
+  - Made it fail `set_allocate_memory_from_process_group` if used together with custom comm hooks ([#157487](https://github.com/pytorch/pytorch/pull/157487))
+  - Use `reduceOpSum` when world size is 1 ([#157529](https://github.com/pytorch/pytorch/pull/157529))
+  - Skipp `allgather` when world size is 1 ([#160135](https://github.com/pytorch/pytorch/pull/160135))
+  - Use `post_reduce_stream.record_event()` on hsdp+cpuoffload ([#160481](https://github.com/pytorch/pytorch/pull/160481))
+### Pipeline Parallelism (PP)
+  - Add `eval()` API to schedule ([#157795](https://github.com/pytorch/pytorch/pull/157795))
+  - Allow intermediate nodes in zero bubble to have multiple grads ([#159084](https://github.com/pytorch/pytorch/pull/159084))
+  - Support `OVERLAP_F_B` computation type ([#158978](https://github.com/pytorch/pytorch/pull/158978))
+  - Initializ P2P communicators on first step ([#160210](https://github.com/pytorch/pytorch/pull/160210))
+  - Add `DualPipeV` schedule ([#159591](https://github.com/pytorch/pytorch/pull/159591))
+### TorchElastic
+  - Enable NUMA binding integration with elastic agent and `torchrun` ([#149334](https://github.com/pytorch/pytorch/pull/149334))
+  - Support NUMA Binding for Callable Entrypoints ([#160163](https://github.com/pytorch/pytorch/pull/160163), [#161183](https://github.com/pytorch/pytorch/pull/161183))
+### Tensor Parallel (TP)
+  - Improve `parallelize_module` API to support more cases ([#157182](https://github.com/pytorch/pytorch/pull/157182))
+### TensorPipe
+  - Update TensorPipe pinned dependency version ([#159834](https://github.com/pytorch/pytorch/pull/159834))
+
+## Dynamo
+- Improve tracing support for various Python builtin data structures/modules:
+  - `list`s (e.g. [#153969](https://github.com/pytorch/pytorch/pull/153969))
+  - `set`s (e.g. [#153150](https://github.com/pytorch/pytorch/pull/153150))
+  - `dict`s (e.g. [#154794](https://github.com/pytorch/pytorch/pull/154794))
+  - `iter` (e.g. [#156371](https://github.com/pytorch/pytorch/pull/156371))
+  - `itertools` (e.g. [#159693](https://github.com/pytorch/pytorch/pull/159693))
+  - `collections` (e.g. [#159365](https://github.com/pytorch/pytorch/pull/159365))
+  - `collections.NamedTuple` ([#159367](https://github.com/pytorch/pytorch/pull/159367))
+  - frozen `dataclasses.dataclass` ([#159529](https://github.com/pytorch/pytorch/pull/159529))
+- Graph break error messages link to a website with more information ([#159011](https://github.com/pytorch/pytorch/pull/159011))
+- Add option for `TorchDispatchMode` to ignore `torch.compile` internals ([#161648](https://github.com/pytorch/pytorch/pull/161648))
+
+## Export
+- Add `_compile_and_package` method for ExportPackage ([#156638](https://github.com/pytorch/pytorch/pull/156638))
+- Handle `None` & ellipsis slicing/select in non-strict ([#157821](https://github.com/pytorch/pytorch/pull/157821))
+- Extend FP8 types in serialization ([#158430](https://github.com/pytorch/pytorch/pull/158430))
+- Improve error messages for deserialization ([#159881](https://github.com/pytorch/pytorch/pull/159881))
+- Support serialization for `triton_kernel_wrapper_functional` HOP ([#161314](https://github.com/pytorch/pytorch/pull/161314))
+- Support serialization for complex constants ([#161517](https://github.com/pytorch/pytorch/pull/161517))
+- Add runtime asserts to `while_loop` HOP subgraphs ([#158467](https://github.com/pytorch/pytorch/pull/158467))
+- Warn on side-effectful code in strict mode ([#160060](https://github.com/pytorch/pytorch/pull/160060))
+- Support for vmap in pre-dispatch export ([#154650](https://github.com/pytorch/pytorch/pull/154650))
+- Support vmap and custom autograd function/improve DTensor constructor inefficiency ([#162240](https://github.com/pytorch/pytorch/pull/162240))
+
+## Foreach
+- Invoke `vector.reserve()` consistently for non-inplace foreach operations ([#161128](https://github.com/pytorch/pytorch/pull/161128))
+- Faster and safer lambda expression capture in `has_integral_tensor()` ([#161042](https://github.com/pytorch/pytorch/pull/161042))
+
+## FX
+- Fix DCE eliminating random operations by improving `is_impure()` (#151524) ([#157981](https://github.com/pytorch/pytorch/pull/157981))
+- Support converting a float32 tensor to a scalar in FX trace. ([#158216](https://github.com/pytorch/pytorch/pull/158216))
+- Correctly copy `self.module_stack` in ModuleStackTracer ([#159956](https://github.com/pytorch/pytorch/pull/159956))
+- Add tool to track events in graph split ([#159795](https://github.com/pytorch/pytorch/pull/159795))
+- Add `node_name_match` to subgraph rewriter ([#157574](https://github.com/pytorch/pytorch/pull/157574))
+
+## Inductor
+- Add Inductor support for MTIA backend ([#159211](https://github.com/pytorch/pytorch/pull/159211))
+- Share default device context when all graph partitions and cudagraph-unsafe ops are on the same device([#162873](https://github.com/pytorch/pytorch/pull/162873))
+
+## Ahead-Of-Time Inductor (AOTI)
+- Enable AOTI for CPU on Windows ([#158915](https://github.com/pytorch/pytorch/pull/158915))
+- Re-enable TMA templates w/ AOTI ([#157819](https://github.com/pytorch/pytorch/pull/157819))
+- Don't allow int32 indices if `{non-inf, > int32_max}` upper bound is provided ([#159433](https://github.com/pytorch/pytorch/pull/159433))
+- Add RecordFunction to C shim so that profiling works with AOTI ([#159842](https://github.com/pytorch/pytorch/pull/159842))
+- Add AOTI C shim functions for collective ops ([#154492](https://github.com/pytorch/pytorch/pull/154492))
+- Add missing ops to set of C-shim ops which can have nullptr returns ([#158073](https://github.com/pytorch/pytorch/pull/158073))
+
+## Linear Algebra Frontend
+- Use rocSOLVER for Cholesky inversion on AMD. ([#157154](https://github.com/pytorch/pytorch/pull/157154))
+- Add option for using TF32 as fp32 internal precision for matmul/linear/conv on MKLDNN ([#157520](https://github.com/pytorch/pytorch/pull/157520))
+- Make einsum produce contiguous outputs in more cases ([#161755](https://github.com/pytorch/pytorch/pull/161755))
+
+## MPS
+- Add `shifted_chebyshev_polynomial_[tuvw]`, `igamma/igammac,grid_sampler_3d, native_dropout`/`native_dropout_backward`  ([\#157488](https://github.com/pytorch/pytorch/pull/157488), [\#161927](https://github.com/pytorch/pytorch/pull/161927), [\#160541](https://github.com/pytorch/pytorch/pull/160541), [\#162108](https://github.com/pytorch/pytorch/pull/162108))
+- Extend atomic operations to all int types ([\#158179](https://github.com/pytorch/pytorch/pull/158179))
+- Extend `index_put` to complex types ([\#160159](https://github.com/pytorch/pytorch/pull/160159))
+- Extend `addmm` to integral types ([\#160270](https://github.com/pytorch/pytorch/pull/160270))
+- Add support for unsigned types ([\#159094](https://github.com/pytorch/pytorch/pull/159094))
+- Add API to query GPU core count ([\#160414](https://github.com/pytorch/pytorch/pull/160414))
+- Add `kthvalue` ([\#161817](https://github.com/pytorch/pytorch/pull/161817))
+- Type-promote tensor-iterator common dtype ([\#160334](https://github.com/pytorch/pytorch/pull/160334))
+- Implement `logcumsumexp` metal kernel ([\#156858](https://github.com/pytorch/pytorch/pull/156858))
+- Enable `dlpack` integration ([\#158888](https://github.com/pytorch/pytorch/pull/158888))
+- Dynamic reductions ([\#159355](https://github.com/pytorch/pytorch/pull/159355))
+- Update `avg_pool2d` to use Metal kernel when `ceil_mode=True` ([\#161011](https://github.com/pytorch/pytorch/pull/161011))
+
+## Nested Tensor (NJT)
+- Added initial `log_softmax()` support ([#159662](https://github.com/pytorch/pytorch/pull/159662))
+
+## torch.nn
+- Allow `register_buffer` with `Tensor`-like objects ([#159455](https://github.com/pytorch/pytorch/pull/159455))
+- Improve error message for unsupported padding configurations ([#160866](https://github.com/pytorch/pytorch/pull/160866))
+- Validate target is 0D when input is 1D in `NLLLoss` ([#161412](https://github.com/pytorch/pytorch/pull/161412))
+
+## ONNX
+- Support symbolic arguments in ONNX exporter ([#157734](https://github.com/pytorch/pytorch/pull/157734))
+- Fix `torch.tensor` warning in ONNX `symbolic_opset10` export  ([#158835](https://github.com/pytorch/pytorch/pull/158835))
+
+## Optimizer
+- Resolve warning in LBFGS when converting a tensor with `requires_grad=True` to a scalar ([#160389](https://github.com/pytorch/pytorch/pull/160389))
+- Resolve `SequentialLR` deprecation warning about invoking `step(epoch)` ([#149392](https://github.com/pytorch/pytorch/pull/149392))
+
+## Profiler
+- Add more CUDA API for kernel launcher ([#156016](https://github.com/pytorch/pytorch/pull/156016))
+- Allow Custom Time Unit When Printing Profiler Table ([#157913](https://github.com/pytorch/pytorch/pull/157913))
+- Update CUDA runtime kernel identification logic ([#157890](https://github.com/pytorch/pytorch/pull/157890))
+
+## Python Frontend
+- Speed up `torch.load` under `FakeTensorMode` by reducing random reads ([#157931](https://github.com/pytorch/pytorch/pull/157931))
+- Make `torch.utils.benchmark.utils.timer` accelerator agnostic ([#157131](https://github.com/pytorch/pytorch/pull/157131))
+- Improve error message for weight-only load errors ([#159935](https://github.com/pytorch/pytorch/pull/159935))
+
+## Quantization
+- Avoid getting model device once per node for pt2e quantization flow ([#159901](https://github.com/pytorch/pytorch/pull/159901))
+- Fixes bug in implementation of `HistogramObserver` ([#156457](https://github.com/pytorch/pytorch/pull/156457))
+- Support `bias=None` for `fbgemm_linear_fp16_weight` CPU op ([#158535](https://github.com/pytorch/pytorch/pull/158535))
+- Add Static Dispatch Kernel for `wrapped_fbgemm_linear_fp16_weight` for Sigmoid ([#160451](https://github.com/pytorch/pytorch/pull/160451))
+
+## Release Engineering
+- Enable vLLM testing workflow ([#160583](https://github.com/pytorch/pytorch/pull/160583)) ([#161565](https://github.com/pytorch/pytorch/pull/161565)) ([#162292](https://github.com/pytorch/pytorch/pull/162292)) ([#162000](https://github.com/pytorch/pytorch/pull/162000)) ([#161797](https://github.com/pytorch/pytorch/pull/161797))
+- Enable Windows ARM64 CI testing ([#148753](https://github.com/pytorch/pytorch/pull/148753)) ([#161504](https://github.com/pytorch/pytorch/pull/161504))
+- Enable PyTorch ROCm CI for MI355X testing. ([#158889](https://github.com/pytorch/pytorch/pull/158889))
+
+## ROCm
+- Additional hipify mappings ([#158056](https://github.com/pytorch/pytorch/pull/158056), [#158352](https://github.com/pytorch/pytorch/pull/158352), [#161992](https://github.com/pytorch/pytorch/pull/161992))
+- Refactor `composable_kernel` (CK) backend user interface to improve user experience ([#152951](https://github.com/pytorch/pytorch/pull/152951))
+- Allow use of `rocSOLVER` for Cholesky inversion. ([#157154](https://github.com/pytorch/pytorch/pull/157154))
+- AOT Inductor enable gfx950 for max autotune using CK ([#159195](https://github.com/pytorch/pytorch/pull/159195))
+- Add flag `torch.backends.miopen.immediate` to toggle MIOpen Immediate Mode instead of relying on `deterministic=True` and `benchmark=False` ([#158951](https://github.com/pytorch/pytorch/pull/158951))
+- MIOpen convolutions no longer call `reshape_` or unexpectedly change memory formats ([#161687](https://github.com/pytorch/pytorch/pull/161687))
+
+## XPU
+- Support Intel GPU quantization ops in AOTInductor ([#156572](https://github.com/pytorch/pytorch/pull/156572))
+- Add `device_id` to Intel GPU properties to distinguish iGPUs with identical names ([#156481](https://github.com/pytorch/pytorch/pull/156481))
+
+# Bug Fixes
+## Autograd
+- Fix `torch.autograd.Function` memory leak due to `torch.utils.checkpiont` early stopping ([#161171](https://github.com/pytorch/pytorch/pull/161171))
+- Fix `torch.autograd.graph.GradientEdge` for `torch.autograd.Function` ([#160098](https://github.com/pytorch/pytorch/pull/160098))
+- Match 0-dim gradients device type regardless of subclass-ness ([#160165](https://github.com/pytorch/pytorch/pull/160165))
+
+## C++ Frontend
+- Fix `torch.utils.cpp_extension` parser for clang version 20.1.7+libcxx ([#157666](https://github.com/pytorch/pytorch/pull/157666))
+- Fix `MakeTensor::computeStorageSize()` calculation ([#158690](https://github.com/pytorch/pytorch/pull/158690))
+- Fix static initialization order issue with `AllocatorConfig` ([#159629](https://github.com/pytorch/pytorch/pull/159629))
+
+## CPU
+- Add check so non-aarch64 platforms can hit `MKLDNN` path ([#162168](https://github.com/pytorch/pytorch/pull/162168))
+
+## CUDA
+- Handle uninitialized `torch.backends.cuda.matmul.fp32_precision` ([#161102](https://github.com/pytorch/pytorch/pull/161102))
+- Fix nansum in non-JIT build ([#158633](https://github.com/pytorch/pytorch/pull/158633))
+- Decrease launch bounds of CTCLoss backward for blackwell to avoid crash ([#159522](https://github.com/pytorch/pytorch/pull/159522))
+- Implement workaround for `cudaErrorNotSupported` ([#162412](https://github.com/pytorch/pytorch/pull/162412))
+- Fix missing `__syncthreads` in MultiMarginLoss backward ([#158994](https://github.com/pytorch/pytorch/pull/158994))
+- Roll-back cuDNN frontend upgrade and update Meta registration due to compile issues ([#163104](https://github.com/pytorch/pytorch/pull/163104))
+
+## Distributed
+### c10d
+  - Fix slow init due to repeated dns resolution failure in socket ([#159596](https://github.com/pytorch/pytorch/pull/159596))
+  - Fix `setGroupName` and `setGroupDesc` in `group_split` and `merge_remote_group` ([#159429](https://github.com/pytorch/pytorch/pull/159429))
+  - Fix a bug of distributed 'gather' with noncontiguous tensors on the Gloo backend ([#158903](https://github.com/pytorch/pytorch/pull/158903))
+  - Fix a bug of distributed 'gather' with noncontiguous tensors on the NCCL backend ([#159549](https://github.com/pytorch/pytorch/pull/159549))
+### Device Mesh
+  - Fix the not incorrectly chained each of the strings as iterables ([#160709](https://github.com/pytorch/pytorch/pull/160709))
+### DistributedDataParallel (DDP)
+  - Fix incorrect interaction between `DDPOptimizer` and donated buffers ([#160745](https://github.com/pytorch/pytorch/pull/160745))
+### DTensor
+  - Fix DTensor handling of conjugate bit ([#158030](https://github.com/pytorch/pytorch/pull/158030))
+  - Fix `OpSchema` equality check ([#161231](https://github.com/pytorch/pytorch/pull/161231))
+  - Fix `grouped_mm` strategy for invalid stride cases ([#158245](https://github.com/pytorch/pytorch/pull/158245))
+  - Fix `F.one_hot` in DTensor ([#162307](https://github.com/pytorch/pytorch/pull/162307))
+  - Always disabled `ShardingPropagation` cache if compiling ([#156868](https://github.com/pytorch/pytorch/pull/156868))
+### FullyShardedDataParallel (FSDP)
+  - Fix the bug in FSDP offload `pin_memory` ([#157147](https://github.com/pytorch/pytorch/pull/157147))
+  - Fix to ensure writeback handles `NO_SHARD` correctly by flattening tensors before copying ([#154369](https://github.com/pytorch/pytorch/pull/154369))
+### FullyShardedDataParallel2 (FSDP2)
+  - Fix error message for `fsdp_pre_all_gather` ([#160817](https://github.com/pytorch/pytorch/pull/160817))
+  - Fix the issue with `set_reduce_scatter_divide_factor` errors and `MixedPrecisionPolicy`  ([#155964](https://github.com/pytorch/pytorch/pull/155964))
+### Pipeline Parallelism (PP)
+  - Fix eval step under `no_grad()` ([#159293](https://github.com/pytorch/pytorch/pull/159293))
+  - Fix zero bubble schedules for `eval()` ([#159475](https://github.com/pytorch/pytorch/pull/159475))
+### Symmetric Memory (SymmMem)
+- Fix `put_signal` + `wait_until` hang ([#163194](https://github.com/pytorch/pytorch/pull/163194))
+### TorchElastic
+  - Fix wrong log file name in the docs of `torch.distributed.elastic.multiprocessing.start_processes()` ([#160396](https://github.com/pytorch/pytorch/pull/160396))
+### TensorPipe
+  - Fix `import torch` if compiled without `TensorPipe` ([#159461](https://github.com/pytorch/pytorch/pull/159461))
+
+## Dynamo
+- Fix segfault due to interaction between Dynamo backends and `torch.compiler.reset()` ([#156527](https://github.com/pytorch/pytorch/pull/156527))
+- Fix crash due to bad interaction with recompilations and with blocks in Python 3.11+ ([#162318](https://github.com/pytorch/pytorch/pull/162318))
+
+## Export
+- Fix bug in constants lifting pass ([#157719](https://github.com/pytorch/pytorch/pull/157719))
+- Fix `from_node` provenance in unlift pass ([#157943](https://github.com/pytorch/pytorch/pull/157943))
+- Fix `NaN` serialization ([#155359](https://github.com/pytorch/pytorch/pull/155359))
+- Fix deserialization for unbacked symbol ranges ([#158681](https://github.com/pytorch/pytorch/pull/158681))
+- Fix runtime assert handling in deserialization ([#159060](https://github.com/pytorch/pytorch/pull/159060))
+- Fix for FQN handling in unflattener ([#159418](https://github.com/pytorch/pytorch/pull/159418))
+- Add `_ccode` method for `PythonMod` ([#158851](https://github.com/pytorch/pytorch/pull/158851))
+- Fix `nn_module_stack` for `assert_tensor_metadata` nodes ([#159625](https://github.com/pytorch/pytorch/pull/159625))
+- Fix usage for `move_to_device_pass` ([#159992](https://github.com/pytorch/pytorch/pull/159992), [#160528](https://github.com/pytorch/pytorch/pull/160528), [#162301](https://github.com/pytorch/pytorch/pull/162301))
+- Avoid name overwrites for aliased exported module parameters ([#160600](https://github.com/pytorch/pytorch/pull/160600))
+- Avoid inling `dynamo.disables` in unflattening ([#161306](https://github.com/pytorch/pytorch/pull/161306))
+- Fix deserialization issue for storage offset ([#162172](https://github.com/pytorch/pytorch/pull/162172))
+- Remove `.contiguous()` when saving weights to raw bytes to preserve original storage size of tensor ([#163587](https://github.com/pytorch/pytorch/pull/163587))
+
+## Foreach
+- `chunk_size` should always be `int64_t` for Foreach functors ([#156872](https://github.com/pytorch/pytorch/pull/156872))
+
+## FX
+- Fix `split_module` with symint ([#160093](https://github.com/pytorch/pytorch/pull/160093))
+- Fix `getattr_recursive` with ModuleList ([#161204](https://github.com/pytorch/pytorch/pull/161204))
+- Skip const folding with symbolic expression ([#161437](https://github.com/pytorch/pytorch/pull/161437))
+- Fix qualified name for methods of `torch.Tensor` ([#162224](https://github.com/pytorch/pytorch/pull/162224))
+
+## Inductor
+- Fix wrong meta function for `constant_pad_nd` ([#159878](https://github.com/pytorch/pytorch/pull/159878))
+- Fix learnable bias assertion error in Inductor ([#161170](https://github.com/pytorch/pytorch/pull/161170))
+- Fix int64 from `MutationOutput` Buffer ([#162020](https://github.com/pytorch/pytorch/pull/162020))
+- Fix Inductor CUDA sort `NaN` behavior ([#159308](https://github.com/pytorch/pytorch/pull/159308))
+- Fix layout for local buf in outer loop fusion ([#160857](https://github.com/pytorch/pytorch/pull/160857))
+- Fix slice scatter `dtype` consistency ([#160851](https://github.com/pytorch/pytorch/pull/160851))
+- Fix 3d tiled online softmax ([#162341](https://github.com/pytorch/pytorch/pull/162341))
+- Fix unsafe collective reorder past wait in Inductor ([#157489](https://github.com/pytorch/pytorch/pull/157489))
+- Fix `FallbackKernel` alias function to avoid incorrect aliasing for custom ops ([#163227](https://github.com/pytorch/pytorch/pull/163227))
+
+## Ahead-Of-Time Inductor (AOTI)
+- Fix a bug from `load_constants` ([#161887](https://github.com/pytorch/pytorch/pull/161887))
+- Fix wrong propagation of fallback_ops_dict in `gen_aoti_c_shim` ([#159904](https://github.com/pytorch/pytorch/pull/159904))
+- Fix unbacked symint and memory leak in Inductor memory planning ([#159839](https://github.com/pytorch/pytorch/pull/159839))
+- Fix memory leak in AOTI when calling `aoti_torch_as_strided` ([#162118](https://github.com/pytorch/pytorch/pull/162118))
+- Explicitly delete `wait_tensor` returned tensor ([#159502](https://github.com/pytorch/pytorch/pull/159502))
+- Fix memory leak from `all_reduce` ([#159818](https://github.com/pytorch/pytorch/pull/159818))
+
+## JIT
+- Make `ErrorReport::CallStack` thread-safe ([#160386](https://github.com/pytorch/pytorch/pull/160386))
+- Fix `RemoveProfileNodesAndSpecializeTypes` handling for `Tensor?` that is resolved to `None` ([#161538](https://github.com/pytorch/pytorch/pull/161538))
+
+## Linear Algebra Frontend
+- Avoid downcasts for fp16 matmul on the BLAS backend ([#161999](https://github.com/pytorch/pytorch/pull/161999))
+
+## MPS
+- Fix batch norm incorrect gradient ([#156867](https://github.com/pytorch/pytorch/pull/156867))
+- Do not crash if `tensor dim > INT_MAX` ([#158824](https://github.com/pytorch/pytorch/pull/158824))
+- Avoid outputing zeros from `exponential_` for MPS ([#159386](https://github.com/pytorch/pytorch/pull/159386))
+- Fix MPS autocast for `ConvTranspose3d` ([#160345](https://github.com/pytorch/pytorch/pull/160345))
+- Fix MPS `conv3d` autocast bias dtype mismatch ([#160423](https://github.com/pytorch/pytorch/pull/160423))
+- Fix error check for `torch.var` on scalar ([#160889](https://github.com/pytorch/pytorch/pull/160889))
+- Fix `index_add` for complex + int64, int64 input + zerodim index ([#160926](https://github.com/pytorch/pytorch/pull/160926), [#161511](https://github.com/pytorch/pytorch/pull/161511))
+- Fix `constant_pad_nd_mps` bug when pad is empty ([#161149](https://github.com/pytorch/pytorch/pull/161149))
+- Fix `index_select` for `scalar_types` ([#161206](https://github.com/pytorch/pytorch/pull/161206))
+- Fix `index_copy` for scalars and `index_copy` for strided indices ([#161267](https://github.com/pytorch/pytorch/pull/161267), [#161333](https://github.com/pytorch/pytorch/pull/161333))
+- Ensure that tensors are contiguous before using MPS linear kernel ([#161641](https://github.com/pytorch/pytorch/pull/161641))
+- Address `NaN`s if SDPA is called with all values masked from query ([#157727](https://github.com/pytorch/pytorch/pull/157727))
+- Fix invalid formatting ([#158436](https://github.com/pytorch/pytorch/pull/158436))
+- Fix empty input in posneg functions ([#161824](https://github.com/pytorch/pytorch/pull/161824))
+- Migrate round unary op to Metal ([#161712](https://github.com/pytorch/pytorch/pull/161712))
+- Type-promote tensor-iterator common dtype ([#160334](https://github.com/pytorch/pytorch/pull/160334))
+
+## ONNX
+- Make onnx export SDPA match ATen behavior ([#159973](https://github.com/pytorch/pytorch/pull/159973))
+- Fix `rotary_embedding_23` implementation ([#162865](https://github.com/pytorch/pytorch/pull/162865))
+- Fix export behavior when model has `None` as output ([#160200](https://github.com/pytorch/pytorch/pull/160200))
+- Fix lower opset version support in `dynamo=True` ([#161056](https://github.com/pytorch/pytorch/pull/161056))
+- Fix `index_put_` usage ([#161263](https://github.com/pytorch/pytorch/pull/161263))
+
+## Profiler
+- Fix Linter for Global Annotations flag in Snapshot ([#157858](https://github.com/pytorch/pytorch/pull/157858))
+
+## Python Frontend
+- Add option in `torch.utils.cpp_extension.load_inline` to override gencode ([#156850](https://github.com/pytorch/pytorch/pull/156850))
+- Fix `max_width` computation in Tensor printing ([#126859](https://github.com/pytorch/pytorch/pull/126859))
+- Improve `pin_memory` error message on CPU-only systems ([#159994](https://github.com/pytorch/pytorch/pull/159994))
+- Making batching rule for `F.embedding` DTensor-aware ([#162117](https://github.com/pytorch/pytorch/pull/162117))
+
+## Quantization
+- Avoid `NaN` in fp8 output of CPU `qlinear` and `qconv` ops ([#160957](https://github.com/pytorch/pytorch/pull/160957))
+- Fix segmentation fault when `choose_qparams_optimized` ([#161966](https://github.com/pytorch/pytorch/pull/161966))
+
+## ROCm
+- Fix Inductor with cudagraph trees `hip:0` device error ([#161221](https://github.com/pytorch/pytorch/pull/161221))
+- Fix some build failures and support some BLAS calls on Windows ([#161981](https://github.com/pytorch/pytorch/pull/161981))
+- Fix undefined symbol linker error after exposing MIOpen symbols on Windows ([#156479](https://github.com/pytorch/pytorch/pull/156479))
+- Fix finding ROCm/HIP version on Windows ([#156486](https://github.com/pytorch/pytorch/pull/156486))
+- Fix LoadHIP handling of environment variable paths on Windows ([#159080](https://github.com/pytorch/pytorch/pull/159080))
+- Add hipcc compatibility flags to `cpp_extension.py` on Windows ([#159790](https://github.com/pytorch/pytorch/pull/159790))
+- Symmetric memory set handle type for ROCm ([#161741](https://github.com/pytorch/pytorch/pull/161741))
+- In SDPA via AOTriton, `logsumexp` needs scaling back to natural base ([#156903](https://github.com/pytorch/pytorch/pull/156903))
+- Check stream graph capture status in `memcpy_and_sync` inline function ([#158165](https://github.com/pytorch/pytorch/pull/158165))
+
+## XPU
+- Fix `cpp_extension` compatibility with `intel-deep-learning-essentials-2025.2` ([#161012](https://github.com/pytorch/pytorch/pull/161012))
+
+# Performance
+## Autograd
+- Fix SVD forward-mode AD multiplication priority ([#161027](https://github.com/pytorch/pytorch/pull/161027))
+
+## CUDA
+- Use a nonblocking copy to avoid stream synchronization for GPU tensor indexing with CPU mask ([#156384](https://github.com/pytorch/pytorch/pull/156384))
+- Disable cudagraph GCs by default to improve capture performance ([#158649](https://github.com/pytorch/pytorch/pull/158649))
+
+## Dynamo
+- Recursive `dict` tag optimization for faster guard evaluation ([#159183](https://github.com/pytorch/pytorch/pull/159183))
+
+## Export
+- Caching optimizations for placeholder naming pass ([#158594](https://github.com/pytorch/pytorch/pull/158594))
+- Add Static Dispatch Kernel for `fmod.Scalar` and `scale_gradient` ([#160654](https://github.com/pytorch/pytorch/pull/160654), [#160454](https://github.com/pytorch/pytorch/pull/160454))
+
+## Inductor
+- Improve performance of A16W4 and A16W8 `GEMM` template ([#159127](https://github.com/pytorch/pytorch/pull/159127)) ([#161148](https://github.com/pytorch/pytorch/pull/161148))
+- More aggressive persistent reduction ([#161055](https://github.com/pytorch/pytorch/pull/161055))
+- Add a few outer dimension reduction cases for LOAF ([#162028](https://github.com/pytorch/pytorch/pull/162028))
+- Fuse two RoPE kernels into a single kernel and improving runtime efficiency ([#161420](https://github.com/pytorch/pytorch/pull/161420))
+
+## MPS
+- Optimize cummin/cummax metal kernels ([\#156794](https://github.com/pytorch/pytorch/pull/156794))
+- Speedup `torch.full` for 1-byte types ([\#158874](https://github.com/pytorch/pytorch/pull/158874))
+- Speedup `argmax`/`argmin` ([\#159524](https://github.com/pytorch/pytorch/pull/159524))
+- Improve performance of `max_pool3d` ([\#157875](https://github.com/pytorch/pytorch/pull/157875))
+- Avoid calling tensor ops in `max_pool3d` impl ([\#157874](https://github.com/pytorch/pytorch/pull/157874))
+- Move `max_pool2d` to Metal for `stride != 1` ([\#157876](https://github.com/pytorch/pytorch/pull/157876))
+
+## Optimizer
+- Use `addmm` to improve Newton–Schulz orthogonalization in Muon ([#161379](https://github.com/pytorch/pytorch/pull/161379))
+- Avoid stream sync in SWA `AveragedModel.update_parameters()` ([#157705](https://github.com/pytorch/pytorch/pull/157705))
+
+## Release Engineering
+- Upgrade to ROCm 6.4.1 and 6.4.2 patch releases ([#156636](https://github.com/pytorch/pytorch/pull/156636)) ([#158887](https://github.com/pytorch/pytorch/pull/158887)) ([#158886](https://github.com/pytorch/pytorch/pull/158886)) ([#158651](https://github.com/pytorch/pytorch/pull/158651)) ([#159001](https://github.com/pytorch/pytorch/pull/159001))
+- Migrate RPyTorch ROCm CI to MI325 capacity ([#159059](https://github.com/pytorch/pytorch/pull/159059)) ([#159649](https://github.com/pytorch/pytorch/pull/159649)) ([#161184](https://github.com/pytorch/pytorch/pull/161184))
+- Enable B200 PyTorch benchmark testing ([#158011](https://github.com/pytorch/pytorch/pull/158011)) ([#157341](https://github.com/pytorch/pytorch/pull/157341))
+
+## ROCm
+- SDPA now uses AOTriton to 0.11b ([#161754](https://github.com/pytorch/pytorch/pull/161754))
+- `hipblaslt` is used by default on gfx908 for ROCm >= 6.3 ([#159092](https://github.com/pytorch/pytorch/pull/159092))
+- Enable miopen channels last 3d for conv and batchnorm ([#160529](https://github.com/pytorch/pytorch/pull/160529))
+- Remove extra transposes in NHWC convolutions on MIOpen ([#160435](https://github.com/pytorch/pytorch/pull/160435))
+- Remove extra sync in `tensor.item()` ([#158486](https://github.com/pytorch/pytorch/pull/158486))
+- Elementwise and reduction kernel perf improvements ([#159430](https://github.com/pytorch/pytorch/pull/159430), [#159652](https://github.com/pytorch/pytorch/pull/159652), [#160444](https://github.com/pytorch/pytorch/pull/160444), [#160466](https://github.com/pytorch/pytorch/pull/160466), [#161054](https://github.com/pytorch/pytorch/pull/161054), [#161180](https://github.com/pytorch/pytorch/pull/161180), [#161181](https://github.com/pytorch/pytorch/pull/161181))
+- Symmetric Memory Performance improvements for two-shot allreduce ([#156746](https://github.com/pytorch/pytorch/pull/156746))
+- Enable build of `fbgemm_gpu genai` sources for grouped GEMM support ([#160676](https://github.com/pytorch/pytorch/pull/160676))
+
+## XPU
+- Enable tensor memory descriptor Triton template for Intel GPU ([#161600](https://github.com/pytorch/pytorch/pull/161600))
+
+# Documentation
+## Autograd
+- Improve `torch.inference_mode` docs and error message ([#161164](https://github.com/pytorch/pytorch/pull/161164))
+
+## Distributed
+### c10d
+  - Documented barrier collective's interaction with `device_id` ([#159389](https://github.com/pytorch/pytorch/pull/159389))
+  - Fix comment to match logic in `distributed_c10d.py` ([#162158](https://github.com/pytorch/pytorch/pull/162158))
+### DTensor
+  - Rewrote doc of `TupleStrategy` ([#158132](https://github.com/pytorch/pytorch/pull/158132))
+  - Documented `redistribute_costs` ([#158495](https://github.com/pytorch/pytorch/pull/158495))
+### FullyShardedDataParallel (FSDP)
+  - Removed FSDP1 developer note ([#158991](https://github.com/pytorch/pytorch/pull/158991))
+
+## Export
+- Update docs around draft export, dynamism, and PT2 Archive ([#157750](https://github.com/pytorch/pytorch/pull/157750))
+
+## FX
+- Fix typos in `torch/` (`torch/fx/`) ([#156604](https://github.com/pytorch/pytorch/pull/156604))
+- Add typing ([#158450](https://github.com/pytorch/pytorch/pull/158450))
+- Fix typo in FX interpreter class docs ([#162055](https://github.com/pytorch/pytorch/pull/162055))
+- Remove allow-untyped-defs from `torch/fx/experimental/migrate_gradual_types/util.py` ([#157236](https://github.com/pytorch/pytorch/pull/157236))
+
+## Inductor
+- Add documentation for CUDAGraph partition ([#159450](https://github.com/pytorch/pytorch/pull/159450))
+
+## torch.nn
+- Improve description of `padding` for `avg_poolnd` ([#159142](https://github.com/pytorch/pytorch/pull/159142))
+- Improve `CrossEntropyLoss` docs with example of incorrect target specification ([#155649](https://github.com/pytorch/pytorch/pull/155649))
+- Remove redundant dtype conversion in `scaled_dot_product_attention` example ([#161613](https://github.com/pytorch/pytorch/pull/161613))
+
+## ONNX
+- Update export docstring ([#162622](https://github.com/pytorch/pytorch/pull/162622))
+- Delete deprecated tutorial page link ([#157310](https://github.com/pytorch/pytorch/pull/157310))
+- Filter out torchscript sentences ([#158850](https://github.com/pytorch/pytorch/pull/158850))
+- Fix doc typo for `symbolic_multi_out` ([#160702](https://github.com/pytorch/pytorch/pull/160702))
+- `onnx.md` to simplify deprecated entities ([#159312](https://github.com/pytorch/pytorch/pull/159312))
+- Update export docstring and set `fallback=False` by default ([#162622](https://github.com/pytorch/pytorch/pull/162622), [#162726](https://github.com/pytorch/pytorch/pull/162726))
+- Fix typo in error message: summit -> submit ([#162587](https://github.com/pytorch/pytorch/pull/162587))
+
+
+## Optimizer
+- Document specific optimizer modules APIs e.g., `torch.optim.adam.Adam`, properly ([#158483](https://github.com/pytorch/pytorch/pull/158483), [#158669](https://github.com/pytorch/pytorch/pull/158669), [#160194](https://github.com/pytorch/pytorch/pull/160194))
+- Add note for clarity in Adafactor doc #154862 ([#155248](https://github.com/pytorch/pytorch/pull/155248))
+- Minorly improve `zero_grad` description ([#161239](https://github.com/pytorch/pytorch/pull/161239))
+
+## Profiler
+- Update PT2 Profiler Torch-Compiled Region Image ([#158066](https://github.com/pytorch/pytorch/pull/158066))
+- Fix Experimental Config Documentatation([#156586](https://github.com/pytorch/pytorch/pull/156586))
+- Update README ([#159816](https://github.com/pytorch/pytorch/pull/159816))
+
+## Python Frontend
+- Improve documentation for `torch.lobpcg`, `torch.clone`, `torch.matmul`, `torch.max`, `torch.gather`, `torch.Tensor.scatter_`, `torch.empty_like`, `torch.randint`, `torch.mul`, `torch.min`, `torch.max`. `torch.sort`, `torch.full_like`, `torch.histogramdd`, `torch.hamming_window` ([#156139](https://github.com/pytorch/pytorch/pull/156139), [#157007](https://github.com/pytorch/pytorch/pull/157007), [#161424](https://github.com/pytorch/pytorch/pull/161424), [#156153](https://github.com/pytorch/pytorch/pull/156153), [#157929](https://github.com/pytorch/pytorch/pull/157929), [#157920](https://github.com/pytorch/pytorch/pull/157920), [#158050](https://github.com/pytorch/pytorch/pull/158050), [#158731](https://github.com/pytorch/pytorch/pull/158731), [#160312](https://github.com/pytorch/pytorch/pull/160312), [#161539](https://github.com/pytorch/pytorch/pull/161539), [#162051](https://github.com/pytorch/pytorch/pull/162051), [#158275](https://github.com/pytorch/pytorch/pull/158275), [#152682](https://github.com/pytorch/pytorch/pull/152682))
+- Remove torchscript related sections in serialization docs ([#156648](https://github.com/pytorch/pytorch/pull/156648))
+- Fix typo in `torch.set_float32_matmul_precision` docs ([#158191](https://github.com/pytorch/pytorch/pull/158191))
+- Fix docstring for `torch.nn.utils.clip_grads_with_norm_` to reflect clamping behavior ([#158200](https://github.com/pytorch/pytorch/pull/158200))
+- Fix the Doc issue on the description of edge_order in `torch.gradient` ([#159130](https://github.com/pytorch/pytorch/pull/159130))
+- Add `torch.segment_reduce` docs ([#154352](https://github.com/pytorch/pytorch/pull/154352))
+- Add examples to `torch.is_floating_point` and `torch.is_complex` docs ([#161951](https://github.com/pytorch/pytorch/pull/161951))
+
+## Release Engineering
+- Add decorator to create deprecation warnings ([#155127](https://github.com/pytorch/pytorch/pull/155127))
+- Add runnable code examples to export documentation ([#158506](https://github.com/pytorch/pytorch/pull/158506))
+- Add developer notes for integrating new backends into PyTorch ([#158644](https://github.com/pytorch/pytorch/pull/158644))
+
+## XPU
+- Update supported OS to Windows 11 & Ubuntu 24.04/25.04 for Intel client GPU ([#161699](https://github.com/pytorch/pytorch/pull/161699))
+
+# Security
+## Python Frontend
+- Don't store flamegraph to tmp folder ([#157374](https://github.com/pytorch/pytorch/pull/157374))
+
+# Developers
+## Composability
+- Stop suggesting to use `guard_size_oblivious` on data dependent errors ([#160510](https://github.com/pytorch/pytorch/pull/160510))
+- Avoid unnecessary slices resulting in data-dependent errors ([#157528](https://github.com/pytorch/pytorch/pull/157528))
+
+## Dataloader Frontend
+- Add `torch.utils.data` samplers benchmark script ([#156974](https://github.com/pytorch/pytorch/pull/156974))
+- Add `torch.utils.data.Dataloader` benchmark script ([#159432](https://github.com/pytorch/pytorch/pull/159432))
+
+## Distributed
+### c10d
+  - Add `waitcounter` for watchdog and heartbeat monitoring thread ([#157480](https://github.com/pytorch/pytorch/pull/157480))
+  - Made `torch.distributed.breakpoint` set a long timeout ([#158481](https://github.com/pytorch/pytorch/pull/158481))
+  - Add `check_rng_sync` util ([#160283](https://github.com/pytorch/pytorch/pull/160283))
+  - Add `FlightRecorder` support for `ProcessGroupXCCL` ([#158568](https://github.com/pytorch/pytorch/pull/158568))
+  - Add `early_stop` kwarg to `torch.utils.checkpoint` ([#160781](https://github.com/pytorch/pytorch/pull/160781))
+### Device Mesh
+  - Add error when users try to slice non contiguous flattened dim submesh ([#157523](https://github.com/pytorch/pytorch/pull/157523))
+  - Make the repr shorter when debug ENV not set ([#158822](https://github.com/pytorch/pytorch/pull/158822))
+### DTensor
+  - Wrap sharding prop error with contextual exception ([#161574](https://github.com/pytorch/pytorch/pull/161574))
+  - Add check if tracing for sharding propagation to handle un-hashable keys in DTensor ([#160798](https://github.com/pytorch/pytorch/pull/160798))
+### ShardedTensor
+  - Make error message descriptive in ShardedTensor creation (#150627) ([#159423](https://github.com/pytorch/pytorch/pull/159423))
+### Pipeline Parallelism (PP)
+  - Add profiling to schedule execution ([#160753](https://github.com/pytorch/pytorch/pull/160753))
+
+## FX
+- Consolidate stack trace in Tracer ([#156257](https://github.com/pytorch/pytorch/pull/156257), [#157302](https://github.com/pytorch/pytorch/pull/157302), [#158266](https://github.com/pytorch/pytorch/pull/158266))
+- Separate provenance tracking to different levels ([#160383](https://github.com/pytorch/pytorch/pull/160383), [#158399](https://github.com/pytorch/pytorch/pull/158399), [#158796](https://github.com/pytorch/pytorch/pull/158796), [#159484](https://github.com/pytorch/pytorch/pull/159484))
+- Fix `register_foward_pre_hook not supported on ScriptModule` error ([#156904](https://github.com/pytorch/pytorch/pull/156904))
+- Add `__eq__` function to NodeSource ([#158170](https://github.com/pytorch/pytorch/pull/158170))
+- Add `__hash__` function to NodeSource ([#158322](https://github.com/pytorch/pytorch/pull/158322))
+- Cache dict and string rep for better perf in NodeSource ([#158372](https://github.com/pytorch/pytorch/pull/158372))
+- Recover node source from dict (#158373) ([#158473](https://github.com/pytorch/pytorch/pull/158473))
+- Include error stacktrace and graph module in `tlparse` error ([#158469](https://github.com/pytorch/pytorch/pull/158469))
+- Add `expanded_def` option for FX printing, render descriptor, update tests ([#158708](https://github.com/pytorch/pytorch/pull/158708))
+- Remove `co_lnotab` in favor of `co_linetable` ([#159227](https://github.com/pytorch/pytorch/pull/159227))
+- Remove duplicate imports ([#161685](https://github.com/pytorch/pytorch/pull/161685))
+- Include Output tensor metadata for `CompiledFxGraph` ([#159311](https://github.com/pytorch/pytorch/pull/159311))
+
+## Inductor
+- Deprecate `allow_tf32` in `tl.dot(..., allow_tf32=...)`, use `tl.dot(..., input_precision=...)` ([#160711](https://github.com/pytorch/pytorch/pull/160711))
+- Log autotune choices and benchmark result to scuba/chrome trace ([#159496](https://github.com/pytorch/pytorch/pull/159496))
+- Add TLParse artifact for logging runtime of collective and compute ops ([#159730](https://github.com/pytorch/pytorch/pull/159730))
+- Call `jit_post_compile_hook` within Inductor Triton Kernel compile path ([#161443](https://github.com/pytorch/pytorch/pull/161443))
+- Prune configs that require more shared memory than the hardware limit ([#161996](https://github.com/pytorch/pytorch/pull/161996))
+- Runtime estimations using nccl estimator on mm only benchmark mode ([#161405](https://github.com/pytorch/pytorch/pull/161405))
+- Don't use `torch.backends.cuda.matmul.allow_tf32` in Inductor cache key ([#159480](https://github.com/pytorch/pytorch/pull/159480))
+
+## Ahead-Of-Time Inductor (AOTI)
+- Better error message when no .so/cpp files are found ([#156863](https://github.com/pytorch/pytorch/pull/156863))
+- Clean up old APIs in AOTI c shim ([#158400](https://github.com/pytorch/pytorch/pull/158400))
+- Add Inductor provenance mapping for cpp extern kernel (#161656) ([#162069](https://github.com/pytorch/pytorch/pull/162069))
+- Print out error msg when nvcc compiler fails ([#157203](https://github.com/pytorch/pytorch/pull/157203))
+- Add kernel information JSON generation for AOTI packages ([#160540](https://github.com/pytorch/pytorch/pull/160540))
+
+## Python Frontend
+- Better sample inputs for addmm OpInfo ([#160234](https://github.com/pytorch/pytorch/pull/160234))
+
+## Quantization
+- Revamp dtype documentation ([#156087](https://github.com/pytorch/pytorch/pull/156087))
+- Use new type statement to fix public API of types ([#158487](https://github.com/pytorch/pytorch/pull/158487))
+
+## Release Engineering
+- Replace `setup.py develop` with `pip install -e` for development builds ([#155998](https://github.com/pytorch/pytorch/pull/155998)) ([#156027](https://github.com/pytorch/pytorch/pull/156027)) ([#156710](https://github.com/pytorch/pytorch/pull/156710))  ([#156709](https://github.com/pytorch/pytorch/pull/156709))
+
+## XPU
+- Upgrade Intel GPU software stack package to intel-deep-learning-essentials-2025.2 ([#158733](https://github.com/pytorch/pytorch/pull/158733))
diff --git a/2.9.0/miscategorized.md b/2.9.0/miscategorized.md
index 460c0b4..ba3572c 100644
--- a/2.9.0/miscategorized.md
+++ b/2.9.0/miscategorized.md
@@ -6,17 +6,4 @@ Handle any commits that actually do belong to your domain and remove them from t
 
 ## Untopiced
 
-
-StableABI:
-- Add pad and narrow to torch/csrc/stable/ops.h ([#159328](https://github.com/pytorch/pytorch/pull/159328))
-- Add getCurrentDeviceIndex to torch::stable::accelerator ([#160453](https://github.com/pytorch/pytorch/pull/160453))
-- Add new_zeros dtype variant to the shim and as a stable op ([#161597](https://github.com/pytorch/pytorch/pull/161597))
-- Update torch::stable::Tensor() default constructor ([#159507](https://github.com/pytorch/pytorch/pull/159507))
-- Add beginnings of torch::stable::accelerator ([#159679](https://github.com/pytorch/pytorch/pull/159679))
-- Port amax to stable ABI ([#160214](https://github.com/pytorch/pytorch/pull/160214))
-- Add new_empty (with dtype argument only) to torch::stable ([#159508](https://github.com/pytorch/pytorch/pull/159508))
-- Enable generating generic c_shim that doesn't bypass dispatcher ([#158974](https://github.com/pytorch/pytorch/pull/158974))
-- Cut a version of TORCH_ERROR_CODE_CHECK in headeronly from AOTI ([#159604](https://github.com/pytorch/pytorch/pull/159604))
-
-
 ## not user facing

From ea086e88e1b1e923b564dc746b1ac7b469ffe528 Mon Sep 17 00:00:00 2001
From: Angel Li <liangel@meta.com>
Date: Mon, 29 Sep 2025 09:49:15 -0700
Subject: [PATCH 2/6] remove private apis

---
 2.9.0/final.md | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/2.9.0/final.md b/2.9.0/final.md
index d713728..5e92cba 100644
--- a/2.9.0/final.md
+++ b/2.9.0/final.md
@@ -204,7 +204,6 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Support GQA for flash attention ([#157893](https://github.com/pytorch/pytorch/pull/157893))
 
 ## CUDA
-- MXFP8 grouped GEMM support for `torch._scaled_grouped_mm` + submodule bump ([#162209](https://github.com/pytorch/pytorch/pull/162209))
 - Add getter for CUDA graph exec to allow mutation of captured kernel params ([#161294](https://github.com/pytorch/pytorch/pull/161294))
 - Implement support for `cudnn_batch_norm_out` kernel to replace the autogen approach ([#123020](https://github.com/pytorch/pytorch/pull/123020))
 
@@ -214,7 +213,6 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 
 ## Dynamo
 - Experimental API for ahead-of-time compiling models in fullgraph mode ([#161383](https://github.com/pytorch/pytorch/pull/161383))
-- Toggle erroring/resume on graph break with `torch._dynamo.error_on_graph_break` ([#161739](https://github.com/pytorch/pytorch/pull/161739), [#161747](https://github.com/pytorch/pytorch/pull/161747))
 - Add a hook for recompilations ([#157961](https://github.com/pytorch/pytorch/pull/157961))
 
 ## Export
@@ -264,18 +262,14 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 
 ## ROCm
 - OCP Micro-scaling Format (mx-fp8/mx-fp4) Support ([#151360](https://github.com/pytorch/pytorch/pull/151360))
-- Support experimental CU carveout `torch._C._set_sm_carveout_experimental()` ([#149466](https://github.com/pytorch/pytorch/pull/149466))
-- Add FP8 rowwise support to `_scaled_grouped_mm` ([#159075](https://github.com/pytorch/pytorch/pull/159075))
 
 ## XPU
 - Enable `FlexAttention` on Intel GPU ([#143553](https://github.com/pytorch/pytorch/pull/143553))
-- Enable `_int_mm` on Intel GPU ([#157769](https://github.com/pytorch/pytorch/pull/157769))
 
 # Improvements
 ## AOTDispatcher
 - Skip logging in fp8 activation quantization if there are no nodes to be quantized ([#158129](https://github.com/pytorch/pytorch/pull/158129))
 - Add `aot_export_joint_with_descriptors` and `aot_compile_joint_with_descriptors` ([#158715](https://github.com/pytorch/pytorch/pull/158715))
-- Allow keeping input mutations in the graph for `_aot_export_function` ([#157730](https://github.com/pytorch/pytorch/pull/157730))
 - Extract out `prepare_aot_module_simplified` for use in next PR ([#158319](https://github.com/pytorch/pytorch/pull/158319))
 - Rename modules in AOTAutograd ([#158449](https://github.com/pytorch/pytorch/pull/158449))
 - Track descriptors for all inputs/outputs of AOTAutograd traced graph ([#158624](https://github.com/pytorch/pytorch/pull/158624))
@@ -291,14 +285,11 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Build `libtorch` without NVSHMEM ([#160910](https://github.com/pytorch/pytorch/pull/160910))
 
 ## Composability
-- Set `enable_gqa` for `aten._scaled_dot_product_attention_math decomp`([#158604](https://github.com/pytorch/pytorch/pull/158604))
-- Meta implementation for `aten._scaled_dot_product_attention_math_for_mps` ([#159695](https://github.com/pytorch/pytorch/pull/159695))
 - Meta implementation for `aten.add.Scalar` ([#161332](https://github.com/pytorch/pytorch/pull/161332))
 - `aten.expand_copy` decomp ([#161688](https://github.com/pytorch/pytorch/pull/161688))
 - Fix result dtype cast in decomp for `aten.linalg_vector_norm` ([#155111](https://github.com/pytorch/pytorch/pull/155111))
 - Add dtype checks in meta implementation for several ordering ops ([#159556](https://github.com/pytorch/pytorch/pull/159556))
 - Fix meta function for `aten.complex` ([#160894](https://github.com/pytorch/pytorch/pull/160894))
-- Improve shape checks for `aten._grouped_mm` ([#159666](https://github.com/pytorch/pytorch/pull/159666))
 - Improve unbacked symint (dynamic shape) support for several decompositions ([#148815](https://github.com/pytorch/pytorch/pull/148815), [#156902](https://github.com/pytorch/pytorch/pull/156902), [#157008](https://github.com/pytorch/pytorch/pull/157008), [#158894](https://github.com/pytorch/pytorch/pull/158894), [#159184](https://github.com/pytorch/pytorch/pull/159184), [#160683](https://github.com/pytorch/pytorch/pull/160683), [#160253](https://github.com/pytorch/pytorch/pull/160253), [#162084](https://github.com/pytorch/pytorch/pull/162084), [#162099](https://github.com/pytorch/pytorch/pull/162099), [#162109](https://github.com/pytorch/pytorch/pull/162109), [#160462](https://github.com/pytorch/pytorch/pull/160462))
 
 ## C++ Frontend
@@ -385,7 +376,6 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Add option for `TorchDispatchMode` to ignore `torch.compile` internals ([#161648](https://github.com/pytorch/pytorch/pull/161648))
 
 ## Export
-- Add `_compile_and_package` method for ExportPackage ([#156638](https://github.com/pytorch/pytorch/pull/156638))
 - Handle `None` & ellipsis slicing/select in non-strict ([#157821](https://github.com/pytorch/pytorch/pull/157821))
 - Extend FP8 types in serialization ([#158430](https://github.com/pytorch/pytorch/pull/158430))
 - Improve error messages for deserialization ([#159881](https://github.com/pytorch/pytorch/pull/159881))
@@ -552,7 +542,6 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Fix deserialization for unbacked symbol ranges ([#158681](https://github.com/pytorch/pytorch/pull/158681))
 - Fix runtime assert handling in deserialization ([#159060](https://github.com/pytorch/pytorch/pull/159060))
 - Fix for FQN handling in unflattener ([#159418](https://github.com/pytorch/pytorch/pull/159418))
-- Add `_ccode` method for `PythonMod` ([#158851](https://github.com/pytorch/pytorch/pull/158851))
 - Fix `nn_module_stack` for `assert_tensor_metadata` nodes ([#159625](https://github.com/pytorch/pytorch/pull/159625))
 - Fix usage for `move_to_device_pass` ([#159992](https://github.com/pytorch/pytorch/pull/159992), [#160528](https://github.com/pytorch/pytorch/pull/160528), [#162301](https://github.com/pytorch/pytorch/pull/162301))
 - Avoid name overwrites for aliased exported module parameters ([#160600](https://github.com/pytorch/pytorch/pull/160600))

From 5ed27e9bb825b780189128389460ca860f14c2b2 Mon Sep 17 00:00:00 2001
From: Angel Li <liangel@meta.com>
Date: Mon, 29 Sep 2025 09:51:13 -0700
Subject: [PATCH 3/6] updates

---
 2.9.0/done/result_distributed.md |   4 +
 2.9.0/final.md                   | 127 +++++++++++++------------------
 2 files changed, 56 insertions(+), 75 deletions(-)

diff --git a/2.9.0/done/result_distributed.md b/2.9.0/done/result_distributed.md
index e875c05..cd44fff 100644
--- a/2.9.0/done/result_distributed.md
+++ b/2.9.0/done/result_distributed.md
@@ -332,3 +332,7 @@ The categories below are as follows:
 - Work: block_current_stream API ([#156883](https://github.com/pytorch/pytorch/pull/156883))
 - [c10d] block_current_stream: correctness fixes ([#158757](https://github.com/pytorch/pytorch/pull/158757))
 - Add pg transport and tests ([#154653](https://github.com/pytorch/pytorch/pull/154653))
+- Symmetric memory set handle type for ROCm ([#161741](https://github.com/pytorch/pytorch/pull/161741))
+- Symmetric Memory Performance improvements for two-shot allreduce ([#156746](https://github.com/pytorch/pytorch/pull/156746))
+- NVSHMEM support for Triton 3.5 ([#163152](https://github.com/pytorch/pytorch/pull/163152))
+- Fix `put_signal` + `wait_until` hang ([#163194](https://github.com/pytorch/pytorch/pull/163194))
diff --git a/2.9.0/final.md b/2.9.0/final.md
index 5e92cba..f969bd5 100644
--- a/2.9.0/final.md
+++ b/2.9.0/final.md
@@ -24,32 +24,23 @@ Below are the full release notes for this release.
 
 The minimum version of Python required for PyTorch 2.9.0 is 3.10.
 
-## Build Frontend
-
-### Remove `/d2implyavx512upperregs` flag that slows build ([#159431](https://github.com/pytorch/pytorch/pull/159431))
+## Build metal kernels of MacOS-14+ and remove all pre-MacOS-14 specific logic, requires MacOS-14+ going forward ([\#159733](https://github.com/pytorch/pytorch/pull/159733), [\#159912](https://github.com/pytorch/pytorch/pull/159912))
 
-### Add `ScalarType` to shim conversion and `stable::Tensor.scalar_type` ([#160557](https://github.com/pytorch/pytorch/pull/160557))
-
-Before, user extensions could only in abstract pass around obfuscated dtypes appearing as `int32_ts`. Now, users can confidently use `torch::headeronly::ScalarType` in their extensions for major scalar types. This PR enables ABI stability by adding a translation layer through the shim, so that even if the `ScalarType` enum values change in the future, user extensions need not fear.
-
-This is narrowly BC breaking for unpopular dtypes: `quint*`s, `qint*`s, `Bits*`, `dummy_uint*`s, `dummy_int*`s, `Float8_e8m0fnu`, and `Float4_e2m1fn_x2` in the use case where an extension retrieves a Tensor dtype of the above and passes it into `aoti_torch_call_dispatcher`.
-
-## Export
-### Switch off runtime asserts by default in favor of a shape guards function ([#160111](https://github.com/pytorch/pytorch/pull/160111), [#161178](https://github.com/pytorch/pytorch/pull/161178), [#161794](https://github.com/pytorch/pytorch/pull/161794))
+PyTorch MPS is only supported on MacOS-14 or later. If you need to use MPS on MacOS Ventura, please avoid updating to Python-3.9 or above
 
+## Upgrade to DLPack 1.0 ([#145000](https://github.com/pytorch/pytorch/pull/145000))
 
-To enable runtime asserts, use `export(..., prefer_deferred_runtime_asserts_over_guards=True)`. Also kills the `allow_complex_guards_as_runtime_asserts` flag, merging it into the former option.
+This upgrade is doing the same BC-breaking changes as the DLPack release.
+Objects in `torch.utils.dlpack` have been updated to reflect these changes, such as `DLDeviceType`.
+See the PR for details on the exact changes and how to update your code.
 
+## Raise appropriate errors in `torch.cat` ([#158249](https://github.com/pytorch/pytorch/pull/158249))
 
-Additionally, `exported_program.module()` will generate a call to a `_guards_fn` submodule that will run additional checks on inputs. Users who do not want this behavior can either remove this call in the graph, or do `exported_program.module(check_guards=False)` to avoid the generation.
-
-## MPS
-### Build metal kernels of MacOS-14+ and remove all pre-MacOS-14 specific logic, requires MacOS-14+ going forward ([\#159733](https://github.com/pytorch/pytorch/pull/159733), [\#159912](https://github.com/pytorch/pytorch/pull/159912))
+Raising `ValueError`, `IndexError` or `TypeError` where appropriate instead of the generic `RuntimeError`.
+If you code was catching these error, you can update to catch the new error type.
 
-PyTorch MPS is only supported on MacOS-14 or later. If you need to use MPS on MacOS Ventura, please avoid updating to Python-3.9 or above
 
-## ONNX
-### Default to `dynamo=True` for ONNX exporter ([#159646](https://github.com/pytorch/pytorch/pull/159646), [#162726](https://github.com/pytorch/pytorch/pull/162726))
+## Default to `dynamo=True` for ONNX exporter ([#159646](https://github.com/pytorch/pytorch/pull/159646), [#162726](https://github.com/pytorch/pytorch/pull/162726))
 
 Previously `torch.onnx.export(...)` used the legacy TorchScript exporter if no arguments were provied. The ONNX exporter now uses the newer `torch.export.export` pipeline by default (`dynamo=True`). This change improves graph fidelity and future-proofs exports, but may surface graph capture errors that were previously masked or handled differently.
 
@@ -73,7 +64,15 @@ torch.onnx.export(...)
 Recommendation: first try the new default; only fall back if you hit blocking issues and report them upstream.
 Long term solution: fix the root cause instead of relying on fallback or TorchScript exporter.
 
-### Set default opset to 20 ([#158802](https://github.com/pytorch/pytorch/pull/158802))
+## Switch off runtime asserts by default in favor of a shape guards function ([#160111](https://github.com/pytorch/pytorch/pull/160111), [#161178](https://github.com/pytorch/pytorch/pull/161178), [#161794](https://github.com/pytorch/pytorch/pull/161794))
+
+
+To enable runtime asserts, use `export(..., prefer_deferred_runtime_asserts_over_guards=True)`. Also kills the `allow_complex_guards_as_runtime_asserts` flag, merging it into the former option.
+
+
+Additionally, `exported_program.module()` will generate a call to a `_guards_fn` submodule that will run additional checks on inputs. Users who do not want this behavior can either remove this call in the graph, or do `exported_program.module(check_guards=False)` to avoid the generation.
+
+## Set default opset to 20 ([#158802](https://github.com/pytorch/pytorch/pull/158802))
 
 Opset 20 enables newer operator definitions. If your tooling or downstream runtime only supports opset 18, pin it explicitly. For the latest ONNX operators, you can experiment with opset 23.
 
@@ -97,7 +96,7 @@ torch.onnx.export(...)
 torch.onnx.export(..., opset_version=23)
 ```
 
-### Drop `draft_export` in exporter API ([#161454](https://github.com/pytorch/pytorch/pull/161454), [#162225](https://github.com/pytorch/pytorch/pull/162225))
+## Drop `draft_export` in exporter API ([#161454](https://github.com/pytorch/pytorch/pull/161454), [#162225](https://github.com/pytorch/pytorch/pull/162225))
 
 Remove implicit draft tracing from the default exporter path, achieving clearer behaviour and faster failures.
 The expensive `torch.export.draft_export` diagnostic path is no longer auto-invoked (which could take hours on large models). You can still opt in for deep diagnostics:
@@ -125,45 +124,41 @@ Now in torch 2.9.0:
 TORCH_ONNX_ENABLE_DRAFT_EXPORT=True python export_to_onnx.py
 ```
 
-### Remove `torch.onnx.dynamo_export` and the `onnxrt` torch compile backend ([#158130](https://github.com/pytorch/pytorch/pull/158130), [#158258](https://github.com/pytorch/pytorch/pull/158258))
+## Remove `torch.onnx.dynamo_export` and the `onnxrt` torch compile backend ([#158130](https://github.com/pytorch/pytorch/pull/158130), [#158258](https://github.com/pytorch/pytorch/pull/158258))
 
 `torch.onnx.dynamo_export` is removed. Please use `torch.onnx.export` instead.
 The experimental ONNX Runtime compile backend (`torch.compile(backend="onnxrt")`) is no longer supported.
 
-### Remove `torch.onnx.enable_fake_mode` ([#161222](https://github.com/pytorch/pytorch/pull/161222))
+## Remove `torch.onnx.enable_fake_mode` ([#161222](https://github.com/pytorch/pytorch/pull/161222))
 
 The `dynamo=True` mode uses `FakeTensor`s by default which is memory efficient.
 
-### Some public facing utility APIs for the TorchScript based exporter are now private ([#161323](https://github.com/pytorch/pytorch/pull/161323))
-### Remove `torch.onnx.symbolic_caffe2` ([#157102](https://github.com/pytorch/pytorch/pull/157102))
+## Some public facing utility APIs for the TorchScript based exporter are now private ([#161323](https://github.com/pytorch/pytorch/pull/161323))
 
-## Python Frontend
-### Upgrade to DLPack 1.0. ([#145000](https://github.com/pytorch/pytorch/pull/145000))
+Deprecated members in `torch.onnx.verification` are removed. Previously private `torch.onnx.symbolic_opsets*` functions will no longer be accessible. Consider making a copy of the source code if you need to access any private functions for compatibility with the TorchScript based exporter.
 
-This upgrade is doing the same BC-breaking changes as the DLPack release.
-Objects in `torch.utils.dlpack` have been updated to reflect these changes, such as `DLDeviceType`.
-See the PR for details on the exact changes and how to update your code.
+## Remove `torch.onnx.symbolic_caffe2` ([#157102](https://github.com/pytorch/pytorch/pull/157102))
 
-### Raise appropriate errors in `torch.cat` ([#158249](https://github.com/pytorch/pytorch/pull/158249))
+Support for `caffe2` in the ONNX exporter has ended and is removed.
 
-Raising `ValueError`, `IndexError` or `TypeError` where appropriate instead of the generic `RuntimeError`.
-If you code was catching these error, you can update to catch the new error type.
+## Remove `/d2implyavx512upperregs` flag that slows build ([#159431](https://github.com/pytorch/pytorch/pull/159431))
 
-# Deprecations
-## Dataloader Frontend
-### Deprecate `pin_memory_device` param in `torch.utils.data.DataLoader` ([#158323](https://github.com/pytorch/pytorch/pull/158323))
+Re-introduced AVX512 optimizations for Windows VS2022 builds, may cause issues with specific versions of VS2022, see [#145702](https://github.com/pytorch/pytorch/issues/145702)
 
-We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required for `StatefulDataloader` which leveraged `BaseDataLoaderIter` direclty rather than the `Dataloader` class init
+## Add `ScalarType` to shim conversion and `stable::Tensor.scalar_type` ([#160557](https://github.com/pytorch/pytorch/pull/160557))
 
-## Export
-### Deprecation for `export_for_training` API, in favor of equivalent `export` API ([#158203](https://github.com/pytorch/pytorch/pull/158203))
+Before, user extensions could only in abstract pass around obfuscated dtypes appearing as `int32_ts`. Now, users can confidently use `torch::headeronly::ScalarType` in their extensions for major scalar types. This PR enables ABI stability by adding a translation layer through the shim, so that even if the `ScalarType` enum values change in the future, user extensions need not fear.
 
-`export_for_training` exists because we couldn't migrate internal usages of export to the final IR. Now that we have completed the migration, we deprecated and deleted this API.
+This change adds ScalarType support for user extensions and is only narrowly BC breaking for unpopular dtypes: `quint*`s, `qint*`s, `Bits*`, `dummy_uint*`s, `dummy_int*`s, `Float8_e8m0fnu`, and `Float4_e2m1fn_x2` in the use case where an extension retrieves a Tensor dtype of the above and passes it into `aoti_torch_call_dispatcher`.
 
-## Release Engineering
-### Remove Python 3.9 support in CD builds. Move CI to Python 3.10.([#161427](https://github.com/pytorch/pytorch/pull/161427)) ([#162265](https://github.com/pytorch/pytorch/pull/162265)) ([#162297](https://github.com/pytorch/pytorch/pull/162297)) ([#160852](https://github.com/pytorch/pytorch/pull/160852))
+# Deprecations
+## Deprecate `pin_memory_device` param in `torch.utils.data.DataLoader` ([#158323](https://github.com/pytorch/pytorch/pull/158323))
+
+We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required for `StatefulDataloader` which leveraged `BaseDataLoaderIter` direclty rather than the `Dataloader` class init
+
+## Deprecate `torch.export.export_for_training` API in favor of equivalent `torch.export.export` API ([#158203](https://github.com/pytorch/pytorch/pull/158203))
 
-### Remove CUDA 12.9 support in CD builds ([#161916](https://github.com/pytorch/pytorch/pull/161916))
+`torch.export.export_for_training` exists because we couldn't migrate internal usages of export to the final IR. Now that we have completed the migration, we deprecated and deleted this API.
 
 # New Features
 ## AOTDispatcher
@@ -174,29 +169,12 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Add `zero_()` and `empty_like(t)` to `torch/csrc/stable/ops.h` ([#158866](https://github.com/pytorch/pytorch/pull/158866))
 
 ## C++ Extensions
-- Add pad and narrow to `torch/csrc/stable/ops.h` ([#159328](https://github.com/pytorch/pytorch/pull/159328))
-- Add `getCurrentDeviceIndex` to `torch::stable::accelerator` ([#160453](https://github.com/pytorch/pytorch/pull/160453))
-- Add `new_zeros` dtype variant to the shim and as a stable op ([#161597](https://github.com/pytorch/pytorch/pull/161597))
-- Update `torch::stable::Tensor()` default constructor ([#159507](https://github.com/pytorch/pytorch/pull/159507))
-- Add beginnings of `torch::stable::accelerator` ([#159679](https://github.com/pytorch/pytorch/pull/159679))
-- Port `amax` to stable ABI ([#160214](https://github.com/pytorch/pytorch/pull/160214))
-- Add `new_empty` (with dtype argument only) to `torch::stable` ([#159508](https://github.com/pytorch/pytorch/pull/159508))
-- Enable generating generic `c_shim` that doesn't bypass dispatcher ([#158974](https://github.com/pytorch/pytorch/pull/158974))
-- Cut a version of `TORCH_ERROR_CODE_CHECK` in `headeronly` from AOTI ([#159604](https://github.com/pytorch/pytorch/pull/159604))
-- Check F2C BLAS for OpenBLAS and other vendors ([#143846](https://github.com/pytorch/pytorch/pull/143846))
-- Add an ovrsource target for `torch/headeronly` ([#157912](https://github.com/pytorch/pytorch/pull/157912))
-- Migrate `c10/macros/cmake_macros.h.in` to `torch/headeronly` ([#158035](https://github.com/pytorch/pytorch/pull/158035))
-- Move `c10/macros/Macros.h` to `headeronly` ([#158365](https://github.com/pytorch/pytorch/pull/158365))
-- Add `STD_TORCH_CHECK` to `headeronly` ([#158377](https://github.com/pytorch/pytorch/pull/158377))
-- Migrate easy q(u)int/bits stuff to `torch/headeronly` ([#159302](https://github.com/pytorch/pytorch/pull/159302))
-- Move `Float4` to `headeronly` ([#159414](https://github.com/pytorch/pytorch/pull/159414))
-- Move `BFloat16.h` to `headeronly` ([#159412](https://github.com/pytorch/pytorch/pull/159412))
-- Move `Float8` variations to `headeronly` ([#159415](https://github.com/pytorch/pytorch/pull/159415))
-- Move complex to `headeronly` ([#159411](https://github.com/pytorch/pytorch/pull/159411))
-- Migrate `ScalarType` to `headeronly` ([#159911](https://github.com/pytorch/pytorch/pull/159911))
-- Add stable Tensor `get_device_index`, use more stable `DeviceIndex` ([#160143](https://github.com/pytorch/pytorch/pull/160143))
-- Add `is_cpu` method to stable tensor type ([#160212](https://github.com/pytorch/pytorch/pull/160212))
+- Build out a stable set of ATen ops in `torch/csrc/stable/ops.h`:  `amax`, `narrow`, `new_empty` + `new_zeros` dtype variant, `pad`, ([#159328](https://github.com/pytorch/pytorch/pull/159328), [#158974](https://github.com/pytorch/pytorch/pull/158974), [#159508](https://github.com/pytorch/pytorch/pull/159508), [#161597](https://github.com/pytorch/pytorch/pull/161597), [#160214](https://github.com/pytorch/pytorch/pull/160214), )
+- Add `torch::stable::Tensor()` default constructor,  `is_cpu`, and `get_device_index`([#159507](https://github.com/pytorch/pytorch/pull/159507), [#160212](https://github.com/pytorch/pytorch/pull/160212), [#160143](https://github.com/pytorch/pytorch/pull/160143))
+- Add beginnings of `torch::stable::accelerator` with support for DeviceGuard and Stream ([#159679](https://github.com/pytorch/pytorch/pull/159679), [#160453](https://github.com/pytorch/pytorch/pull/160453))
+- Start building out `torch/headeronly`: c10 Macros, STD_TORCH_CHECK, ScalarTypes (like BFloat16 and Half) ([#158035](https://github.com/pytorch/pytorch/pull/158035), [#158365](https://github.com/pytorch/pytorch/pull/158365), [#157912](https://github.com/pytorch/pytorch/pull/157912), [#158377](https://github.com/pytorch/pytorch/pull/158377), [#159302](https://github.com/pytorch/pytorch/pull/159302), [#159414](https://github.com/pytorch/pytorch/pull/159414), [#159412](https://github.com/pytorch/pytorch/pull/159412), [#159415](https://github.com/pytorch/pytorch/pull/159415), [#159411](https://github.com/pytorch/pytorch/pull/159411), [#159911](https://github.com/pytorch/pytorch/pull/159911))
 - Remove cmake cache and reconfigure again if it is invalid ([#156958](https://github.com/pytorch/pytorch/pull/156958))
+- Cut a version of `TORCH_ERROR_CODE_CHECK` in `headeronly` from AOTI ([#159604](https://github.com/pytorch/pytorch/pull/159604))
 - Remove `wheel` from build requirements ([#158027](https://github.com/pytorch/pytorch/pull/158027))
 - Error when `TORCH_STABLE_ONLY` is defined in `TensorBase.h` ([#161658](https://github.com/pytorch/pytorch/pull/161658))
 
@@ -207,10 +185,6 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Add getter for CUDA graph exec to allow mutation of captured kernel params ([#161294](https://github.com/pytorch/pytorch/pull/161294))
 - Implement support for `cudnn_batch_norm_out` kernel to replace the autogen approach ([#123020](https://github.com/pytorch/pytorch/pull/123020))
 
-## Distributed
-### Symmetric Memory
-- NVSHMEM support for Triton 3.5 ([#163152](https://github.com/pytorch/pytorch/pull/163152))
-
 ## Dynamo
 - Experimental API for ahead-of-time compiling models in fullgraph mode ([#161383](https://github.com/pytorch/pytorch/pull/161383))
 - Add a hook for recompilations ([#157961](https://github.com/pytorch/pytorch/pull/157961))
@@ -248,8 +222,7 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Add `torch.hash_tensor` reduction function ([#154149](https://github.com/pytorch/pytorch/pull/154149))
 
 ## Quantization
-- Enable cpu fp8 qlinear ([#155678](https://github.com/pytorch/pytorch/pull/155678))
-- Enable cpu fp8 qconv ([#157076](https://github.com/pytorch/pytorch/pull/157076))
+- Enable cpu fp8 qlinear and cpu fp8 qconv ([#155678](https://github.com/pytorch/pytorch/pull/155678), [#157076](https://github.com/pytorch/pytorch/pull/157076))
 
 ## Release Engineering
 - Add support for CUDA 13.0 in CI/CD builds. Enable CUDA compression mode for binary size reduction for CUDA 13.0 builds ([#160956](https://github.com/pytorch/pytorch/pull/160956)) ([#161073](https://github.com/pytorch/pytorch/pull/161073)) ([#161257](https://github.com/pytorch/pytorch/pull/161257)) ([#161663](https://github.com/pytorch/pytorch/pull/161663)) ([#161316](https://github.com/pytorch/pytorch/pull/161316)) ([#160201](https://github.com/pytorch/pytorch/pull/160201)) ([#160770](https://github.com/pytorch/pytorch/pull/160770)) ([#161013](https://github.com/pytorch/pytorch/pull/161013)) ([#161916](https://github.com/pytorch/pytorch/pull/161916)) ([#162268](https://github.com/pytorch/pytorch/pull/162268)) ([#162322](https://github.com/pytorch/pytorch/pull/162322)) ([#162383](https://github.com/pytorch/pytorch/pull/162383)) ([#161833](https://github.com/pytorch/pytorch/pull/161833))
@@ -283,6 +256,8 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Fix dev warning in `Dependencies.cmake` ([#159702](https://github.com/pytorch/pytorch/pull/159702))
 - Fix building system gloo with CUDA/HIP ([#146637](https://github.com/pytorch/pytorch/pull/146637))
 - Build `libtorch` without NVSHMEM ([#160910](https://github.com/pytorch/pytorch/pull/160910))
+- Improve BLAS feature detection ([#143846](https://github.com/pytorch/pytorch/pull/143846))
+
 
 ## Composability
 - Meta implementation for `aten.add.Scalar` ([#161332](https://github.com/pytorch/pytorch/pull/161332))
@@ -483,6 +458,9 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Fix `torch.autograd.graph.GradientEdge` for `torch.autograd.Function` ([#160098](https://github.com/pytorch/pytorch/pull/160098))
 - Match 0-dim gradients device type regardless of subclass-ness ([#160165](https://github.com/pytorch/pytorch/pull/160165))
 
+## Build Frontend
+- Turn on `BUILD_BUNDLEPTXAS=1` to allow compile on newer GPUs([#163988](https://github.com/pytorch/pytorch/pull/163988))
+
 ## C++ Frontend
 - Fix `torch.utils.cpp_extension` parser for clang version 20.1.7+libcxx ([#157666](https://github.com/pytorch/pytorch/pull/157666))
 - Fix `MakeTensor::computeStorageSize()` calculation ([#158690](https://github.com/pytorch/pytorch/pull/158690))
@@ -498,6 +476,7 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Implement workaround for `cudaErrorNotSupported` ([#162412](https://github.com/pytorch/pytorch/pull/162412))
 - Fix missing `__syncthreads` in MultiMarginLoss backward ([#158994](https://github.com/pytorch/pytorch/pull/158994))
 - Roll-back cuDNN frontend upgrade and update Meta registration due to compile issues ([#163104](https://github.com/pytorch/pytorch/pull/163104))
+- Disable cuDNN for 3D convolutions with `kernel size != 1` for cuDNN 9.8+ ([#163581](https://github.com/pytorch/pytorch/pull/163581))
 
 ## Distributed
 ### c10d
@@ -505,6 +484,8 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
   - Fix `setGroupName` and `setGroupDesc` in `group_split` and `merge_remote_group` ([#159429](https://github.com/pytorch/pytorch/pull/159429))
   - Fix a bug of distributed 'gather' with noncontiguous tensors on the Gloo backend ([#158903](https://github.com/pytorch/pytorch/pull/158903))
   - Fix a bug of distributed 'gather' with noncontiguous tensors on the NCCL backend ([#159549](https://github.com/pytorch/pytorch/pull/159549))
+  - Fix data inconsistencies when using `batch_isend_irecv` with 2D tensor views by making P2P tensors dense ([#163719](https://github.com/pytorch/pytorch/pull/163719))
+  - Handle discontiguous `allgather`/`reducescatter` inputs ([#163712](https://github.com/pytorch/pytorch/pull/163712))
 ### Device Mesh
   - Fix the not incorrectly chained each of the strings as iterables ([#160709](https://github.com/pytorch/pytorch/pull/160709))
 ### DistributedDataParallel (DDP)
@@ -524,8 +505,6 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 ### Pipeline Parallelism (PP)
   - Fix eval step under `no_grad()` ([#159293](https://github.com/pytorch/pytorch/pull/159293))
   - Fix zero bubble schedules for `eval()` ([#159475](https://github.com/pytorch/pytorch/pull/159475))
-### Symmetric Memory (SymmMem)
-- Fix `put_signal` + `wait_until` hang ([#163194](https://github.com/pytorch/pytorch/pull/163194))
 ### TorchElastic
   - Fix wrong log file name in the docs of `torch.distributed.elastic.multiprocessing.start_processes()` ([#160396](https://github.com/pytorch/pytorch/pull/160396))
 ### TensorPipe
@@ -629,7 +608,6 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Fix finding ROCm/HIP version on Windows ([#156486](https://github.com/pytorch/pytorch/pull/156486))
 - Fix LoadHIP handling of environment variable paths on Windows ([#159080](https://github.com/pytorch/pytorch/pull/159080))
 - Add hipcc compatibility flags to `cpp_extension.py` on Windows ([#159790](https://github.com/pytorch/pytorch/pull/159790))
-- Symmetric memory set handle type for ROCm ([#161741](https://github.com/pytorch/pytorch/pull/161741))
 - In SDPA via AOTriton, `logsumexp` needs scaling back to natural base ([#156903](https://github.com/pytorch/pytorch/pull/156903))
 - Check stream graph capture status in `memcpy_and_sync` inline function ([#158165](https://github.com/pytorch/pytorch/pull/158165))
 
@@ -681,7 +659,6 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Remove extra transposes in NHWC convolutions on MIOpen ([#160435](https://github.com/pytorch/pytorch/pull/160435))
 - Remove extra sync in `tensor.item()` ([#158486](https://github.com/pytorch/pytorch/pull/158486))
 - Elementwise and reduction kernel perf improvements ([#159430](https://github.com/pytorch/pytorch/pull/159430), [#159652](https://github.com/pytorch/pytorch/pull/159652), [#160444](https://github.com/pytorch/pytorch/pull/160444), [#160466](https://github.com/pytorch/pytorch/pull/160466), [#161054](https://github.com/pytorch/pytorch/pull/161054), [#161180](https://github.com/pytorch/pytorch/pull/161180), [#161181](https://github.com/pytorch/pytorch/pull/161181))
-- Symmetric Memory Performance improvements for two-shot allreduce ([#156746](https://github.com/pytorch/pytorch/pull/156746))
 - Enable build of `fbgemm_gpu genai` sources for grouped GEMM support ([#160676](https://github.com/pytorch/pytorch/pull/160676))
 
 ## XPU

From 0c8c65de8175390737e9a57c739a02fdd8f6cd56 Mon Sep 17 00:00:00 2001
From: Angel Li <liangel@meta.com>
Date: Wed, 1 Oct 2025 08:12:59 -0700
Subject: [PATCH 4/6] reordering

---
 2.9.0/final.md | 565 ++++++++++++++++++++++++-------------------------
 1 file changed, 279 insertions(+), 286 deletions(-)

diff --git a/2.9.0/final.md b/2.9.0/final.md
index f969bd5..963c5d8 100644
--- a/2.9.0/final.md
+++ b/2.9.0/final.md
@@ -22,7 +22,7 @@ Below are the full release notes for this release.
 
 ## Min supported Python version is now 3.10 ([#162310](https://github.com/pytorch/pytorch/pull/162310))
 
-The minimum version of Python required for PyTorch 2.9.0 is 3.10.
+The minimum version of Python required for PyTorch 2.9.0 is 3.10. We also have 3.14 and 3.14t available as preview with this release.
 
 ## Build metal kernels of MacOS-14+ and remove all pre-MacOS-14 specific logic, requires MacOS-14+ going forward ([\#159733](https://github.com/pytorch/pytorch/pull/159733), [\#159912](https://github.com/pytorch/pytorch/pull/159912))
 
@@ -36,8 +36,7 @@ See the PR for details on the exact changes and how to update your code.
 
 ## Raise appropriate errors in `torch.cat` ([#158249](https://github.com/pytorch/pytorch/pull/158249))
 
-Raising `ValueError`, `IndexError` or `TypeError` where appropriate instead of the generic `RuntimeError`.
-If you code was catching these error, you can update to catch the new error type.
+`torch.cat` now raises `ValueError`, `IndexError` or `TypeError` where appropriate instead of the generic `RuntimeError`. If you code was catching these error, you can update to catch the new error type.
 
 
 ## Default to `dynamo=True` for ONNX exporter ([#159646](https://github.com/pytorch/pytorch/pull/159646), [#162726](https://github.com/pytorch/pytorch/pull/162726))
@@ -161,68 +160,54 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 `torch.export.export_for_training` exists because we couldn't migrate internal usages of export to the final IR. Now that we have completed the migration, we deprecated and deleted this API.
 
 # New Features
-## AOTDispatcher
-- Add AOTDispatcher config to set backward autocast behavior ([#156356](https://github.com/pytorch/pytorch/pull/156356))
-
-## Build Frontend
-- Add transpose to `torch/csrc/stable` ([#158160](https://github.com/pytorch/pytorch/pull/158160))
-- Add `zero_()` and `empty_like(t)` to `torch/csrc/stable/ops.h` ([#158866](https://github.com/pytorch/pytorch/pull/158866))
-
-## C++ Extensions
-- Build out a stable set of ATen ops in `torch/csrc/stable/ops.h`:  `amax`, `narrow`, `new_empty` + `new_zeros` dtype variant, `pad`, ([#159328](https://github.com/pytorch/pytorch/pull/159328), [#158974](https://github.com/pytorch/pytorch/pull/158974), [#159508](https://github.com/pytorch/pytorch/pull/159508), [#161597](https://github.com/pytorch/pytorch/pull/161597), [#160214](https://github.com/pytorch/pytorch/pull/160214), )
-- Add `torch::stable::Tensor()` default constructor,  `is_cpu`, and `get_device_index`([#159507](https://github.com/pytorch/pytorch/pull/159507), [#160212](https://github.com/pytorch/pytorch/pull/160212), [#160143](https://github.com/pytorch/pytorch/pull/160143))
-- Add beginnings of `torch::stable::accelerator` with support for DeviceGuard and Stream ([#159679](https://github.com/pytorch/pytorch/pull/159679), [#160453](https://github.com/pytorch/pytorch/pull/160453))
-- Start building out `torch/headeronly`: c10 Macros, STD_TORCH_CHECK, ScalarTypes (like BFloat16 and Half) ([#158035](https://github.com/pytorch/pytorch/pull/158035), [#158365](https://github.com/pytorch/pytorch/pull/158365), [#157912](https://github.com/pytorch/pytorch/pull/157912), [#158377](https://github.com/pytorch/pytorch/pull/158377), [#159302](https://github.com/pytorch/pytorch/pull/159302), [#159414](https://github.com/pytorch/pytorch/pull/159414), [#159412](https://github.com/pytorch/pytorch/pull/159412), [#159415](https://github.com/pytorch/pytorch/pull/159415), [#159411](https://github.com/pytorch/pytorch/pull/159411), [#159911](https://github.com/pytorch/pytorch/pull/159911))
-- Remove cmake cache and reconfigure again if it is invalid ([#156958](https://github.com/pytorch/pytorch/pull/156958))
-- Cut a version of `TORCH_ERROR_CODE_CHECK` in `headeronly` from AOTI ([#159604](https://github.com/pytorch/pytorch/pull/159604))
-- Remove `wheel` from build requirements ([#158027](https://github.com/pytorch/pytorch/pull/158027))
-- Error when `TORCH_STABLE_ONLY` is defined in `TensorBase.h` ([#161658](https://github.com/pytorch/pytorch/pull/161658))
-
-## CPU
-- Support GQA for flash attention ([#157893](https://github.com/pytorch/pytorch/pull/157893))
+## Python Frontend
+- Add utility to get the kernel currently registered on the dispatcher ([#158393](https://github.com/pytorch/pytorch/pull/158393))
+- Extend `__torch_function__` handler to be triggered by elements within a list ([#160256](https://github.com/pytorch/pytorch/pull/160256))
+- Add `torch.hash_tensor` reduction function ([#154149](https://github.com/pytorch/pytorch/pull/154149))
 
-## CUDA
-- Add getter for CUDA graph exec to allow mutation of captured kernel params ([#161294](https://github.com/pytorch/pytorch/pull/161294))
-- Implement support for `cudnn_batch_norm_out` kernel to replace the autogen approach ([#123020](https://github.com/pytorch/pytorch/pull/123020))
+## FX
+- Extend torch function support to ALL arguments instead of just scalar type (but not inside of list) ([#145089](https://github.com/pytorch/pytorch/pull/145089))
+- Add `is_fx_symbolic_tracing` flag ([#161385](https://github.com/pytorch/pytorch/pull/161385))
 
 ## Dynamo
 - Experimental API for ahead-of-time compiling models in fullgraph mode ([#161383](https://github.com/pytorch/pytorch/pull/161383))
 - Add a hook for recompilations ([#157961](https://github.com/pytorch/pytorch/pull/157961))
 
-## Export
-- Add support for param mutation under inference mode ([#159661](https://github.com/pytorch/pytorch/pull/159661))
+## Optimizer
+- Introduce Muon optimizer to PyTorch ([#160213](https://github.com/pytorch/pytorch/pull/160213))
 
-## FX
-- Extend torch function support to ALL arguments instead of just scalar type (but not inside of list) ([#145089](https://github.com/pytorch/pytorch/pull/145089))
-- Add `is_fx_symbolic_tracing` flag ([#161385](https://github.com/pytorch/pytorch/pull/161385))
+## Profiler
+- Add GC Events to Python Stack Tracer ([#161209](https://github.com/pytorch/pytorch/pull/161209))
+- Add a custom profiler configuration option ([#151656](https://github.com/pytorch/pytorch/pull/151656))
 
 ## Inductor
 - Allow user to pass in custom partitioner function ([#157580](https://github.com/pytorch/pytorch/pull/157580))
 
-## JIT
-- Add `torch._check` compatibility support ([#159988](https://github.com/pytorch/pytorch/pull/159988))
+## Export
+- Add support for param mutation under inference mode ([#159661](https://github.com/pytorch/pytorch/pull/159661))
 
-## MPS
-- Partial sparse support for MPS backend ([\#159729](https://github.com/pytorch/pytorch/pull/159729), [\#160254](https://github.com/pytorch/pytorch/pull/160254), [\#160223](https://github.com/pytorch/pytorch/pull/160223), [\#161846](https://github.com/pytorch/pytorch/pull/161846), [\#162007](https://github.com/pytorch/pytorch/pull/162007), [#157238](https://github.com/pytorch/pytorch/pull/157238))
-- Add `avg_pool3d`, `max_unpool1d/2d/3d`, `max_pool3d`, `max_pool3d` bwd pass, and `avg_pool3d` bwd pass for MPS ([#158877](https://github.com/pytorch/pytorch/pull/158877),[#159789](https://github.com/pytorch/pytorch/pull/159789), [#156467](https://github.com/pytorch/pytorch/pull/156467), [#157498](https://github.com/pytorch/pytorch/pull/157498), [#159089](https://github.com/pytorch/pytorch/pull/159089))
+## AOTDispatcher
+- Add AOTDispatcher config to set backward autocast behavior ([#156356](https://github.com/pytorch/pytorch/pull/156356))
+
+## Quantization
+- Enable cpu fp8 qlinear and cpu fp8 qconv ([#155678](https://github.com/pytorch/pytorch/pull/155678), [#157076](https://github.com/pytorch/pytorch/pull/157076))
 
 ## ONNX
 - RMS Norm support in opset 23 ([#159377](https://github.com/pytorch/pytorch/pull/159377))
 
-## Optimizer
-- Introduce Muon optimizer to PyTorch ([#160213](https://github.com/pytorch/pytorch/pull/160213))
-
-## Profiler
-- Add GC Events to Python Stack Tracer ([#161209](https://github.com/pytorch/pytorch/pull/161209))
-- Add a custom profiler configuration option ([#151656](https://github.com/pytorch/pytorch/pull/151656))
-
-## Python Frontend
-- Add utility to get the kernel currently registered on the dispatcher ([#158393](https://github.com/pytorch/pytorch/pull/158393))
-- Extend `__torch_function__` handler to be triggered by elements within a list ([#160256](https://github.com/pytorch/pytorch/pull/160256))
-- Add `torch.hash_tensor` reduction function ([#154149](https://github.com/pytorch/pytorch/pull/154149))
+## C++ Extensions
+- Build out a stable set of ATen ops in `torch/csrc/stable/ops.h`:  `amax`, `narrow`, `new_empty` + `new_zeros` dtype variant, `pad`, ([#159328](https://github.com/pytorch/pytorch/pull/159328), [#158974](https://github.com/pytorch/pytorch/pull/158974), [#159508](https://github.com/pytorch/pytorch/pull/159508), [#161597](https://github.com/pytorch/pytorch/pull/161597), [#160214](https://github.com/pytorch/pytorch/pull/160214), )
+- Add `torch::stable::Tensor()` default constructor,  `is_cpu`, and `get_device_index`([#159507](https://github.com/pytorch/pytorch/pull/159507), [#160212](https://github.com/pytorch/pytorch/pull/160212), [#160143](https://github.com/pytorch/pytorch/pull/160143))
+- Add beginnings of `torch::stable::accelerator` with support for DeviceGuard and Stream ([#159679](https://github.com/pytorch/pytorch/pull/159679), [#160453](https://github.com/pytorch/pytorch/pull/160453))
+- Start building out `torch/headeronly`: c10 Macros, STD_TORCH_CHECK, ScalarTypes (like BFloat16 and Half) ([#158035](https://github.com/pytorch/pytorch/pull/158035), [#158365](https://github.com/pytorch/pytorch/pull/158365), [#157912](https://github.com/pytorch/pytorch/pull/157912), [#158377](https://github.com/pytorch/pytorch/pull/158377), [#159302](https://github.com/pytorch/pytorch/pull/159302), [#159414](https://github.com/pytorch/pytorch/pull/159414), [#159412](https://github.com/pytorch/pytorch/pull/159412), [#159415](https://github.com/pytorch/pytorch/pull/159415), [#159411](https://github.com/pytorch/pytorch/pull/159411), [#159911](https://github.com/pytorch/pytorch/pull/159911))
+- Remove cmake cache and reconfigure again if it is invalid ([#156958](https://github.com/pytorch/pytorch/pull/156958))
+- Cut a version of `TORCH_ERROR_CODE_CHECK` in `headeronly` from AOTI ([#159604](https://github.com/pytorch/pytorch/pull/159604))
+- Remove `wheel` from build requirements ([#158027](https://github.com/pytorch/pytorch/pull/158027))
+- Error when `TORCH_STABLE_ONLY` is defined in `TensorBase.h` ([#161658](https://github.com/pytorch/pytorch/pull/161658))
 
-## Quantization
-- Enable cpu fp8 qlinear and cpu fp8 qconv ([#155678](https://github.com/pytorch/pytorch/pull/155678), [#157076](https://github.com/pytorch/pytorch/pull/157076))
+## Build Frontend
+- Add transpose to `torch/csrc/stable` ([#158160](https://github.com/pytorch/pytorch/pull/158160))
+- Add `zero_()` and `empty_like(t)` to `torch/csrc/stable/ops.h` ([#158866](https://github.com/pytorch/pytorch/pull/158866))
 
 ## Release Engineering
 - Add support for CUDA 13.0 in CI/CD builds. Enable CUDA compression mode for binary size reduction for CUDA 13.0 builds ([#160956](https://github.com/pytorch/pytorch/pull/160956)) ([#161073](https://github.com/pytorch/pytorch/pull/161073)) ([#161257](https://github.com/pytorch/pytorch/pull/161257)) ([#161663](https://github.com/pytorch/pytorch/pull/161663)) ([#161316](https://github.com/pytorch/pytorch/pull/161316)) ([#160201](https://github.com/pytorch/pytorch/pull/160201)) ([#160770](https://github.com/pytorch/pytorch/pull/160770)) ([#161013](https://github.com/pytorch/pytorch/pull/161013)) ([#161916](https://github.com/pytorch/pytorch/pull/161916)) ([#162268](https://github.com/pytorch/pytorch/pull/162268)) ([#162322](https://github.com/pytorch/pytorch/pull/162322)) ([#162383](https://github.com/pytorch/pytorch/pull/162383)) ([#161833](https://github.com/pytorch/pytorch/pull/161833))
@@ -233,6 +218,17 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 
 - Enable NVSHMEM integration ([#151261](https://github.com/pytorch/pytorch/pull/151261)) ([#153010](https://github.com/pytorch/pytorch/pull/153010)) ([#154538](https://github.com/pytorch/pytorch/pull/154538)) ([#155506](https://github.com/pytorch/pytorch/pull/155506)) ([#156685](https://github.com/pytorch/pytorch/pull/156685)) ([#158938](https://github.com/pytorch/pytorch/pull/158938)) ([#161321](https://github.com/pytorch/pytorch/pull/161321)) ([#160778](https://github.com/pytorch/pytorch/pull/160778)) ([#159907](https://github.com/pytorch/pytorch/pull/159907)) ([#160465](https://github.com/pytorch/pytorch/pull/160465))
 
+## CUDA
+- Add getter for CUDA graph exec to allow mutation of captured kernel params ([#161294](https://github.com/pytorch/pytorch/pull/161294))
+- Implement support for `cudnn_batch_norm_out` kernel to replace the autogen approach ([#123020](https://github.com/pytorch/pytorch/pull/123020))
+
+## CPU
+- Support GQA for flash attention ([#157893](https://github.com/pytorch/pytorch/pull/157893))
+
+## MPS
+- Partial sparse support for MPS backend ([\#159729](https://github.com/pytorch/pytorch/pull/159729), [\#160254](https://github.com/pytorch/pytorch/pull/160254), [\#160223](https://github.com/pytorch/pytorch/pull/160223), [\#161846](https://github.com/pytorch/pytorch/pull/161846), [\#162007](https://github.com/pytorch/pytorch/pull/162007), [#157238](https://github.com/pytorch/pytorch/pull/157238))
+- Add `avg_pool3d`, `max_unpool1d/2d/3d`, `max_pool3d`, `max_pool3d` bwd pass, and `avg_pool3d` bwd pass for MPS ([#158877](https://github.com/pytorch/pytorch/pull/158877),[#159789](https://github.com/pytorch/pytorch/pull/159789), [#156467](https://github.com/pytorch/pytorch/pull/156467), [#157498](https://github.com/pytorch/pytorch/pull/157498), [#159089](https://github.com/pytorch/pytorch/pull/159089))
+
 ## ROCm
 - OCP Micro-scaling Format (mx-fp8/mx-fp4) Support ([#151360](https://github.com/pytorch/pytorch/pull/151360))
 
@@ -240,54 +236,22 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Enable `FlexAttention` on Intel GPU ([#143553](https://github.com/pytorch/pytorch/pull/143553))
 
 # Improvements
-## AOTDispatcher
-- Skip logging in fp8 activation quantization if there are no nodes to be quantized ([#158129](https://github.com/pytorch/pytorch/pull/158129))
-- Add `aot_export_joint_with_descriptors` and `aot_compile_joint_with_descriptors` ([#158715](https://github.com/pytorch/pytorch/pull/158715))
-- Extract out `prepare_aot_module_simplified` for use in next PR ([#158319](https://github.com/pytorch/pytorch/pull/158319))
-- Rename modules in AOTAutograd ([#158449](https://github.com/pytorch/pytorch/pull/158449))
-- Track descriptors for all inputs/outputs of AOTAutograd traced graph ([#158624](https://github.com/pytorch/pytorch/pull/158624))
-- Improve graph output alias with subclass error message ([#159619](https://github.com/pytorch/pytorch/pull/159619))
-- Pass fw/bw compilers to `aot_export_joint_with_descriptors` ([#159814](https://github.com/pytorch/pytorch/pull/159814))
-
-## Autograd
-- Support deterministic `torch.nn.Upsample` `mode="trilinear"` backward ([#154239](https://github.com/pytorch/pytorch/pull/154239))
-
-## Build Frontend
-- Fix dev warning in `Dependencies.cmake` ([#159702](https://github.com/pytorch/pytorch/pull/159702))
-- Fix building system gloo with CUDA/HIP ([#146637](https://github.com/pytorch/pytorch/pull/146637))
-- Build `libtorch` without NVSHMEM ([#160910](https://github.com/pytorch/pytorch/pull/160910))
-- Improve BLAS feature detection ([#143846](https://github.com/pytorch/pytorch/pull/143846))
-
-
-## Composability
-- Meta implementation for `aten.add.Scalar` ([#161332](https://github.com/pytorch/pytorch/pull/161332))
-- `aten.expand_copy` decomp ([#161688](https://github.com/pytorch/pytorch/pull/161688))
-- Fix result dtype cast in decomp for `aten.linalg_vector_norm` ([#155111](https://github.com/pytorch/pytorch/pull/155111))
-- Add dtype checks in meta implementation for several ordering ops ([#159556](https://github.com/pytorch/pytorch/pull/159556))
-- Fix meta function for `aten.complex` ([#160894](https://github.com/pytorch/pytorch/pull/160894))
-- Improve unbacked symint (dynamic shape) support for several decompositions ([#148815](https://github.com/pytorch/pytorch/pull/148815), [#156902](https://github.com/pytorch/pytorch/pull/156902), [#157008](https://github.com/pytorch/pytorch/pull/157008), [#158894](https://github.com/pytorch/pytorch/pull/158894), [#159184](https://github.com/pytorch/pytorch/pull/159184), [#160683](https://github.com/pytorch/pytorch/pull/160683), [#160253](https://github.com/pytorch/pytorch/pull/160253), [#162084](https://github.com/pytorch/pytorch/pull/162084), [#162099](https://github.com/pytorch/pytorch/pull/162099), [#162109](https://github.com/pytorch/pytorch/pull/162109), [#160462](https://github.com/pytorch/pytorch/pull/160462))
+## Python Frontend
+- Speed up `torch.load` under `FakeTensorMode` by reducing random reads ([#157931](https://github.com/pytorch/pytorch/pull/157931))
+- Make `torch.utils.benchmark.utils.timer` accelerator agnostic ([#157131](https://github.com/pytorch/pytorch/pull/157131))
+- Improve error message for weight-only load errors ([#159935](https://github.com/pytorch/pytorch/pull/159935))
 
-## C++ Frontend
-- Generalized `AllocatorConfig` to be device-agnostic via new `AcceleratorAllocatorConfig` ([#149601](https://github.com/pytorch/pytorch/pull/149601), [#150312](https://github.com/pytorch/pytorch/pull/150312))
-- Added `Scalar::isUnsigned()` method ([#159877](https://github.com/pytorch/pytorch/pull/159877))
-- Exposed `ModelRunner` from nativert as public ([#159989](https://github.com/pytorch/pytorch/pull/159989))
-- Improve error message for `torch.binomial` enforcing float inputs ([#157658](https://github.com/pytorch/pytorch/pull/157658))
+## torch.nn
+- Allow `register_buffer` with `Tensor`-like objects ([#159455](https://github.com/pytorch/pytorch/pull/159455))
+- Improve error message for unsupported padding configurations ([#160866](https://github.com/pytorch/pytorch/pull/160866))
+- Validate target is 0D when input is 1D in `NLLLoss` ([#161412](https://github.com/pytorch/pytorch/pull/161412))
 
-## CPU (AArch64)
-- Made PyTorch compilable with gcc-14 on ARM ([#157867](https://github.com/pytorch/pytorch/pull/157867))
+## Optimizer
+- Resolve warning in LBFGS when converting a tensor with `requires_grad=True` to a scalar ([#160389](https://github.com/pytorch/pytorch/pull/160389))
+- Resolve `SequentialLR` deprecation warning about invoking `step(epoch)` ([#149392](https://github.com/pytorch/pytorch/pull/149392))
 
-## CUDA
-- Make cublaslt/hipblaslt workspaces persistent ([#156495](https://github.com/pytorch/pytorch/pull/156495))
-- Remove unnecessary warnings during the ATen compilation process ([#157703](https://github.com/pytorch/pytorch/pull/157703))
-- Slightly improve error message from `repeat_interleave` kernel ([#157996](https://github.com/pytorch/pytorch/pull/157996))
-- Add framework for explanations for common CUDA errors ([#158395](https://github.com/pytorch/pytorch/pull/158395))
-- Upgrade KernelLauncher `kernelLaunchCheck` to print help string ([#158896](https://github.com/pytorch/pytorch/pull/158896))
-- Prep for cutlass upgrade by ignoring `Wunused-but-set-variable` ([#159276](https://github.com/pytorch/pytorch/pull/159276))
-- Workaround ATen SFINAE under `libc++` ([#161101](https://github.com/pytorch/pytorch/pull/161101))
-- Implement changes to CCCL (CUB/Thrust/LibCUDACXX) usage in ATen ([#153373](https://github.com/pytorch/pytorch/pull/153373))
-- Add maybe unused flag to remove warning ([#157655](https://github.com/pytorch/pytorch/pull/157655))
-- Use new CCCL API in v2.8 ([#160554](https://github.com/pytorch/pytorch/pull/160554))
-- Improve cupy device placement when device is provided with explicit index ([#158529](https://github.com/pytorch/pytorch/pull/158529))
+## Autograd
+- Support deterministic `torch.nn.Upsample` `mode="trilinear"` backward ([#154239](https://github.com/pytorch/pytorch/pull/154239))
 
 ## Distributed
 ### c10d
@@ -301,10 +265,6 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
   - Make FakeStore optional to be passed into fake backend ([#162164](https://github.com/pytorch/pytorch/pull/162164))
   - Enable complex datatype support in `ProcessGroupGloo` ([#156633](https://github.com/pytorch/pytorch/pull/156633))
   - Move thread-local capture mode guard to include `work.isStarted` ([#160398](https://github.com/pytorch/pytorch/pull/160398))
-### Device Mesh
-  - Enable the use of user set backend and pg option even for the global mesh ([#157501](https://github.com/pytorch/pytorch/pull/157501))
-  - Enable slicing a submesh with warnings ([#158899](https://github.com/pytorch/pytorch/pull/158899))
-  - Allow controlling PG backend and options via `init_device_mesh` ([#159371](https://github.com/pytorch/pytorch/pull/159371))
 ### DistributedDataParallel (DDP)
   - Support ddp zero hook XCCL path ([#159240](https://github.com/pytorch/pytorch/pull/159240))
 ### DTensor
@@ -317,25 +277,46 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
   - Support user-supplied Generator for random ops ([#159933](https://github.com/pytorch/pytorch/pull/159933))
   - Add `propagate_tensor_meta` function that skips cache if `_are_we_tracing` ([#161334](https://github.com/pytorch/pytorch/pull/161334))
   - Support `local_map` as a decorator ([#161353](https://github.com/pytorch/pytorch/pull/161353))
+### Device Mesh
+  - Enable the use of user set backend and pg option even for the global mesh ([#157501](https://github.com/pytorch/pytorch/pull/157501))
+  - Enable slicing a submesh with warnings ([#158899](https://github.com/pytorch/pytorch/pull/158899))
+  - Allow controlling PG backend and options via `init_device_mesh` ([#159371](https://github.com/pytorch/pytorch/pull/159371))
 ### FullyShardedDataParallel2 (FSDP2)
   - Support custom `all_gather` and `reduce_scatter` comms ([#155189](https://github.com/pytorch/pytorch/pull/155189))
   - Made it fail `set_allocate_memory_from_process_group` if used together with custom comm hooks ([#157487](https://github.com/pytorch/pytorch/pull/157487))
   - Use `reduceOpSum` when world size is 1 ([#157529](https://github.com/pytorch/pytorch/pull/157529))
   - Skipp `allgather` when world size is 1 ([#160135](https://github.com/pytorch/pytorch/pull/160135))
   - Use `post_reduce_stream.record_event()` on hsdp+cpuoffload ([#160481](https://github.com/pytorch/pytorch/pull/160481))
+### Tensor Parallel (TP)
+  - Improve `parallelize_module` API to support more cases ([#157182](https://github.com/pytorch/pytorch/pull/157182))
+### TensorPipe
+  - Update TensorPipe pinned dependency version ([#159834](https://github.com/pytorch/pytorch/pull/159834))
+### TorchElastic
+  - Enable NUMA binding integration with elastic agent and `torchrun` ([#149334](https://github.com/pytorch/pytorch/pull/149334))
+  - Support NUMA Binding for Callable Entrypoints ([#160163](https://github.com/pytorch/pytorch/pull/160163), [#161183](https://github.com/pytorch/pytorch/pull/161183))
 ### Pipeline Parallelism (PP)
   - Add `eval()` API to schedule ([#157795](https://github.com/pytorch/pytorch/pull/157795))
   - Allow intermediate nodes in zero bubble to have multiple grads ([#159084](https://github.com/pytorch/pytorch/pull/159084))
   - Support `OVERLAP_F_B` computation type ([#158978](https://github.com/pytorch/pytorch/pull/158978))
   - Initializ P2P communicators on first step ([#160210](https://github.com/pytorch/pytorch/pull/160210))
   - Add `DualPipeV` schedule ([#159591](https://github.com/pytorch/pytorch/pull/159591))
-### TorchElastic
-  - Enable NUMA binding integration with elastic agent and `torchrun` ([#149334](https://github.com/pytorch/pytorch/pull/149334))
-  - Support NUMA Binding for Callable Entrypoints ([#160163](https://github.com/pytorch/pytorch/pull/160163), [#161183](https://github.com/pytorch/pytorch/pull/161183))
-### Tensor Parallel (TP)
-  - Improve `parallelize_module` API to support more cases ([#157182](https://github.com/pytorch/pytorch/pull/157182))
-### TensorPipe
-  - Update TensorPipe pinned dependency version ([#159834](https://github.com/pytorch/pytorch/pull/159834))
+
+## Linear Algebra Frontend
+- Use rocSOLVER for Cholesky inversion on AMD. ([#157154](https://github.com/pytorch/pytorch/pull/157154))
+- Add option for using TF32 as fp32 internal precision for matmul/linear/conv on MKLDNN ([#157520](https://github.com/pytorch/pytorch/pull/157520))
+- Make einsum produce contiguous outputs in more cases ([#161755](https://github.com/pytorch/pytorch/pull/161755))
+
+## Profiler
+- Add more CUDA API for kernel launcher ([#156016](https://github.com/pytorch/pytorch/pull/156016))
+- Allow Custom Time Unit When Printing Profiler Table ([#157913](https://github.com/pytorch/pytorch/pull/157913))
+- Update CUDA runtime kernel identification logic ([#157890](https://github.com/pytorch/pytorch/pull/157890))
+
+## FX
+- Fix DCE eliminating random operations by improving `is_impure()` (#151524) ([#157981](https://github.com/pytorch/pytorch/pull/157981))
+- Support converting a float32 tensor to a scalar in FX trace. ([#158216](https://github.com/pytorch/pytorch/pull/158216))
+- Correctly copy `self.module_stack` in ModuleStackTracer ([#159956](https://github.com/pytorch/pytorch/pull/159956))
+- Add tool to track events in graph split ([#159795](https://github.com/pytorch/pytorch/pull/159795))
+- Add `node_name_match` to subgraph rewriter ([#157574](https://github.com/pytorch/pytorch/pull/157574))
 
 ## Dynamo
 - Improve tracing support for various Python builtin data structures/modules:
@@ -350,28 +331,6 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Graph break error messages link to a website with more information ([#159011](https://github.com/pytorch/pytorch/pull/159011))
 - Add option for `TorchDispatchMode` to ignore `torch.compile` internals ([#161648](https://github.com/pytorch/pytorch/pull/161648))
 
-## Export
-- Handle `None` & ellipsis slicing/select in non-strict ([#157821](https://github.com/pytorch/pytorch/pull/157821))
-- Extend FP8 types in serialization ([#158430](https://github.com/pytorch/pytorch/pull/158430))
-- Improve error messages for deserialization ([#159881](https://github.com/pytorch/pytorch/pull/159881))
-- Support serialization for `triton_kernel_wrapper_functional` HOP ([#161314](https://github.com/pytorch/pytorch/pull/161314))
-- Support serialization for complex constants ([#161517](https://github.com/pytorch/pytorch/pull/161517))
-- Add runtime asserts to `while_loop` HOP subgraphs ([#158467](https://github.com/pytorch/pytorch/pull/158467))
-- Warn on side-effectful code in strict mode ([#160060](https://github.com/pytorch/pytorch/pull/160060))
-- Support for vmap in pre-dispatch export ([#154650](https://github.com/pytorch/pytorch/pull/154650))
-- Support vmap and custom autograd function/improve DTensor constructor inefficiency ([#162240](https://github.com/pytorch/pytorch/pull/162240))
-
-## Foreach
-- Invoke `vector.reserve()` consistently for non-inplace foreach operations ([#161128](https://github.com/pytorch/pytorch/pull/161128))
-- Faster and safer lambda expression capture in `has_integral_tensor()` ([#161042](https://github.com/pytorch/pytorch/pull/161042))
-
-## FX
-- Fix DCE eliminating random operations by improving `is_impure()` (#151524) ([#157981](https://github.com/pytorch/pytorch/pull/157981))
-- Support converting a float32 tensor to a scalar in FX trace. ([#158216](https://github.com/pytorch/pytorch/pull/158216))
-- Correctly copy `self.module_stack` in ModuleStackTracer ([#159956](https://github.com/pytorch/pytorch/pull/159956))
-- Add tool to track events in graph split ([#159795](https://github.com/pytorch/pytorch/pull/159795))
-- Add `node_name_match` to subgraph rewriter ([#157574](https://github.com/pytorch/pytorch/pull/157574))
-
 ## Inductor
 - Add Inductor support for MTIA backend ([#159211](https://github.com/pytorch/pytorch/pull/159211))
 - Share default device context when all graph partitions and cudagraph-unsafe ops are on the same device([#162873](https://github.com/pytorch/pytorch/pull/162873))
@@ -384,62 +343,98 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Add AOTI C shim functions for collective ops ([#154492](https://github.com/pytorch/pytorch/pull/154492))
 - Add missing ops to set of C-shim ops which can have nullptr returns ([#158073](https://github.com/pytorch/pytorch/pull/158073))
 
-## Linear Algebra Frontend
-- Use rocSOLVER for Cholesky inversion on AMD. ([#157154](https://github.com/pytorch/pytorch/pull/157154))
-- Add option for using TF32 as fp32 internal precision for matmul/linear/conv on MKLDNN ([#157520](https://github.com/pytorch/pytorch/pull/157520))
-- Make einsum produce contiguous outputs in more cases ([#161755](https://github.com/pytorch/pytorch/pull/161755))
+## Export
+- Handle `None` & ellipsis slicing/select in non-strict ([#157821](https://github.com/pytorch/pytorch/pull/157821))
+- Extend FP8 types in serialization ([#158430](https://github.com/pytorch/pytorch/pull/158430))
+- Improve error messages for deserialization ([#159881](https://github.com/pytorch/pytorch/pull/159881))
+- Support serialization for `triton_kernel_wrapper_functional` HOP ([#161314](https://github.com/pytorch/pytorch/pull/161314))
+- Support serialization for complex constants ([#161517](https://github.com/pytorch/pytorch/pull/161517))
+- Add runtime asserts to `while_loop` HOP subgraphs ([#158467](https://github.com/pytorch/pytorch/pull/158467))
+- Warn on side-effectful code in strict mode ([#160060](https://github.com/pytorch/pytorch/pull/160060))
+- Support for vmap in pre-dispatch export ([#154650](https://github.com/pytorch/pytorch/pull/154650))
+- Support vmap and custom autograd function/improve DTensor constructor inefficiency ([#162240](https://github.com/pytorch/pytorch/pull/162240))
 
-## MPS
-- Add `shifted_chebyshev_polynomial_[tuvw]`, `igamma/igammac,grid_sampler_3d, native_dropout`/`native_dropout_backward`  ([\#157488](https://github.com/pytorch/pytorch/pull/157488), [\#161927](https://github.com/pytorch/pytorch/pull/161927), [\#160541](https://github.com/pytorch/pytorch/pull/160541), [\#162108](https://github.com/pytorch/pytorch/pull/162108))
-- Extend atomic operations to all int types ([\#158179](https://github.com/pytorch/pytorch/pull/158179))
-- Extend `index_put` to complex types ([\#160159](https://github.com/pytorch/pytorch/pull/160159))
-- Extend `addmm` to integral types ([\#160270](https://github.com/pytorch/pytorch/pull/160270))
-- Add support for unsigned types ([\#159094](https://github.com/pytorch/pytorch/pull/159094))
-- Add API to query GPU core count ([\#160414](https://github.com/pytorch/pytorch/pull/160414))
-- Add `kthvalue` ([\#161817](https://github.com/pytorch/pytorch/pull/161817))
-- Type-promote tensor-iterator common dtype ([\#160334](https://github.com/pytorch/pytorch/pull/160334))
-- Implement `logcumsumexp` metal kernel ([\#156858](https://github.com/pytorch/pytorch/pull/156858))
-- Enable `dlpack` integration ([\#158888](https://github.com/pytorch/pytorch/pull/158888))
-- Dynamic reductions ([\#159355](https://github.com/pytorch/pytorch/pull/159355))
-- Update `avg_pool2d` to use Metal kernel when `ceil_mode=True` ([\#161011](https://github.com/pytorch/pytorch/pull/161011))
+## AOTDispatcher
+- Skip logging in fp8 activation quantization if there are no nodes to be quantized ([#158129](https://github.com/pytorch/pytorch/pull/158129))
+- Add `aot_export_joint_with_descriptors` and `aot_compile_joint_with_descriptors` ([#158715](https://github.com/pytorch/pytorch/pull/158715))
+- Extract out `prepare_aot_module_simplified` for use in next PR ([#158319](https://github.com/pytorch/pytorch/pull/158319))
+- Rename modules in AOTAutograd ([#158449](https://github.com/pytorch/pytorch/pull/158449))
+- Track descriptors for all inputs/outputs of AOTAutograd traced graph ([#158624](https://github.com/pytorch/pytorch/pull/158624))
+- Improve graph output alias with subclass error message ([#159619](https://github.com/pytorch/pytorch/pull/159619))
+- Pass fw/bw compilers to `aot_export_joint_with_descriptors` ([#159814](https://github.com/pytorch/pytorch/pull/159814))
+
+## Composability
+- Meta implementation for `aten.add.Scalar` ([#161332](https://github.com/pytorch/pytorch/pull/161332))
+- `aten.expand_copy` decomp ([#161688](https://github.com/pytorch/pytorch/pull/161688))
+- Fix result dtype cast in decomp for `aten.linalg_vector_norm` ([#155111](https://github.com/pytorch/pytorch/pull/155111))
+- Add dtype checks in meta implementation for several ordering ops ([#159556](https://github.com/pytorch/pytorch/pull/159556))
+- Fix meta function for `aten.complex` ([#160894](https://github.com/pytorch/pytorch/pull/160894))
+- Improve unbacked symint (dynamic shape) support for several decompositions ([#148815](https://github.com/pytorch/pytorch/pull/148815), [#156902](https://github.com/pytorch/pytorch/pull/156902), [#157008](https://github.com/pytorch/pytorch/pull/157008), [#158894](https://github.com/pytorch/pytorch/pull/158894), [#159184](https://github.com/pytorch/pytorch/pull/159184), [#160683](https://github.com/pytorch/pytorch/pull/160683), [#160253](https://github.com/pytorch/pytorch/pull/160253), [#162084](https://github.com/pytorch/pytorch/pull/162084), [#162099](https://github.com/pytorch/pytorch/pull/162099), [#162109](https://github.com/pytorch/pytorch/pull/162109), [#160462](https://github.com/pytorch/pytorch/pull/160462))
+
+## Quantization
+- Avoid getting model device once per node for pt2e quantization flow ([#159901](https://github.com/pytorch/pytorch/pull/159901))
+- Fixes bug in implementation of `HistogramObserver` ([#156457](https://github.com/pytorch/pytorch/pull/156457))
+- Support `bias=None` for `fbgemm_linear_fp16_weight` CPU op ([#158535](https://github.com/pytorch/pytorch/pull/158535))
+- Add Static Dispatch Kernel for `wrapped_fbgemm_linear_fp16_weight` for Sigmoid ([#160451](https://github.com/pytorch/pytorch/pull/160451))
 
 ## Nested Tensor (NJT)
 - Added initial `log_softmax()` support ([#159662](https://github.com/pytorch/pytorch/pull/159662))
 
-## torch.nn
-- Allow `register_buffer` with `Tensor`-like objects ([#159455](https://github.com/pytorch/pytorch/pull/159455))
-- Improve error message for unsupported padding configurations ([#160866](https://github.com/pytorch/pytorch/pull/160866))
-- Validate target is 0D when input is 1D in `NLLLoss` ([#161412](https://github.com/pytorch/pytorch/pull/161412))
+## Foreach
+- Invoke `vector.reserve()` consistently for non-inplace foreach operations ([#161128](https://github.com/pytorch/pytorch/pull/161128))
+- Faster and safer lambda expression capture in `has_integral_tensor()` ([#161042](https://github.com/pytorch/pytorch/pull/161042))
 
 ## ONNX
 - Support symbolic arguments in ONNX exporter ([#157734](https://github.com/pytorch/pytorch/pull/157734))
 - Fix `torch.tensor` warning in ONNX `symbolic_opset10` export  ([#158835](https://github.com/pytorch/pytorch/pull/158835))
 
-## Optimizer
-- Resolve warning in LBFGS when converting a tensor with `requires_grad=True` to a scalar ([#160389](https://github.com/pytorch/pytorch/pull/160389))
-- Resolve `SequentialLR` deprecation warning about invoking `step(epoch)` ([#149392](https://github.com/pytorch/pytorch/pull/149392))
-
-## Profiler
-- Add more CUDA API for kernel launcher ([#156016](https://github.com/pytorch/pytorch/pull/156016))
-- Allow Custom Time Unit When Printing Profiler Table ([#157913](https://github.com/pytorch/pytorch/pull/157913))
-- Update CUDA runtime kernel identification logic ([#157890](https://github.com/pytorch/pytorch/pull/157890))
-
-## Python Frontend
-- Speed up `torch.load` under `FakeTensorMode` by reducing random reads ([#157931](https://github.com/pytorch/pytorch/pull/157931))
-- Make `torch.utils.benchmark.utils.timer` accelerator agnostic ([#157131](https://github.com/pytorch/pytorch/pull/157131))
-- Improve error message for weight-only load errors ([#159935](https://github.com/pytorch/pytorch/pull/159935))
+## C++ Frontend
+- Generalized `AllocatorConfig` to be device-agnostic via new `AcceleratorAllocatorConfig` ([#149601](https://github.com/pytorch/pytorch/pull/149601), [#150312](https://github.com/pytorch/pytorch/pull/150312))
+- Added `Scalar::isUnsigned()` method ([#159877](https://github.com/pytorch/pytorch/pull/159877))
+- Exposed `ModelRunner` from nativert as public ([#159989](https://github.com/pytorch/pytorch/pull/159989))
+- Improve error message for `torch.binomial` enforcing float inputs ([#157658](https://github.com/pytorch/pytorch/pull/157658))
 
-## Quantization
-- Avoid getting model device once per node for pt2e quantization flow ([#159901](https://github.com/pytorch/pytorch/pull/159901))
-- Fixes bug in implementation of `HistogramObserver` ([#156457](https://github.com/pytorch/pytorch/pull/156457))
-- Support `bias=None` for `fbgemm_linear_fp16_weight` CPU op ([#158535](https://github.com/pytorch/pytorch/pull/158535))
-- Add Static Dispatch Kernel for `wrapped_fbgemm_linear_fp16_weight` for Sigmoid ([#160451](https://github.com/pytorch/pytorch/pull/160451))
+## Build Frontend
+- Fix dev warning in `Dependencies.cmake` ([#159702](https://github.com/pytorch/pytorch/pull/159702))
+- Fix building system gloo with CUDA/HIP ([#146637](https://github.com/pytorch/pytorch/pull/146637))
+- Build `libtorch` without NVSHMEM ([#160910](https://github.com/pytorch/pytorch/pull/160910))
+- Improve BLAS feature detection ([#143846](https://github.com/pytorch/pytorch/pull/143846))
 
 ## Release Engineering
 - Enable vLLM testing workflow ([#160583](https://github.com/pytorch/pytorch/pull/160583)) ([#161565](https://github.com/pytorch/pytorch/pull/161565)) ([#162292](https://github.com/pytorch/pytorch/pull/162292)) ([#162000](https://github.com/pytorch/pytorch/pull/162000)) ([#161797](https://github.com/pytorch/pytorch/pull/161797))
 - Enable Windows ARM64 CI testing ([#148753](https://github.com/pytorch/pytorch/pull/148753)) ([#161504](https://github.com/pytorch/pytorch/pull/161504))
 - Enable PyTorch ROCm CI for MI355X testing. ([#158889](https://github.com/pytorch/pytorch/pull/158889))
 
+## CUDA
+- Make cublaslt/hipblaslt workspaces persistent ([#156495](https://github.com/pytorch/pytorch/pull/156495))
+- Remove unnecessary warnings during the ATen compilation process ([#157703](https://github.com/pytorch/pytorch/pull/157703))
+- Slightly improve error message from `repeat_interleave` kernel ([#157996](https://github.com/pytorch/pytorch/pull/157996))
+- Add framework for explanations for common CUDA errors ([#158395](https://github.com/pytorch/pytorch/pull/158395))
+- Upgrade KernelLauncher `kernelLaunchCheck` to print help string ([#158896](https://github.com/pytorch/pytorch/pull/158896))
+- Prep for cutlass upgrade by ignoring `Wunused-but-set-variable` ([#159276](https://github.com/pytorch/pytorch/pull/159276))
+- Workaround ATen SFINAE under `libc++` ([#161101](https://github.com/pytorch/pytorch/pull/161101))
+- Implement changes to CCCL (CUB/Thrust/LibCUDACXX) usage in ATen ([#153373](https://github.com/pytorch/pytorch/pull/153373))
+- Add maybe unused flag to remove warning ([#157655](https://github.com/pytorch/pytorch/pull/157655))
+- Use new CCCL API in v2.8 ([#160554](https://github.com/pytorch/pytorch/pull/160554))
+- Improve cupy device placement when device is provided with explicit index ([#158529](https://github.com/pytorch/pytorch/pull/158529))
+
+## CPU (AArch64)
+- Made PyTorch compilable with gcc-14 on ARM ([#157867](https://github.com/pytorch/pytorch/pull/157867))
+
+## MPS
+- Add `shifted_chebyshev_polynomial_[tuvw]`, `igamma/igammac,grid_sampler_3d, native_dropout`/`native_dropout_backward`  ([\#157488](https://github.com/pytorch/pytorch/pull/157488), [\#161927](https://github.com/pytorch/pytorch/pull/161927), [\#160541](https://github.com/pytorch/pytorch/pull/160541), [\#162108](https://github.com/pytorch/pytorch/pull/162108))
+- Extend atomic operations to all int types ([\#158179](https://github.com/pytorch/pytorch/pull/158179))
+- Extend `index_put` to complex types ([\#160159](https://github.com/pytorch/pytorch/pull/160159))
+- Extend `addmm` to integral types ([\#160270](https://github.com/pytorch/pytorch/pull/160270))
+- Add support for unsigned types ([\#159094](https://github.com/pytorch/pytorch/pull/159094))
+- Add API to query GPU core count ([\#160414](https://github.com/pytorch/pytorch/pull/160414))
+- Add `kthvalue` ([\#161817](https://github.com/pytorch/pytorch/pull/161817))
+- Type-promote tensor-iterator common dtype ([\#160334](https://github.com/pytorch/pytorch/pull/160334))
+- Implement `logcumsumexp` metal kernel ([\#156858](https://github.com/pytorch/pytorch/pull/156858))
+- Enable `dlpack` integration ([\#158888](https://github.com/pytorch/pytorch/pull/158888))
+- Dynamic reductions ([\#159355](https://github.com/pytorch/pytorch/pull/159355))
+- Update `avg_pool2d` to use Metal kernel when `ceil_mode=True` ([\#161011](https://github.com/pytorch/pytorch/pull/161011))
+
 ## ROCm
 - Additional hipify mappings ([#158056](https://github.com/pytorch/pytorch/pull/158056), [#158352](https://github.com/pytorch/pytorch/pull/158352), [#161992](https://github.com/pytorch/pytorch/pull/161992))
 - Refactor `composable_kernel` (CK) backend user interface to improve user experience ([#152951](https://github.com/pytorch/pytorch/pull/152951))
@@ -453,31 +448,17 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Add `device_id` to Intel GPU properties to distinguish iGPUs with identical names ([#156481](https://github.com/pytorch/pytorch/pull/156481))
 
 # Bug Fixes
+## Python Frontend
+- Add option in `torch.utils.cpp_extension.load_inline` to override gencode ([#156850](https://github.com/pytorch/pytorch/pull/156850))
+- Fix `max_width` computation in Tensor printing ([#126859](https://github.com/pytorch/pytorch/pull/126859))
+- Improve `pin_memory` error message on CPU-only systems ([#159994](https://github.com/pytorch/pytorch/pull/159994))
+- Making batching rule for `F.embedding` DTensor-aware ([#162117](https://github.com/pytorch/pytorch/pull/162117))
+
 ## Autograd
 - Fix `torch.autograd.Function` memory leak due to `torch.utils.checkpiont` early stopping ([#161171](https://github.com/pytorch/pytorch/pull/161171))
 - Fix `torch.autograd.graph.GradientEdge` for `torch.autograd.Function` ([#160098](https://github.com/pytorch/pytorch/pull/160098))
 - Match 0-dim gradients device type regardless of subclass-ness ([#160165](https://github.com/pytorch/pytorch/pull/160165))
 
-## Build Frontend
-- Turn on `BUILD_BUNDLEPTXAS=1` to allow compile on newer GPUs([#163988](https://github.com/pytorch/pytorch/pull/163988))
-
-## C++ Frontend
-- Fix `torch.utils.cpp_extension` parser for clang version 20.1.7+libcxx ([#157666](https://github.com/pytorch/pytorch/pull/157666))
-- Fix `MakeTensor::computeStorageSize()` calculation ([#158690](https://github.com/pytorch/pytorch/pull/158690))
-- Fix static initialization order issue with `AllocatorConfig` ([#159629](https://github.com/pytorch/pytorch/pull/159629))
-
-## CPU
-- Add check so non-aarch64 platforms can hit `MKLDNN` path ([#162168](https://github.com/pytorch/pytorch/pull/162168))
-
-## CUDA
-- Handle uninitialized `torch.backends.cuda.matmul.fp32_precision` ([#161102](https://github.com/pytorch/pytorch/pull/161102))
-- Fix nansum in non-JIT build ([#158633](https://github.com/pytorch/pytorch/pull/158633))
-- Decrease launch bounds of CTCLoss backward for blackwell to avoid crash ([#159522](https://github.com/pytorch/pytorch/pull/159522))
-- Implement workaround for `cudaErrorNotSupported` ([#162412](https://github.com/pytorch/pytorch/pull/162412))
-- Fix missing `__syncthreads` in MultiMarginLoss backward ([#158994](https://github.com/pytorch/pytorch/pull/158994))
-- Roll-back cuDNN frontend upgrade and update Meta registration due to compile issues ([#163104](https://github.com/pytorch/pytorch/pull/163104))
-- Disable cuDNN for 3D convolutions with `kernel size != 1` for cuDNN 9.8+ ([#163581](https://github.com/pytorch/pytorch/pull/163581))
-
 ## Distributed
 ### c10d
   - Fix slow init due to repeated dns resolution failure in socket ([#159596](https://github.com/pytorch/pytorch/pull/159596))
@@ -505,31 +486,16 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 ### Pipeline Parallelism (PP)
   - Fix eval step under `no_grad()` ([#159293](https://github.com/pytorch/pytorch/pull/159293))
   - Fix zero bubble schedules for `eval()` ([#159475](https://github.com/pytorch/pytorch/pull/159475))
-### TorchElastic
-  - Fix wrong log file name in the docs of `torch.distributed.elastic.multiprocessing.start_processes()` ([#160396](https://github.com/pytorch/pytorch/pull/160396))
 ### TensorPipe
   - Fix `import torch` if compiled without `TensorPipe` ([#159461](https://github.com/pytorch/pytorch/pull/159461))
+### TorchElastic
+  - Fix wrong log file name in the docs of `torch.distributed.elastic.multiprocessing.start_processes()` ([#160396](https://github.com/pytorch/pytorch/pull/160396))
 
-## Dynamo
-- Fix segfault due to interaction between Dynamo backends and `torch.compiler.reset()` ([#156527](https://github.com/pytorch/pytorch/pull/156527))
-- Fix crash due to bad interaction with recompilations and with blocks in Python 3.11+ ([#162318](https://github.com/pytorch/pytorch/pull/162318))
-
-## Export
-- Fix bug in constants lifting pass ([#157719](https://github.com/pytorch/pytorch/pull/157719))
-- Fix `from_node` provenance in unlift pass ([#157943](https://github.com/pytorch/pytorch/pull/157943))
-- Fix `NaN` serialization ([#155359](https://github.com/pytorch/pytorch/pull/155359))
-- Fix deserialization for unbacked symbol ranges ([#158681](https://github.com/pytorch/pytorch/pull/158681))
-- Fix runtime assert handling in deserialization ([#159060](https://github.com/pytorch/pytorch/pull/159060))
-- Fix for FQN handling in unflattener ([#159418](https://github.com/pytorch/pytorch/pull/159418))
-- Fix `nn_module_stack` for `assert_tensor_metadata` nodes ([#159625](https://github.com/pytorch/pytorch/pull/159625))
-- Fix usage for `move_to_device_pass` ([#159992](https://github.com/pytorch/pytorch/pull/159992), [#160528](https://github.com/pytorch/pytorch/pull/160528), [#162301](https://github.com/pytorch/pytorch/pull/162301))
-- Avoid name overwrites for aliased exported module parameters ([#160600](https://github.com/pytorch/pytorch/pull/160600))
-- Avoid inling `dynamo.disables` in unflattening ([#161306](https://github.com/pytorch/pytorch/pull/161306))
-- Fix deserialization issue for storage offset ([#162172](https://github.com/pytorch/pytorch/pull/162172))
-- Remove `.contiguous()` when saving weights to raw bytes to preserve original storage size of tensor ([#163587](https://github.com/pytorch/pytorch/pull/163587))
+## Linear Algebra Frontend
+- Avoid downcasts for fp16 matmul on the BLAS backend ([#161999](https://github.com/pytorch/pytorch/pull/161999))
 
-## Foreach
-- `chunk_size` should always be `int64_t` for Foreach functors ([#156872](https://github.com/pytorch/pytorch/pull/156872))
+## Profiler
+- Fix Linter for Global Annotations flag in Snapshot ([#157858](https://github.com/pytorch/pytorch/pull/157858))
 
 ## FX
 - Fix `split_module` with symint ([#160093](https://github.com/pytorch/pytorch/pull/160093))
@@ -537,6 +503,10 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Skip const folding with symbolic expression ([#161437](https://github.com/pytorch/pytorch/pull/161437))
 - Fix qualified name for methods of `torch.Tensor` ([#162224](https://github.com/pytorch/pytorch/pull/162224))
 
+## Dynamo
+- Fix segfault due to interaction between Dynamo backends and `torch.compiler.reset()` ([#156527](https://github.com/pytorch/pytorch/pull/156527))
+- Fix crash due to bad interaction with recompilations and with blocks in Python 3.11+ ([#162318](https://github.com/pytorch/pytorch/pull/162318))
+
 ## Inductor
 - Fix wrong meta function for `constant_pad_nd` ([#159878](https://github.com/pytorch/pytorch/pull/159878))
 - Fix learnable bias assertion error in Inductor ([#161170](https://github.com/pytorch/pytorch/pull/161170))
@@ -556,12 +526,53 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Explicitly delete `wait_tensor` returned tensor ([#159502](https://github.com/pytorch/pytorch/pull/159502))
 - Fix memory leak from `all_reduce` ([#159818](https://github.com/pytorch/pytorch/pull/159818))
 
-## JIT
-- Make `ErrorReport::CallStack` thread-safe ([#160386](https://github.com/pytorch/pytorch/pull/160386))
-- Fix `RemoveProfileNodesAndSpecializeTypes` handling for `Tensor?` that is resolved to `None` ([#161538](https://github.com/pytorch/pytorch/pull/161538))
+## Export
+- Fix bug in constants lifting pass ([#157719](https://github.com/pytorch/pytorch/pull/157719))
+- Fix `from_node` provenance in unlift pass ([#157943](https://github.com/pytorch/pytorch/pull/157943))
+- Fix `NaN` serialization ([#155359](https://github.com/pytorch/pytorch/pull/155359))
+- Fix deserialization for unbacked symbol ranges ([#158681](https://github.com/pytorch/pytorch/pull/158681))
+- Fix runtime assert handling in deserialization ([#159060](https://github.com/pytorch/pytorch/pull/159060))
+- Fix for FQN handling in unflattener ([#159418](https://github.com/pytorch/pytorch/pull/159418))
+- Fix `nn_module_stack` for `assert_tensor_metadata` nodes ([#159625](https://github.com/pytorch/pytorch/pull/159625))
+- Fix usage for `move_to_device_pass` ([#159992](https://github.com/pytorch/pytorch/pull/159992), [#160528](https://github.com/pytorch/pytorch/pull/160528), [#162301](https://github.com/pytorch/pytorch/pull/162301))
+- Avoid name overwrites for aliased exported module parameters ([#160600](https://github.com/pytorch/pytorch/pull/160600))
+- Avoid inling `dynamo.disables` in unflattening ([#161306](https://github.com/pytorch/pytorch/pull/161306))
+- Fix deserialization issue for storage offset ([#162172](https://github.com/pytorch/pytorch/pull/162172))
+- Remove `.contiguous()` when saving weights to raw bytes to preserve original storage size of tensor ([#163587](https://github.com/pytorch/pytorch/pull/163587))
 
-## Linear Algebra Frontend
-- Avoid downcasts for fp16 matmul on the BLAS backend ([#161999](https://github.com/pytorch/pytorch/pull/161999))
+## Quantization
+- Avoid `NaN` in fp8 output of CPU `qlinear` and `qconv` ops ([#160957](https://github.com/pytorch/pytorch/pull/160957))
+- Fix segmentation fault when `choose_qparams_optimized` ([#161966](https://github.com/pytorch/pytorch/pull/161966))
+
+## Foreach
+- `chunk_size` should always be `int64_t` for Foreach functors ([#156872](https://github.com/pytorch/pytorch/pull/156872))
+
+## ONNX
+- Make onnx export SDPA match ATen behavior ([#159973](https://github.com/pytorch/pytorch/pull/159973))
+- Fix `rotary_embedding_23` implementation ([#162865](https://github.com/pytorch/pytorch/pull/162865))
+- Fix export behavior when model has `None` as output ([#160200](https://github.com/pytorch/pytorch/pull/160200))
+- Fix lower opset version support in `dynamo=True` ([#161056](https://github.com/pytorch/pytorch/pull/161056))
+- Fix `index_put_` usage ([#161263](https://github.com/pytorch/pytorch/pull/161263))
+
+## C++ Frontend
+- Fix `torch.utils.cpp_extension` parser for clang version 20.1.7+libcxx ([#157666](https://github.com/pytorch/pytorch/pull/157666))
+- Fix `MakeTensor::computeStorageSize()` calculation ([#158690](https://github.com/pytorch/pytorch/pull/158690))
+- Fix static initialization order issue with `AllocatorConfig` ([#159629](https://github.com/pytorch/pytorch/pull/159629))
+
+## Build Frontend
+- Turn on `BUILD_BUNDLEPTXAS=1` to allow compile on newer GPUs([#163988](https://github.com/pytorch/pytorch/pull/163988))
+
+## CUDA
+- Handle uninitialized `torch.backends.cuda.matmul.fp32_precision` ([#161102](https://github.com/pytorch/pytorch/pull/161102))
+- Fix nansum in non-JIT build ([#158633](https://github.com/pytorch/pytorch/pull/158633))
+- Decrease launch bounds of CTCLoss backward for blackwell to avoid crash ([#159522](https://github.com/pytorch/pytorch/pull/159522))
+- Implement workaround for `cudaErrorNotSupported` ([#162412](https://github.com/pytorch/pytorch/pull/162412))
+- Fix missing `__syncthreads` in MultiMarginLoss backward ([#158994](https://github.com/pytorch/pytorch/pull/158994))
+- Roll-back cuDNN frontend upgrade and update Meta registration due to compile issues ([#163104](https://github.com/pytorch/pytorch/pull/163104))
+- Disable cuDNN for 3D convolutions with `kernel size != 1` for cuDNN 9.8+ ([#163581](https://github.com/pytorch/pytorch/pull/163581))
+
+## CPU
+- Add check so non-aarch64 platforms can hit `MKLDNN` path ([#162168](https://github.com/pytorch/pytorch/pull/162168))
 
 ## MPS
 - Fix batch norm incorrect gradient ([#156867](https://github.com/pytorch/pytorch/pull/156867))
@@ -581,26 +592,6 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Migrate round unary op to Metal ([#161712](https://github.com/pytorch/pytorch/pull/161712))
 - Type-promote tensor-iterator common dtype ([#160334](https://github.com/pytorch/pytorch/pull/160334))
 
-## ONNX
-- Make onnx export SDPA match ATen behavior ([#159973](https://github.com/pytorch/pytorch/pull/159973))
-- Fix `rotary_embedding_23` implementation ([#162865](https://github.com/pytorch/pytorch/pull/162865))
-- Fix export behavior when model has `None` as output ([#160200](https://github.com/pytorch/pytorch/pull/160200))
-- Fix lower opset version support in `dynamo=True` ([#161056](https://github.com/pytorch/pytorch/pull/161056))
-- Fix `index_put_` usage ([#161263](https://github.com/pytorch/pytorch/pull/161263))
-
-## Profiler
-- Fix Linter for Global Annotations flag in Snapshot ([#157858](https://github.com/pytorch/pytorch/pull/157858))
-
-## Python Frontend
-- Add option in `torch.utils.cpp_extension.load_inline` to override gencode ([#156850](https://github.com/pytorch/pytorch/pull/156850))
-- Fix `max_width` computation in Tensor printing ([#126859](https://github.com/pytorch/pytorch/pull/126859))
-- Improve `pin_memory` error message on CPU-only systems ([#159994](https://github.com/pytorch/pytorch/pull/159994))
-- Making batching rule for `F.embedding` DTensor-aware ([#162117](https://github.com/pytorch/pytorch/pull/162117))
-
-## Quantization
-- Avoid `NaN` in fp8 output of CPU `qlinear` and `qconv` ops ([#160957](https://github.com/pytorch/pytorch/pull/160957))
-- Fix segmentation fault when `choose_qparams_optimized` ([#161966](https://github.com/pytorch/pytorch/pull/161966))
-
 ## ROCm
 - Fix Inductor with cudagraph trees `hip:0` device error ([#161221](https://github.com/pytorch/pytorch/pull/161221))
 - Fix some build failures and support some BLAS calls on Windows ([#161981](https://github.com/pytorch/pytorch/pull/161981))
@@ -614,27 +605,40 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 ## XPU
 - Fix `cpp_extension` compatibility with `intel-deep-learning-essentials-2025.2` ([#161012](https://github.com/pytorch/pytorch/pull/161012))
 
+## JIT
+- Make `ErrorReport::CallStack` thread-safe ([#160386](https://github.com/pytorch/pytorch/pull/160386))
+- Fix `RemoveProfileNodesAndSpecializeTypes` handling for `Tensor?` that is resolved to `None` ([#161538](https://github.com/pytorch/pytorch/pull/161538))
+
 # Performance
+## Optimizer
+- Use `addmm` to improve Newton–Schulz orthogonalization in Muon ([#161379](https://github.com/pytorch/pytorch/pull/161379))
+- Avoid stream sync in SWA `AveragedModel.update_parameters()` ([#157705](https://github.com/pytorch/pytorch/pull/157705))
+
 ## Autograd
 - Fix SVD forward-mode AD multiplication priority ([#161027](https://github.com/pytorch/pytorch/pull/161027))
 
-## CUDA
-- Use a nonblocking copy to avoid stream synchronization for GPU tensor indexing with CPU mask ([#156384](https://github.com/pytorch/pytorch/pull/156384))
-- Disable cudagraph GCs by default to improve capture performance ([#158649](https://github.com/pytorch/pytorch/pull/158649))
-
 ## Dynamo
 - Recursive `dict` tag optimization for faster guard evaluation ([#159183](https://github.com/pytorch/pytorch/pull/159183))
 
-## Export
-- Caching optimizations for placeholder naming pass ([#158594](https://github.com/pytorch/pytorch/pull/158594))
-- Add Static Dispatch Kernel for `fmod.Scalar` and `scale_gradient` ([#160654](https://github.com/pytorch/pytorch/pull/160654), [#160454](https://github.com/pytorch/pytorch/pull/160454))
-
 ## Inductor
 - Improve performance of A16W4 and A16W8 `GEMM` template ([#159127](https://github.com/pytorch/pytorch/pull/159127)) ([#161148](https://github.com/pytorch/pytorch/pull/161148))
 - More aggressive persistent reduction ([#161055](https://github.com/pytorch/pytorch/pull/161055))
 - Add a few outer dimension reduction cases for LOAF ([#162028](https://github.com/pytorch/pytorch/pull/162028))
 - Fuse two RoPE kernels into a single kernel and improving runtime efficiency ([#161420](https://github.com/pytorch/pytorch/pull/161420))
 
+## Export
+- Caching optimizations for placeholder naming pass ([#158594](https://github.com/pytorch/pytorch/pull/158594))
+- Add Static Dispatch Kernel for `fmod.Scalar` and `scale_gradient` ([#160654](https://github.com/pytorch/pytorch/pull/160654), [#160454](https://github.com/pytorch/pytorch/pull/160454))
+
+## CUDA
+- Use a nonblocking copy to avoid stream synchronization for GPU tensor indexing with CPU mask ([#156384](https://github.com/pytorch/pytorch/pull/156384))
+- Disable cudagraph GCs by default to improve capture performance ([#158649](https://github.com/pytorch/pytorch/pull/158649))
+
+## Release Engineering
+- Upgrade to ROCm 6.4.1 and 6.4.2 patch releases ([#156636](https://github.com/pytorch/pytorch/pull/156636)) ([#158887](https://github.com/pytorch/pytorch/pull/158887)) ([#158886](https://github.com/pytorch/pytorch/pull/158886)) ([#158651](https://github.com/pytorch/pytorch/pull/158651)) ([#159001](https://github.com/pytorch/pytorch/pull/159001))
+- Migrate RPyTorch ROCm CI to MI325 capacity ([#159059](https://github.com/pytorch/pytorch/pull/159059)) ([#159649](https://github.com/pytorch/pytorch/pull/159649)) ([#161184](https://github.com/pytorch/pytorch/pull/161184))
+- Enable B200 PyTorch benchmark testing ([#158011](https://github.com/pytorch/pytorch/pull/158011)) ([#157341](https://github.com/pytorch/pytorch/pull/157341))
+
 ## MPS
 - Optimize cummin/cummax metal kernels ([\#156794](https://github.com/pytorch/pytorch/pull/156794))
 - Speedup `torch.full` for 1-byte types ([\#158874](https://github.com/pytorch/pytorch/pull/158874))
@@ -643,15 +647,6 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Avoid calling tensor ops in `max_pool3d` impl ([\#157874](https://github.com/pytorch/pytorch/pull/157874))
 - Move `max_pool2d` to Metal for `stride != 1` ([\#157876](https://github.com/pytorch/pytorch/pull/157876))
 
-## Optimizer
-- Use `addmm` to improve Newton–Schulz orthogonalization in Muon ([#161379](https://github.com/pytorch/pytorch/pull/161379))
-- Avoid stream sync in SWA `AveragedModel.update_parameters()` ([#157705](https://github.com/pytorch/pytorch/pull/157705))
-
-## Release Engineering
-- Upgrade to ROCm 6.4.1 and 6.4.2 patch releases ([#156636](https://github.com/pytorch/pytorch/pull/156636)) ([#158887](https://github.com/pytorch/pytorch/pull/158887)) ([#158886](https://github.com/pytorch/pytorch/pull/158886)) ([#158651](https://github.com/pytorch/pytorch/pull/158651)) ([#159001](https://github.com/pytorch/pytorch/pull/159001))
-- Migrate RPyTorch ROCm CI to MI325 capacity ([#159059](https://github.com/pytorch/pytorch/pull/159059)) ([#159649](https://github.com/pytorch/pytorch/pull/159649)) ([#161184](https://github.com/pytorch/pytorch/pull/161184))
-- Enable B200 PyTorch benchmark testing ([#158011](https://github.com/pytorch/pytorch/pull/158011)) ([#157341](https://github.com/pytorch/pytorch/pull/157341))
-
 ## ROCm
 - SDPA now uses AOTriton to 0.11b ([#161754](https://github.com/pytorch/pytorch/pull/161754))
 - `hipblaslt` is used by default on gfx908 for ROCm >= 6.3 ([#159092](https://github.com/pytorch/pytorch/pull/159092))
@@ -665,6 +660,24 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Enable tensor memory descriptor Triton template for Intel GPU ([#161600](https://github.com/pytorch/pytorch/pull/161600))
 
 # Documentation
+## Python Frontend
+- Improve documentation for `torch.lobpcg`, `torch.clone`, `torch.matmul`, `torch.max`, `torch.gather`, `torch.Tensor.scatter_`, `torch.empty_like`, `torch.randint`, `torch.mul`, `torch.min`, `torch.max`. `torch.sort`, `torch.full_like`, `torch.histogramdd`, `torch.hamming_window` ([#156139](https://github.com/pytorch/pytorch/pull/156139), [#157007](https://github.com/pytorch/pytorch/pull/157007), [#161424](https://github.com/pytorch/pytorch/pull/161424), [#156153](https://github.com/pytorch/pytorch/pull/156153), [#157929](https://github.com/pytorch/pytorch/pull/157929), [#157920](https://github.com/pytorch/pytorch/pull/157920), [#158050](https://github.com/pytorch/pytorch/pull/158050), [#158731](https://github.com/pytorch/pytorch/pull/158731), [#160312](https://github.com/pytorch/pytorch/pull/160312), [#161539](https://github.com/pytorch/pytorch/pull/161539), [#162051](https://github.com/pytorch/pytorch/pull/162051), [#158275](https://github.com/pytorch/pytorch/pull/158275), [#152682](https://github.com/pytorch/pytorch/pull/152682))
+- Remove torchscript related sections in serialization docs ([#156648](https://github.com/pytorch/pytorch/pull/156648))
+- Fix typo in `torch.set_float32_matmul_precision` docs ([#158191](https://github.com/pytorch/pytorch/pull/158191))
+- Fix docstring for `torch.nn.utils.clip_grads_with_norm_` to reflect clamping behavior ([#158200](https://github.com/pytorch/pytorch/pull/158200))
+- Fix the Doc issue on the description of edge_order in `torch.gradient` ([#159130](https://github.com/pytorch/pytorch/pull/159130))
+- Add `torch.segment_reduce` docs ([#154352](https://github.com/pytorch/pytorch/pull/154352))
+- Add examples to `torch.is_floating_point` and `torch.is_complex` docs ([#161951](https://github.com/pytorch/pytorch/pull/161951))
+## torch.nn
+- Improve description of `padding` for `avg_poolnd` ([#159142](https://github.com/pytorch/pytorch/pull/159142))
+- Improve `CrossEntropyLoss` docs with example of incorrect target specification ([#155649](https://github.com/pytorch/pytorch/pull/155649))
+- Remove redundant dtype conversion in `scaled_dot_product_attention` example ([#161613](https://github.com/pytorch/pytorch/pull/161613))
+
+## Optimizer
+- Document specific optimizer modules APIs e.g., `torch.optim.adam.Adam`, properly ([#158483](https://github.com/pytorch/pytorch/pull/158483), [#158669](https://github.com/pytorch/pytorch/pull/158669), [#160194](https://github.com/pytorch/pytorch/pull/160194))
+- Add note for clarity in Adafactor doc #154862 ([#155248](https://github.com/pytorch/pytorch/pull/155248))
+- Minorly improve `zero_grad` description ([#161239](https://github.com/pytorch/pytorch/pull/161239))
+
 ## Autograd
 - Improve `torch.inference_mode` docs and error message ([#161164](https://github.com/pytorch/pytorch/pull/161164))
 
@@ -678,8 +691,10 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 ### FullyShardedDataParallel (FSDP)
   - Removed FSDP1 developer note ([#158991](https://github.com/pytorch/pytorch/pull/158991))
 
-## Export
-- Update docs around draft export, dynamism, and PT2 Archive ([#157750](https://github.com/pytorch/pytorch/pull/157750))
+## Profiler
+- Update PT2 Profiler Torch-Compiled Region Image ([#158066](https://github.com/pytorch/pytorch/pull/158066))
+- Fix Experimental Config Documentatation([#156586](https://github.com/pytorch/pytorch/pull/156586))
+- Update README ([#159816](https://github.com/pytorch/pytorch/pull/159816))
 
 ## FX
 - Fix typos in `torch/` (`torch/fx/`) ([#156604](https://github.com/pytorch/pytorch/pull/156604))
@@ -690,10 +705,8 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 ## Inductor
 - Add documentation for CUDAGraph partition ([#159450](https://github.com/pytorch/pytorch/pull/159450))
 
-## torch.nn
-- Improve description of `padding` for `avg_poolnd` ([#159142](https://github.com/pytorch/pytorch/pull/159142))
-- Improve `CrossEntropyLoss` docs with example of incorrect target specification ([#155649](https://github.com/pytorch/pytorch/pull/155649))
-- Remove redundant dtype conversion in `scaled_dot_product_attention` example ([#161613](https://github.com/pytorch/pytorch/pull/161613))
+## Export
+- Update docs around draft export, dynamism, and PT2 Archive ([#157750](https://github.com/pytorch/pytorch/pull/157750))
 
 ## ONNX
 - Update export docstring ([#162622](https://github.com/pytorch/pytorch/pull/162622))
@@ -704,26 +717,6 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Update export docstring and set `fallback=False` by default ([#162622](https://github.com/pytorch/pytorch/pull/162622), [#162726](https://github.com/pytorch/pytorch/pull/162726))
 - Fix typo in error message: summit -> submit ([#162587](https://github.com/pytorch/pytorch/pull/162587))
 
-
-## Optimizer
-- Document specific optimizer modules APIs e.g., `torch.optim.adam.Adam`, properly ([#158483](https://github.com/pytorch/pytorch/pull/158483), [#158669](https://github.com/pytorch/pytorch/pull/158669), [#160194](https://github.com/pytorch/pytorch/pull/160194))
-- Add note for clarity in Adafactor doc #154862 ([#155248](https://github.com/pytorch/pytorch/pull/155248))
-- Minorly improve `zero_grad` description ([#161239](https://github.com/pytorch/pytorch/pull/161239))
-
-## Profiler
-- Update PT2 Profiler Torch-Compiled Region Image ([#158066](https://github.com/pytorch/pytorch/pull/158066))
-- Fix Experimental Config Documentatation([#156586](https://github.com/pytorch/pytorch/pull/156586))
-- Update README ([#159816](https://github.com/pytorch/pytorch/pull/159816))
-
-## Python Frontend
-- Improve documentation for `torch.lobpcg`, `torch.clone`, `torch.matmul`, `torch.max`, `torch.gather`, `torch.Tensor.scatter_`, `torch.empty_like`, `torch.randint`, `torch.mul`, `torch.min`, `torch.max`. `torch.sort`, `torch.full_like`, `torch.histogramdd`, `torch.hamming_window` ([#156139](https://github.com/pytorch/pytorch/pull/156139), [#157007](https://github.com/pytorch/pytorch/pull/157007), [#161424](https://github.com/pytorch/pytorch/pull/161424), [#156153](https://github.com/pytorch/pytorch/pull/156153), [#157929](https://github.com/pytorch/pytorch/pull/157929), [#157920](https://github.com/pytorch/pytorch/pull/157920), [#158050](https://github.com/pytorch/pytorch/pull/158050), [#158731](https://github.com/pytorch/pytorch/pull/158731), [#160312](https://github.com/pytorch/pytorch/pull/160312), [#161539](https://github.com/pytorch/pytorch/pull/161539), [#162051](https://github.com/pytorch/pytorch/pull/162051), [#158275](https://github.com/pytorch/pytorch/pull/158275), [#152682](https://github.com/pytorch/pytorch/pull/152682))
-- Remove torchscript related sections in serialization docs ([#156648](https://github.com/pytorch/pytorch/pull/156648))
-- Fix typo in `torch.set_float32_matmul_precision` docs ([#158191](https://github.com/pytorch/pytorch/pull/158191))
-- Fix docstring for `torch.nn.utils.clip_grads_with_norm_` to reflect clamping behavior ([#158200](https://github.com/pytorch/pytorch/pull/158200))
-- Fix the Doc issue on the description of edge_order in `torch.gradient` ([#159130](https://github.com/pytorch/pytorch/pull/159130))
-- Add `torch.segment_reduce` docs ([#154352](https://github.com/pytorch/pytorch/pull/154352))
-- Add examples to `torch.is_floating_point` and `torch.is_complex` docs ([#161951](https://github.com/pytorch/pytorch/pull/161951))
-
 ## Release Engineering
 - Add decorator to create deprecation warnings ([#155127](https://github.com/pytorch/pytorch/pull/155127))
 - Add runnable code examples to export documentation ([#158506](https://github.com/pytorch/pytorch/pull/158506))
@@ -737,13 +730,8 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Don't store flamegraph to tmp folder ([#157374](https://github.com/pytorch/pytorch/pull/157374))
 
 # Developers
-## Composability
-- Stop suggesting to use `guard_size_oblivious` on data dependent errors ([#160510](https://github.com/pytorch/pytorch/pull/160510))
-- Avoid unnecessary slices resulting in data-dependent errors ([#157528](https://github.com/pytorch/pytorch/pull/157528))
-
-## Dataloader Frontend
-- Add `torch.utils.data` samplers benchmark script ([#156974](https://github.com/pytorch/pytorch/pull/156974))
-- Add `torch.utils.data.Dataloader` benchmark script ([#159432](https://github.com/pytorch/pytorch/pull/159432))
+## Python Frontend
+- Better sample inputs for addmm OpInfo ([#160234](https://github.com/pytorch/pytorch/pull/160234))
 
 ## Distributed
 ### c10d
@@ -752,12 +740,12 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
   - Add `check_rng_sync` util ([#160283](https://github.com/pytorch/pytorch/pull/160283))
   - Add `FlightRecorder` support for `ProcessGroupXCCL` ([#158568](https://github.com/pytorch/pytorch/pull/158568))
   - Add `early_stop` kwarg to `torch.utils.checkpoint` ([#160781](https://github.com/pytorch/pytorch/pull/160781))
-### Device Mesh
-  - Add error when users try to slice non contiguous flattened dim submesh ([#157523](https://github.com/pytorch/pytorch/pull/157523))
-  - Make the repr shorter when debug ENV not set ([#158822](https://github.com/pytorch/pytorch/pull/158822))
 ### DTensor
   - Wrap sharding prop error with contextual exception ([#161574](https://github.com/pytorch/pytorch/pull/161574))
   - Add check if tracing for sharding propagation to handle un-hashable keys in DTensor ([#160798](https://github.com/pytorch/pytorch/pull/160798))
+### Device Mesh
+  - Add error when users try to slice non contiguous flattened dim submesh ([#157523](https://github.com/pytorch/pytorch/pull/157523))
+  - Make the repr shorter when debug ENV not set ([#158822](https://github.com/pytorch/pytorch/pull/158822))
 ### ShardedTensor
   - Make error message descriptive in ShardedTensor creation (#150627) ([#159423](https://github.com/pytorch/pytorch/pull/159423))
 ### Pipeline Parallelism (PP)
@@ -793,13 +781,18 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Print out error msg when nvcc compiler fails ([#157203](https://github.com/pytorch/pytorch/pull/157203))
 - Add kernel information JSON generation for AOTI packages ([#160540](https://github.com/pytorch/pytorch/pull/160540))
 
-## Python Frontend
-- Better sample inputs for addmm OpInfo ([#160234](https://github.com/pytorch/pytorch/pull/160234))
+## Composability
+- Stop suggesting to use `guard_size_oblivious` on data dependent errors ([#160510](https://github.com/pytorch/pytorch/pull/160510))
+- Avoid unnecessary slices resulting in data-dependent errors ([#157528](https://github.com/pytorch/pytorch/pull/157528))
 
 ## Quantization
 - Revamp dtype documentation ([#156087](https://github.com/pytorch/pytorch/pull/156087))
 - Use new type statement to fix public API of types ([#158487](https://github.com/pytorch/pytorch/pull/158487))
 
+## Dataloader Frontend
+- Add `torch.utils.data` samplers benchmark script ([#156974](https://github.com/pytorch/pytorch/pull/156974))
+- Add `torch.utils.data.Dataloader` benchmark script ([#159432](https://github.com/pytorch/pytorch/pull/159432))
+
 ## Release Engineering
 - Replace `setup.py develop` with `pip install -e` for development builds ([#155998](https://github.com/pytorch/pytorch/pull/155998)) ([#156027](https://github.com/pytorch/pytorch/pull/156027)) ([#156710](https://github.com/pytorch/pytorch/pull/156710))  ([#156709](https://github.com/pytorch/pytorch/pull/156709))
 

From ad2363f386c478d0122312224758d72ad8dd88a3 Mon Sep 17 00:00:00 2001
From: Angel Li <liangel@meta.com>
Date: Tue, 7 Oct 2025 14:48:35 -0700
Subject: [PATCH 5/6] adding cherry picks

---
 2.9.0/final.md | 37 +++++++++++++++++++++++++++++++++----
 1 file changed, 33 insertions(+), 4 deletions(-)

diff --git a/2.9.0/final.md b/2.9.0/final.md
index 963c5d8..2f709ac 100644
--- a/2.9.0/final.md
+++ b/2.9.0/final.md
@@ -36,7 +36,7 @@ See the PR for details on the exact changes and how to update your code.
 
 ## Raise appropriate errors in `torch.cat` ([#158249](https://github.com/pytorch/pytorch/pull/158249))
 
-`torch.cat` now raises `ValueError`, `IndexError` or `TypeError` where appropriate instead of the generic `RuntimeError`. If you code was catching these error, you can update to catch the new error type.
+`torch.cat` now raises `ValueError`, `IndexError` or `TypeError` where appropriate instead of the generic `RuntimeError`. If you code was catching these errors, you can update to catch the new error type.
 
 
 ## Default to `dynamo=True` for ONNX exporter ([#159646](https://github.com/pytorch/pytorch/pull/159646), [#162726](https://github.com/pytorch/pytorch/pull/162726))
@@ -63,7 +63,7 @@ torch.onnx.export(...)
 Recommendation: first try the new default; only fall back if you hit blocking issues and report them upstream.
 Long term solution: fix the root cause instead of relying on fallback or TorchScript exporter.
 
-## Switch off runtime asserts by default in favor of a shape guards function ([#160111](https://github.com/pytorch/pytorch/pull/160111), [#161178](https://github.com/pytorch/pytorch/pull/161178), [#161794](https://github.com/pytorch/pytorch/pull/161794))
+## In Export, switch off runtime asserts by default in favor of a shape guards function ([#160111](https://github.com/pytorch/pytorch/pull/160111), [#161178](https://github.com/pytorch/pytorch/pull/161178), [#161794](https://github.com/pytorch/pytorch/pull/161794))
 
 
 To enable runtime asserts, use `export(..., prefer_deferred_runtime_asserts_over_guards=True)`. Also kills the `allow_complex_guards_as_runtime_asserts` flag, merging it into the former option.
@@ -71,7 +71,7 @@ To enable runtime asserts, use `export(..., prefer_deferred_runtime_asserts_over
 
 Additionally, `exported_program.module()` will generate a call to a `_guards_fn` submodule that will run additional checks on inputs. Users who do not want this behavior can either remove this call in the graph, or do `exported_program.module(check_guards=False)` to avoid the generation.
 
-## Set default opset to 20 ([#158802](https://github.com/pytorch/pytorch/pull/158802))
+## Set default opset to 20 in ONNX ([#158802](https://github.com/pytorch/pytorch/pull/158802))
 
 Opset 20 enables newer operator definitions. If your tooling or downstream runtime only supports opset 18, pin it explicitly. For the latest ONNX operators, you can experiment with opset 23.
 
@@ -132,7 +132,7 @@ The experimental ONNX Runtime compile backend (`torch.compile(backend="onnxrt")`
 
 The `dynamo=True` mode uses `FakeTensor`s by default which is memory efficient.
 
-## Some public facing utility APIs for the TorchScript based exporter are now private ([#161323](https://github.com/pytorch/pytorch/pull/161323))
+## In ONNX, some public facing utility APIs for the TorchScript based exporter are now private ([#161323](https://github.com/pytorch/pytorch/pull/161323))
 
 Deprecated members in `torch.onnx.verification` are removed. Previously private `torch.onnx.symbolic_opsets*` functions will no longer be accessible. Consider making a copy of the source code if you need to access any private functions for compatibility with the TorchScript based exporter.
 
@@ -172,6 +172,21 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 ## Dynamo
 - Experimental API for ahead-of-time compiling models in fullgraph mode ([#161383](https://github.com/pytorch/pytorch/pull/161383))
 - Add a hook for recompilations ([#157961](https://github.com/pytorch/pytorch/pull/157961))
+- DynamicInts prototype ([#162194](https://github.com/pytorch/pytorch/pull/162194))
+
+Introduces an API for annotating dynamic integer inputs & attributes for `torch.compile`, by wrapping plain ints with `DynamicInt()`.
+DynamicInt objects also work in eager mode, acting as their underlying values when passed as scalar inputs.
+
+```python
+a = DynamicInt(4)
+y = a + 2  # DynamicInt(6)
+z = torch.ones(a)  # torch.ones(4)
+
+fn = torch.compile(torch.ones)
+fn(a)  # compiled fn takes a dynamic integer input
+fn(2)  # returns torch.ones(2) without recompiling
+```
+
 
 ## Optimizer
 - Introduce Muon optimizer to PyTorch ([#160213](https://github.com/pytorch/pytorch/pull/160213))
@@ -507,6 +522,11 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Fix segfault due to interaction between Dynamo backends and `torch.compiler.reset()` ([#156527](https://github.com/pytorch/pytorch/pull/156527))
 - Fix crash due to bad interaction with recompilations and with blocks in Python 3.11+ ([#162318](https://github.com/pytorch/pytorch/pull/162318))
 
+## torch.nn
+- Fix silent correctness w/ backpropping grads for `FlexAttention` ([#163677](https://github.com/pytorch/pytorch/pull/163677))
+- Fix `return_lse` warning message in `FlexAttention` ([#163578](https://github.com/pytorch/pytorch/pull/163578))
+- Fix `FlexAttention` head broadcast ([#163426](https://github.com/pytorch/pytorch/pull/163426))
+
 ## Inductor
 - Fix wrong meta function for `constant_pad_nd` ([#159878](https://github.com/pytorch/pytorch/pull/159878))
 - Fix learnable bias assertion error in Inductor ([#161170](https://github.com/pytorch/pytorch/pull/161170))
@@ -526,6 +546,9 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Explicitly delete `wait_tensor` returned tensor ([#159502](https://github.com/pytorch/pytorch/pull/159502))
 - Fix memory leak from `all_reduce` ([#159818](https://github.com/pytorch/pytorch/pull/159818))
 
+## Composability
+- Make functionalization ViewMeta serializable with pickle ([#163769](https://github.com/pytorch/pytorch/pull/163769))
+
 ## Export
 - Fix bug in constants lifting pass ([#157719](https://github.com/pytorch/pytorch/pull/157719))
 - Fix `from_node` provenance in unlift pass ([#157943](https://github.com/pytorch/pytorch/pull/157943))
@@ -554,6 +577,9 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Fix lower opset version support in `dynamo=True` ([#161056](https://github.com/pytorch/pytorch/pull/161056))
 - Fix `index_put_` usage ([#161263](https://github.com/pytorch/pytorch/pull/161263))
 
+## C++ Extensions
+- Fix CPP extension distributed warning for `TORCH_CUDA_ARCH_LIST` to only log when running on non-distributed or on rank 0 ([#162764](https://github.com/pytorch/pytorch/pull/162764))
+
 ## C++ Frontend
 - Fix `torch.utils.cpp_extension` parser for clang version 20.1.7+libcxx ([#157666](https://github.com/pytorch/pytorch/pull/157666))
 - Fix `MakeTensor::computeStorageSize()` calculation ([#158690](https://github.com/pytorch/pytorch/pull/158690))
@@ -591,6 +617,9 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Fix empty input in posneg functions ([#161824](https://github.com/pytorch/pytorch/pull/161824))
 - Migrate round unary op to Metal ([#161712](https://github.com/pytorch/pytorch/pull/161712))
 - Type-promote tensor-iterator common dtype ([#160334](https://github.com/pytorch/pytorch/pull/160334))
+- Fix regression in 2.8.0 for `scaled_dot_product_attention` using MPS ([#163598](https://github.com/pytorch/pytorch/pull/163598))
+- Chunk `fillBuffer` into 4Gb slices to avoid regression on MacOS 26 ([#164108](https://github.com/pytorch/pytorch/pull/164108))
+- Fix latent bug that can result in segfault in CPP extensions ([#164093](https://github.com/pytorch/pytorch/pull/164093))
 
 ## ROCm
 - Fix Inductor with cudagraph trees `hip:0` device error ([#161221](https://github.com/pytorch/pytorch/pull/161221))

From cfe8eebbbaa4ec9d1866857f2c96ff967bcd4f32 Mon Sep 17 00:00:00 2001
From: Angel Li <liangel@meta.com>
Date: Mon, 13 Oct 2025 07:19:53 -0700
Subject: [PATCH 6/6] apply seds

---
 2.9.0/final.md | 868 ++++++++++++++++++++++++-------------------------
 1 file changed, 434 insertions(+), 434 deletions(-)

diff --git a/2.9.0/final.md b/2.9.0/final.md
index 2f709ac..e465ceb 100644
--- a/2.9.0/final.md
+++ b/2.9.0/final.md
@@ -20,26 +20,26 @@ Below are the full release notes for this release.
 
 # Backwards Incompatible Changes
 
-## Min supported Python version is now 3.10 ([#162310](https://github.com/pytorch/pytorch/pull/162310))
+## Min supported Python version is now 3.10 (#162310)
 
 The minimum version of Python required for PyTorch 2.9.0 is 3.10. We also have 3.14 and 3.14t available as preview with this release.
 
-## Build metal kernels of MacOS-14+ and remove all pre-MacOS-14 specific logic, requires MacOS-14+ going forward ([\#159733](https://github.com/pytorch/pytorch/pull/159733), [\#159912](https://github.com/pytorch/pytorch/pull/159912))
+## Build metal kernels of MacOS-14+ and remove all pre-MacOS-14 specific logic, requires MacOS-14+ going forward (#159733, #159912)
 
 PyTorch MPS is only supported on MacOS-14 or later. If you need to use MPS on MacOS Ventura, please avoid updating to Python-3.9 or above
 
-## Upgrade to DLPack 1.0 ([#145000](https://github.com/pytorch/pytorch/pull/145000))
+## Upgrade to DLPack 1.0 (#145000)
 
 This upgrade is doing the same BC-breaking changes as the DLPack release.
 Objects in `torch.utils.dlpack` have been updated to reflect these changes, such as `DLDeviceType`.
 See the PR for details on the exact changes and how to update your code.
 
-## Raise appropriate errors in `torch.cat` ([#158249](https://github.com/pytorch/pytorch/pull/158249))
+## Raise appropriate errors in `torch.cat` (#158249)
 
 `torch.cat` now raises `ValueError`, `IndexError` or `TypeError` where appropriate instead of the generic `RuntimeError`. If you code was catching these errors, you can update to catch the new error type.
 
 
-## Default to `dynamo=True` for ONNX exporter ([#159646](https://github.com/pytorch/pytorch/pull/159646), [#162726](https://github.com/pytorch/pytorch/pull/162726))
+## Default to `dynamo=True` for ONNX exporter (#159646, #162726)
 
 Previously `torch.onnx.export(...)` used the legacy TorchScript exporter if no arguments were provied. The ONNX exporter now uses the newer `torch.export.export` pipeline by default (`dynamo=True`). This change improves graph fidelity and future-proofs exports, but may surface graph capture errors that were previously masked or handled differently.
 
@@ -63,7 +63,7 @@ torch.onnx.export(...)
 Recommendation: first try the new default; only fall back if you hit blocking issues and report them upstream.
 Long term solution: fix the root cause instead of relying on fallback or TorchScript exporter.
 
-## In Export, switch off runtime asserts by default in favor of a shape guards function ([#160111](https://github.com/pytorch/pytorch/pull/160111), [#161178](https://github.com/pytorch/pytorch/pull/161178), [#161794](https://github.com/pytorch/pytorch/pull/161794))
+## Switch off runtime asserts by default in Export in favor of a shape guards function (#160111, #161178, #161794)
 
 
 To enable runtime asserts, use `export(..., prefer_deferred_runtime_asserts_over_guards=True)`. Also kills the `allow_complex_guards_as_runtime_asserts` flag, merging it into the former option.
@@ -71,7 +71,7 @@ To enable runtime asserts, use `export(..., prefer_deferred_runtime_asserts_over
 
 Additionally, `exported_program.module()` will generate a call to a `_guards_fn` submodule that will run additional checks on inputs. Users who do not want this behavior can either remove this call in the graph, or do `exported_program.module(check_guards=False)` to avoid the generation.
 
-## Set default opset to 20 in ONNX ([#158802](https://github.com/pytorch/pytorch/pull/158802))
+## Set default opset to 20 in ONNX (#158802)
 
 Opset 20 enables newer operator definitions. If your tooling or downstream runtime only supports opset 18, pin it explicitly. For the latest ONNX operators, you can experiment with opset 23.
 
@@ -95,7 +95,7 @@ torch.onnx.export(...)
 torch.onnx.export(..., opset_version=23)
 ```
 
-## Drop `draft_export` in exporter API ([#161454](https://github.com/pytorch/pytorch/pull/161454), [#162225](https://github.com/pytorch/pytorch/pull/162225))
+## Drop `draft_export` in exporter API (#161454, #162225)
 
 Remove implicit draft tracing from the default exporter path, achieving clearer behaviour and faster failures.
 The expensive `torch.export.draft_export` diagnostic path is no longer auto-invoked (which could take hours on large models). You can still opt in for deep diagnostics:
@@ -123,56 +123,56 @@ Now in torch 2.9.0:
 TORCH_ONNX_ENABLE_DRAFT_EXPORT=True python export_to_onnx.py
 ```
 
-## Remove `torch.onnx.dynamo_export` and the `onnxrt` torch compile backend ([#158130](https://github.com/pytorch/pytorch/pull/158130), [#158258](https://github.com/pytorch/pytorch/pull/158258))
+## Remove `torch.onnx.dynamo_export` and the `onnxrt` torch compile backend (#158130, #158258)
 
 `torch.onnx.dynamo_export` is removed. Please use `torch.onnx.export` instead.
 The experimental ONNX Runtime compile backend (`torch.compile(backend="onnxrt")`) is no longer supported.
 
-## Remove `torch.onnx.enable_fake_mode` ([#161222](https://github.com/pytorch/pytorch/pull/161222))
+## Remove `torch.onnx.enable_fake_mode` (#161222)
 
 The `dynamo=True` mode uses `FakeTensor`s by default which is memory efficient.
 
-## In ONNX, some public facing utility APIs for the TorchScript based exporter are now private ([#161323](https://github.com/pytorch/pytorch/pull/161323))
+## Some public facing ONNX utility APIs for the TorchScript based exporter are now private (#161323)
 
 Deprecated members in `torch.onnx.verification` are removed. Previously private `torch.onnx.symbolic_opsets*` functions will no longer be accessible. Consider making a copy of the source code if you need to access any private functions for compatibility with the TorchScript based exporter.
 
-## Remove `torch.onnx.symbolic_caffe2` ([#157102](https://github.com/pytorch/pytorch/pull/157102))
+## Remove `torch.onnx.symbolic_caffe2` (#157102)
 
 Support for `caffe2` in the ONNX exporter has ended and is removed.
 
-## Remove `/d2implyavx512upperregs` flag that slows build ([#159431](https://github.com/pytorch/pytorch/pull/159431))
+## Remove `/d2implyavx512upperregs` flag that slows build (#159431)
 
-Re-introduced AVX512 optimizations for Windows VS2022 builds, may cause issues with specific versions of VS2022, see [#145702](https://github.com/pytorch/pytorch/issues/145702)
+Re-introduced AVX512 optimizations for Windows VS2022 builds, may cause issues with specific versions of VS2022, see #145702
 
-## Add `ScalarType` to shim conversion and `stable::Tensor.scalar_type` ([#160557](https://github.com/pytorch/pytorch/pull/160557))
+## Add `ScalarType` to shim conversion and `stable::Tensor.scalar_type` (#160557)
 
 Before, user extensions could only in abstract pass around obfuscated dtypes appearing as `int32_ts`. Now, users can confidently use `torch::headeronly::ScalarType` in their extensions for major scalar types. This PR enables ABI stability by adding a translation layer through the shim, so that even if the `ScalarType` enum values change in the future, user extensions need not fear.
 
 This change adds ScalarType support for user extensions and is only narrowly BC breaking for unpopular dtypes: `quint*`s, `qint*`s, `Bits*`, `dummy_uint*`s, `dummy_int*`s, `Float8_e8m0fnu`, and `Float4_e2m1fn_x2` in the use case where an extension retrieves a Tensor dtype of the above and passes it into `aoti_torch_call_dispatcher`.
 
 # Deprecations
-## Deprecate `pin_memory_device` param in `torch.utils.data.DataLoader` ([#158323](https://github.com/pytorch/pytorch/pull/158323))
+## Deprecate `pin_memory_device` param in `torch.utils.data.DataLoader` (#158323)
 
 We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required for `StatefulDataloader` which leveraged `BaseDataLoaderIter` direclty rather than the `Dataloader` class init
 
-## Deprecate `torch.export.export_for_training` API in favor of equivalent `torch.export.export` API ([#158203](https://github.com/pytorch/pytorch/pull/158203))
+## Deprecate `torch.export.export_for_training` API in favor of equivalent `torch.export.export` API (#158203)
 
 `torch.export.export_for_training` exists because we couldn't migrate internal usages of export to the final IR. Now that we have completed the migration, we deprecated and deleted this API.
 
 # New Features
 ## Python Frontend
-- Add utility to get the kernel currently registered on the dispatcher ([#158393](https://github.com/pytorch/pytorch/pull/158393))
-- Extend `__torch_function__` handler to be triggered by elements within a list ([#160256](https://github.com/pytorch/pytorch/pull/160256))
-- Add `torch.hash_tensor` reduction function ([#154149](https://github.com/pytorch/pytorch/pull/154149))
+- Add utility to get the kernel currently registered on the dispatcher (#158393)
+- Extend `__torch_function__` handler to be triggered by elements within a list (#160256)
+- Add `torch.hash_tensor` reduction function (#154149)
 
 ## FX
-- Extend torch function support to ALL arguments instead of just scalar type (but not inside of list) ([#145089](https://github.com/pytorch/pytorch/pull/145089))
-- Add `is_fx_symbolic_tracing` flag ([#161385](https://github.com/pytorch/pytorch/pull/161385))
+- Extend torch function support to ALL arguments instead of just scalar type (but not inside of list, #145089)
+- Add `is_fx_symbolic_tracing` flag (#161385)
 
 ## Dynamo
-- Experimental API for ahead-of-time compiling models in fullgraph mode ([#161383](https://github.com/pytorch/pytorch/pull/161383))
-- Add a hook for recompilations ([#157961](https://github.com/pytorch/pytorch/pull/157961))
-- DynamicInts prototype ([#162194](https://github.com/pytorch/pytorch/pull/162194))
+- Experimental API for ahead-of-time compiling models in fullgraph mode (#161383)
+- Add a hook for recompilations (#157961)
+- DynamicInts prototype (#162194)
 
 Introduces an API for annotating dynamic integer inputs & attributes for `torch.compile`, by wrapping plain ints with `DynamicInt()`.
 DynamicInt objects also work in eager mode, acting as their underlying values when passed as scalar inputs.
@@ -189,641 +189,641 @@ fn(2)  # returns torch.ones(2) without recompiling
 
 
 ## Optimizer
-- Introduce Muon optimizer to PyTorch ([#160213](https://github.com/pytorch/pytorch/pull/160213))
+- Introduce Muon optimizer to PyTorch (#160213)
 
 ## Profiler
-- Add GC Events to Python Stack Tracer ([#161209](https://github.com/pytorch/pytorch/pull/161209))
-- Add a custom profiler configuration option ([#151656](https://github.com/pytorch/pytorch/pull/151656))
+- Add GC Events to Python Stack Tracer (#161209)
+- Add a custom profiler configuration option (#151656)
 
 ## Inductor
-- Allow user to pass in custom partitioner function ([#157580](https://github.com/pytorch/pytorch/pull/157580))
+- Allow user to pass in custom partitioner function (#157580)
 
 ## Export
-- Add support for param mutation under inference mode ([#159661](https://github.com/pytorch/pytorch/pull/159661))
+- Add support for param mutation under inference mode (#159661)
 
 ## AOTDispatcher
-- Add AOTDispatcher config to set backward autocast behavior ([#156356](https://github.com/pytorch/pytorch/pull/156356))
+- Add AOTDispatcher config to set backward autocast behavior (#156356)
 
 ## Quantization
-- Enable cpu fp8 qlinear and cpu fp8 qconv ([#155678](https://github.com/pytorch/pytorch/pull/155678), [#157076](https://github.com/pytorch/pytorch/pull/157076))
+- Enable cpu fp8 qlinear and cpu fp8 qconv (#155678, #157076)
 
 ## ONNX
-- RMS Norm support in opset 23 ([#159377](https://github.com/pytorch/pytorch/pull/159377))
+- RMS Norm support in opset 23 (#159377)
 
 ## C++ Extensions
-- Build out a stable set of ATen ops in `torch/csrc/stable/ops.h`:  `amax`, `narrow`, `new_empty` + `new_zeros` dtype variant, `pad`, ([#159328](https://github.com/pytorch/pytorch/pull/159328), [#158974](https://github.com/pytorch/pytorch/pull/158974), [#159508](https://github.com/pytorch/pytorch/pull/159508), [#161597](https://github.com/pytorch/pytorch/pull/161597), [#160214](https://github.com/pytorch/pytorch/pull/160214), )
-- Add `torch::stable::Tensor()` default constructor,  `is_cpu`, and `get_device_index`([#159507](https://github.com/pytorch/pytorch/pull/159507), [#160212](https://github.com/pytorch/pytorch/pull/160212), [#160143](https://github.com/pytorch/pytorch/pull/160143))
-- Add beginnings of `torch::stable::accelerator` with support for DeviceGuard and Stream ([#159679](https://github.com/pytorch/pytorch/pull/159679), [#160453](https://github.com/pytorch/pytorch/pull/160453))
-- Start building out `torch/headeronly`: c10 Macros, STD_TORCH_CHECK, ScalarTypes (like BFloat16 and Half) ([#158035](https://github.com/pytorch/pytorch/pull/158035), [#158365](https://github.com/pytorch/pytorch/pull/158365), [#157912](https://github.com/pytorch/pytorch/pull/157912), [#158377](https://github.com/pytorch/pytorch/pull/158377), [#159302](https://github.com/pytorch/pytorch/pull/159302), [#159414](https://github.com/pytorch/pytorch/pull/159414), [#159412](https://github.com/pytorch/pytorch/pull/159412), [#159415](https://github.com/pytorch/pytorch/pull/159415), [#159411](https://github.com/pytorch/pytorch/pull/159411), [#159911](https://github.com/pytorch/pytorch/pull/159911))
-- Remove cmake cache and reconfigure again if it is invalid ([#156958](https://github.com/pytorch/pytorch/pull/156958))
-- Cut a version of `TORCH_ERROR_CODE_CHECK` in `headeronly` from AOTI ([#159604](https://github.com/pytorch/pytorch/pull/159604))
-- Remove `wheel` from build requirements ([#158027](https://github.com/pytorch/pytorch/pull/158027))
-- Error when `TORCH_STABLE_ONLY` is defined in `TensorBase.h` ([#161658](https://github.com/pytorch/pytorch/pull/161658))
+- Build out a stable set of ATen ops in `torch/csrc/stable/ops.h`:  `amax`, `narrow`, `new_empty` + `new_zeros` dtype variant, `pad`, (#159328, #158974, #159508, #161597, #160214)
+- Add `torch::stable::Tensor()` default constructor,  `is_cpu`, and `get_device_index`(#159507, #160212, #160143)
+- Add beginnings of `torch::stable::accelerator` with support for DeviceGuard and Stream (#159679, #160453)
+- Start building out `torch/headeronly`: c10 Macros, STD_TORCH_CHECK, ScalarTypes (like BFloat16 and Half, #158035, #158365, #157912, #158377, #159302, #159414, #159412, #159415, #159411, #159911)
+- Remove cmake cache and reconfigure again if it is invalid (#156958)
+- Cut a version of `TORCH_ERROR_CODE_CHECK` in `headeronly` from AOTI (#159604)
+- Remove `wheel` from build requirements (#158027)
+- Error when `TORCH_STABLE_ONLY` is defined in `TensorBase.h` (#161658)
 
 ## Build Frontend
-- Add transpose to `torch/csrc/stable` ([#158160](https://github.com/pytorch/pytorch/pull/158160))
-- Add `zero_()` and `empty_like(t)` to `torch/csrc/stable/ops.h` ([#158866](https://github.com/pytorch/pytorch/pull/158866))
+- Add transpose to `torch/csrc/stable` (#158160)
+- Add `zero_()` and `empty_like(t)` to `torch/csrc/stable/ops.h` (#158866)
 
 ## Release Engineering
-- Add support for CUDA 13.0 in CI/CD builds. Enable CUDA compression mode for binary size reduction for CUDA 13.0 builds ([#160956](https://github.com/pytorch/pytorch/pull/160956)) ([#161073](https://github.com/pytorch/pytorch/pull/161073)) ([#161257](https://github.com/pytorch/pytorch/pull/161257)) ([#161663](https://github.com/pytorch/pytorch/pull/161663)) ([#161316](https://github.com/pytorch/pytorch/pull/161316)) ([#160201](https://github.com/pytorch/pytorch/pull/160201)) ([#160770](https://github.com/pytorch/pytorch/pull/160770)) ([#161013](https://github.com/pytorch/pytorch/pull/161013)) ([#161916](https://github.com/pytorch/pytorch/pull/161916)) ([#162268](https://github.com/pytorch/pytorch/pull/162268)) ([#162322](https://github.com/pytorch/pytorch/pull/162322)) ([#162383](https://github.com/pytorch/pytorch/pull/162383)) ([#161833](https://github.com/pytorch/pytorch/pull/161833))
+- Add support for CUDA 13.0 in CI/CD builds. Enable CUDA compression mode for binary size reduction for CUDA 13.0 builds (#160956, #161073, #161257, #161663, #161316, #160201, #160770, #161013, #161916, #162268, #162322, #162383, #161833)
 
-- Enable CUDA 12.6, 12.8 and 13.0 support for Linux ARM64 CD builds ([#162364](https://github.com/pytorch/pytorch/pull/162364)) ([#160720](https://github.com/pytorch/pytorch/pull/160720)) ([#159481](https://github.com/pytorch/pytorch/pull/159481))
+- Enable CUDA 12.6, 12.8 and 13.0 support for Linux ARM64 CD builds (#162364, #160720, #159481)
 
-- Add support for Python 3.14 in CI/CD builds ([#156889](https://github.com/pytorch/pytorch/pull/156889)) ([#157559](https://github.com/pytorch/pytorch/pull/157559)) ([#159261](https://github.com/pytorch/pytorch/pull/159261)) ([#159869](https://github.com/pytorch/pytorch/pull/159869)) ([#160593](https://github.com/pytorch/pytorch/pull/160593)) ([#160788](https://github.com/pytorch/pytorch/pull/160788)) ([#161255](https://github.com/pytorch/pytorch/pull/161255)) ([#159725](https://github.com/pytorch/pytorch/pull/159725))
+- Add support for Python 3.14 in CI/CD builds (#156889, #157559, #159261, #159869, #160593, #160788, #161255, #159725)
 
-- Enable NVSHMEM integration ([#151261](https://github.com/pytorch/pytorch/pull/151261)) ([#153010](https://github.com/pytorch/pytorch/pull/153010)) ([#154538](https://github.com/pytorch/pytorch/pull/154538)) ([#155506](https://github.com/pytorch/pytorch/pull/155506)) ([#156685](https://github.com/pytorch/pytorch/pull/156685)) ([#158938](https://github.com/pytorch/pytorch/pull/158938)) ([#161321](https://github.com/pytorch/pytorch/pull/161321)) ([#160778](https://github.com/pytorch/pytorch/pull/160778)) ([#159907](https://github.com/pytorch/pytorch/pull/159907)) ([#160465](https://github.com/pytorch/pytorch/pull/160465))
+- Enable NVSHMEM integration (#151261, #153010, #154538, #155506, #156685, #158938, #161321, #160778, #159907, #160465)
 
 ## CUDA
-- Add getter for CUDA graph exec to allow mutation of captured kernel params ([#161294](https://github.com/pytorch/pytorch/pull/161294))
-- Implement support for `cudnn_batch_norm_out` kernel to replace the autogen approach ([#123020](https://github.com/pytorch/pytorch/pull/123020))
+- Add getter for CUDA graph exec to allow mutation of captured kernel params (#161294)
+- Implement support for `cudnn_batch_norm_out` kernel to replace the autogen approach (#123020)
 
 ## CPU
-- Support GQA for flash attention ([#157893](https://github.com/pytorch/pytorch/pull/157893))
+- Support GQA for flash attention (#157893)
 
 ## MPS
-- Partial sparse support for MPS backend ([\#159729](https://github.com/pytorch/pytorch/pull/159729), [\#160254](https://github.com/pytorch/pytorch/pull/160254), [\#160223](https://github.com/pytorch/pytorch/pull/160223), [\#161846](https://github.com/pytorch/pytorch/pull/161846), [\#162007](https://github.com/pytorch/pytorch/pull/162007), [#157238](https://github.com/pytorch/pytorch/pull/157238))
-- Add `avg_pool3d`, `max_unpool1d/2d/3d`, `max_pool3d`, `max_pool3d` bwd pass, and `avg_pool3d` bwd pass for MPS ([#158877](https://github.com/pytorch/pytorch/pull/158877),[#159789](https://github.com/pytorch/pytorch/pull/159789), [#156467](https://github.com/pytorch/pytorch/pull/156467), [#157498](https://github.com/pytorch/pytorch/pull/157498), [#159089](https://github.com/pytorch/pytorch/pull/159089))
+- Partial sparse support for MPS backend (#159729, #160254, #160223, #161846, #162007, #157238)
+- Add `avg_pool3d`, `max_unpool1d/2d/3d`, `max_pool3d`, `max_pool3d` bwd pass, and `avg_pool3d` bwd pass for MPS (#158877,#159789, #156467, #157498, #159089)
 
 ## ROCm
-- OCP Micro-scaling Format (mx-fp8/mx-fp4) Support ([#151360](https://github.com/pytorch/pytorch/pull/151360))
+- OCP Micro-scaling Format (mx-fp8/mx-fp4) Support (#151360)
 
 ## XPU
-- Enable `FlexAttention` on Intel GPU ([#143553](https://github.com/pytorch/pytorch/pull/143553))
+- Enable `FlexAttention` on Intel GPU (#143553)
 
 # Improvements
 ## Python Frontend
-- Speed up `torch.load` under `FakeTensorMode` by reducing random reads ([#157931](https://github.com/pytorch/pytorch/pull/157931))
-- Make `torch.utils.benchmark.utils.timer` accelerator agnostic ([#157131](https://github.com/pytorch/pytorch/pull/157131))
-- Improve error message for weight-only load errors ([#159935](https://github.com/pytorch/pytorch/pull/159935))
+- Speed up `torch.load` under `FakeTensorMode` by reducing random reads (#157931)
+- Make `torch.utils.benchmark.utils.timer` accelerator agnostic (#157131)
+- Improve error message for weight-only load errors (#159935)
 
 ## torch.nn
-- Allow `register_buffer` with `Tensor`-like objects ([#159455](https://github.com/pytorch/pytorch/pull/159455))
-- Improve error message for unsupported padding configurations ([#160866](https://github.com/pytorch/pytorch/pull/160866))
-- Validate target is 0D when input is 1D in `NLLLoss` ([#161412](https://github.com/pytorch/pytorch/pull/161412))
+- Allow `register_buffer` with `Tensor`-like objects (#159455)
+- Improve error message for unsupported padding configurations (#160866)
+- Validate target is 0D when input is 1D in `NLLLoss` (#161412)
 
 ## Optimizer
-- Resolve warning in LBFGS when converting a tensor with `requires_grad=True` to a scalar ([#160389](https://github.com/pytorch/pytorch/pull/160389))
-- Resolve `SequentialLR` deprecation warning about invoking `step(epoch)` ([#149392](https://github.com/pytorch/pytorch/pull/149392))
+- Resolve warning in LBFGS when converting a tensor with `requires_grad=True` to a scalar (#160389)
+- Resolve `SequentialLR` deprecation warning about invoking `step(epoch)` (#149392)
 
 ## Autograd
-- Support deterministic `torch.nn.Upsample` `mode="trilinear"` backward ([#154239](https://github.com/pytorch/pytorch/pull/154239))
+- Support deterministic `torch.nn.Upsample` `mode="trilinear"` backward (#154239)
 
 ## Distributed
 ### c10d
-  - Add improvements to eager init of `ProcessGroupNCCL` ([#156748](https://github.com/pytorch/pytorch/pull/156748))
-  - Simplify unique hash management of `ProcessGroupNCCL` ([#156790](https://github.com/pytorch/pytorch/pull/156790))
-  - Support per operation timeouts in `ProcessGroupGloo` ([#158128](https://github.com/pytorch/pytorch/pull/158128))
-  - Allow ping to be retried in `TCPStore` ([#159165](https://github.com/pytorch/pytorch/pull/159165))
-  - Support scalar tensor for functional `all_gather` ([#149913](https://github.com/pytorch/pytorch/pull/149913))
-  - Expos `unsafe_get_ptr` for dist.ProcessGroupNCCL.NCCLConfig ([#161136](https://github.com/pytorch/pytorch/pull/161136))
-  - Add batch option for `send/recv_object_list` ([#160342](https://github.com/pytorch/pytorch/pull/160342))
-  - Make FakeStore optional to be passed into fake backend ([#162164](https://github.com/pytorch/pytorch/pull/162164))
-  - Enable complex datatype support in `ProcessGroupGloo` ([#156633](https://github.com/pytorch/pytorch/pull/156633))
-  - Move thread-local capture mode guard to include `work.isStarted` ([#160398](https://github.com/pytorch/pytorch/pull/160398))
+  - Add improvements to eager init of `ProcessGroupNCCL` (#156748)
+  - Simplify unique hash management of `ProcessGroupNCCL` (#156790)
+  - Support per operation timeouts in `ProcessGroupGloo` (#158128)
+  - Allow ping to be retried in `TCPStore` (#159165)
+  - Support scalar tensor for functional `all_gather` (#149913)
+  - Expos `unsafe_get_ptr` for dist.ProcessGroupNCCL.NCCLConfig (#161136)
+  - Add batch option for `send/recv_object_list` (#160342)
+  - Make FakeStore optional to be passed into fake backend (#162164)
+  - Enable complex datatype support in `ProcessGroupGloo` (#156633)
+  - Move thread-local capture mode guard to include `work.isStarted` (#160398)
 ### DistributedDataParallel (DDP)
-  - Support ddp zero hook XCCL path ([#159240](https://github.com/pytorch/pytorch/pull/159240))
+  - Support ddp zero hook XCCL path (#159240)
 ### DTensor
-  - Relax `device_mesh` argument constraint in `local_map` ([#157049](https://github.com/pytorch/pytorch/pull/157049))
-  - Support complex numbers in DTensor redistribute ([#157329](https://github.com/pytorch/pytorch/pull/157329))
-  - Rework partial propagation in point-wise op and support mul ([#157340](https://github.com/pytorch/pytorch/pull/157340))
-  - Allow dynamic shapes for `DTensor` slice ([#157953](https://github.com/pytorch/pytorch/pull/157953))
-  - Implement `histc` op ([#158298](https://github.com/pytorch/pytorch/pull/158298))
-  - Made dispatch to sharding prop over decomps ([#159324](https://github.com/pytorch/pytorch/pull/159324))
-  - Support user-supplied Generator for random ops ([#159933](https://github.com/pytorch/pytorch/pull/159933))
-  - Add `propagate_tensor_meta` function that skips cache if `_are_we_tracing` ([#161334](https://github.com/pytorch/pytorch/pull/161334))
-  - Support `local_map` as a decorator ([#161353](https://github.com/pytorch/pytorch/pull/161353))
+  - Relax `device_mesh` argument constraint in `local_map` (#157049)
+  - Support complex numbers in DTensor redistribute (#157329)
+  - Rework partial propagation in point-wise op and support mul (#157340)
+  - Allow dynamic shapes for `DTensor` slice (#157953)
+  - Implement `histc` op (#158298)
+  - Made dispatch to sharding prop over decomps (#159324)
+  - Support user-supplied Generator for random ops (#159933)
+  - Add `propagate_tensor_meta` function that skips cache if `_are_we_tracing` (#161334)
+  - Support `local_map` as a decorator (#161353)
 ### Device Mesh
-  - Enable the use of user set backend and pg option even for the global mesh ([#157501](https://github.com/pytorch/pytorch/pull/157501))
-  - Enable slicing a submesh with warnings ([#158899](https://github.com/pytorch/pytorch/pull/158899))
-  - Allow controlling PG backend and options via `init_device_mesh` ([#159371](https://github.com/pytorch/pytorch/pull/159371))
+  - Enable the use of user set backend and pg option even for the global mesh (#157501)
+  - Enable slicing a submesh with warnings (#158899)
+  - Allow controlling PG backend and options via `init_device_mesh` (#159371)
 ### FullyShardedDataParallel2 (FSDP2)
-  - Support custom `all_gather` and `reduce_scatter` comms ([#155189](https://github.com/pytorch/pytorch/pull/155189))
-  - Made it fail `set_allocate_memory_from_process_group` if used together with custom comm hooks ([#157487](https://github.com/pytorch/pytorch/pull/157487))
-  - Use `reduceOpSum` when world size is 1 ([#157529](https://github.com/pytorch/pytorch/pull/157529))
-  - Skipp `allgather` when world size is 1 ([#160135](https://github.com/pytorch/pytorch/pull/160135))
-  - Use `post_reduce_stream.record_event()` on hsdp+cpuoffload ([#160481](https://github.com/pytorch/pytorch/pull/160481))
+  - Support custom `all_gather` and `reduce_scatter` comms (#155189)
+  - Made it fail `set_allocate_memory_from_process_group` if used together with custom comm hooks (#157487)
+  - Use `reduceOpSum` when world size is 1 (#157529)
+  - Skipp `allgather` when world size is 1 (#160135)
+  - Use `post_reduce_stream.record_event()` on hsdp+cpuoffload (#160481)
 ### Tensor Parallel (TP)
-  - Improve `parallelize_module` API to support more cases ([#157182](https://github.com/pytorch/pytorch/pull/157182))
+  - Improve `parallelize_module` API to support more cases (#157182)
 ### TensorPipe
-  - Update TensorPipe pinned dependency version ([#159834](https://github.com/pytorch/pytorch/pull/159834))
+  - Update TensorPipe pinned dependency version (#159834)
 ### TorchElastic
-  - Enable NUMA binding integration with elastic agent and `torchrun` ([#149334](https://github.com/pytorch/pytorch/pull/149334))
-  - Support NUMA Binding for Callable Entrypoints ([#160163](https://github.com/pytorch/pytorch/pull/160163), [#161183](https://github.com/pytorch/pytorch/pull/161183))
+  - Enable NUMA binding integration with elastic agent and `torchrun` (#149334)
+  - Support NUMA Binding for Callable Entrypoints (#160163, #161183)
 ### Pipeline Parallelism (PP)
-  - Add `eval()` API to schedule ([#157795](https://github.com/pytorch/pytorch/pull/157795))
-  - Allow intermediate nodes in zero bubble to have multiple grads ([#159084](https://github.com/pytorch/pytorch/pull/159084))
-  - Support `OVERLAP_F_B` computation type ([#158978](https://github.com/pytorch/pytorch/pull/158978))
-  - Initializ P2P communicators on first step ([#160210](https://github.com/pytorch/pytorch/pull/160210))
-  - Add `DualPipeV` schedule ([#159591](https://github.com/pytorch/pytorch/pull/159591))
+  - Add `eval()` API to schedule (#157795)
+  - Allow intermediate nodes in zero bubble to have multiple grads (#159084)
+  - Support `OVERLAP_F_B` computation type (#158978)
+  - Initializ P2P communicators on first step (#160210)
+  - Add `DualPipeV` schedule (#159591)
 
 ## Linear Algebra Frontend
-- Use rocSOLVER for Cholesky inversion on AMD. ([#157154](https://github.com/pytorch/pytorch/pull/157154))
-- Add option for using TF32 as fp32 internal precision for matmul/linear/conv on MKLDNN ([#157520](https://github.com/pytorch/pytorch/pull/157520))
-- Make einsum produce contiguous outputs in more cases ([#161755](https://github.com/pytorch/pytorch/pull/161755))
+- Use rocSOLVER for Cholesky inversion on AMD. (#157154)
+- Add option for using TF32 as fp32 internal precision for matmul/linear/conv on MKLDNN (#157520)
+- Make einsum produce contiguous outputs in more cases (#161755)
 
 ## Profiler
-- Add more CUDA API for kernel launcher ([#156016](https://github.com/pytorch/pytorch/pull/156016))
-- Allow Custom Time Unit When Printing Profiler Table ([#157913](https://github.com/pytorch/pytorch/pull/157913))
-- Update CUDA runtime kernel identification logic ([#157890](https://github.com/pytorch/pytorch/pull/157890))
+- Add more CUDA API for kernel launcher (#156016)
+- Allow Custom Time Unit When Printing Profiler Table (#157913)
+- Update CUDA runtime kernel identification logic (#157890)
 
 ## FX
-- Fix DCE eliminating random operations by improving `is_impure()` (#151524) ([#157981](https://github.com/pytorch/pytorch/pull/157981))
-- Support converting a float32 tensor to a scalar in FX trace. ([#158216](https://github.com/pytorch/pytorch/pull/158216))
-- Correctly copy `self.module_stack` in ModuleStackTracer ([#159956](https://github.com/pytorch/pytorch/pull/159956))
-- Add tool to track events in graph split ([#159795](https://github.com/pytorch/pytorch/pull/159795))
-- Add `node_name_match` to subgraph rewriter ([#157574](https://github.com/pytorch/pytorch/pull/157574))
+- Fix DCE eliminating random operations by improving `is_impure()` (#151524, #157981)
+- Support converting a float32 tensor to a scalar in FX trace. (#158216)
+- Correctly copy `self.module_stack` in ModuleStackTracer (#159956)
+- Add tool to track events in graph split (#159795)
+- Add `node_name_match` to subgraph rewriter (#157574)
 
 ## Dynamo
 - Improve tracing support for various Python builtin data structures/modules:
-  - `list`s (e.g. [#153969](https://github.com/pytorch/pytorch/pull/153969))
-  - `set`s (e.g. [#153150](https://github.com/pytorch/pytorch/pull/153150))
-  - `dict`s (e.g. [#154794](https://github.com/pytorch/pytorch/pull/154794))
-  - `iter` (e.g. [#156371](https://github.com/pytorch/pytorch/pull/156371))
-  - `itertools` (e.g. [#159693](https://github.com/pytorch/pytorch/pull/159693))
-  - `collections` (e.g. [#159365](https://github.com/pytorch/pytorch/pull/159365))
-  - `collections.NamedTuple` ([#159367](https://github.com/pytorch/pytorch/pull/159367))
-  - frozen `dataclasses.dataclass` ([#159529](https://github.com/pytorch/pytorch/pull/159529))
-- Graph break error messages link to a website with more information ([#159011](https://github.com/pytorch/pytorch/pull/159011))
-- Add option for `TorchDispatchMode` to ignore `torch.compile` internals ([#161648](https://github.com/pytorch/pytorch/pull/161648))
+  - `list`s (e.g. #153969)
+  - `set`s (e.g. #153150)
+  - `dict`s (e.g. #154794)
+  - `iter` (e.g. #156371)
+  - `itertools` (e.g. #159693)
+  - `collections` (e.g. #159365)
+  - `collections.NamedTuple` (#159367)
+  - frozen `dataclasses.dataclass` (#159529)
+- Graph break error messages link to a website with more information (#159011)
+- Add option for `TorchDispatchMode` to ignore `torch.compile` internals (#161648)
 
 ## Inductor
-- Add Inductor support for MTIA backend ([#159211](https://github.com/pytorch/pytorch/pull/159211))
-- Share default device context when all graph partitions and cudagraph-unsafe ops are on the same device([#162873](https://github.com/pytorch/pytorch/pull/162873))
+- Add Inductor support for MTIA backend (#159211)
+- Share default device context when all graph partitions and cudagraph-unsafe ops are on the same device(#162873)
 
 ## Ahead-Of-Time Inductor (AOTI)
-- Enable AOTI for CPU on Windows ([#158915](https://github.com/pytorch/pytorch/pull/158915))
-- Re-enable TMA templates w/ AOTI ([#157819](https://github.com/pytorch/pytorch/pull/157819))
-- Don't allow int32 indices if `{non-inf, > int32_max}` upper bound is provided ([#159433](https://github.com/pytorch/pytorch/pull/159433))
-- Add RecordFunction to C shim so that profiling works with AOTI ([#159842](https://github.com/pytorch/pytorch/pull/159842))
-- Add AOTI C shim functions for collective ops ([#154492](https://github.com/pytorch/pytorch/pull/154492))
-- Add missing ops to set of C-shim ops which can have nullptr returns ([#158073](https://github.com/pytorch/pytorch/pull/158073))
+- Enable AOTI for CPU on Windows (#158915)
+- Re-enable TMA templates w/ AOTI (#157819)
+- Don't allow int32 indices if `{non-inf, > int32_max}` upper bound is provided (#159433)
+- Add RecordFunction to C shim so that profiling works with AOTI (#159842)
+- Add AOTI C shim functions for collective ops (#154492)
+- Add missing ops to set of C-shim ops which can have nullptr returns (#158073)
 
 ## Export
-- Handle `None` & ellipsis slicing/select in non-strict ([#157821](https://github.com/pytorch/pytorch/pull/157821))
-- Extend FP8 types in serialization ([#158430](https://github.com/pytorch/pytorch/pull/158430))
-- Improve error messages for deserialization ([#159881](https://github.com/pytorch/pytorch/pull/159881))
-- Support serialization for `triton_kernel_wrapper_functional` HOP ([#161314](https://github.com/pytorch/pytorch/pull/161314))
-- Support serialization for complex constants ([#161517](https://github.com/pytorch/pytorch/pull/161517))
-- Add runtime asserts to `while_loop` HOP subgraphs ([#158467](https://github.com/pytorch/pytorch/pull/158467))
-- Warn on side-effectful code in strict mode ([#160060](https://github.com/pytorch/pytorch/pull/160060))
-- Support for vmap in pre-dispatch export ([#154650](https://github.com/pytorch/pytorch/pull/154650))
-- Support vmap and custom autograd function/improve DTensor constructor inefficiency ([#162240](https://github.com/pytorch/pytorch/pull/162240))
+- Handle `None` & ellipsis slicing/select in non-strict (#157821)
+- Extend FP8 types in serialization (#158430)
+- Improve error messages for deserialization (#159881)
+- Support serialization for `triton_kernel_wrapper_functional` HOP (#161314)
+- Support serialization for complex constants (#161517)
+- Add runtime asserts to `while_loop` HOP subgraphs (#158467)
+- Warn on side-effectful code in strict mode (#160060)
+- Support for vmap in pre-dispatch export (#154650)
+- Support vmap and custom autograd function/improve DTensor constructor inefficiency (#162240)
 
 ## AOTDispatcher
-- Skip logging in fp8 activation quantization if there are no nodes to be quantized ([#158129](https://github.com/pytorch/pytorch/pull/158129))
-- Add `aot_export_joint_with_descriptors` and `aot_compile_joint_with_descriptors` ([#158715](https://github.com/pytorch/pytorch/pull/158715))
-- Extract out `prepare_aot_module_simplified` for use in next PR ([#158319](https://github.com/pytorch/pytorch/pull/158319))
-- Rename modules in AOTAutograd ([#158449](https://github.com/pytorch/pytorch/pull/158449))
-- Track descriptors for all inputs/outputs of AOTAutograd traced graph ([#158624](https://github.com/pytorch/pytorch/pull/158624))
-- Improve graph output alias with subclass error message ([#159619](https://github.com/pytorch/pytorch/pull/159619))
-- Pass fw/bw compilers to `aot_export_joint_with_descriptors` ([#159814](https://github.com/pytorch/pytorch/pull/159814))
+- Skip logging in fp8 activation quantization if there are no nodes to be quantized (#158129)
+- Add `aot_export_joint_with_descriptors` and `aot_compile_joint_with_descriptors` (#158715)
+- Extract out `prepare_aot_module_simplified` for use in next PR (#158319)
+- Rename modules in AOTAutograd (#158449)
+- Track descriptors for all inputs/outputs of AOTAutograd traced graph (#158624)
+- Improve graph output alias with subclass error message (#159619)
+- Pass fw/bw compilers to `aot_export_joint_with_descriptors` (#159814)
 
 ## Composability
-- Meta implementation for `aten.add.Scalar` ([#161332](https://github.com/pytorch/pytorch/pull/161332))
-- `aten.expand_copy` decomp ([#161688](https://github.com/pytorch/pytorch/pull/161688))
-- Fix result dtype cast in decomp for `aten.linalg_vector_norm` ([#155111](https://github.com/pytorch/pytorch/pull/155111))
-- Add dtype checks in meta implementation for several ordering ops ([#159556](https://github.com/pytorch/pytorch/pull/159556))
-- Fix meta function for `aten.complex` ([#160894](https://github.com/pytorch/pytorch/pull/160894))
-- Improve unbacked symint (dynamic shape) support for several decompositions ([#148815](https://github.com/pytorch/pytorch/pull/148815), [#156902](https://github.com/pytorch/pytorch/pull/156902), [#157008](https://github.com/pytorch/pytorch/pull/157008), [#158894](https://github.com/pytorch/pytorch/pull/158894), [#159184](https://github.com/pytorch/pytorch/pull/159184), [#160683](https://github.com/pytorch/pytorch/pull/160683), [#160253](https://github.com/pytorch/pytorch/pull/160253), [#162084](https://github.com/pytorch/pytorch/pull/162084), [#162099](https://github.com/pytorch/pytorch/pull/162099), [#162109](https://github.com/pytorch/pytorch/pull/162109), [#160462](https://github.com/pytorch/pytorch/pull/160462))
+- Meta implementation for `aten.add.Scalar` (#161332)
+- `aten.expand_copy` decomp (#161688)
+- Fix result dtype cast in decomp for `aten.linalg_vector_norm` (#155111)
+- Add dtype checks in meta implementation for several ordering ops (#159556)
+- Fix meta function for `aten.complex` (#160894)
+- Improve unbacked symint (dynamic shape) support for several decompositions (#148815, #156902, #157008, #158894, #159184, #160683, #160253, #162084, #162099, #162109, #160462)
 
 ## Quantization
-- Avoid getting model device once per node for pt2e quantization flow ([#159901](https://github.com/pytorch/pytorch/pull/159901))
-- Fixes bug in implementation of `HistogramObserver` ([#156457](https://github.com/pytorch/pytorch/pull/156457))
-- Support `bias=None` for `fbgemm_linear_fp16_weight` CPU op ([#158535](https://github.com/pytorch/pytorch/pull/158535))
-- Add Static Dispatch Kernel for `wrapped_fbgemm_linear_fp16_weight` for Sigmoid ([#160451](https://github.com/pytorch/pytorch/pull/160451))
+- Avoid getting model device once per node for pt2e quantization flow (#159901)
+- Fixes bug in implementation of `HistogramObserver` (#156457)
+- Support `bias=None` for `fbgemm_linear_fp16_weight` CPU op (#158535)
+- Add Static Dispatch Kernel for `wrapped_fbgemm_linear_fp16_weight` for Sigmoid (#160451)
 
 ## Nested Tensor (NJT)
-- Added initial `log_softmax()` support ([#159662](https://github.com/pytorch/pytorch/pull/159662))
+- Added initial `log_softmax()` support (#159662)
 
 ## Foreach
-- Invoke `vector.reserve()` consistently for non-inplace foreach operations ([#161128](https://github.com/pytorch/pytorch/pull/161128))
-- Faster and safer lambda expression capture in `has_integral_tensor()` ([#161042](https://github.com/pytorch/pytorch/pull/161042))
+- Invoke `vector.reserve()` consistently for non-inplace foreach operations (#161128)
+- Faster and safer lambda expression capture in `has_integral_tensor()` (#161042)
 
 ## ONNX
-- Support symbolic arguments in ONNX exporter ([#157734](https://github.com/pytorch/pytorch/pull/157734))
-- Fix `torch.tensor` warning in ONNX `symbolic_opset10` export  ([#158835](https://github.com/pytorch/pytorch/pull/158835))
+- Support symbolic arguments in ONNX exporter (#157734)
+- Fix `torch.tensor` warning in ONNX `symbolic_opset10` export  (#158835)
 
 ## C++ Frontend
-- Generalized `AllocatorConfig` to be device-agnostic via new `AcceleratorAllocatorConfig` ([#149601](https://github.com/pytorch/pytorch/pull/149601), [#150312](https://github.com/pytorch/pytorch/pull/150312))
-- Added `Scalar::isUnsigned()` method ([#159877](https://github.com/pytorch/pytorch/pull/159877))
-- Exposed `ModelRunner` from nativert as public ([#159989](https://github.com/pytorch/pytorch/pull/159989))
-- Improve error message for `torch.binomial` enforcing float inputs ([#157658](https://github.com/pytorch/pytorch/pull/157658))
+- Generalized `AllocatorConfig` to be device-agnostic via new `AcceleratorAllocatorConfig` (#149601, #150312)
+- Added `Scalar::isUnsigned()` method (#159877)
+- Exposed `ModelRunner` from nativert as public (#159989)
+- Improve error message for `torch.binomial` enforcing float inputs (#157658)
 
 ## Build Frontend
-- Fix dev warning in `Dependencies.cmake` ([#159702](https://github.com/pytorch/pytorch/pull/159702))
-- Fix building system gloo with CUDA/HIP ([#146637](https://github.com/pytorch/pytorch/pull/146637))
-- Build `libtorch` without NVSHMEM ([#160910](https://github.com/pytorch/pytorch/pull/160910))
-- Improve BLAS feature detection ([#143846](https://github.com/pytorch/pytorch/pull/143846))
+- Fix dev warning in `Dependencies.cmake` (#159702)
+- Fix building system gloo with CUDA/HIP (#146637)
+- Build `libtorch` without NVSHMEM (#160910)
+- Improve BLAS feature detection (#143846)
 
 ## Release Engineering
-- Enable vLLM testing workflow ([#160583](https://github.com/pytorch/pytorch/pull/160583)) ([#161565](https://github.com/pytorch/pytorch/pull/161565)) ([#162292](https://github.com/pytorch/pytorch/pull/162292)) ([#162000](https://github.com/pytorch/pytorch/pull/162000)) ([#161797](https://github.com/pytorch/pytorch/pull/161797))
-- Enable Windows ARM64 CI testing ([#148753](https://github.com/pytorch/pytorch/pull/148753)) ([#161504](https://github.com/pytorch/pytorch/pull/161504))
-- Enable PyTorch ROCm CI for MI355X testing. ([#158889](https://github.com/pytorch/pytorch/pull/158889))
+- Enable vLLM testing workflow (#160583, #161565, #162292, #162000, #161797)
+- Enable Windows ARM64 CI testing (#148753, #161504)
+- Enable PyTorch ROCm CI for MI355X testing. (#158889)
 
 ## CUDA
-- Make cublaslt/hipblaslt workspaces persistent ([#156495](https://github.com/pytorch/pytorch/pull/156495))
-- Remove unnecessary warnings during the ATen compilation process ([#157703](https://github.com/pytorch/pytorch/pull/157703))
-- Slightly improve error message from `repeat_interleave` kernel ([#157996](https://github.com/pytorch/pytorch/pull/157996))
-- Add framework for explanations for common CUDA errors ([#158395](https://github.com/pytorch/pytorch/pull/158395))
-- Upgrade KernelLauncher `kernelLaunchCheck` to print help string ([#158896](https://github.com/pytorch/pytorch/pull/158896))
-- Prep for cutlass upgrade by ignoring `Wunused-but-set-variable` ([#159276](https://github.com/pytorch/pytorch/pull/159276))
-- Workaround ATen SFINAE under `libc++` ([#161101](https://github.com/pytorch/pytorch/pull/161101))
-- Implement changes to CCCL (CUB/Thrust/LibCUDACXX) usage in ATen ([#153373](https://github.com/pytorch/pytorch/pull/153373))
-- Add maybe unused flag to remove warning ([#157655](https://github.com/pytorch/pytorch/pull/157655))
-- Use new CCCL API in v2.8 ([#160554](https://github.com/pytorch/pytorch/pull/160554))
-- Improve cupy device placement when device is provided with explicit index ([#158529](https://github.com/pytorch/pytorch/pull/158529))
+- Make cublaslt/hipblaslt workspaces persistent (#156495)
+- Remove unnecessary warnings during the ATen compilation process (#157703)
+- Slightly improve error message from `repeat_interleave` kernel (#157996)
+- Add framework for explanations for common CUDA errors (#158395)
+- Upgrade KernelLauncher `kernelLaunchCheck` to print help string (#158896)
+- Prep for cutlass upgrade by ignoring `Wunused-but-set-variable` (#159276)
+- Workaround ATen SFINAE under `libc++` (#161101)
+- Implement changes to CCCL (CUB/Thrust/LibCUDACXX) usage in ATen (#153373)
+- Add maybe unused flag to remove warning (#157655)
+- Use new CCCL API in v2.8 (#160554)
+- Improve cupy device placement when device is provided with explicit index (#158529)
 
 ## CPU (AArch64)
-- Made PyTorch compilable with gcc-14 on ARM ([#157867](https://github.com/pytorch/pytorch/pull/157867))
+- Made PyTorch compilable with gcc-14 on ARM (#157867)
 
 ## MPS
-- Add `shifted_chebyshev_polynomial_[tuvw]`, `igamma/igammac,grid_sampler_3d, native_dropout`/`native_dropout_backward`  ([\#157488](https://github.com/pytorch/pytorch/pull/157488), [\#161927](https://github.com/pytorch/pytorch/pull/161927), [\#160541](https://github.com/pytorch/pytorch/pull/160541), [\#162108](https://github.com/pytorch/pytorch/pull/162108))
-- Extend atomic operations to all int types ([\#158179](https://github.com/pytorch/pytorch/pull/158179))
-- Extend `index_put` to complex types ([\#160159](https://github.com/pytorch/pytorch/pull/160159))
-- Extend `addmm` to integral types ([\#160270](https://github.com/pytorch/pytorch/pull/160270))
-- Add support for unsigned types ([\#159094](https://github.com/pytorch/pytorch/pull/159094))
-- Add API to query GPU core count ([\#160414](https://github.com/pytorch/pytorch/pull/160414))
-- Add `kthvalue` ([\#161817](https://github.com/pytorch/pytorch/pull/161817))
-- Type-promote tensor-iterator common dtype ([\#160334](https://github.com/pytorch/pytorch/pull/160334))
-- Implement `logcumsumexp` metal kernel ([\#156858](https://github.com/pytorch/pytorch/pull/156858))
-- Enable `dlpack` integration ([\#158888](https://github.com/pytorch/pytorch/pull/158888))
-- Dynamic reductions ([\#159355](https://github.com/pytorch/pytorch/pull/159355))
-- Update `avg_pool2d` to use Metal kernel when `ceil_mode=True` ([\#161011](https://github.com/pytorch/pytorch/pull/161011))
+- Add `shifted_chebyshev_polynomial_[tuvw]`, `igamma/igammac,grid_sampler_3d, native_dropout`/`native_dropout_backward`  (#157488, #161927, #160541, #162108)
+- Extend atomic operations to all int types (#158179)
+- Extend `index_put` to complex types (#160159)
+- Extend `addmm` to integral types (#160270)
+- Add support for unsigned types (#159094)
+- Add API to query GPU core count (#160414)
+- Add `kthvalue` (#161817)
+- Type-promote tensor-iterator common dtype (#160334)
+- Implement `logcumsumexp` metal kernel (#156858)
+- Enable `dlpack` integration (#158888)
+- Dynamic reductions (#159355)
+- Update `avg_pool2d` to use Metal kernel when `ceil_mode=True` (#161011)
 
 ## ROCm
-- Additional hipify mappings ([#158056](https://github.com/pytorch/pytorch/pull/158056), [#158352](https://github.com/pytorch/pytorch/pull/158352), [#161992](https://github.com/pytorch/pytorch/pull/161992))
-- Refactor `composable_kernel` (CK) backend user interface to improve user experience ([#152951](https://github.com/pytorch/pytorch/pull/152951))
-- Allow use of `rocSOLVER` for Cholesky inversion. ([#157154](https://github.com/pytorch/pytorch/pull/157154))
-- AOT Inductor enable gfx950 for max autotune using CK ([#159195](https://github.com/pytorch/pytorch/pull/159195))
-- Add flag `torch.backends.miopen.immediate` to toggle MIOpen Immediate Mode instead of relying on `deterministic=True` and `benchmark=False` ([#158951](https://github.com/pytorch/pytorch/pull/158951))
-- MIOpen convolutions no longer call `reshape_` or unexpectedly change memory formats ([#161687](https://github.com/pytorch/pytorch/pull/161687))
+- Additional hipify mappings (#158056, #158352, #161992)
+- Refactor `composable_kernel` (CK) backend user interface to improve user experience (#152951)
+- Allow use of `rocSOLVER` for Cholesky inversion. (#157154)
+- AOT Inductor enable gfx950 for max autotune using CK (#159195)
+- Add flag `torch.backends.miopen.immediate` to toggle MIOpen Immediate Mode instead of relying on `deterministic=True` and `benchmark=False` (#158951)
+- MIOpen convolutions no longer call `reshape_` or unexpectedly change memory formats (#161687)
 
 ## XPU
-- Support Intel GPU quantization ops in AOTInductor ([#156572](https://github.com/pytorch/pytorch/pull/156572))
-- Add `device_id` to Intel GPU properties to distinguish iGPUs with identical names ([#156481](https://github.com/pytorch/pytorch/pull/156481))
+- Support Intel GPU quantization ops in AOTInductor (#156572)
+- Add `device_id` to Intel GPU properties to distinguish iGPUs with identical names (#156481)
 
 # Bug Fixes
 ## Python Frontend
-- Add option in `torch.utils.cpp_extension.load_inline` to override gencode ([#156850](https://github.com/pytorch/pytorch/pull/156850))
-- Fix `max_width` computation in Tensor printing ([#126859](https://github.com/pytorch/pytorch/pull/126859))
-- Improve `pin_memory` error message on CPU-only systems ([#159994](https://github.com/pytorch/pytorch/pull/159994))
-- Making batching rule for `F.embedding` DTensor-aware ([#162117](https://github.com/pytorch/pytorch/pull/162117))
+- Add option in `torch.utils.cpp_extension.load_inline` to override gencode (#156850)
+- Fix `max_width` computation in Tensor printing (#126859)
+- Improve `pin_memory` error message on CPU-only systems (#159994)
+- Making batching rule for `F.embedding` DTensor-aware (#162117)
 
 ## Autograd
-- Fix `torch.autograd.Function` memory leak due to `torch.utils.checkpiont` early stopping ([#161171](https://github.com/pytorch/pytorch/pull/161171))
-- Fix `torch.autograd.graph.GradientEdge` for `torch.autograd.Function` ([#160098](https://github.com/pytorch/pytorch/pull/160098))
-- Match 0-dim gradients device type regardless of subclass-ness ([#160165](https://github.com/pytorch/pytorch/pull/160165))
+- Fix `torch.autograd.Function` memory leak due to `torch.utils.checkpiont` early stopping (#161171)
+- Fix `torch.autograd.graph.GradientEdge` for `torch.autograd.Function` (#160098)
+- Match 0-dim gradients device type regardless of subclass-ness (#160165)
 
 ## Distributed
 ### c10d
-  - Fix slow init due to repeated dns resolution failure in socket ([#159596](https://github.com/pytorch/pytorch/pull/159596))
-  - Fix `setGroupName` and `setGroupDesc` in `group_split` and `merge_remote_group` ([#159429](https://github.com/pytorch/pytorch/pull/159429))
-  - Fix a bug of distributed 'gather' with noncontiguous tensors on the Gloo backend ([#158903](https://github.com/pytorch/pytorch/pull/158903))
-  - Fix a bug of distributed 'gather' with noncontiguous tensors on the NCCL backend ([#159549](https://github.com/pytorch/pytorch/pull/159549))
-  - Fix data inconsistencies when using `batch_isend_irecv` with 2D tensor views by making P2P tensors dense ([#163719](https://github.com/pytorch/pytorch/pull/163719))
-  - Handle discontiguous `allgather`/`reducescatter` inputs ([#163712](https://github.com/pytorch/pytorch/pull/163712))
+  - Fix slow init due to repeated dns resolution failure in socket (#159596)
+  - Fix `setGroupName` and `setGroupDesc` in `group_split` and `merge_remote_group` (#159429)
+  - Fix a bug of distributed 'gather' with noncontiguous tensors on the Gloo backend (#158903)
+  - Fix a bug of distributed 'gather' with noncontiguous tensors on the NCCL backend (#159549)
+  - Fix data inconsistencies when using `batch_isend_irecv` with 2D tensor views by making P2P tensors dense (#163719)
+  - Handle discontiguous `allgather`/`reducescatter` inputs (#163712)
 ### Device Mesh
-  - Fix the not incorrectly chained each of the strings as iterables ([#160709](https://github.com/pytorch/pytorch/pull/160709))
+  - Fix the not incorrectly chained each of the strings as iterables (#160709)
 ### DistributedDataParallel (DDP)
-  - Fix incorrect interaction between `DDPOptimizer` and donated buffers ([#160745](https://github.com/pytorch/pytorch/pull/160745))
+  - Fix incorrect interaction between `DDPOptimizer` and donated buffers (#160745)
 ### DTensor
-  - Fix DTensor handling of conjugate bit ([#158030](https://github.com/pytorch/pytorch/pull/158030))
-  - Fix `OpSchema` equality check ([#161231](https://github.com/pytorch/pytorch/pull/161231))
-  - Fix `grouped_mm` strategy for invalid stride cases ([#158245](https://github.com/pytorch/pytorch/pull/158245))
-  - Fix `F.one_hot` in DTensor ([#162307](https://github.com/pytorch/pytorch/pull/162307))
-  - Always disabled `ShardingPropagation` cache if compiling ([#156868](https://github.com/pytorch/pytorch/pull/156868))
+  - Fix DTensor handling of conjugate bit (#158030)
+  - Fix `OpSchema` equality check (#161231)
+  - Fix `grouped_mm` strategy for invalid stride cases (#158245)
+  - Fix `F.one_hot` in DTensor (#162307)
+  - Always disabled `ShardingPropagation` cache if compiling (#156868)
 ### FullyShardedDataParallel (FSDP)
-  - Fix the bug in FSDP offload `pin_memory` ([#157147](https://github.com/pytorch/pytorch/pull/157147))
-  - Fix to ensure writeback handles `NO_SHARD` correctly by flattening tensors before copying ([#154369](https://github.com/pytorch/pytorch/pull/154369))
+  - Fix the bug in FSDP offload `pin_memory` (#157147)
+  - Fix to ensure writeback handles `NO_SHARD` correctly by flattening tensors before copying (#154369)
 ### FullyShardedDataParallel2 (FSDP2)
-  - Fix error message for `fsdp_pre_all_gather` ([#160817](https://github.com/pytorch/pytorch/pull/160817))
-  - Fix the issue with `set_reduce_scatter_divide_factor` errors and `MixedPrecisionPolicy`  ([#155964](https://github.com/pytorch/pytorch/pull/155964))
+  - Fix error message for `fsdp_pre_all_gather` (#160817)
+  - Fix the issue with `set_reduce_scatter_divide_factor` errors and `MixedPrecisionPolicy`  (#155964)
 ### Pipeline Parallelism (PP)
-  - Fix eval step under `no_grad()` ([#159293](https://github.com/pytorch/pytorch/pull/159293))
-  - Fix zero bubble schedules for `eval()` ([#159475](https://github.com/pytorch/pytorch/pull/159475))
+  - Fix eval step under `no_grad()` (#159293)
+  - Fix zero bubble schedules for `eval()` (#159475)
 ### TensorPipe
-  - Fix `import torch` if compiled without `TensorPipe` ([#159461](https://github.com/pytorch/pytorch/pull/159461))
+  - Fix `import torch` if compiled without `TensorPipe` (#159461)
 ### TorchElastic
-  - Fix wrong log file name in the docs of `torch.distributed.elastic.multiprocessing.start_processes()` ([#160396](https://github.com/pytorch/pytorch/pull/160396))
+  - Fix wrong log file name in the docs of `torch.distributed.elastic.multiprocessing.start_processes()` (#160396)
 
 ## Linear Algebra Frontend
-- Avoid downcasts for fp16 matmul on the BLAS backend ([#161999](https://github.com/pytorch/pytorch/pull/161999))
+- Avoid downcasts for fp16 matmul on the BLAS backend (#161999)
 
 ## Profiler
-- Fix Linter for Global Annotations flag in Snapshot ([#157858](https://github.com/pytorch/pytorch/pull/157858))
+- Fix Linter for Global Annotations flag in Snapshot (#157858)
 
 ## FX
-- Fix `split_module` with symint ([#160093](https://github.com/pytorch/pytorch/pull/160093))
-- Fix `getattr_recursive` with ModuleList ([#161204](https://github.com/pytorch/pytorch/pull/161204))
-- Skip const folding with symbolic expression ([#161437](https://github.com/pytorch/pytorch/pull/161437))
-- Fix qualified name for methods of `torch.Tensor` ([#162224](https://github.com/pytorch/pytorch/pull/162224))
+- Fix `split_module` with symint (#160093)
+- Fix `getattr_recursive` with ModuleList (#161204)
+- Skip const folding with symbolic expression (#161437)
+- Fix qualified name for methods of `torch.Tensor` (#162224)
 
 ## Dynamo
-- Fix segfault due to interaction between Dynamo backends and `torch.compiler.reset()` ([#156527](https://github.com/pytorch/pytorch/pull/156527))
-- Fix crash due to bad interaction with recompilations and with blocks in Python 3.11+ ([#162318](https://github.com/pytorch/pytorch/pull/162318))
+- Fix segfault due to interaction between Dynamo backends and `torch.compiler.reset()` (#156527)
+- Fix crash due to bad interaction with recompilations and with blocks in Python 3.11+ (#162318)
 
 ## torch.nn
-- Fix silent correctness w/ backpropping grads for `FlexAttention` ([#163677](https://github.com/pytorch/pytorch/pull/163677))
-- Fix `return_lse` warning message in `FlexAttention` ([#163578](https://github.com/pytorch/pytorch/pull/163578))
-- Fix `FlexAttention` head broadcast ([#163426](https://github.com/pytorch/pytorch/pull/163426))
+- Fix silent correctness w/ backpropping grads for `FlexAttention` (#163677)
+- Fix `return_lse` warning message in `FlexAttention` (#163578)
+- Fix `FlexAttention` head broadcast (#163426)
 
 ## Inductor
-- Fix wrong meta function for `constant_pad_nd` ([#159878](https://github.com/pytorch/pytorch/pull/159878))
-- Fix learnable bias assertion error in Inductor ([#161170](https://github.com/pytorch/pytorch/pull/161170))
-- Fix int64 from `MutationOutput` Buffer ([#162020](https://github.com/pytorch/pytorch/pull/162020))
-- Fix Inductor CUDA sort `NaN` behavior ([#159308](https://github.com/pytorch/pytorch/pull/159308))
-- Fix layout for local buf in outer loop fusion ([#160857](https://github.com/pytorch/pytorch/pull/160857))
-- Fix slice scatter `dtype` consistency ([#160851](https://github.com/pytorch/pytorch/pull/160851))
-- Fix 3d tiled online softmax ([#162341](https://github.com/pytorch/pytorch/pull/162341))
-- Fix unsafe collective reorder past wait in Inductor ([#157489](https://github.com/pytorch/pytorch/pull/157489))
-- Fix `FallbackKernel` alias function to avoid incorrect aliasing for custom ops ([#163227](https://github.com/pytorch/pytorch/pull/163227))
+- Fix wrong meta function for `constant_pad_nd` (#159878)
+- Fix learnable bias assertion error in Inductor (#161170)
+- Fix int64 from `MutationOutput` Buffer (#162020)
+- Fix Inductor CUDA sort `NaN` behavior (#159308)
+- Fix layout for local buf in outer loop fusion (#160857)
+- Fix slice scatter `dtype` consistency (#160851)
+- Fix 3d tiled online softmax (#162341)
+- Fix unsafe collective reorder past wait in Inductor (#157489)
+- Fix `FallbackKernel` alias function to avoid incorrect aliasing for custom ops (#163227)
 
 ## Ahead-Of-Time Inductor (AOTI)
-- Fix a bug from `load_constants` ([#161887](https://github.com/pytorch/pytorch/pull/161887))
-- Fix wrong propagation of fallback_ops_dict in `gen_aoti_c_shim` ([#159904](https://github.com/pytorch/pytorch/pull/159904))
-- Fix unbacked symint and memory leak in Inductor memory planning ([#159839](https://github.com/pytorch/pytorch/pull/159839))
-- Fix memory leak in AOTI when calling `aoti_torch_as_strided` ([#162118](https://github.com/pytorch/pytorch/pull/162118))
-- Explicitly delete `wait_tensor` returned tensor ([#159502](https://github.com/pytorch/pytorch/pull/159502))
-- Fix memory leak from `all_reduce` ([#159818](https://github.com/pytorch/pytorch/pull/159818))
+- Fix a bug from `load_constants` (#161887)
+- Fix wrong propagation of fallback_ops_dict in `gen_aoti_c_shim` (#159904)
+- Fix unbacked symint and memory leak in Inductor memory planning (#159839)
+- Fix memory leak in AOTI when calling `aoti_torch_as_strided` (#162118)
+- Explicitly delete `wait_tensor` returned tensor (#159502)
+- Fix memory leak from `all_reduce` (#159818)
 
 ## Composability
-- Make functionalization ViewMeta serializable with pickle ([#163769](https://github.com/pytorch/pytorch/pull/163769))
+- Make functionalization ViewMeta serializable with pickle (#163769)
 
 ## Export
-- Fix bug in constants lifting pass ([#157719](https://github.com/pytorch/pytorch/pull/157719))
-- Fix `from_node` provenance in unlift pass ([#157943](https://github.com/pytorch/pytorch/pull/157943))
-- Fix `NaN` serialization ([#155359](https://github.com/pytorch/pytorch/pull/155359))
-- Fix deserialization for unbacked symbol ranges ([#158681](https://github.com/pytorch/pytorch/pull/158681))
-- Fix runtime assert handling in deserialization ([#159060](https://github.com/pytorch/pytorch/pull/159060))
-- Fix for FQN handling in unflattener ([#159418](https://github.com/pytorch/pytorch/pull/159418))
-- Fix `nn_module_stack` for `assert_tensor_metadata` nodes ([#159625](https://github.com/pytorch/pytorch/pull/159625))
-- Fix usage for `move_to_device_pass` ([#159992](https://github.com/pytorch/pytorch/pull/159992), [#160528](https://github.com/pytorch/pytorch/pull/160528), [#162301](https://github.com/pytorch/pytorch/pull/162301))
-- Avoid name overwrites for aliased exported module parameters ([#160600](https://github.com/pytorch/pytorch/pull/160600))
-- Avoid inling `dynamo.disables` in unflattening ([#161306](https://github.com/pytorch/pytorch/pull/161306))
-- Fix deserialization issue for storage offset ([#162172](https://github.com/pytorch/pytorch/pull/162172))
-- Remove `.contiguous()` when saving weights to raw bytes to preserve original storage size of tensor ([#163587](https://github.com/pytorch/pytorch/pull/163587))
+- Fix bug in constants lifting pass (#157719)
+- Fix `from_node` provenance in unlift pass (#157943)
+- Fix `NaN` serialization (#155359)
+- Fix deserialization for unbacked symbol ranges (#158681)
+- Fix runtime assert handling in deserialization (#159060)
+- Fix for FQN handling in unflattener (#159418)
+- Fix `nn_module_stack` for `assert_tensor_metadata` nodes (#159625)
+- Fix usage for `move_to_device_pass` (#159992, #160528, #162301)
+- Avoid name overwrites for aliased exported module parameters (#160600)
+- Avoid inling `dynamo.disables` in unflattening (#161306)
+- Fix deserialization issue for storage offset (#162172)
+- Remove `.contiguous()` when saving weights to raw bytes to preserve original storage size of tensor (#163587)
 
 ## Quantization
-- Avoid `NaN` in fp8 output of CPU `qlinear` and `qconv` ops ([#160957](https://github.com/pytorch/pytorch/pull/160957))
-- Fix segmentation fault when `choose_qparams_optimized` ([#161966](https://github.com/pytorch/pytorch/pull/161966))
+- Avoid `NaN` in fp8 output of CPU `qlinear` and `qconv` ops (#160957)
+- Fix segmentation fault when `choose_qparams_optimized` (#161966)
 
 ## Foreach
-- `chunk_size` should always be `int64_t` for Foreach functors ([#156872](https://github.com/pytorch/pytorch/pull/156872))
+- `chunk_size` should always be `int64_t` for Foreach functors (#156872)
 
 ## ONNX
-- Make onnx export SDPA match ATen behavior ([#159973](https://github.com/pytorch/pytorch/pull/159973))
-- Fix `rotary_embedding_23` implementation ([#162865](https://github.com/pytorch/pytorch/pull/162865))
-- Fix export behavior when model has `None` as output ([#160200](https://github.com/pytorch/pytorch/pull/160200))
-- Fix lower opset version support in `dynamo=True` ([#161056](https://github.com/pytorch/pytorch/pull/161056))
-- Fix `index_put_` usage ([#161263](https://github.com/pytorch/pytorch/pull/161263))
+- Make onnx export SDPA match ATen behavior (#159973)
+- Fix `rotary_embedding_23` implementation (#162865)
+- Fix export behavior when model has `None` as output (#160200)
+- Fix lower opset version support in `dynamo=True` (#161056)
+- Fix `index_put_` usage (#161263)
 
 ## C++ Extensions
-- Fix CPP extension distributed warning for `TORCH_CUDA_ARCH_LIST` to only log when running on non-distributed or on rank 0 ([#162764](https://github.com/pytorch/pytorch/pull/162764))
+- Fix CPP extension distributed warning for `TORCH_CUDA_ARCH_LIST` to only log when running on non-distributed or on rank 0 (#162764)
 
 ## C++ Frontend
-- Fix `torch.utils.cpp_extension` parser for clang version 20.1.7+libcxx ([#157666](https://github.com/pytorch/pytorch/pull/157666))
-- Fix `MakeTensor::computeStorageSize()` calculation ([#158690](https://github.com/pytorch/pytorch/pull/158690))
-- Fix static initialization order issue with `AllocatorConfig` ([#159629](https://github.com/pytorch/pytorch/pull/159629))
+- Fix `torch.utils.cpp_extension` parser for clang version 20.1.7+libcxx (#157666)
+- Fix `MakeTensor::computeStorageSize()` calculation (#158690)
+- Fix static initialization order issue with `AllocatorConfig` (#159629)
 
 ## Build Frontend
-- Turn on `BUILD_BUNDLEPTXAS=1` to allow compile on newer GPUs([#163988](https://github.com/pytorch/pytorch/pull/163988))
+- Turn on `BUILD_BUNDLEPTXAS=1` to allow compile on newer GPUs(#163988)
 
 ## CUDA
-- Handle uninitialized `torch.backends.cuda.matmul.fp32_precision` ([#161102](https://github.com/pytorch/pytorch/pull/161102))
-- Fix nansum in non-JIT build ([#158633](https://github.com/pytorch/pytorch/pull/158633))
-- Decrease launch bounds of CTCLoss backward for blackwell to avoid crash ([#159522](https://github.com/pytorch/pytorch/pull/159522))
-- Implement workaround for `cudaErrorNotSupported` ([#162412](https://github.com/pytorch/pytorch/pull/162412))
-- Fix missing `__syncthreads` in MultiMarginLoss backward ([#158994](https://github.com/pytorch/pytorch/pull/158994))
-- Roll-back cuDNN frontend upgrade and update Meta registration due to compile issues ([#163104](https://github.com/pytorch/pytorch/pull/163104))
-- Disable cuDNN for 3D convolutions with `kernel size != 1` for cuDNN 9.8+ ([#163581](https://github.com/pytorch/pytorch/pull/163581))
+- Handle uninitialized `torch.backends.cuda.matmul.fp32_precision` (#161102)
+- Fix nansum in non-JIT build (#158633)
+- Decrease launch bounds of CTCLoss backward for blackwell to avoid crash (#159522)
+- Implement workaround for `cudaErrorNotSupported` (#162412)
+- Fix missing `__syncthreads` in MultiMarginLoss backward (#158994)
+- Roll-back cuDNN frontend upgrade and update Meta registration due to compile issues (#163104)
+- Disable cuDNN for 3D convolutions with `kernel size != 1` for cuDNN 9.8+ (#163581)
 
 ## CPU
-- Add check so non-aarch64 platforms can hit `MKLDNN` path ([#162168](https://github.com/pytorch/pytorch/pull/162168))
+- Add check so non-aarch64 platforms can hit `MKLDNN` path (#162168)
 
 ## MPS
-- Fix batch norm incorrect gradient ([#156867](https://github.com/pytorch/pytorch/pull/156867))
-- Do not crash if `tensor dim > INT_MAX` ([#158824](https://github.com/pytorch/pytorch/pull/158824))
-- Avoid outputing zeros from `exponential_` for MPS ([#159386](https://github.com/pytorch/pytorch/pull/159386))
-- Fix MPS autocast for `ConvTranspose3d` ([#160345](https://github.com/pytorch/pytorch/pull/160345))
-- Fix MPS `conv3d` autocast bias dtype mismatch ([#160423](https://github.com/pytorch/pytorch/pull/160423))
-- Fix error check for `torch.var` on scalar ([#160889](https://github.com/pytorch/pytorch/pull/160889))
-- Fix `index_add` for complex + int64, int64 input + zerodim index ([#160926](https://github.com/pytorch/pytorch/pull/160926), [#161511](https://github.com/pytorch/pytorch/pull/161511))
-- Fix `constant_pad_nd_mps` bug when pad is empty ([#161149](https://github.com/pytorch/pytorch/pull/161149))
-- Fix `index_select` for `scalar_types` ([#161206](https://github.com/pytorch/pytorch/pull/161206))
-- Fix `index_copy` for scalars and `index_copy` for strided indices ([#161267](https://github.com/pytorch/pytorch/pull/161267), [#161333](https://github.com/pytorch/pytorch/pull/161333))
-- Ensure that tensors are contiguous before using MPS linear kernel ([#161641](https://github.com/pytorch/pytorch/pull/161641))
-- Address `NaN`s if SDPA is called with all values masked from query ([#157727](https://github.com/pytorch/pytorch/pull/157727))
-- Fix invalid formatting ([#158436](https://github.com/pytorch/pytorch/pull/158436))
-- Fix empty input in posneg functions ([#161824](https://github.com/pytorch/pytorch/pull/161824))
-- Migrate round unary op to Metal ([#161712](https://github.com/pytorch/pytorch/pull/161712))
-- Type-promote tensor-iterator common dtype ([#160334](https://github.com/pytorch/pytorch/pull/160334))
-- Fix regression in 2.8.0 for `scaled_dot_product_attention` using MPS ([#163598](https://github.com/pytorch/pytorch/pull/163598))
-- Chunk `fillBuffer` into 4Gb slices to avoid regression on MacOS 26 ([#164108](https://github.com/pytorch/pytorch/pull/164108))
-- Fix latent bug that can result in segfault in CPP extensions ([#164093](https://github.com/pytorch/pytorch/pull/164093))
+- Fix batch norm incorrect gradient (#156867)
+- Do not crash if `tensor dim > INT_MAX` (#158824)
+- Avoid outputing zeros from `exponential_` for MPS (#159386)
+- Fix MPS autocast for `ConvTranspose3d` (#160345)
+- Fix MPS `conv3d` autocast bias dtype mismatch (#160423)
+- Fix error check for `torch.var` on scalar (#160889)
+- Fix `index_add` for complex + int64, int64 input + zerodim index (#160926, #161511)
+- Fix `constant_pad_nd_mps` bug when pad is empty (#161149)
+- Fix `index_select` for `scalar_types` (#161206)
+- Fix `index_copy` for scalars and `index_copy` for strided indices (#161267, #161333)
+- Ensure that tensors are contiguous before using MPS linear kernel (#161641)
+- Address `NaN`s if SDPA is called with all values masked from query (#157727)
+- Fix invalid formatting (#158436)
+- Fix empty input in posneg functions (#161824)
+- Migrate round unary op to Metal (#161712)
+- Type-promote tensor-iterator common dtype (#160334)
+- Fix regression in 2.8.0 for `scaled_dot_product_attention` using MPS (#163598)
+- Chunk `fillBuffer` into 4Gb slices to avoid regression on MacOS 26 (#164108)
+- Fix latent bug that can result in segfault in CPP extensions (#164093)
 
 ## ROCm
-- Fix Inductor with cudagraph trees `hip:0` device error ([#161221](https://github.com/pytorch/pytorch/pull/161221))
-- Fix some build failures and support some BLAS calls on Windows ([#161981](https://github.com/pytorch/pytorch/pull/161981))
-- Fix undefined symbol linker error after exposing MIOpen symbols on Windows ([#156479](https://github.com/pytorch/pytorch/pull/156479))
-- Fix finding ROCm/HIP version on Windows ([#156486](https://github.com/pytorch/pytorch/pull/156486))
-- Fix LoadHIP handling of environment variable paths on Windows ([#159080](https://github.com/pytorch/pytorch/pull/159080))
-- Add hipcc compatibility flags to `cpp_extension.py` on Windows ([#159790](https://github.com/pytorch/pytorch/pull/159790))
-- In SDPA via AOTriton, `logsumexp` needs scaling back to natural base ([#156903](https://github.com/pytorch/pytorch/pull/156903))
-- Check stream graph capture status in `memcpy_and_sync` inline function ([#158165](https://github.com/pytorch/pytorch/pull/158165))
+- Fix Inductor with cudagraph trees `hip:0` device error (#161221)
+- Fix some build failures and support some BLAS calls on Windows (#161981)
+- Fix undefined symbol linker error after exposing MIOpen symbols on Windows (#156479)
+- Fix finding ROCm/HIP version on Windows (#156486)
+- Fix LoadHIP handling of environment variable paths on Windows (#159080)
+- Add hipcc compatibility flags to `cpp_extension.py` on Windows (#159790)
+- In SDPA via AOTriton, `logsumexp` needs scaling back to natural base (#156903)
+- Check stream graph capture status in `memcpy_and_sync` inline function (#158165)
 
 ## XPU
-- Fix `cpp_extension` compatibility with `intel-deep-learning-essentials-2025.2` ([#161012](https://github.com/pytorch/pytorch/pull/161012))
+- Fix `cpp_extension` compatibility with `intel-deep-learning-essentials-2025.2` (#161012)
 
 ## JIT
-- Make `ErrorReport::CallStack` thread-safe ([#160386](https://github.com/pytorch/pytorch/pull/160386))
-- Fix `RemoveProfileNodesAndSpecializeTypes` handling for `Tensor?` that is resolved to `None` ([#161538](https://github.com/pytorch/pytorch/pull/161538))
+- Make `ErrorReport::CallStack` thread-safe (#160386)
+- Fix `RemoveProfileNodesAndSpecializeTypes` handling for `Tensor?` that is resolved to `None` (#161538)
 
 # Performance
 ## Optimizer
-- Use `addmm` to improve Newton–Schulz orthogonalization in Muon ([#161379](https://github.com/pytorch/pytorch/pull/161379))
-- Avoid stream sync in SWA `AveragedModel.update_parameters()` ([#157705](https://github.com/pytorch/pytorch/pull/157705))
+- Use `addmm` to improve Newton–Schulz orthogonalization in Muon (#161379)
+- Avoid stream sync in SWA `AveragedModel.update_parameters()` (#157705)
 
 ## Autograd
-- Fix SVD forward-mode AD multiplication priority ([#161027](https://github.com/pytorch/pytorch/pull/161027))
+- Fix SVD forward-mode AD multiplication priority (#161027)
 
 ## Dynamo
-- Recursive `dict` tag optimization for faster guard evaluation ([#159183](https://github.com/pytorch/pytorch/pull/159183))
+- Recursive `dict` tag optimization for faster guard evaluation (#159183)
 
 ## Inductor
-- Improve performance of A16W4 and A16W8 `GEMM` template ([#159127](https://github.com/pytorch/pytorch/pull/159127)) ([#161148](https://github.com/pytorch/pytorch/pull/161148))
-- More aggressive persistent reduction ([#161055](https://github.com/pytorch/pytorch/pull/161055))
-- Add a few outer dimension reduction cases for LOAF ([#162028](https://github.com/pytorch/pytorch/pull/162028))
-- Fuse two RoPE kernels into a single kernel and improving runtime efficiency ([#161420](https://github.com/pytorch/pytorch/pull/161420))
+- Improve performance of A16W4 and A16W8 `GEMM` template (#159127, #161148)
+- More aggressive persistent reduction (#161055)
+- Add a few outer dimension reduction cases for LOAF (#162028)
+- Fuse two RoPE kernels into a single kernel and improving runtime efficiency (#161420)
 
 ## Export
-- Caching optimizations for placeholder naming pass ([#158594](https://github.com/pytorch/pytorch/pull/158594))
-- Add Static Dispatch Kernel for `fmod.Scalar` and `scale_gradient` ([#160654](https://github.com/pytorch/pytorch/pull/160654), [#160454](https://github.com/pytorch/pytorch/pull/160454))
+- Caching optimizations for placeholder naming pass (#158594)
+- Add Static Dispatch Kernel for `fmod.Scalar` and `scale_gradient` (#160654, #160454)
 
 ## CUDA
-- Use a nonblocking copy to avoid stream synchronization for GPU tensor indexing with CPU mask ([#156384](https://github.com/pytorch/pytorch/pull/156384))
-- Disable cudagraph GCs by default to improve capture performance ([#158649](https://github.com/pytorch/pytorch/pull/158649))
+- Use a nonblocking copy to avoid stream synchronization for GPU tensor indexing with CPU mask (#156384)
+- Disable cudagraph GCs by default to improve capture performance (#158649)
 
 ## Release Engineering
-- Upgrade to ROCm 6.4.1 and 6.4.2 patch releases ([#156636](https://github.com/pytorch/pytorch/pull/156636)) ([#158887](https://github.com/pytorch/pytorch/pull/158887)) ([#158886](https://github.com/pytorch/pytorch/pull/158886)) ([#158651](https://github.com/pytorch/pytorch/pull/158651)) ([#159001](https://github.com/pytorch/pytorch/pull/159001))
-- Migrate RPyTorch ROCm CI to MI325 capacity ([#159059](https://github.com/pytorch/pytorch/pull/159059)) ([#159649](https://github.com/pytorch/pytorch/pull/159649)) ([#161184](https://github.com/pytorch/pytorch/pull/161184))
-- Enable B200 PyTorch benchmark testing ([#158011](https://github.com/pytorch/pytorch/pull/158011)) ([#157341](https://github.com/pytorch/pytorch/pull/157341))
+- Upgrade to ROCm 6.4.1 and 6.4.2 patch releases (#156636, #158887, #158886, #158651, #159001)
+- Migrate RPyTorch ROCm CI to MI325 capacity (#159059, #159649, #161184)
+- Enable B200 PyTorch benchmark testing (#158011, #157341)
 
 ## MPS
-- Optimize cummin/cummax metal kernels ([\#156794](https://github.com/pytorch/pytorch/pull/156794))
-- Speedup `torch.full` for 1-byte types ([\#158874](https://github.com/pytorch/pytorch/pull/158874))
-- Speedup `argmax`/`argmin` ([\#159524](https://github.com/pytorch/pytorch/pull/159524))
-- Improve performance of `max_pool3d` ([\#157875](https://github.com/pytorch/pytorch/pull/157875))
-- Avoid calling tensor ops in `max_pool3d` impl ([\#157874](https://github.com/pytorch/pytorch/pull/157874))
-- Move `max_pool2d` to Metal for `stride != 1` ([\#157876](https://github.com/pytorch/pytorch/pull/157876))
+- Optimize cummin/cummax metal kernels (#156794)
+- Speedup `torch.full` for 1-byte types (#158874)
+- Speedup `argmax`/`argmin` (#159524)
+- Improve performance of `max_pool3d` (#157875)
+- Avoid calling tensor ops in `max_pool3d` impl (#157874)
+- Move `max_pool2d` to Metal for `stride != 1` (#157876)
 
 ## ROCm
-- SDPA now uses AOTriton to 0.11b ([#161754](https://github.com/pytorch/pytorch/pull/161754))
-- `hipblaslt` is used by default on gfx908 for ROCm >= 6.3 ([#159092](https://github.com/pytorch/pytorch/pull/159092))
-- Enable miopen channels last 3d for conv and batchnorm ([#160529](https://github.com/pytorch/pytorch/pull/160529))
-- Remove extra transposes in NHWC convolutions on MIOpen ([#160435](https://github.com/pytorch/pytorch/pull/160435))
-- Remove extra sync in `tensor.item()` ([#158486](https://github.com/pytorch/pytorch/pull/158486))
-- Elementwise and reduction kernel perf improvements ([#159430](https://github.com/pytorch/pytorch/pull/159430), [#159652](https://github.com/pytorch/pytorch/pull/159652), [#160444](https://github.com/pytorch/pytorch/pull/160444), [#160466](https://github.com/pytorch/pytorch/pull/160466), [#161054](https://github.com/pytorch/pytorch/pull/161054), [#161180](https://github.com/pytorch/pytorch/pull/161180), [#161181](https://github.com/pytorch/pytorch/pull/161181))
-- Enable build of `fbgemm_gpu genai` sources for grouped GEMM support ([#160676](https://github.com/pytorch/pytorch/pull/160676))
+- SDPA now uses AOTriton to 0.11b (#161754)
+- `hipblaslt` is used by default on gfx908 for ROCm >= 6.3 (#159092)
+- Enable miopen channels last 3d for conv and batchnorm (#160529)
+- Remove extra transposes in NHWC convolutions on MIOpen (#160435)
+- Remove extra sync in `tensor.item()` (#158486)
+- Elementwise and reduction kernel perf improvements (#159430, #159652, #160444, #160466, #161054, #161180, #161181)
+- Enable build of `fbgemm_gpu genai` sources for grouped GEMM support (#160676)
 
 ## XPU
-- Enable tensor memory descriptor Triton template for Intel GPU ([#161600](https://github.com/pytorch/pytorch/pull/161600))
+- Enable tensor memory descriptor Triton template for Intel GPU (#161600)
 
 # Documentation
 ## Python Frontend
-- Improve documentation for `torch.lobpcg`, `torch.clone`, `torch.matmul`, `torch.max`, `torch.gather`, `torch.Tensor.scatter_`, `torch.empty_like`, `torch.randint`, `torch.mul`, `torch.min`, `torch.max`. `torch.sort`, `torch.full_like`, `torch.histogramdd`, `torch.hamming_window` ([#156139](https://github.com/pytorch/pytorch/pull/156139), [#157007](https://github.com/pytorch/pytorch/pull/157007), [#161424](https://github.com/pytorch/pytorch/pull/161424), [#156153](https://github.com/pytorch/pytorch/pull/156153), [#157929](https://github.com/pytorch/pytorch/pull/157929), [#157920](https://github.com/pytorch/pytorch/pull/157920), [#158050](https://github.com/pytorch/pytorch/pull/158050), [#158731](https://github.com/pytorch/pytorch/pull/158731), [#160312](https://github.com/pytorch/pytorch/pull/160312), [#161539](https://github.com/pytorch/pytorch/pull/161539), [#162051](https://github.com/pytorch/pytorch/pull/162051), [#158275](https://github.com/pytorch/pytorch/pull/158275), [#152682](https://github.com/pytorch/pytorch/pull/152682))
-- Remove torchscript related sections in serialization docs ([#156648](https://github.com/pytorch/pytorch/pull/156648))
-- Fix typo in `torch.set_float32_matmul_precision` docs ([#158191](https://github.com/pytorch/pytorch/pull/158191))
-- Fix docstring for `torch.nn.utils.clip_grads_with_norm_` to reflect clamping behavior ([#158200](https://github.com/pytorch/pytorch/pull/158200))
-- Fix the Doc issue on the description of edge_order in `torch.gradient` ([#159130](https://github.com/pytorch/pytorch/pull/159130))
-- Add `torch.segment_reduce` docs ([#154352](https://github.com/pytorch/pytorch/pull/154352))
-- Add examples to `torch.is_floating_point` and `torch.is_complex` docs ([#161951](https://github.com/pytorch/pytorch/pull/161951))
+- Improve documentation for `torch.lobpcg`, `torch.clone`, `torch.matmul`, `torch.max`, `torch.gather`, `torch.Tensor.scatter_`, `torch.empty_like`, `torch.randint`, `torch.mul`, `torch.min`, `torch.max`. `torch.sort`, `torch.full_like`, `torch.histogramdd`, `torch.hamming_window` (#156139, #157007, #161424, #156153, #157929, #157920, #158050, #158731, #160312, #161539, #162051, #158275, #152682)
+- Remove torchscript related sections in serialization docs (#156648)
+- Fix typo in `torch.set_float32_matmul_precision` docs (#158191)
+- Fix docstring for `torch.nn.utils.clip_grads_with_norm_` to reflect clamping behavior (#158200)
+- Fix the Doc issue on the description of edge_order in `torch.gradient` (#159130)
+- Add `torch.segment_reduce` docs (#154352)
+- Add examples to `torch.is_floating_point` and `torch.is_complex` docs (#161951)
 ## torch.nn
-- Improve description of `padding` for `avg_poolnd` ([#159142](https://github.com/pytorch/pytorch/pull/159142))
-- Improve `CrossEntropyLoss` docs with example of incorrect target specification ([#155649](https://github.com/pytorch/pytorch/pull/155649))
-- Remove redundant dtype conversion in `scaled_dot_product_attention` example ([#161613](https://github.com/pytorch/pytorch/pull/161613))
+- Improve description of `padding` for `avg_poolnd` (#159142)
+- Improve `CrossEntropyLoss` docs with example of incorrect target specification (#155649)
+- Remove redundant dtype conversion in `scaled_dot_product_attention` example (#161613)
 
 ## Optimizer
-- Document specific optimizer modules APIs e.g., `torch.optim.adam.Adam`, properly ([#158483](https://github.com/pytorch/pytorch/pull/158483), [#158669](https://github.com/pytorch/pytorch/pull/158669), [#160194](https://github.com/pytorch/pytorch/pull/160194))
-- Add note for clarity in Adafactor doc #154862 ([#155248](https://github.com/pytorch/pytorch/pull/155248))
-- Minorly improve `zero_grad` description ([#161239](https://github.com/pytorch/pytorch/pull/161239))
+- Document specific optimizer modules APIs e.g., `torch.optim.adam.Adam`, properly (#158483, #158669, #160194)
+- Add note for clarity in Adafactor doc #154862 (#155248)
+- Minorly improve `zero_grad` description (#161239)
 
 ## Autograd
-- Improve `torch.inference_mode` docs and error message ([#161164](https://github.com/pytorch/pytorch/pull/161164))
+- Improve `torch.inference_mode` docs and error message (#161164)
 
 ## Distributed
 ### c10d
-  - Documented barrier collective's interaction with `device_id` ([#159389](https://github.com/pytorch/pytorch/pull/159389))
-  - Fix comment to match logic in `distributed_c10d.py` ([#162158](https://github.com/pytorch/pytorch/pull/162158))
+  - Documented barrier collective's interaction with `device_id` (#159389)
+  - Fix comment to match logic in `distributed_c10d.py` (#162158)
 ### DTensor
-  - Rewrote doc of `TupleStrategy` ([#158132](https://github.com/pytorch/pytorch/pull/158132))
-  - Documented `redistribute_costs` ([#158495](https://github.com/pytorch/pytorch/pull/158495))
+  - Rewrote doc of `TupleStrategy` (#158132)
+  - Documented `redistribute_costs` (#158495)
 ### FullyShardedDataParallel (FSDP)
-  - Removed FSDP1 developer note ([#158991](https://github.com/pytorch/pytorch/pull/158991))
+  - Removed FSDP1 developer note (#158991)
 
 ## Profiler
-- Update PT2 Profiler Torch-Compiled Region Image ([#158066](https://github.com/pytorch/pytorch/pull/158066))
-- Fix Experimental Config Documentatation([#156586](https://github.com/pytorch/pytorch/pull/156586))
-- Update README ([#159816](https://github.com/pytorch/pytorch/pull/159816))
+- Update PT2 Profiler Torch-Compiled Region Image (#158066)
+- Fix Experimental Config Documentatation(#156586)
+- Update README (#159816)
 
 ## FX
-- Fix typos in `torch/` (`torch/fx/`) ([#156604](https://github.com/pytorch/pytorch/pull/156604))
-- Add typing ([#158450](https://github.com/pytorch/pytorch/pull/158450))
-- Fix typo in FX interpreter class docs ([#162055](https://github.com/pytorch/pytorch/pull/162055))
-- Remove allow-untyped-defs from `torch/fx/experimental/migrate_gradual_types/util.py` ([#157236](https://github.com/pytorch/pytorch/pull/157236))
+- Fix typos in `torch/` (`torch/fx/`, #156604)
+- Add typing (#158450)
+- Fix typo in FX interpreter class docs (#162055)
+- Remove allow-untyped-defs from `torch/fx/experimental/migrate_gradual_types/util.py` (#157236)
 
 ## Inductor
-- Add documentation for CUDAGraph partition ([#159450](https://github.com/pytorch/pytorch/pull/159450))
+- Add documentation for CUDAGraph partition (#159450)
 
 ## Export
-- Update docs around draft export, dynamism, and PT2 Archive ([#157750](https://github.com/pytorch/pytorch/pull/157750))
+- Update docs around draft export, dynamism, and PT2 Archive (#157750)
 
 ## ONNX
-- Update export docstring ([#162622](https://github.com/pytorch/pytorch/pull/162622))
-- Delete deprecated tutorial page link ([#157310](https://github.com/pytorch/pytorch/pull/157310))
-- Filter out torchscript sentences ([#158850](https://github.com/pytorch/pytorch/pull/158850))
-- Fix doc typo for `symbolic_multi_out` ([#160702](https://github.com/pytorch/pytorch/pull/160702))
-- `onnx.md` to simplify deprecated entities ([#159312](https://github.com/pytorch/pytorch/pull/159312))
-- Update export docstring and set `fallback=False` by default ([#162622](https://github.com/pytorch/pytorch/pull/162622), [#162726](https://github.com/pytorch/pytorch/pull/162726))
-- Fix typo in error message: summit -> submit ([#162587](https://github.com/pytorch/pytorch/pull/162587))
+- Update export docstring (#162622)
+- Delete deprecated tutorial page link (#157310)
+- Filter out torchscript sentences (#158850)
+- Fix doc typo for `symbolic_multi_out` (#160702)
+- `onnx.md` to simplify deprecated entities (#159312)
+- Update export docstring and set `fallback=False` by default (#162622, #162726)
+- Fix typo in error message: summit -> submit (#162587)
 
 ## Release Engineering
-- Add decorator to create deprecation warnings ([#155127](https://github.com/pytorch/pytorch/pull/155127))
-- Add runnable code examples to export documentation ([#158506](https://github.com/pytorch/pytorch/pull/158506))
-- Add developer notes for integrating new backends into PyTorch ([#158644](https://github.com/pytorch/pytorch/pull/158644))
+- Add decorator to create deprecation warnings (#155127)
+- Add runnable code examples to export documentation (#158506)
+- Add developer notes for integrating new backends into PyTorch (#158644)
 
 ## XPU
-- Update supported OS to Windows 11 & Ubuntu 24.04/25.04 for Intel client GPU ([#161699](https://github.com/pytorch/pytorch/pull/161699))
+- Update supported OS to Windows 11 & Ubuntu 24.04/25.04 for Intel client GPU (#161699)
 
 # Security
 ## Python Frontend
-- Don't store flamegraph to tmp folder ([#157374](https://github.com/pytorch/pytorch/pull/157374))
+- Don't store flamegraph to tmp folder (#157374)
 
 # Developers
 ## Python Frontend
-- Better sample inputs for addmm OpInfo ([#160234](https://github.com/pytorch/pytorch/pull/160234))
+- Better sample inputs for addmm OpInfo (#160234)
 
 ## Distributed
 ### c10d
-  - Add `waitcounter` for watchdog and heartbeat monitoring thread ([#157480](https://github.com/pytorch/pytorch/pull/157480))
-  - Made `torch.distributed.breakpoint` set a long timeout ([#158481](https://github.com/pytorch/pytorch/pull/158481))
-  - Add `check_rng_sync` util ([#160283](https://github.com/pytorch/pytorch/pull/160283))
-  - Add `FlightRecorder` support for `ProcessGroupXCCL` ([#158568](https://github.com/pytorch/pytorch/pull/158568))
-  - Add `early_stop` kwarg to `torch.utils.checkpoint` ([#160781](https://github.com/pytorch/pytorch/pull/160781))
+  - Add `waitcounter` for watchdog and heartbeat monitoring thread (#157480)
+  - Made `torch.distributed.breakpoint` set a long timeout (#158481)
+  - Add `check_rng_sync` util (#160283)
+  - Add `FlightRecorder` support for `ProcessGroupXCCL` (#158568)
+  - Add `early_stop` kwarg to `torch.utils.checkpoint` (#160781)
 ### DTensor
-  - Wrap sharding prop error with contextual exception ([#161574](https://github.com/pytorch/pytorch/pull/161574))
-  - Add check if tracing for sharding propagation to handle un-hashable keys in DTensor ([#160798](https://github.com/pytorch/pytorch/pull/160798))
+  - Wrap sharding prop error with contextual exception (#161574)
+  - Add check if tracing for sharding propagation to handle un-hashable keys in DTensor (#160798)
 ### Device Mesh
-  - Add error when users try to slice non contiguous flattened dim submesh ([#157523](https://github.com/pytorch/pytorch/pull/157523))
-  - Make the repr shorter when debug ENV not set ([#158822](https://github.com/pytorch/pytorch/pull/158822))
+  - Add error when users try to slice non contiguous flattened dim submesh (#157523)
+  - Make the repr shorter when debug ENV not set (#158822)
 ### ShardedTensor
-  - Make error message descriptive in ShardedTensor creation (#150627) ([#159423](https://github.com/pytorch/pytorch/pull/159423))
+  - Make error message descriptive in ShardedTensor creation (#150627, #159423)
 ### Pipeline Parallelism (PP)
-  - Add profiling to schedule execution ([#160753](https://github.com/pytorch/pytorch/pull/160753))
+  - Add profiling to schedule execution (#160753)
 
 ## FX
-- Consolidate stack trace in Tracer ([#156257](https://github.com/pytorch/pytorch/pull/156257), [#157302](https://github.com/pytorch/pytorch/pull/157302), [#158266](https://github.com/pytorch/pytorch/pull/158266))
-- Separate provenance tracking to different levels ([#160383](https://github.com/pytorch/pytorch/pull/160383), [#158399](https://github.com/pytorch/pytorch/pull/158399), [#158796](https://github.com/pytorch/pytorch/pull/158796), [#159484](https://github.com/pytorch/pytorch/pull/159484))
-- Fix `register_foward_pre_hook not supported on ScriptModule` error ([#156904](https://github.com/pytorch/pytorch/pull/156904))
-- Add `__eq__` function to NodeSource ([#158170](https://github.com/pytorch/pytorch/pull/158170))
-- Add `__hash__` function to NodeSource ([#158322](https://github.com/pytorch/pytorch/pull/158322))
-- Cache dict and string rep for better perf in NodeSource ([#158372](https://github.com/pytorch/pytorch/pull/158372))
-- Recover node source from dict (#158373) ([#158473](https://github.com/pytorch/pytorch/pull/158473))
-- Include error stacktrace and graph module in `tlparse` error ([#158469](https://github.com/pytorch/pytorch/pull/158469))
-- Add `expanded_def` option for FX printing, render descriptor, update tests ([#158708](https://github.com/pytorch/pytorch/pull/158708))
-- Remove `co_lnotab` in favor of `co_linetable` ([#159227](https://github.com/pytorch/pytorch/pull/159227))
-- Remove duplicate imports ([#161685](https://github.com/pytorch/pytorch/pull/161685))
-- Include Output tensor metadata for `CompiledFxGraph` ([#159311](https://github.com/pytorch/pytorch/pull/159311))
+- Consolidate stack trace in Tracer (#156257, #157302, #158266)
+- Separate provenance tracking to different levels (#160383, #158399, #158796, #159484)
+- Fix `register_foward_pre_hook not supported on ScriptModule` error (#156904)
+- Add `__eq__` function to NodeSource (#158170)
+- Add `__hash__` function to NodeSource (#158322)
+- Cache dict and string rep for better perf in NodeSource (#158372)
+- Recover node source from dict (#158373, #158473)
+- Include error stacktrace and graph module in `tlparse` error (#158469)
+- Add `expanded_def` option for FX printing, render descriptor, update tests (#158708)
+- Remove `co_lnotab` in favor of `co_linetable` (#159227)
+- Remove duplicate imports (#161685)
+- Include Output tensor metadata for `CompiledFxGraph` (#159311)
 
 ## Inductor
-- Deprecate `allow_tf32` in `tl.dot(..., allow_tf32=...)`, use `tl.dot(..., input_precision=...)` ([#160711](https://github.com/pytorch/pytorch/pull/160711))
-- Log autotune choices and benchmark result to scuba/chrome trace ([#159496](https://github.com/pytorch/pytorch/pull/159496))
-- Add TLParse artifact for logging runtime of collective and compute ops ([#159730](https://github.com/pytorch/pytorch/pull/159730))
-- Call `jit_post_compile_hook` within Inductor Triton Kernel compile path ([#161443](https://github.com/pytorch/pytorch/pull/161443))
-- Prune configs that require more shared memory than the hardware limit ([#161996](https://github.com/pytorch/pytorch/pull/161996))
-- Runtime estimations using nccl estimator on mm only benchmark mode ([#161405](https://github.com/pytorch/pytorch/pull/161405))
-- Don't use `torch.backends.cuda.matmul.allow_tf32` in Inductor cache key ([#159480](https://github.com/pytorch/pytorch/pull/159480))
+- Deprecate `allow_tf32` in `tl.dot(..., allow_tf32=...)`, use `tl.dot(..., input_precision=...)` (#160711)
+- Log autotune choices and benchmark result to scuba/chrome trace (#159496)
+- Add TLParse artifact for logging runtime of collective and compute ops (#159730)
+- Call `jit_post_compile_hook` within Inductor Triton Kernel compile path (#161443)
+- Prune configs that require more shared memory than the hardware limit (#161996)
+- Runtime estimations using nccl estimator on mm only benchmark mode (#161405)
+- Don't use `torch.backends.cuda.matmul.allow_tf32` in Inductor cache key (#159480)
 
 ## Ahead-Of-Time Inductor (AOTI)
-- Better error message when no .so/cpp files are found ([#156863](https://github.com/pytorch/pytorch/pull/156863))
-- Clean up old APIs in AOTI c shim ([#158400](https://github.com/pytorch/pytorch/pull/158400))
-- Add Inductor provenance mapping for cpp extern kernel (#161656) ([#162069](https://github.com/pytorch/pytorch/pull/162069))
-- Print out error msg when nvcc compiler fails ([#157203](https://github.com/pytorch/pytorch/pull/157203))
-- Add kernel information JSON generation for AOTI packages ([#160540](https://github.com/pytorch/pytorch/pull/160540))
+- Better error message when no .so/cpp files are found (#156863)
+- Clean up old APIs in AOTI c shim (#158400)
+- Add Inductor provenance mapping for cpp extern kernel (#161656, #162069)
+- Print out error msg when nvcc compiler fails (#157203)
+- Add kernel information JSON generation for AOTI packages (#160540)
 
 ## Composability
-- Stop suggesting to use `guard_size_oblivious` on data dependent errors ([#160510](https://github.com/pytorch/pytorch/pull/160510))
-- Avoid unnecessary slices resulting in data-dependent errors ([#157528](https://github.com/pytorch/pytorch/pull/157528))
+- Stop suggesting to use `guard_size_oblivious` on data dependent errors (#160510)
+- Avoid unnecessary slices resulting in data-dependent errors (#157528)
 
 ## Quantization
-- Revamp dtype documentation ([#156087](https://github.com/pytorch/pytorch/pull/156087))
-- Use new type statement to fix public API of types ([#158487](https://github.com/pytorch/pytorch/pull/158487))
+- Revamp dtype documentation (#156087)
+- Use new type statement to fix public API of types (#158487)
 
 ## Dataloader Frontend
-- Add `torch.utils.data` samplers benchmark script ([#156974](https://github.com/pytorch/pytorch/pull/156974))
-- Add `torch.utils.data.Dataloader` benchmark script ([#159432](https://github.com/pytorch/pytorch/pull/159432))
+- Add `torch.utils.data` samplers benchmark script (#156974)
+- Add `torch.utils.data.Dataloader` benchmark script (#159432)
 
 ## Release Engineering
-- Replace `setup.py develop` with `pip install -e` for development builds ([#155998](https://github.com/pytorch/pytorch/pull/155998)) ([#156027](https://github.com/pytorch/pytorch/pull/156027)) ([#156710](https://github.com/pytorch/pytorch/pull/156710))  ([#156709](https://github.com/pytorch/pytorch/pull/156709))
+- Replace `setup.py develop` with `pip install -e` for development builds (#155998, #156027, #156710)  (#156709)
 
 ## XPU
-- Upgrade Intel GPU software stack package to intel-deep-learning-essentials-2025.2 ([#158733](https://github.com/pytorch/pytorch/pull/158733))
+- Upgrade Intel GPU software stack package to intel-deep-learning-essentials-2025.2 (#158733)