adding cherry picks

liangel-02 · liangel-02 · commit b79d2c0fe2cb · 2025-10-07T14:48:35.000-07:00
diff --git a/2.9.0/final.md b/2.9.0/final.md
@@ -517,6 +517,9 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Fix 3d tiled online softmax ([#162341](https://github.com/pytorch/pytorch/pull/162341))
 - Fix unsafe collective reorder past wait in Inductor ([#157489](https://github.com/pytorch/pytorch/pull/157489))
 - Fix `FallbackKernel` alias function to avoid incorrect aliasing for custom ops ([#163227](https://github.com/pytorch/pytorch/pull/163227))
+- Fix silent correctness w/ backpropping grads for `FlexAttention` ([#163677](https://github.com/pytorch/pytorch/pull/163677))
+- Fix `return_lse` warning message in `FlexAttention` ([#163578](https://github.com/pytorch/pytorch/pull/163578))
+- Fix `FlexAttention` head broadcast ([#163426](https://github.com/pytorch/pytorch/pull/163426))
 
 ## Ahead-Of-Time Inductor (AOTI)
 - Fix a bug from `load_constants` ([#161887](https://github.com/pytorch/pytorch/pull/161887))
@@ -554,6 +557,9 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Fix lower opset version support in `dynamo=True` ([#161056](https://github.com/pytorch/pytorch/pull/161056))
 - Fix `index_put_` usage ([#161263](https://github.com/pytorch/pytorch/pull/161263))
 
+## C++ Extensions
+- Fix CPP extension distributed warning for `TORCH_CUDA_ARCH_LIST` to only log when running on non-distributed or on rank 0 ([#162764](https://github.com/pytorch/pytorch/pull/162764))
+
 ## C++ Frontend
 - Fix `torch.utils.cpp_extension` parser for clang version 20.1.7+libcxx ([#157666](https://github.com/pytorch/pytorch/pull/157666))
 - Fix `MakeTensor::computeStorageSize()` calculation ([#158690](https://github.com/pytorch/pytorch/pull/158690))
@@ -591,6 +597,9 @@ We move enabling `pin_memory` back inside `BaseDataLoaderIter`. This is required
 - Fix empty input in posneg functions ([#161824](https://github.com/pytorch/pytorch/pull/161824))
 - Migrate round unary op to Metal ([#161712](https://github.com/pytorch/pytorch/pull/161712))
 - Type-promote tensor-iterator common dtype ([#160334](https://github.com/pytorch/pytorch/pull/160334))
+- Fix regression in 2.8.0 for `scaled_dot_product_attention` using MPS ([#163598](https://github.com/pytorch/pytorch/pull/163598))
+- Chunk `fillBuffer` into 4Gb slices to avoid regression on MacOS 26 ([#164108](https://github.com/pytorch/pytorch/pull/164108))
+- Fix latent bug that can result in segfault in CPP extensions ([#164093](https://github.com/pytorch/pytorch/pull/164093))
 
 ## ROCm
 - Fix Inductor with cudagraph trees `hip:0` device error ([#161221](https://github.com/pytorch/pytorch/pull/161221))