First, we need to figure out in particular shfl_xor_sync lane semantics in 1-indexed laneid() for use in merge steps. Secondly, can we mix shfl_up/down_sync fos CAS steps? Finally, does this actually provide any benefit?