InfiniTensor · wawahejun · Aug 24, 2025
diff --git a/infiniop/ops/README.md b/infiniop/ops/README.md
@@ -3,12 +3,19 @@
 - [`Add`](/infiniop/ops/add/README.md)
 - [`Causal Softmax`](/infiniop/ops/causal_softmax/README.md)
 - [`Clip`](/infiniop/ops/clip/README.md)
+- [`Gather`](/infiniop/ops/gather/README.md)
 - [`GEMM`](/infiniop/ops/gemm/README.md)
+- [`IndexCopyInplace`](/infiniop/ops/index_copy_inplace/README.md)
+- [`Linear`](/infiniop/ops/linear/README.md)
+- [`Linear Backward`](/infiniop/ops/linear_backward/README.md)
 - [`Mul`](/infiniop/ops/mul/README.md)
 - [`Random Sample`](/infiniop/ops/random_sample/README.md)
 - [`Rearrange`](/infiniop/ops/rearrange/README.md)
 - [`RMS Norm`](/infiniop/ops/rms_norm/README.md)
 - [`RoPE`](/infiniop/ops/rope/README.md)
+- [`Scatter`](/infiniop/ops/scatter/README.md)
 - [`Softmax`](/infiniop/ops/softmax/README.md)
 - [`Sub`](/infiniop/ops/sub/README.md)
 - [`SwiGLU`](/infiniop/ops/swiglu/README.md)
+- [`Tril`](/infiniop/ops/tril/README.md)
+- [`Triu`](/infiniop/ops/triu/README.md)
diff --git a/infiniop/ops/gather/README.md b/infiniop/ops/gather/README.md
@@ -0,0 +1,142 @@
+# `Gather`
+
+`Gather`，即**聚集**算子。该算子沿着指定维度从输入张量中收集值，根据索引张量指定的位置进行聚集操作。其计算可被表述为：
+
+$$ output[i][j][k] = input[index[i][j][k]][j][k] $$
+
+（当 `dim=0` 时的示例）
+
+其中 `input` 为输入张量，`output` 为输出张量，`index` 为索引张量，`dim` 为聚集维度。
+
+参考 `torch.gather` 实现，不用考虑 `sparse_grad`。
+
+## 接口
+
+### 计算
+
+```c
+infiniStatus_t infiniopGather(
+    infiniopGatherDescriptor_t desc,
+    void *workspace,
+    size_t workspace_size,
+    void *output,
+    const void *input,
+    const void *index,
+    void *stream
+);
+```
+
+<div style="background-color: lightblue; padding: 1px;"> 参数： </div>
+
+- `desc`:
+  已使用 `infiniopCreateGatherDescriptor()` 初始化的算子描述符；
+- `workspace`:
+  指向算子计算所需的额外工作空间；
+- `workspace_size`:
+  `workspace` 的大小，单位：字节；
+- `output`:
+  输出张量。张量限制见[创建算子描述](#创建算子描述)部分；
+- `input`:
+  输入张量。张量限制见[创建算子描述](#创建算子描述)部分；
+- `index`:
+  索引张量。张量限制见[创建算子描述](#创建算子描述)部分；
+- `stream`:
+  计算流/队列；
+
+<div style="background-color: lightblue; padding: 1px;"> 返回值：</div>
+
+- [`INFINI_STATUS_SUCCESS`], [`INFINI_STATUS_BAD_PARAM`], [`INFINI_STATUS_INSUFFICIENT_WORKSPACE`], [`INFINI_STATUS_DEVICE_TYPE_NOT_SUPPORTED`], [`INFINI_STATUS_INTERNAL_ERROR`]，[`INFINI_STATUS_BAD_TENSOR_DTYPE`].
+
+### 创建算子描述
+
+```c
+infiniStatus_t infiniopCreateGatherDescriptor(
+    infiniopHandle_t handle,
+    infiniopGatherDescriptor_t *desc_ptr,
+    infiniopTensorDescriptor_t output_desc,
+    infiniopTensorDescriptor_t input_desc,
+    infiniopTensorDescriptor_t index_desc,
+    int dim
+);
+```
+
+<div style="background-color: lightblue; padding: 1px;"> 参数：</div>
+
+- `handle`:
+  `infiniopHandle_t` 类型的硬件控柄。详情请看：[`InfiniopHandle_t`]。
+- `desc_ptr`:
+  `infiniopGatherDescriptor_t` 指针，指向将被初始化的算子描述符地址；
+- `output_desc` - { dT | (d1,...,dn) | (...) }:
+  算子计算参数 `output` 的张量描述，支持原位计算。
+- `input_desc` - { dT | (d1,...,dn) | (...) }:
+  算子计算参数 `input` 的张量描述。
+- `index_desc` - { int32/int64 | (d1,...,dn) | (...) }:
+  算子计算参数 `index` 的张量描述。
+- `dim`:
+  聚集维度。
+
+参数限制：
+
+- `dT`: 所有合法类型。
+- 支持原位计算，即计算时 `output` 可以和 `input` 指向同一地址。
+- `index` 张量的数据类型必须为 `int32` 或 `int64`。
+- `output` 和 `index` 张量必须具有相同的形状。
+- `input` 张量在除 `dim` 维度外的其他维度必须与 `output` 张量相同。
+
+<div style="background-color: lightblue; padding: 1px;"> 返回值：</div>
+
+- [`INFINI_STATUS_SUCCESS`], [`INFINI_STATUS_BAD_PARAM`], [`INFINI_STATUS_BAD_TENSOR_SHAPE`], [`INFINI_STATUS_BAD_TENSOR_DTYPE`], [`INFINI_STATUS_BAD_TENSOR_STRIDES`], [`INFINI_STATUS_DEVICE_TYPE_NOT_SUPPORTED`].
+
+### 计算额外工作空间
+
+```c
+infiniStatus_t infiniopGetGatherWorkspaceSize(
+    infiniopGatherDescriptor_t desc,
+    size_t *size
+);
+```
+
+<div style="background-color: lightblue; padding: 1px;"> 参数：</div>
+
+- `desc`:
+  已使用 `infiniopCreateGatherDescriptor()` 初始化的算子描述符；
+- `size`:
+  额外空间大小的计算结果的写入地址；
+
+<div style="background-color: lightblue; padding: 1px;"> 返回值：</div>
+
+- [`INFINI_STATUS_SUCCESS`], [`INFINI_STATUS_NULL_POINTER`], [`INFINI_STATUS_DEVICE_TYPE_NOT_SUPPORTED`].
+
+### 销毁算子描述符
+
+```c
+infiniStatus_t infiniopDestroyGatherDescriptor(
+    infiniopGatherDescriptor_t desc
+);
+```
+
+<div style="background-color: lightblue; padding: 1px;"> 参数： </div>
+
+- `desc`:
+  输入。 待销毁的算子描述符；
+
+<div style="background-color: lightblue; padding: 1px;"> 返回值： </div>
+
+- [`INFINI_STATUS_SUCCESS`], [`INFINI_STATUS_DEVICE_TYPE_NOT_SUPPORTED`].
+
+## 已知问题
+
+无
+
+<!-- 链接 -->
+[`InfiniopHandle_t`]: /infiniop/handle/README.md
+
+[`INFINI_STATUS_SUCCESS`]: /common/status/README.md#INFINI_STATUS_SUCCESS
+[`INFINI_STATUS_BAD_PARAM`]: /common/status/README.md#INFINI_STATUS_BAD_PARAM
+[`INFINI_STATUS_INSUFFICIENT_WORKSPACE`]: /common/status/README.md#INFINI_STATUS_INSUFFICIENT_WORKSPACE
+[`INFINI_STATUS_DEVICE_TYPE_NOT_SUPPORTED`]: /common/status/README.md#INFINI_STATUS_DEVICE_TYPE_NOT_SUPPORTED
+[`INFINI_STATUS_INTERNAL_ERROR`]: /common/status/README.md#INFINI_STATUS_INTERNAL_ERROR
+[`INFINI_STATUS_NULL_POINTER`]: /common/status/README.md#INFINI_STATUS_NULL_POINTER
+[`INFINI_STATUS_BAD_TENSOR_SHAPE`]: /common/status/README.md#INFINI_STATUS_BAD_TENSOR_SHAPE
+[`INFINI_STATUS_BAD_TENSOR_DTYPE`]: /common/status/README.md#INFINI_STATUS_BAD_TENSOR_DTYPE
+[`INFINI_STATUS_BAD_TENSOR_STRIDES`]: /common/status/README.md#INFINI_STATUS_BAD_TENSOR_STRIDES
diff --git a/infiniop/ops/index_copy_inplace/README.md b/infiniop/ops/index_copy_inplace/README.md
@@ -0,0 +1,139 @@
+# `IndexCopyInplace`
+
+`IndexCopyInplace`，即**索引复制原位**算子。该算子将源张量中指定索引的元素复制到目标张量的对应位置，支持原位操作。其计算可被表述为：
+
+$$ output[index[i]] = input[i] $$
+
+其中 `input` 为输入张量，`output` 为输出张量，`index` 为索引张量，`dim` 为操作维度。
+
+参考 `torch.Tensor.index_copy_` 实现。
+
+## 接口
+
+### 计算
+
+```c
+infiniStatus_t infiniopIndexCopyInplace(
+    infiniopIndexCopyInplaceDescriptor_t desc,
+    void *workspace,
+    size_t workspace_size,
+    void *output,
+    const void *input,
+    const void *index,
+    void *stream
+);
+```
+
+<div style="background-color: lightblue; padding: 1px;"> 参数： </div>
+
+- `desc`:
+  已使用 `infiniopCreateIndexCopyInplaceDescriptor()` 初始化的算子描述符；
+- `workspace`:
+  指向算子计算所需的额外工作空间；
+- `workspace_size`:
+  `workspace` 的大小，单位：字节；
+- `output`:
+  输出张量。张量限制见[创建算子描述](#创建算子描述)部分；
+- `input`:
+  输入张量。张量限制见[创建算子描述](#创建算子描述)部分；
+- `index`:
+  索引张量。张量限制见[创建算子描述](#创建算子描述)部分；
+- `stream`:
+  计算流/队列；
+
+<div style="background-color: lightblue; padding: 1px;"> 返回值：</div>
+
+- [`INFINI_STATUS_SUCCESS`], [`INFINI_STATUS_BAD_PARAM`], [`INFINI_STATUS_INSUFFICIENT_WORKSPACE`], [`INFINI_STATUS_DEVICE_TYPE_NOT_SUPPORTED`], [`INFINI_STATUS_INTERNAL_ERROR`]，[`INFINI_STATUS_BAD_TENSOR_DTYPE`].
+
+### 创建算子描述
+
+```c
+infiniStatus_t infiniopCreateIndexCopyInplaceDescriptor(
+    infiniopHandle_t handle,
+    infiniopIndexCopyInplaceDescriptor_t *desc_ptr,
+    infiniopTensorDescriptor_t output_desc,
+    infiniopTensorDescriptor_t input_desc,
+    infiniopTensorDescriptor_t index_desc,
+    int dim
+);
+```
+
+<div style="background-color: lightblue; padding: 1px;"> 参数：</div>
+
+- `handle`:
+  `infiniopHandle_t` 类型的硬件控柄。详情请看：[`InfiniopHandle_t`]。
+- `desc_ptr`:
+  `infiniopIndexCopyInplaceDescriptor_t` 指针，指向将被初始化的算子描述符地址；
+- `output_desc` - { dT | (d1,...,dn) | (...) }:
+  算子计算参数 `output` 的张量描述，支持原位计算。
+- `input_desc` - { dT | (d1,...,dn) | (...) }:
+  算子计算参数 `input` 的张量描述。
+- `index_desc` - { int32/int64 | (d1,...,dn) | (...) }:
+  算子计算参数 `index` 的张量描述。
+- `dim`:
+  操作维度。
+
+参数限制：
+
+- `dT`: 所有合法类型。
+- 支持任意步长，可以复用rearrange算子代码。
+- 支持原位计算，即计算时 `output` 可以和 `input` 指向同一地址。
+- `index` 张量的数据类型必须为 `int32` 或 `int64`。
+
+<div style="background-color: lightblue; padding: 1px;"> 返回值：</div>
+
+- [`INFINI_STATUS_SUCCESS`], [`INFINI_STATUS_BAD_PARAM`], [`INFINI_STATUS_BAD_TENSOR_SHAPE`], [`INFINI_STATUS_BAD_TENSOR_DTYPE`], [`INFINI_STATUS_BAD_TENSOR_STRIDES`], [`INFINI_STATUS_DEVICE_TYPE_NOT_SUPPORTED`].
+
+### 计算额外工作空间
+
+```c
+infiniStatus_t infiniopGetIndexCopyInplaceWorkspaceSize(
+    infiniopIndexCopyInplaceDescriptor_t desc,
+    size_t *size
+);
+```
+
+<div style="background-color: lightblue; padding: 1px;"> 参数：</div>
+
+- `desc`:
+  已使用 `infiniopCreateIndexCopyInplaceDescriptor()` 初始化的算子描述符；
+- `size`:
+  额外空间大小的计算结果的写入地址；
+
+<div style="background-color: lightblue; padding: 1px;"> 返回值：</div>
+
+- [`INFINI_STATUS_SUCCESS`], [`INFINI_STATUS_NULL_POINTER`], [`INFINI_STATUS_DEVICE_TYPE_NOT_SUPPORTED`].
+
+### 销毁算子描述符
+
+```c
+infiniStatus_t infiniopDestroyIndexCopyInplaceDescriptor(
+    infiniopIndexCopyInplaceDescriptor_t desc
+);
+```
+
+<div style="background-color: lightblue; padding: 1px;"> 参数： </div>
+
+- `desc`:
+  输入。 待销毁的算子描述符；
+
+<div style="background-color: lightblue; padding: 1px;"> 返回值： </div>
+
+- [`INFINI_STATUS_SUCCESS`], [`INFINI_STATUS_DEVICE_TYPE_NOT_SUPPORTED`].
+
+## 已知问题
+
+无
+
+<!-- 链接 -->
+[`InfiniopHandle_t`]: /infiniop/handle/README.md
+
+[`INFINI_STATUS_SUCCESS`]: /common/status/README.md#INFINI_STATUS_SUCCESS
+[`INFINI_STATUS_BAD_PARAM`]: /common/status/README.md#INFINI_STATUS_BAD_PARAM
+[`INFINI_STATUS_INSUFFICIENT_WORKSPACE`]: /common/status/README.md#INFINI_STATUS_INSUFFICIENT_WORKSPACE
+[`INFINI_STATUS_DEVICE_TYPE_NOT_SUPPORTED`]: /common/status/README.md#INFINI_STATUS_DEVICE_TYPE_NOT_SUPPORTED
+[`INFINI_STATUS_INTERNAL_ERROR`]: /common/status/README.md#INFINI_STATUS_INTERNAL_ERROR
+[`INFINI_STATUS_NULL_POINTER`]: /common/status/README.md#INFINI_STATUS_NULL_POINTER
+[`INFINI_STATUS_BAD_TENSOR_SHAPE`]: /common/status/README.md#INFINI_STATUS_BAD_TENSOR_SHAPE
+[`INFINI_STATUS_BAD_TENSOR_DTYPE`]: /common/status/README.md#INFINI_STATUS_BAD_TENSOR_DTYPE
+[`INFINI_STATUS_BAD_TENSOR_STRIDES`]: /common/status/README.md#INFINI_STATUS_BAD_TENSOR_STRIDES