Skip to content

Conversation

@yingxudeng
Copy link
Collaborator

No description provided.

#elif defined(USE_CUDA)
cuda::act_and_mul(params.output, params.input, params.act_mode);
#else
LOG(FATAL) << "active not implemented";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove torch::Tensor active_tensor(ActivationParams& params) and add params.output = npu::active(params.input, params.act_mode) here for npu device.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto output = torch::empty(
    {batch_size,
     intermediate_size_ / parallel_args_.tp_group_->world_size()},
    gate_up.options());

This is a good modification. However, as described, the current code's output still allocates space preemptively. For NPU operators, they typically allocate their own space and return the result. This unavoidable difference still forces the external calling code to use an #if block to skip space allocation specifically for the NPU case.

To standardize the external calling code, I personally recommend aligning with the NPU's behavior: allocate the space within the operator wrapper/layer and then return it. This approach allows for a unified code structure for all external calls.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so don't add active_tensor and fused_layernorm_tensor these two func in ops_api.h, because no other platform will use such api.
put they in npu_ops_api.h and call them directly in npu layer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image Regarding the code snippet above: if we implement the changes as suggested, we would need to introduce #if directives here to skip memory allocation, since the NPU operator handles this internally.

Could we instead consider moving the memory allocation logic for MLU and CUDA into their respective kernel wrappers? This would make the behavior more similar to PyTorch and allow us to unify the calling code here.

(PS: I haven't modified the CUDA or MLU code yet.)

#endif
}

torch::Tensor fused_layernorm_tensor(FusedLayerNormParams& params) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Collaborator Author

@yingxudeng yingxudeng Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the previous comment.

// Must be less than or equal to rope_seqlen if not using discrete
// position_ids.
int64_t max_query_len;
torch::Tensor positions;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::optional<torch::Tensor> position_ids already exists.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the implementation, I noticed that position_ids are set to empty during the prefill stage, so I initially added position. However, I see that the latest CUDA code addresses the same issue using a different approach. To ensure consistency, I plan to align my implementation with the CUDA method.

@yingxudeng yingxudeng force-pushed the feat/npu_backend_torch_2_kernels branch 7 times, most recently from 277c5fb to 1da759f Compare December 5, 2025 13:16
@XuZhang99
Copy link
Collaborator

For activation ops on npu, revert this commit(refactor: standardize interface for active kernel execution.), and this is what you only need to do:

#elif defined(USE_NPU)
  # make params.output become a null tensor 
  params.output = torch::Tensor();
  params.output = npu::active(params.input, params.act_mode);

@yingxudeng
Copy link
Collaborator Author

yingxudeng commented Dec 5, 2025

#474 (comment) Regarding GPU/MLU code, this has no performance impact, as it simply utilizes pre-allocated memory space. However, for NPU code, repeated calls to operations such as RMS normalization may lead to frequent memory allocations and immediate deallocations within the external framework. Could this pattern of repeatedly allocating and promptly discarding memory potentially affect performance on NPU architectures?

@yingxudeng yingxudeng force-pushed the feat/npu_backend_torch_2_kernels branch from 28e6e79 to a0382bb Compare December 5, 2025 16:15
params.output = params.input;
params.residual_out = params.residual;
} else {
params.output = torch::empty(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto output = torch::empty_like(input);
output is already created, why create it again?

@XuZhang99
Copy link
Collaborator

XuZhang99 commented Dec 6, 2025

#474 (comment) Regarding GPU/MLU code, this has no performance impact, as it simply utilizes pre-allocated memory space. However, for NPU code, repeated calls to operations such as RMS normalization may lead to frequent memory allocations and immediate deallocations within the external framework. Could this pattern of repeatedly allocating and promptly discarding memory potentially affect performance on NPU architectures?

in dense_mlp.cpp:

torch::Tensor output;
if(Device::type!="npu"){
    output = torch::empty(
        {batch_size,
         intermediate_size_ / parallel_args_.tp_group_->world_size()},
        gate_up.options());
}

btw, you need to learn more about memory management in torch.

#if defined(USE_NPU)
return npu::fused_layernorm(
#elif defined(USE_NPU)
params.output = npu::fused_layernorm(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

norm ops for npu need to support fused and non-fused mode.

@yingxudeng
Copy link
Collaborator Author

#474 (comment) Regarding GPU/MLU code, this has no performance impact, as it simply utilizes pre-allocated memory space. However, for NPU code, repeated calls to operations such as RMS normalization may lead to frequent memory allocations and immediate deallocations within the external framework. Could this pattern of repeatedly allocating and promptly discarding memory potentially affect performance on NPU architectures?

in dense_mlp.cpp:

torch::Tensor output;
if(Device::type!="npu"){
    output = torch::empty(
        {batch_size,
         intermediate_size_ / parallel_args_.tp_group_->world_size()},
        gate_up.options());
}

btw, you need to learn more about memory management in torch.

#474 (comment) Regarding GPU/MLU code, this has no performance impact, as it simply utilizes pre-allocated memory space. However, for NPU code, repeated calls to operations such as RMS normalization may lead to frequent memory allocations and immediate deallocations within the external framework. Could this pattern of repeatedly allocating and promptly discarding memory potentially affect performance on NPU architectures?

in dense_mlp.cpp:

torch::Tensor output;
if(Device::type!="npu"){
    output = torch::empty(
        {batch_size,
         intermediate_size_ / parallel_args_.tp_group_->world_size()},
        gate_up.options());
}

btw, you need to learn more about memory management in torch.

Thank you for your review. Moving forward, I will replace the unavoidable #if defined macros with runtime checks based on Device::type. Regarding the description of memory allocation and de-allocation, please feel free to disregard it; it was translated by LLM and is not entirely accurate. In short, for NPU models, I want to avoid having an external output = torch::empty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants