`fix`: improve the GPU utilization with better `Tensor` and `Var` handling

Currently there are a few issues with the way the `Matrix`, `VarMatrix`, and `Vector` wrappers of `candle_core::Tensor` are being treated, that leads to bad GPU utilization and low cache locality. These include:

 1. Frequent `Device` Transfers: The code creates many small tensors individually, which can cause inefficient GPU memory allocation patterns.
 2. Redundant `Device`/`DType` Storage: Each `Matrix`, `VarMatrix`, and `Vector` stores its own `Device` and `DType`, which is redundant since the underlying `Tensor` already has this information.
 3. Inefficient Small Operations: Operations like creating identity matrices element-by-element are not GPU-optimized.
 4. Sequential Processing: The sheaf operations process cells one at a time rather than in batches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`fix`: improve the GPU utilization with better `Tensor` and `Var` handling #18

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

fix: improve the GPU utilization with better Tensor and Var handling #18

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`fix`: improve the GPU utilization with better `Tensor` and `Var` handling #18