-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Labels
optimizationthis issue optimizes some aspect of the librarythis issue optimizes some aspect of the librarypriority: high
Description
Currently there are a few issues with the way the Matrix, VarMatrix, and Vector wrappers of candle_core::Tensor are being treated, that leads to bad GPU utilization and low cache locality. These include:
- Frequent
DeviceTransfers: The code creates many small tensors individually, which can cause inefficient GPU memory allocation patterns. - Redundant
Device/DTypeStorage: EachMatrix,VarMatrix, andVectorstores its ownDeviceandDType, which is redundant since the underlyingTensoralready has this information. - Inefficient Small Operations: Operations like creating identity matrices element-by-element are not GPU-optimized.
- Sequential Processing: The sheaf operations process cells one at a time rather than in batches.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
optimizationthis issue optimizes some aspect of the librarythis issue optimizes some aspect of the librarypriority: high