[sharktank] Add ops.clone and implement for DefaultPrimitiveTensor #1222

sogartar · 2025-04-04T15:12:40Z

Change sharded tensors to clone their data, but not the name to avoid name aliasing. Cloning should be thought as equivalent to b = a + 0.

Change sharded tensors to clone their data, but not the name to avoid name aliasing. Cloning should thought as equivalent to `b = a + 0`.

sogartar · 2025-04-07T16:37:20Z

This PR depends on #1227.

Alex-Vasile · 2025-04-07T20:11:39Z

sharktank/sharktank/types/tensors.py

@@ -1075,7 +1086,7 @@ def __init__(
        )

    def clone(self, **kwargs) -> "SplitPrimitiveTensor":
-        kwargs["name"] = kwargs.get("name", self.name)
+        # We don't copy the name to not introduce name aliasing.


Is this a problem? I purposefully copied the name in order to keep richer metadata.

I assume that names should be unique. If the data is not the same it does not make sense that the name is the same. We are in the same compute graph.

Ops that insert computation in the graph should not create tensors with the same name.

Alex-Vasile · 2025-04-07T20:13:22Z

sharktank/sharktank/types/tensors.py

+    def clone(self) -> "InferenceTensor":
+        from .. import ops
+
+        # We don't clone due the name to not introduce name aliasing.


The "due" looks out of place

Removed it.

Alex-Vasile · 2025-04-08T12:33:43Z

sharktank/sharktank/ops/default_impls.py

+    # Clone introduces a tensor copy operation in the graph. It does not make sense to
+    # clone also the ExternalTensorTrait. The expectation is that when a user clones a
+    # tensor if they modify in-place the result this should not affect the original
+    # tensor. If we are to clone also the ExternalTensorTrait then we would have
+    # external_name aliasing, but the expectation is that there is no data aliasing.


For pipeline parallelism I need the external tensor trait behaviour to copy. We don't want to bake the pipeline parallelism decisions into the weights, so they are applied to the tensors when loaded from file inside an export script, e.g. export_paged_llama_v1.py, FYI changes aren't up yet.

.clone was added for the following behaviour:

Load tensor t.

Change .devices or move .shards to different devices with ops.transfer_to_logical_device

t_new = t.clone(ts=moved_shards, devices=new_devices).

Store back into theta.

Continue exporting model.

If pipeline stages are not going to be a part of the same compute graph then you could do a deepcopy.

I think that there is a discrepancy of expectations.

Usually a clone method does a deep copy.

In torch this term is loaded with more meaning as it creates a tensor that is a part of the compute graph. I think we should avoid deviating from that meaning as it will introduce confusion down the line. If operations under sharktank.ops or methods of tensors have the same name as in torch then they should do the same.

We should avoiding cloning the underlying torch tensor. It is mildly incorrect but it could drastically increase memory use when exporting. We can considering renaming clone to something else but it seemed good enough for the modelling library.

The primary motivation behind clone was to provide a mechanism to change non-numeric tensor information, e.g. device placement. Its mainly so that we can better handle different tensor types (Replicated, Sharded, Quantized) without fighting wit hthe underlying tensor data.

Perhaps the name is the problem then. I did not introduce this to be equivalent to torch.clone.

Yes, another name then would be desirable as we may want something like torch.clone as some point.

Alex-Vasile · 2025-04-08T12:35:47Z

sharktank/tests/types/tensors_test.py

        cloned_tensor = sharded_tensor.clone()
-        assert sharded_tensor.is_deep_equal(cloned_tensor)
+        assert cloned_tensor.name != sharded_tensor.name
+        assert sharded_tensor.is_deep_equal(cloned_tensor, compare_name=False)
        assert iterables_equal(sharded_tensor.devices, cloned_tensor.devices)

    def testCloneTensorTraits(self):


How is this test passing if ExternalTensorTrait is not being copied.

is_deep_equal does not compare this. It probably should.

rsuderman · 2025-04-08T16:15:58Z

@sogartar could I get a run through behind the motivation for the PR? Most of the intention for clone was being able to create a new meta tensor and override non-numeric information, so cloning a torch tensor does not really fulfill that purpose.

rsuderman · 2025-04-08T15:49:32Z

sharktank/sharktank/ops/default_impls.py

+    # Clone introduces a tensor copy operation in the graph. It does not make sense to
+    # clone also the ExternalTensorTrait. The expectation is that when a user clones a
+    # tensor if they modify in-place the result this should not affect the original
+    # tensor. If we are to clone also the ExternalTensorTrait then we would have
+    # external_name aliasing, but the expectation is that there is no data aliasing.


We should avoiding cloning the underlying torch tensor. It is mildly incorrect but it could drastically increase memory use when exporting. We can considering renaming clone to something else but it seemed good enough for the modelling library.

The primary motivation behind clone was to provide a mechanism to change non-numeric tensor information, e.g. device placement. Its mainly so that we can better handle different tensor types (Replicated, Sharded, Quantized) without fighting wit hthe underlying tensor data.

sogartar · 2025-04-08T21:18:11Z

@rsuderman, the main motivation behind this PR is that clone in torch has already established semantics and if we are to name something the same way it should have the same semantics.
If we want something else, which seems to be the case from the discussions here, we should name it something else.

[sharktank] Add ops.clone and implement for DefaultPrimitiveTensor

032421b

Change sharded tensors to clone their data, but not the name to avoid name aliasing. Cloning should thought as equivalent to `b = a + 0`.

sogartar force-pushed the tensor-clone branch from 218a4dc to 032421b Compare April 7, 2025 16:34

sogartar marked this pull request as ready for review April 7, 2025 16:46

sogartar requested a review from Alex-Vasile April 7, 2025 16:46

Alex-Vasile requested changes Apr 8, 2025

View reviewed changes

Fix comment

3598b4a

sogartar requested a review from Alex-Vasile April 8, 2025 15:07

rsuderman requested changes Apr 8, 2025

View reviewed changes

[sharktank] Add ops.clone and implement for DefaultPrimitiveTensor #1222

Are you sure you want to change the base?

[sharktank] Add ops.clone and implement for DefaultPrimitiveTensor #1222

Conversation

sogartar commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sogartar commented Apr 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rsuderman commented Apr 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sogartar commented Apr 8, 2025

Uh oh!

Uh oh!

sogartar commented Apr 4, 2025 •

edited

Loading