-
Notifications
You must be signed in to change notification settings - Fork 63
[sharktank] Add ops.clone and implement for DefaultPrimitiveTensor #1222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Change sharded tensors to clone their data, but not the name to avoid name aliasing. Cloning should thought as equivalent to `b = a + 0`.
This PR depends on #1227. |
@@ -1075,7 +1086,7 @@ def __init__( | |||
) | |||
|
|||
def clone(self, **kwargs) -> "SplitPrimitiveTensor": | |||
kwargs["name"] = kwargs.get("name", self.name) | |||
# We don't copy the name to not introduce name aliasing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a problem? I purposefully copied the name in order to keep richer metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that names should be unique. If the data is not the same it does not make sense that the name is the same. We are in the same compute graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ops that insert computation in the graph should not create tensors with the same name.
sharktank/sharktank/types/tensors.py
Outdated
def clone(self) -> "InferenceTensor": | ||
from .. import ops | ||
|
||
# We don't clone due the name to not introduce name aliasing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "due" looks out of place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed it.
# Clone introduces a tensor copy operation in the graph. It does not make sense to | ||
# clone also the ExternalTensorTrait. The expectation is that when a user clones a | ||
# tensor if they modify in-place the result this should not affect the original | ||
# tensor. If we are to clone also the ExternalTensorTrait then we would have | ||
# external_name aliasing, but the expectation is that there is no data aliasing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For pipeline parallelism I need the external tensor trait behaviour to copy. We don't want to bake the pipeline parallelism decisions into the weights, so they are applied to the tensors when loaded from file inside an export script, e.g. export_paged_llama_v1.py
, FYI changes aren't up yet.
.clone was added for the following behaviour:
- Load tensor t.
- Change .devices or move .shards to different devices with ops.transfer_to_logical_device
- t_new = t.clone(ts=moved_shards, devices=new_devices).
- Store back into theta.
- Continue exporting model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If pipeline stages are not going to be a part of the same compute graph then you could do a deepcopy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that there is a discrepancy of expectations.
- Usually a
clone
method does a deep copy. - In torch this term is loaded with more meaning as it creates a tensor that is a part of the compute graph. I think we should avoid deviating from that meaning as it will introduce confusion down the line. If operations under
sharktank.ops
or methods of tensors have the same name as in torch then they should do the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoiding cloning the underlying torch tensor. It is mildly incorrect but it could drastically increase memory use when exporting. We can considering renaming clone
to something else but it seemed good enough for the modelling library.
The primary motivation behind clone was to provide a mechanism to change non-numeric tensor information, e.g. device placement. Its mainly so that we can better handle different tensor types (Replicated, Sharded, Quantized) without fighting wit hthe underlying tensor data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the name is the problem then. I did not introduce this to be equivalent to torch.clone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, another name then would be desirable as we may want something like torch.clone
as some point.
cloned_tensor = sharded_tensor.clone() | ||
assert sharded_tensor.is_deep_equal(cloned_tensor) | ||
assert cloned_tensor.name != sharded_tensor.name | ||
assert sharded_tensor.is_deep_equal(cloned_tensor, compare_name=False) | ||
assert iterables_equal(sharded_tensor.devices, cloned_tensor.devices) | ||
|
||
def testCloneTensorTraits(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this test passing if ExternalTensorTrait is not being copied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_deep_equal
does not compare this. It probably should.
@sogartar could I get a run through behind the motivation for the PR? Most of the intention for clone was being able to create a new meta tensor and override non-numeric information, so cloning a |
# Clone introduces a tensor copy operation in the graph. It does not make sense to | ||
# clone also the ExternalTensorTrait. The expectation is that when a user clones a | ||
# tensor if they modify in-place the result this should not affect the original | ||
# tensor. If we are to clone also the ExternalTensorTrait then we would have | ||
# external_name aliasing, but the expectation is that there is no data aliasing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoiding cloning the underlying torch tensor. It is mildly incorrect but it could drastically increase memory use when exporting. We can considering renaming clone
to something else but it seemed good enough for the modelling library.
The primary motivation behind clone was to provide a mechanism to change non-numeric tensor information, e.g. device placement. Its mainly so that we can better handle different tensor types (Replicated, Sharded, Quantized) without fighting wit hthe underlying tensor data.
@rsuderman, the main motivation behind this PR is that |
Change sharded tensors to clone their data, but not the name to avoid name aliasing. Cloning should be thought as equivalent to
b = a + 0
.