Skip to content

dotmesh pull retry logic gets confused by partial errors when creating a filesystem #753

@alaric-dotmesh

Description

@alaric-dotmesh

A clone operation failed with a strange error:

time="2019-09-16T09:17:46Z" level=info msg="Downloading workspace & data from hub - downloaded 477.45/5171.85 MiB at 78.62 MiB/s (1/1)"
time="2019-09-16T09:17:46Z" level=info msg="Still pulling..." dots_pulling=1
time="2019-09-16T09:17:47Z" level=info msg="Transfer status polled" elapsed_ns=6776212170 index=1 message="Attempting to pull d671c4be-fb95-4835-a501-33c707fb66c2 got <Event zfs-recv-failed: err: \"exit status 1\", filesystemId: \"d671c4be-fb95-4835-a501-33c707fb66c2\", stderr: \"cannot receive incremental stream: checksum mismatch or incomplete stream\\n\">" sent_bytes=551458107 size_bytes=5423081024 status="retry 1" total=1 transfer_id=6edae9a4-7620-4c7a-acd2-a15566221b69
time="2019-09-16T09:17:47Z" level=info msg="Downloading workspace & data from hub - downloaded 525.91/5171.85 MiB at 77.61 MiB/s (1/1)"

However, the retry loop then tried again - but the original failure had created SOME snapshots, but the retry loop kept trying to create the filesystem from scratch again and failing:

time="2019-09-16T09:17:47Z" level=info msg="Still pulling..." dots_pulling=1
time="2019-09-16T09:17:48Z" level=info msg="Transfer status polled" elapsed_ns=33781652 index=1 message="Attempting to pull d671c4be-fb95-4835-a501-33c707fb66c2 got <Event zfs-recv-failed: err: \"exit status 1\", filesystemId: \"d671c4be-fb95-4835-a501-33c707fb66c2\", stderr: \"cannot receive new filesystem stream: destination 'pool/dmfs/d671c4be-fb95-4835-a501-33c707fb66c2' exists\\nmust specify -F to overwrite it\\n\">" sent_bytes=51 size_bytes=5423081024 status="retry 2" total=1 transfer_id=6edae9a4-7620-4c7a-acd2-a15566221b69

I've not dug into the code, but I suspect the "calculation of what we need to pull" bit isn't being re-done in the retry loop, so a failure that pulls in some snapshots will then cause all subsequent retries to fail as they try and pull in snapshots we've alreay got.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions