Skip to content

Conversation

pcd1193182
Copy link
Contributor

This change is built on top of #17073, because the changes to add the gang_copies property interact tightly with this change, in addition to the gang blocks test group.

Motivation and Context

See #17070. tl;dr: If you have ganging and multiple different values of the copies property while you use dedup, you can sometimes end up with BPs with a mix of gang and non-gang DVAs. That is illegal in ZFS: in debug mode, it'll trip an assertion when you're syncing the write out. In non-debug mode, it will make it to disk, but cause silent and confusing errors later when BP_IS_GANG is only partially correct.

Note that this change doesn't address cases 2 and 4 from #17070; those are trickier to deal with, and are also less problematic for users. Occasionally not getting as many copies as you asked for is a much less severe bug than silently invalid blkptrs. It also doesn't address the nopwrite case.

Description

This change forces updates of existing FDT entries to only add more DVAs of the same type; if there are already gang DVAs in the FDT, we require that the new allocations also be gangs. if there are non-gang DVAs in the FDT, then this allocation may not be gangs. If it would gang, we have to redo the whole write as a non-dedup write.

This change also contains a new test that covers the relevant cases. I verified that without the patch, it kernel panics when running the test.

How Has This Been Tested?

The added test, plus manual tests/

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not deeply familiar with ganging code, but I have doubts that zp_must_gang is viable. As I understand, to provide full redundancy we must not just provide some ganged pointer, but allocate all the gang children of exactly the same sizes and extend all existing gang header copies with those DVAs, which we can not do without rewriting them also. I think the only thing we can realistically do is prevent addition of copies writing and so dedup in general in that case.

@pcd1193182
Copy link
Contributor Author

As we discussed in the issue, the goal here isn't really to provide 100% of the redundancy the user asked for. That's challenging when you need to read in the entire gang tree to even determine what redundancy level is currently provided. The goal of this PR is just to prevent the creation of mixed gang/non-gang BPs, which are illegal in ZFS. That can be done even if the two gang trees have different depths/layouts.

@amotin
Copy link
Member

amotin commented Mar 10, 2025

That can be done even if the two gang trees have different depths/layouts.

No, I think it can not. You can not have in one block pointer 3 DVAs with non-identical content. zio_gang_tree_assemble() will read only one of them, that in your case won't include all the children. So some remaining will never be read or freed.

@pcd1193182
Copy link
Contributor Author

Hm, I see what you mean. It partially works for reads; if first level of gang headers is corrupt on the earlier DVAs, it'll read in the ones for the later DVAs, but not if anything else is wrong. And frees will free all the gang headers at the first level (because it has to free all the DVAs of each BP it knows about) but it won't know about the child BPs in any gang headers but the first one.

I think we could probably write additional copies of the gang header we already have in the BP? But I don't know if that's actually better than just not deduplicating the write.

@amotin
Copy link
Member

amotin commented Mar 10, 2025

I think we could probably write additional copies of the gang header we already have in the BP?

I don't think anything not overwriting the existing gang headers can be correct. And I don't think we can safely overwrite them. Not mentioning we'd have to allocate new block in exactly the same chunks, which would be quite a pain.

But I don't know if that's actually better than just not deduplicating the write.

That is why I proposed not deduplicating it. Not perfect, but a simple choice for a pretty rare scenario.

@amotin amotin added the Status: Revision Needed Changes are required for the PR to be accepted label Mar 11, 2025
@github-actions github-actions bot removed the Status: Revision Needed Changes are required for the PR to be accepted label Mar 11, 2025
@pcd1193182
Copy link
Contributor Author

I ran into an exciting one with this last commit. The bug presents as follows:

  • One write comes in with copies = 1, we start an IO for it (ZIO#1)
  • Another write comes in with copies = 2, we start another IO and set that to be the dde_lead_zio (ZIO#2) (See zio_ddt_write)
  • For whatever reason (in this case, ZIO#1 gangs, which means we can't safely add to the DVAs), ZIO#2 fails
  • ZIO#1 finishes, copies its dde_phys into dde_orig_phys (See zio_ddt_child_write_done)
  • ZIO#1 adds references to dde_phys (zio_ddt_child_write_done)
  • ZIO#2 calls its done func now that its child is done, it rolls the dde_phys back to the dde_orig_phys , without the references (zio_ddt_child_write_done)
  • Now the refcount is zero, so we free it immediately, even though ZIO#1 finished successfully and the block should stick around

I'm pretty sure this bug dates back to the start of FDT, and there may be more refcount-loss bugs like this lurking; I'm going to look into that next. The fix in this case is to add the references to dde_phys before we copy it into the orig_phys.

@pcd1193182 pcd1193182 added the Status: Code Review Needed Ready for review and testing label Mar 19, 2025
@pcd1193182
Copy link
Contributor Author

pcd1193182 commented Mar 29, 2025

I'm pretty sure this bug dates back to the start of FDT, and there may be more refcount-loss bugs like this lurking; I'm going to look into that next. The fix in this case is to add the references to dde_phys before we copy it into the orig_phys.

I actually don't think this is possible. There are four places where we call ddt_phys_addref: ddt_addref, zio_ddt_child_write_done, and the no-contention cases in zio_ddt_write. The first one happens before we start the syncing passes in spa_sync, so I think there also can't be any outstanding writes here. The second one we have the fix for in this PR. The last one can't race with this, because there's no outstanding IO in the no-contention cases.

ddt_phys_decref is only called from zio_ddt_free, and that's only called from the free path. And, because of the changes made by @robn in 46c4f2ce, frees that affect the DDT go on the deadlist, and aren't processed until after the normal ZIO tree is done in dsl_pool_sync. So there can't be any outstanding writes there, either.

So there may be no other refcount bugs lurking, but only by some pretty extreme luck. I think there are bugs with ddp_class_start getting overwritten by the rollback, which could cause something to get improperly pruned, but that's a much less severe issue. Nevertheless it might be worth fixing this anyway, just so this bug doesn't appear in the future if these timing windows start to overlap with each other again.

I think the fix is to implement something like ddt_phys_unextend, which rolls back just the ddp_phys_birth and ddp_dva. Then we use that for the rollback case, rather than ddt_phys_copy.

@amotin
Copy link
Member

amotin commented Apr 16, 2025

One of CI runners failed the test in unexpected way:

   ZTS run /usr/local/share/zfs/zfs-tests/tests/functional/gang_blocks/gang_blocks_ddt_copies
  panic: VERIFY0(zfs_refcount_count(&mga->mga_queue_depth)) failed (0 == 65536)
  
  cpuid = 0
  time = 1743096409
  KDB: stack backtrace:
  #0 0xffffffff80c3dd05 at kdb_backtrace+0x65
  #1 0xffffffff80bf1882 at vpanic+0x152
  #2 0xffffffff82a419aa at spl_panic+0x3a
  #3 0xffffffff82b3a53d at metaslab_class_validate+0x1dd
  #4 0xffffffff82b3a663 at metaslab_class_balance+0x53
  #5 0xffffffff82b68e93 at spa_sync+0x333
  #6 0xffffffff82b84c41 at txg_sync_thread+0x4c1
  #7 0xffffffff80bad28d at fork_exit+0x7d
  #8 0xffffffff8108610e at fork_trampoline+0xe
  Uptime: 41m54s

@pcd1193182 pcd1193182 force-pushed the gang_ddt_copies branch 2 times, most recently from 483d5cc to 954010f Compare April 16, 2025 16:54
Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Just needs squashing and commit message update.

With the advent of fast dedup, there are no longer separate dedup tables
for different copies values. There is now logic that will add DVAs to
the dedup table entry if more copies are needed for new writes. However,
this interacts poorly with ganging. There are two different cases that
can result in mixed gang/non-gang BPs, which are illegal in ZFS.

This change modifies updates of existing FDT; if there are already gang
DVAs in the FDT, we prevent the new write from extending the DDT
entry. We cannot safely mix different gang trees in one block
pointer. if there are non-gang DVAs in the FDT, then this allocation may
not be gangs. If it would gang, we have to redo the whole write as a
non-dedup write.

This change also fixes a refcount leak that could occur if the lead DDT
write failed.

Sponsored by: iXsystems, Inc.
Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <[email protected]>
@tonyhutter tonyhutter added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Apr 18, 2025
@amotin amotin merged commit 5f5321e into openzfs:master Apr 21, 2025
20 of 24 checks passed
snajpa pushed a commit to vpsfreecz/zfs that referenced this pull request May 2, 2025
With the advent of fast dedup, there are no longer separate dedup tables
for different copies values. There is now logic that will add DVAs to
the dedup table entry if more copies are needed for new writes. However,
this interacts poorly with ganging. There are two different cases that
can result in mixed gang/non-gang BPs, which are illegal in ZFS.

This change modifies updates of existing FDT; if there are already gang
DVAs in the FDT, we prevent the new write from extending the DDT
entry. We cannot safely mix different gang trees in one block
pointer. if there are non-gang DVAs in the FDT, then this allocation may
not be gangs. If it would gang, we have to redo the whole write as a
non-dedup write.

This change also fixes a refcount leak that could occur if the lead DDT
write failed.

Sponsored by: iXsystems, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes: openzfs#17123
ixhamza pushed a commit to truenas/zfs that referenced this pull request May 14, 2025
With the advent of fast dedup, there are no longer separate dedup tables
for different copies values. There is now logic that will add DVAs to
the dedup table entry if more copies are needed for new writes. However,
this interacts poorly with ganging. There are two different cases that
can result in mixed gang/non-gang BPs, which are illegal in ZFS.

This change modifies updates of existing FDT; if there are already gang
DVAs in the FDT, we prevent the new write from extending the DDT
entry. We cannot safely mix different gang trees in one block
pointer. if there are non-gang DVAs in the FDT, then this allocation may
not be gangs. If it would gang, we have to redo the whole write as a
non-dedup write.

This change also fixes a refcount leak that could occur if the lead DDT
write failed.

Sponsored by: iXsystems, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes: openzfs#17123
ixhamza pushed a commit to truenas/zfs that referenced this pull request May 14, 2025
With the advent of fast dedup, there are no longer separate dedup tables
for different copies values. There is now logic that will add DVAs to
the dedup table entry if more copies are needed for new writes. However,
this interacts poorly with ganging. There are two different cases that
can result in mixed gang/non-gang BPs, which are illegal in ZFS.

This change modifies updates of existing FDT; if there are already gang
DVAs in the FDT, we prevent the new write from extending the DDT
entry. We cannot safely mix different gang trees in one block
pointer. if there are non-gang DVAs in the FDT, then this allocation may
not be gangs. If it would gang, we have to redo the whole write as a
non-dedup write.

This change also fixes a refcount leak that could occur if the lead DDT
write failed.

Sponsored by: iXsystems, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes: openzfs#17123
robn pushed a commit to robn/zfs that referenced this pull request May 23, 2025
With the advent of fast dedup, there are no longer separate dedup tables
for different copies values. There is now logic that will add DVAs to
the dedup table entry if more copies are needed for new writes. However,
this interacts poorly with ganging. There are two different cases that
can result in mixed gang/non-gang BPs, which are illegal in ZFS.

This change modifies updates of existing FDT; if there are already gang
DVAs in the FDT, we prevent the new write from extending the DDT
entry. We cannot safely mix different gang trees in one block
pointer. if there are non-gang DVAs in the FDT, then this allocation may
not be gangs. If it would gang, we have to redo the whole write as a
non-dedup write.

This change also fixes a refcount leak that could occur if the lead DDT
write failed.

Sponsored by: iXsystems, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes: openzfs#17123
(cherry picked from commit 5f5321e)
(cherry picked from commit 299d918b78163218a68c1ec453f254d2479ed29f)
@robn robn mentioned this pull request May 23, 2025
14 tasks
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 13, 2025
With the advent of fast dedup, there are no longer separate dedup tables
for different copies values. There is now logic that will add DVAs to
the dedup table entry if more copies are needed for new writes. However,
this interacts poorly with ganging. There are two different cases that
can result in mixed gang/non-gang BPs, which are illegal in ZFS.

This change modifies updates of existing FDT; if there are already gang
DVAs in the FDT, we prevent the new write from extending the DDT
entry. We cannot safely mix different gang trees in one block
pointer. if there are non-gang DVAs in the FDT, then this allocation may
not be gangs. If it would gang, we have to redo the whole write as a
non-dedup write.

This change also fixes a refcount leak that could occur if the lead DDT
write failed.

Sponsored by: iXsystems, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes: openzfs#17123
@robn robn mentioned this pull request Jun 14, 2025
14 tasks
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 16, 2025
With the advent of fast dedup, there are no longer separate dedup tables
for different copies values. There is now logic that will add DVAs to
the dedup table entry if more copies are needed for new writes. However,
this interacts poorly with ganging. There are two different cases that
can result in mixed gang/non-gang BPs, which are illegal in ZFS.

This change modifies updates of existing FDT; if there are already gang
DVAs in the FDT, we prevent the new write from extending the DDT
entry. We cannot safely mix different gang trees in one block
pointer. if there are non-gang DVAs in the FDT, then this allocation may
not be gangs. If it would gang, we have to redo the whole write as a
non-dedup write.

This change also fixes a refcount leak that could occur if the lead DDT
write failed.

Sponsored by: iXsystems, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes: openzfs#17123
gamanakis pushed a commit to gamanakis/zfs that referenced this pull request Jul 20, 2025
With the advent of fast dedup, there are no longer separate dedup tables
for different copies values. There is now logic that will add DVAs to
the dedup table entry if more copies are needed for new writes. However,
this interacts poorly with ganging. There are two different cases that
can result in mixed gang/non-gang BPs, which are illegal in ZFS.

This change modifies updates of existing FDT; if there are already gang
DVAs in the FDT, we prevent the new write from extending the DDT
entry. We cannot safely mix different gang trees in one block
pointer. if there are non-gang DVAs in the FDT, then this allocation may
not be gangs. If it would gang, we have to redo the whole write as a
non-dedup write.

This change also fixes a refcount leak that could occur if the lead DDT
write failed.

Sponsored by: iXsystems, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes: openzfs#17123
spauka pushed a commit to spauka/zfs that referenced this pull request Aug 30, 2025
With the advent of fast dedup, there are no longer separate dedup tables
for different copies values. There is now logic that will add DVAs to
the dedup table entry if more copies are needed for new writes. However,
this interacts poorly with ganging. There are two different cases that
can result in mixed gang/non-gang BPs, which are illegal in ZFS.

This change modifies updates of existing FDT; if there are already gang
DVAs in the FDT, we prevent the new write from extending the DDT
entry. We cannot safely mix different gang trees in one block
pointer. if there are non-gang DVAs in the FDT, then this allocation may
not be gangs. If it would gang, we have to redo the whole write as a
non-dedup write.

This change also fixes a refcount leak that could occur if the lead DDT
write failed.

Sponsored by: iXsystems, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes: openzfs#17123
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants