Skip to content

Commit a60214e

Browse files
rrevanstonyhutter
authored andcommitted
dnode_next_offset: backtrack if lower level does not match
This changes the basic search algorithm from a single search up and down the tree to a full depth-first traversal to handle conditions where the tree matches at a higher level but not a lower level. Normally higher level blocks always point to matching blocks, but there are cases where this does not happen: 1. Racing block pointer updates from dbuf_write_ready. Before f664f1e (openzfs#8946), both dbuf_write_ready and dnode_next_offset held dn_struct_rwlock which protected against pointer writes from concurrent syncs. This no longer applies, so sync context can f.e. clear or fill all L1->L0 BPs before the L2->L1 BP and higher BP's are updated. dnode_free_range in particular can reach this case and skip over L1 blocks that need to be dirtied. Later, sync will panic in free_children when trying to clear a non-dirty indirect block. This case was found with ztest. 2. txg > 0, non-hole case. This is openzfs#11196. Freeing blocks/dnodes breaks the assumption that a match at a higher level implies a match at a lower level when filtering txg > 0. Whenever some but not all L0 blocks are freed, the parent L1 block is rewritten. Its updated L2->L1 BP reflects a newer birth txg. Later when searching by txg, if the L1 block matches since the txg is newer, it is possible that none of the remaining L1->L0 BPs match if none have been updated. The same behavior is possible with dnode search at L0. This is reachable from dsl_destroy_head for synchronous freeing. When this happens open context fails to free objects leaving sync context stuck freeing potentially many objects. This is also reachable from traverse_pool for extreme rewind where it is theoretically possible that datasets not dirtied after txg are skipped if the MOS has high enough indirection to trigger this case. In both of these cases, without backtracking the search ends prematurely as ESRCH result implies no more matches in the entire object. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Akash B <[email protected]> Signed-off-by: Robert Evans <[email protected]> Closes openzfs#16025 Closes openzfs#11196
1 parent 2a5349b commit a60214e

File tree

1 file changed

+54
-11
lines changed

1 file changed

+54
-11
lines changed

module/zfs/dnode.c

Lines changed: 54 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2694,6 +2694,32 @@ dnode_next_offset_level(dnode_t *dn, int flags, uint64_t *offset,
26942694
return (error);
26952695
}
26962696

2697+
/*
2698+
* Adjust *offset to the next (or previous) block byte offset at lvl.
2699+
* Returns FALSE if *offset would overflow or underflow.
2700+
*/
2701+
static boolean_t
2702+
dnode_next_block(dnode_t *dn, int flags, uint64_t *offset, int lvl)
2703+
{
2704+
int epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
2705+
int span = lvl * epbs + dn->dn_datablkshift;
2706+
uint64_t blkid, maxblkid;
2707+
2708+
if (span >= 8 * sizeof (uint64_t))
2709+
return (B_FALSE);
2710+
2711+
blkid = *offset >> span;
2712+
maxblkid = 1ULL << (8 * sizeof (*offset) - span);
2713+
if (!(flags & DNODE_FIND_BACKWARDS) && blkid + 1 < maxblkid)
2714+
*offset = (blkid + 1) << span;
2715+
else if ((flags & DNODE_FIND_BACKWARDS) && blkid > 0)
2716+
*offset = (blkid << span) - 1;
2717+
else
2718+
return (B_FALSE);
2719+
2720+
return (B_TRUE);
2721+
}
2722+
26972723
/*
26982724
* Find the next hole, data, or sparse region at or after *offset.
26992725
* The value 'blkfill' tells us how many items we expect to find
@@ -2721,7 +2747,7 @@ int
27212747
dnode_next_offset(dnode_t *dn, int flags, uint64_t *offset,
27222748
int minlvl, uint64_t blkfill, uint64_t txg)
27232749
{
2724-
uint64_t initial_offset = *offset;
2750+
uint64_t matched = *offset;
27252751
int lvl, maxlvl;
27262752
int error = 0;
27272753

@@ -2745,16 +2771,36 @@ dnode_next_offset(dnode_t *dn, int flags, uint64_t *offset,
27452771

27462772
maxlvl = dn->dn_phys->dn_nlevels;
27472773

2748-
for (lvl = minlvl; lvl <= maxlvl; lvl++) {
2774+
for (lvl = minlvl; lvl <= maxlvl; ) {
27492775
error = dnode_next_offset_level(dn,
27502776
flags, offset, lvl, blkfill, txg);
2751-
if (error != ESRCH)
2777+
if (error == 0 && lvl > minlvl) {
2778+
--lvl;
2779+
matched = *offset;
2780+
} else if (error == ESRCH && lvl < maxlvl &&
2781+
dnode_next_block(dn, flags, &matched, lvl)) {
2782+
/*
2783+
* Continue search at next/prev offset in lvl+1 block.
2784+
*
2785+
* Usually we only search upwards at the start of the
2786+
* search as higher level blocks point at a matching
2787+
* minlvl block in most cases, but we backtrack if not.
2788+
*
2789+
* This can happen for txg > 0 searches if the block
2790+
* contains only BPs/dnodes freed at that txg. It also
2791+
* happens if we are still syncing out the tree, and
2792+
* some BP's at higher levels are not updated yet.
2793+
*
2794+
* We must adjust offset to avoid coming back to the
2795+
* same offset and getting stuck looping forever. This
2796+
* also deals with the case where offset is already at
2797+
* the beginning or end of the object.
2798+
*/
2799+
++lvl;
2800+
*offset = matched;
2801+
} else {
27522802
break;
2753-
}
2754-
2755-
while (error == 0 && --lvl >= minlvl) {
2756-
error = dnode_next_offset_level(dn,
2757-
flags, offset, lvl, blkfill, txg);
2803+
}
27582804
}
27592805

27602806
/*
@@ -2766,9 +2812,6 @@ dnode_next_offset(dnode_t *dn, int flags, uint64_t *offset,
27662812
error = 0;
27672813
}
27682814

2769-
if (error == 0 && (flags & DNODE_FIND_BACKWARDS ?
2770-
initial_offset < *offset : initial_offset > *offset))
2771-
error = SET_ERROR(ESRCH);
27722815
out:
27732816
if (!(flags & DNODE_FIND_HAVELOCK))
27742817
rw_exit(&dn->dn_struct_rwlock);

0 commit comments

Comments
 (0)