Investigate unexpectedly high volume of "became visible" logs #9058

jcsp · 2024-09-19T10:22:12Z

There are a lot of:

Layer ... became visible as a result of access

jcsp · 2024-09-19T11:36:15Z

Spot checked two cases which were straightforward and not bugs in visibility calculation:

In find_gc_cutoffs, where we're doing timestamp->LSN lookups
In calculate_synthetic_size, where we read from layers at a non-user-readable LSN (GC cutoff) to calculate the size of the base synthetic size imaginary snapshot.

☝ those ones just need thinking about how to quieten the log (we could use the GC cutoff as a read point in visibility calculation to avoid outputting so much? Or selectively suppress the message if we are in a synthetic size calculation)

Some anomalies:

I see the same layer output twice from find_gc_cutoffs and calculate_synthetic_size, so it seems like the visibilty change might not be sticking.
There are also cases of the "became visible" message from page_service handlers, which need investigation.

jcsp · 2024-09-19T17:24:06Z

In #9064 I suppress the GC cutoff & timestamp->LSN related cases, and reduce the frequency of other cases by calling the layer visibility calculation less frequently.

I investigated one of the getpage cases and wrote a unit test that checked the behavior of the visibility calculation using the index of a timeline that hit the log message. It showed a proper calculation: when I did a LayerMap::search at the same key as was read in the log line, it hit just the most recent delta+image layers, and not the layer that had been mentioned in the log line. Of course, the index was captured a few minutes after the log message appeared -- there is some level of raciness here. It is expected that a getpage request can occasionally race with compaction in a way that hits this path, but I'm still a bit 🤨 about how often it's happening (although this isn't severe in terms of impact, which I would measure by the number of on-demand downloads we see).

After #9064 is deployed, we can look again at the volume of logs and pick another getpage example to analyze. For my convenience, noting the branch name (jcsp/issue-9058-hacks) where I have unit test stuff for checking the behavior with real indices: not pushing that branch because it contains non-public metadata.

jcsp · 2024-09-27T12:00:42Z

Analyzed another getpage case today. The index shows that the relevant layer is indeed covered, by an image layer written about 17 seconds before the getpage request hit it. The getpage request was to lsn FFFF, so it should have hit the most recent image layer.

I think there might be a bug in getpage where image layers at an LSN equal to last_record_lsn are getting skipped.

jcsp · 2024-10-01T13:49:21Z

In #9224, I further refine the condition for when to log, to cover the case of reads during logical size sub-tasks in synthetic size calculation.

I suspect that the remaining unexpected getpage@LSN cases are the result of a behavior (#9185) where reads sometimes skip using image layers and end up reading covered deltas anyway - more detail in that ticket.

## Problem Recent change to avoid the "became visible" log messages from certain tasks missed a task: the logical size calculation that happens as a child of synthetic size calculation. Related: #9058 ## Summary of changes - Add OnDemandLogicalSize to the list of permitted tasks for reads making a covered layer visible - Tweak the log message to use layer name instead of key: this is more terse, and easier to use when debugging, as one can search for it elsewhere to see when the layer was written/downloaded etc.

jcsp · 2024-10-17T09:35:23Z

Investigation #9185 has gone deep enough to convince me that the getpage handler really is just reading deltas that should already be covered by image layers -- that's enough to conclude this investigation ticket.

In #9400 I'm toying with the idea of making the visibility calculation match the (buggy) getpage behavior, but might not merge it because of the risk of a negative effect on disk utilization.

jcsp self-assigned this Sep 19, 2024

This was referenced Sep 27, 2024

pageserver: getpage requests sometimes skip reading recently written image layers #9185

Open

pageserver: tweak logging of "became visible" for layers #9224

Merged

jcsp mentioned this issue Oct 15, 2024

pageserver: work around #9185 in layer visibility calculation #9400

Draft

5 tasks

jcsp closed this as completed Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate unexpectedly high volume of "became visible" logs #9058

Investigate unexpectedly high volume of "became visible" logs #9058

jcsp commented Sep 19, 2024 •

edited

Loading

jcsp commented Sep 19, 2024

jcsp commented Sep 19, 2024

jcsp commented Sep 27, 2024

jcsp commented Oct 1, 2024

jcsp commented Oct 17, 2024

Investigate unexpectedly high volume of "became visible" logs #9058

Investigate unexpectedly high volume of "became visible" logs #9058

Comments

jcsp commented Sep 19, 2024 • edited Loading

jcsp commented Sep 19, 2024

jcsp commented Sep 19, 2024

jcsp commented Sep 27, 2024

jcsp commented Oct 1, 2024

jcsp commented Oct 17, 2024

jcsp commented Sep 19, 2024 •

edited

Loading