Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate unexpectedly high volume of "became visible" logs #9058

Closed
jcsp opened this issue Sep 19, 2024 · 5 comments · May be fixed by #9400
Closed

Investigate unexpectedly high volume of "became visible" logs #9058

jcsp opened this issue Sep 19, 2024 · 5 comments · May be fixed by #9400
Assignees

Comments

@jcsp
Copy link
Collaborator

jcsp commented Sep 19, 2024

There are a lot of:

Layer ... became visible as a result of access

@jcsp jcsp self-assigned this Sep 19, 2024
@jcsp
Copy link
Collaborator Author

jcsp commented Sep 19, 2024

Spot checked two cases which were straightforward and not bugs in visibility calculation:

  • In find_gc_cutoffs, where we're doing timestamp->LSN lookups
  • In calculate_synthetic_size, where we read from layers at a non-user-readable LSN (GC cutoff) to calculate the size of the base synthetic size imaginary snapshot.

☝ those ones just need thinking about how to quieten the log (we could use the GC cutoff as a read point in visibility calculation to avoid outputting so much? Or selectively suppress the message if we are in a synthetic size calculation)

Some anomalies:

  • I see the same layer output twice from find_gc_cutoffs and calculate_synthetic_size, so it seems like the visibilty change might not be sticking.
  • There are also cases of the "became visible" message from page_service handlers, which need investigation.

@jcsp
Copy link
Collaborator Author

jcsp commented Sep 19, 2024

In #9064 I suppress the GC cutoff & timestamp->LSN related cases, and reduce the frequency of other cases by calling the layer visibility calculation less frequently.

I investigated one of the getpage cases and wrote a unit test that checked the behavior of the visibility calculation using the index of a timeline that hit the log message. It showed a proper calculation: when I did a LayerMap::search at the same key as was read in the log line, it hit just the most recent delta+image layers, and not the layer that had been mentioned in the log line. Of course, the index was captured a few minutes after the log message appeared -- there is some level of raciness here. It is expected that a getpage request can occasionally race with compaction in a way that hits this path, but I'm still a bit 🤨 about how often it's happening (although this isn't severe in terms of impact, which I would measure by the number of on-demand downloads we see).

After #9064 is deployed, we can look again at the volume of logs and pick another getpage example to analyze. For my convenience, noting the branch name (jcsp/issue-9058-hacks) where I have unit test stuff for checking the behavior with real indices: not pushing that branch because it contains non-public metadata.

@jcsp
Copy link
Collaborator Author

jcsp commented Sep 27, 2024

Analyzed another getpage case today. The index shows that the relevant layer is indeed covered, by an image layer written about 17 seconds before the getpage request hit it. The getpage request was to lsn FFFF, so it should have hit the most recent image layer.

I think there might be a bug in getpage where image layers at an LSN equal to last_record_lsn are getting skipped.

@jcsp
Copy link
Collaborator Author

jcsp commented Oct 1, 2024

In #9224, I further refine the condition for when to log, to cover the case of reads during logical size sub-tasks in synthetic size calculation.

I suspect that the remaining unexpected getpage@LSN cases are the result of a behavior (#9185) where reads sometimes skip using image layers and end up reading covered deltas anyway - more detail in that ticket.

jcsp added a commit that referenced this issue Oct 2, 2024
## Problem

Recent change to avoid the "became visible" log messages from certain
tasks missed a task: the logical size calculation that happens as a
child of synthetic size calculation.

Related: #9058

## Summary of changes

- Add OnDemandLogicalSize to the list of permitted tasks for reads
making a covered layer visible
- Tweak the log message to use layer name instead of key: this is more
terse, and easier to use when debugging, as one can search for it
elsewhere to see when the layer was written/downloaded etc.
@jcsp
Copy link
Collaborator Author

jcsp commented Oct 17, 2024

Investigation #9185 has gone deep enough to convince me that the getpage handler really is just reading deltas that should already be covered by image layers -- that's enough to conclude this investigation ticket.

In #9400 I'm toying with the idea of making the visibility calculation match the (buggy) getpage behavior, but might not merge it because of the risk of a negative effect on disk utilization.

@jcsp jcsp closed this as completed Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant