-
Notifications
You must be signed in to change notification settings - Fork 439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate unexpectedly high volume of "became visible" logs #9058
Comments
Spot checked two cases which were straightforward and not bugs in visibility calculation:
☝ those ones just need thinking about how to quieten the log (we could use the GC cutoff as a read point in visibility calculation to avoid outputting so much? Or selectively suppress the message if we are in a synthetic size calculation) Some anomalies:
|
In #9064 I suppress the GC cutoff & timestamp->LSN related cases, and reduce the frequency of other cases by calling the layer visibility calculation less frequently. I investigated one of the getpage cases and wrote a unit test that checked the behavior of the visibility calculation using the index of a timeline that hit the log message. It showed a proper calculation: when I did a LayerMap::search at the same key as was read in the log line, it hit just the most recent delta+image layers, and not the layer that had been mentioned in the log line. Of course, the index was captured a few minutes after the log message appeared -- there is some level of raciness here. It is expected that a getpage request can occasionally race with compaction in a way that hits this path, but I'm still a bit 🤨 about how often it's happening (although this isn't severe in terms of impact, which I would measure by the number of on-demand downloads we see). After #9064 is deployed, we can look again at the volume of logs and pick another getpage example to analyze. For my convenience, noting the branch name (jcsp/issue-9058-hacks) where I have unit test stuff for checking the behavior with real indices: not pushing that branch because it contains non-public metadata. |
Analyzed another getpage case today. The index shows that the relevant layer is indeed covered, by an image layer written about 17 seconds before the getpage request hit it. The getpage request was to lsn FFFF, so it should have hit the most recent image layer. I think there might be a bug in getpage where image layers at an LSN equal to last_record_lsn are getting skipped. |
In #9224, I further refine the condition for when to log, to cover the case of reads during logical size sub-tasks in synthetic size calculation. I suspect that the remaining unexpected getpage@LSN cases are the result of a behavior (#9185) where reads sometimes skip using image layers and end up reading covered deltas anyway - more detail in that ticket. |
## Problem Recent change to avoid the "became visible" log messages from certain tasks missed a task: the logical size calculation that happens as a child of synthetic size calculation. Related: #9058 ## Summary of changes - Add OnDemandLogicalSize to the list of permitted tasks for reads making a covered layer visible - Tweak the log message to use layer name instead of key: this is more terse, and easier to use when debugging, as one can search for it elsewhere to see when the layer was written/downloaded etc.
Investigation #9185 has gone deep enough to convince me that the getpage handler really is just reading deltas that should already be covered by image layers -- that's enough to conclude this investigation ticket. In #9400 I'm toying with the idea of making the visibility calculation match the (buggy) getpage behavior, but might not merge it because of the risk of a negative effect on disk utilization. |
There are a lot of:
The text was updated successfully, but these errors were encountered: