-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Per Script Invocation Lua Memory Limits #903
Conversation
…ust to prove it's possible; squashing to clean up _a lot_ of experimentation commits
… for LuaScripts, fixes that
It will probably be unusual to use this allocator, but it shouldn't be _bad_ either.
This would be the most concerning for the PR. What is causing this drop, and if it is the trampoline, then is there a way to enable an unsafe mode that avoids this overhead? |
1) Added a check for NA in results which is an indication that the BDN test failed at run time 2) Added 'Lua.LuaScriptCacheOperations','Lua.LuaRunnerOperations' to BDN Github Action 3) Updated Expected values for the new Lua BDN tests
…into luaMemoryLimits
Doing some light profiling, it's the extra pcall layer. I'll look at clawing some of this back. |
* Added LCS command * Format fix * Reverted CommandDocsUpdater.cs * Fix cluster test * Fixed wrong change * Moved to constant * Review command fixes * Fixed review comment * Fixed test issue --------- Co-authored-by: Vasileios Zois <[email protected]> Co-authored-by: Tal Zaccai <[email protected]>
* Configure min and max IO completion threads separately from min and max threads (in the ThreadPool). This is needed as some scenarios may limit number of thread pool threads but require a larger number of IO completion threads. * nit
…enerally in Lua to get some perf back.
…into luaMemoryLimits
@badrishc I've pushed up some changes that claw most of the perf loss back, turning a few cases into improvements even. |
…re, which would cause SendAndReset() to fail thinking the message was too large
Good to merge @darrenge ? |
…n run it a second time and see 1312 without any code changes. Just a small variance between runs so decided to just make sure doesn't go above 1312 for any that were set to 1024. The charts and other things will show the small nuances if needed.
Not quite ready to merge. I want to make sure BDN are all passing. Also, I need to update the charting data from "None" to "Native,None". Hopefully by top of hour. |
I am done ... if / when the BDN pipeline and all the CIs pass, we can merge. |
Another decent sized one, though hopefully this is the last "big" Lua PR - the rest I can foresee should be smaller.
TODOs:
Are memory pressure updated necessary? Have a thread with .NET GC folks for this.Got our answer, they are correct to have here.Behavior when scripts aborted? Redis is weird here.It's reasonable for writes that happened pre-abort to still happen. We can explore rollback if there's a pressing need, but it's non-trivial.This introduces the ability to specify maximum memory limits for Lua scripts, currently this a single config (
--lua-script-memory-limit
). To enable this we also have to introduce custom allocators (--lua-memory-management-mode
) for Lua, there are 3 in this PR:Native
(the current behavior, where Lua provides the allocator),Tracked
(where memory is acquired withNativeMemory
and GC pressure is updated), andManaged
(where a POH array is pre-allocated and memory is obtained from a freelist punned over that allocation).In order to gracefully handle Lua OOMs more of the operation of
LuaRunner
(things like compilation and the preamble) is hidden behind Lua PCalls. This is a necessary change, as the default behavior of Lua is to abort the process in the face of OOMs - PCalls prevent that.To make the PCall changes less expensive (and just generally less awful), I introduced some (Strong, not Pinned) GCHandles, function pointers, and trampolines. At the end of this, we're basically just using KeraLua to package Lua and define some constants - none of the .NET code is really running anymore. If we really wanted, we could build Lua ourselves (maybe even drop down to 5.1 to match Redis) and exploit that tight coupling - but I have no intention of doing so at this time.
When improving the Lua OOM RESP error, I also found a bug in previous PR around buffer management - it is fixed in this commit.
The Allocators
Native
This is the default.
This just uses the built-in Lua allocator, which is a thin shim over
malloc
. It should perform bit better thanTracked
simply because there isn't any .NET code in the way.Native does not support memory limits.
Tracked
A thin wrapper over
NativeMemory
. It supports memory limits, and will fail once total requested bytes exceeds the configure limit. Since it cannot see the overhead ofNativeMemory
the limit is only softly enforced.This also calls GC.(Add|Remove)MemoryPressure to inform the GC of these (potentially fairly large) allocations.
Managed (w/ and w/o Limits)
A really basic free-list based allocator over a POH array. It pre-allocates the total limit, and (if one is configured) it strictly limits allocations since the overhead can be seen.
If a limit is not configured, 2MB (or larger, if the requested size exceeds 2MB) arrays are allocated as needed.
We could certainly do a lot better here (I imagine there's something existing in Garnet I could steal or repurpose), but this is mostly a proof we could get Lua 100% onto the managed heap. That said, I couldn't help put profile a little bit, so it shouldn't be awful given Lua's allocation patterns.
Open Questions
Is GC pressure actually needed in the
Tracked
case?Docs say:Which makes very little sense to me, as in a container (like a job) with memory limits the presence or lack of a finalizer seems irrelevant to whether the GC needs to be informed of native allocations?
Ultimately the .NET GC folks will just have to answer this one, I've opened a thread with them.Docs are (somewhat) incorrect here, and will be updated. It is correct, but not strictly necessary, to have these calls in the Tracked case. I'm leaving them in so the GC can respond more promptly to memory pressure.
What is expected behavior when a script is aborted?
This change introduces a case where a script might be aborted, and I expect future changes (timeouts, and potentially
SCRIPT|KILL
) to add more.Redis doesn't allow this - you are expected to let Redis crash, or force a shutdown, if a script goes out of control. That's kind of nuts, IMO, especially for any HA service.
However, by deviating from Redis (with this opt-in switch), we do need to define expected behavior.
Right now, the behavior is "any commands that executed in the script, executed". Commands cannot half-execute, but scripts can, basically.
Is this acceptable, or do we need some (presumably configurable) rollback behavior?
With transactions enabled we already know the scope of "needs to be rolled back", but the implementation would be non-trivial.
Decision: No rollbacks
Summarizing some discussion:
Benchmarks
I changed
ScriptOperations
to useLuaParams
instead ofOperationParams
as we were already ignoring most of the operation variants there. Now all Lua-related benchmarks run for with different allocators enabled: Native (the old behavior, and current default), Tracked w/ 2M limit, Tracked w/o a limit, Managed w/ 2M limit, and Managed w/o limit.main
results are as ofce21c248f084744e45bbff08d0ecce0a51326cca
.luaMemoryLimits
are as of4c5a2e9893f25895c4ab33bc4e60425a431e7015
.Broadly speaking, we're giving up a bit of perf for the ability to recover from OOMs (and other runtime errors, technically). There's some work that could be done to claw bits of this back, in theory, but we are actually doing more with this change.Having now done a fair amount of profiling and cleanup, most benchmark results are either a wash or a slight improvement after this PR.
LuaRunnerOperations
Comparing the baseline and the Native,None case, we're giving up some perf (~17%) in the CompileForSessionSmall case. We're improving everywhere else, and since the small case is so small (literally no Garnet ops, no calculations, just a
return nil
) I think that's fine.main
luaMemoryLimits
LuaScriptCacheOperations
Cases where we construct a new
LuaRunner
are a bit slower, though most of these are in the error bounds. Observationally, these seem to vary a fair amount run-to-run.main
luaMemoryLimits
LuaScripts
Again comparing baseline to
Native,None
we're giving up a bit of perf in Script1, but improving everywhere else. Loss is ~3%, best improvement is ~20% (for Script2). Given that Script1 is again very very minimal (just areturn
) I think that loss is fine, especially as it's offset by gains for more complicated scripts.main
luaMemoryLimits
ScriptOperations
Comparing to
Native,None
most of these are in the margin of error. Large and SmallScript are slightly improved (~5%).main
(eliding Params != None)luaMemoryLimits