Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This test is failing on main, specifically, the
countvariable is getting set beyond1Also, looking into the DI types, they use
ConcurrentDictionary, likely in an attempt to be "thread safe". However, some very simple benchmarks show that those types are costly.And the results, (the tldr is that
ConcurrentDictionaryuses almost 10x the memory that a regularDictionarydoes!)Indeed, when I ran a benchmark test that only created a
DependencyBuilder, and then ran the.Build()method to produce an emptyDependencyProvider, according to a benchmark, I'd allocated4.43 KBworth of memory.Clearly, there are some problems.
This caused me to go on an adventure!
I ran the above unit test, but as a benchmark instead,
The goal now is to fix the two problems... I guess most importantly the code should pass the unit test, performance be damned! But ideally, while I'm sneaking through the code, it would be nice if it allocated less memory and ran faster. I added a simple
lockkeyword around the entireGetServicemethod, and it raised the mean time a bit, but changed no other performance issues... And it fixed the unit test.With this in mind, it seems like perhaps the goal of using the concurrent data structures isn't worth the memory allocation cost. I ripped them out (without running all the unit tests yet, so perhaps I broke something), and re-ran my base benchmark where I created an empty
DependencyProvider.The allocation goes from
4.4KBto.9KB. And it follows that as less memory is allocated, less time is spent in the GC, so the new method runtime is nearly a third of the old runtime. This feels like a massive improvement. So maybe its worth it to dig in some more and investigate if thoseConcurrentDictionarytypes were really saving us from anything.On a separate note, the internals of the
DependencyProviderkeep two caches, one for instantiated singleton objects, and one for instantiated scoped objects. This is wasteful, because you cannot add the sameTypeas a singleton and a scoped service anyway, so those dictionary caches would never overlap in key-space. Since they never overlap, we can just use a single cache object. This reduces a further80Bof memory.Actually, the same principle can be applied to the
DependencyProvider's use of 3 separate dictionaries to hold theServiceDescriptorsfor transients, scoped, and singletons. It is invalid to have any time overlap anyway, so those 3 dictionaries can be collapsed into a single dictionary. This saves us a few hundred bytes in the base case.For reference, if all you do, is instantiate the
DependencyBuilder, but don't call.Build()on it, that costs136B. That means there are roughly450Binvolved in the basic spin up of aDependencyProvider. A lot of that stuff has to do with a feature I'd love to remove in our DI scheme, called hydration... Indeed, if I run a benchmark where all I do is instantiate aDependencyProvider, it costs the same rough450B. I don't want to think about the hydration stuff for now, but the same stunt to collapse the various dictionaries can be applied to the builder as well. It was holding 3 separate dictionaries for each lifetime type. I collapsed them into dictionary, and the resulting memory wen't down to56B. Re-running the base case, then where the empty provider is created through theDependencyBuilder.Build()method,The returns appear to be diminishing.
I figured it would be a good time to re-run all the unit tests across cli and microservice to see if I had broken anything. (side note, we should combine our solutions at this point). And low and behold, I had broken something!! I broke 81 tests in microservice, all roughly with the same call stack,
The code that is coming from is this code in the constructor of the
DependencyProvider,The exception is obvious- the
desc.Interfacevalue already exists as a key in theDescriptorsdictionary. Where before, there were 3 dictionaries for separate lifetimes, now there is only 1. I guess my assumption that it was invalid to have multiple interfaces per lifetime wasn't quite accurate. To boil backwards, this is happening in theForkoperation from the Microservice base code, where the sub-scope is adding in a scoped version of the_socketRequesterContext.But if I look backwards ad the
conf.ServiceScope, there is a singleton version already in existence.The way that dependencies used to get resolved was in a specific ordering of the lifetime types. First transient, then scoped, then singleton. So if you add something as a "short lived" service, then the higher order it had. In this case then, the addition of the scoped service would override the resolution away from the original singleton. To fix this, I allowed the constructor to override the service if is a "higher order" lifetime.
And now all the tests are passing again.
To ra-cap so far, take this benchmark function
Here are the results... About twice as fast on the runtime, and less than a quarter of the allocation. This test is a pretty vanilla base case. It really isn't doing much at all. its tragic we are allocating an entire kilobyte to instantiate a service, but I still believe that this cost is worth it as the complexity of the dependency scope scales.
So there are more problems. After a dependency scope is created, eventually it needs to get shutdown. This happens in our Microsevice framework all the time, because there is a dependency scope per request. The scope is disposed at the end of the request.
So here is another benchmark to see how the disposal method performs... Hint, its not good.
That
Disposemethod is doing a lot,IBeamableDisposable, call theOnDisposemethod and wait for it to finish via aPromiseIBeamableDisposableOrder, order themDisposeon all child scopesBut it feels absurd that it should cost us a kilobyte to call the dispose method.
One of the first things that happens in the method is that all child scopes are disposed, first, so services die from the bottom of the tree, up towards the root. The
Disposemethod isasync, so we need to wait for all the sub disposals to finish. When I comment this line out, it shaves200Boff the allocation (and tests break).One thing to note is that the order doesn't actually matter for these sub processes to finish, but the implmentation of
Promise.Sequenceis doing more effort than we need to maintain order. I wrote aPromise.WhenAllmethod util that returns aPromisethat completes when the input list ofPromise<T>complete. It is a bit better thanPromise.Sequence.And when applied to the
Disposemethod, it brings the allocation down to1296 B(down almost 200 ), but, its an unfair optimization, because in the test, there are no child scopes, and the implementation forWhenAllhas a zero-input base case,We'll come back to this when we actually have child scopes to think about... Anyway, the next thing that happens in the
Disposemethod is that all cached service instances in the scope are disposed, with respect to their order defined by theIBeamableDisposableOrderinterface. If I take this bit out, then the memory allocation goes down to728 B, which is only a few hundred more bytes than the non disposal case. So there are lots of potential gains to be made by looking at this disposal.Actually, time for a detour, because the act of calling a
Promisefunction, or using anawaiton aPromisecauses allocation. APromiseinstantiation takes104 Bin a benchmark, and every time we useasync Promise, that implicitly generates aPromisebehind the scenes. I removed a tiny bit of unused cruft, and now its96 B, but still. That feels like a lot.We use
Promiseso much over the codebase that I feel like even the tiniest wins inside this library will pay dividends later.In fact, take the following benchmark,
When this runs, it allocates
96 B... Why? Becuase inside thePromise, we allocate anew object()to be used as a reference value for thelockkeywords used through theThen/Errormethods. According to Microsoft, you should never use thethiskeyword value as the reference value for alockstatement, because someone else might use the value in anotherlockstatement, leading to a deadlock. This is a tricky scenario, because if I change our implementation to usethis, all of the allocation goes away (for the instantiatn, allocation will re-appear during the usage). I am of the opinion at the moment that this is a worthy trade, givenPromisetypes are instantiated, andlockaPromise. I think we can curtail this in documentation.Moving on, our
async/awaitcode aroundPromisecosts memory. This benchmark shows that simply addingasyncto the method signature will cost memory.The
asyncversion costs96 Bmore memory.That number should look familiar, because it was the same amount of memory the
new object()took for the previouslockstatement. This is a snippet from theasyncsupport code fromPromise.I took a lot of inspiration (aka, stole) from this article about UniTask.
The
PromiseAsyncMethodBuilderis getting instantiated to build the state machine for theasynchandling. However, it doesn't need to be a class, and can be converted into a struct, which means its allocation goes away.The remaining
64 Baren't actually theasync's fault, but something that happens during theasyncstate machine, which is that thePromiseis completed.Consider this benchmark, that compares a promise instantiation to an instantiation and a
CompleteSuccesscall. The64 Blooks familiar, eh?Where is that
64 Bcoming from? There are 2 main causes...CompleteSuccessmethod takes a generic<T>, which leads to boxing, andPromisetype is secretly aPromise<Unit>under the hood, so passingPromiseBase.UnittoCompleteSuccessis passing the struct by value. We need to pass it by reference to avoid the allocation.It was at this point I realized I'd been doing my benchmarks wrong this whole time. When returning
voidfrom a benchmark, it may be that you've optimized the code so much, that the compiler thinks nothing actually needs to happen, because there are no side effects. So I went back and changed myPromiseallocation benchmark to actually return the generatedPromise, and I saw that the allocation itself was responsible for the64 B, not theCOmpleteSuccess. This also means that theasync Promisemethod was "correct" in that it returned something; where as thevoidmethod was "incorrect". After I fixed my benchmarks, I saw no difference between simply returning a newPromisevs doing it implicitly with anasync Promise. I would still like to to remove more allocation here, but I think the way to do it would be withPromisepooling. However, pooling at this level would introduce some new constraints on the behaviour of thePromiselibrary that I don't feel comfortable doing in the blind.So it is time to return to the
Disposemethod's internal call to dispose all of its internal services...Perhaps an easier target to go after are some LINQ statements inside the method.
I tore out a lot of code in the method (so it doesn't work), to isolate one specific line,
Perhaps not surprisingly, there is some allocation here (
88 B), and ultimately, I don't understand why we need to do this call toDistinctat all. I remember writing this code, and I remember noticing that it was possible for the same instance to appear in the cache more than once. That was bad, because if we call theDisposemethods on the instance more than once, all the logic gets broken. However, I am sad with Past-Me for not drilling into how the multiple instances appeared in the list in the first place, because that seems like the real problem. I am going to boldly ignore the root cause, and remove theDistinctcall, which brings the simple access ofInstanceCache.Valuesdown to680 Bfrom744 B, saving64 B.Moving on, inside the helper method that disposes all the services, the first piece of business is to sort the services into groups based on their possible ordering.
Two things jump out at me,
clonedListallocates, but it doesn't need to exist, since the LINQGroupByis going to re-allocate a new structure anyway. TheclonedListexisted from a time when I wanted to avoid possible collection modification errors.LINQstatements are cute for doing the logic I need, but I'm curious if there is a less memory-intensive way to write this.Calling this method with everything else commented out, so that it only runs the code above, the allocation for the entire
Dispose()method jumps to1304 B.I may just be shifting complexity around, but I decided to move the sorting of the services out of the
Disposemethod, and track it as additional state as the services are resolved. This raises the floor on the base case without callingDispose, but removes the need to do any sorting withinDispose, thus removing LINQ and dropping the allocation. We should always be callingDispose, so I'm valuing that case over the non disposal case.This benchmark is still using broken code, but it illustrates using a precached sorted datastructure instead of the code above.
--more to do