Skip to content

Serious slowdown when filter calls GetFrame(0) in its constructor #476

@pinterf

Description

@pinterf

We came across this problem during the updated ConvertToPlanarRGB tests.
Theoretically, the ConvertBits in this script is a NOP instruction, since nno bit depth conversion is performed.

However, if it calls GetFrame(0) in its constructor for frame-property detection, serious slowdown happens. When I removed the constructor-initiated GetFrame(0), it went back to normal speed. As a sidenote: ConvertToPlanarRGB is calling a similar GetFrame(0) but this has no slowing effect.

Blankclip(60000,width=1600, height=1600, pixel_type="YUV444P8")
ConvertToPlanarRGB(bits=8)
ConvertBits(8) # tested with and without this line.

The applied bits matched the whole flow: 8, 16 and 32 were used.

Benchmarks suggested that memory-cache size might also be involved in the problem. (<1% differences are only measurement errors: Avsmeter64 was used to get the fps data.)

# X,Y: Without-ConvertBits(n), With-ConvertBits(n) [fps]. ConvertBits constructor calls GetFrame(0)
# size:        400x400      600x600       800x800     1200x1200   1600x1600   2000x2000
# 8 bits       27376,27000  13200,13100  7600,7600    3351,3280   1720,1509    837,795
# 16 bits      24000,24000  11235,11200  6600,6100    1854,1420    671,599     380,353
# float        13900,14000    6300,5900  2600,1800     626, 545    285,270     178,171

Blank clip is single frame, and is generated once, then it goes into cache and remains there. Checked, true, we always get the very same single memory address for that precalculated frame. (BlankClip/Colorbars, all the same).

The in-constructor call GetFrame(0) is a usual way for obtaining clip-wide properties such as color matrix, full/limited/narrow range setting, etc.., assuming that such properties in frame #0 won't change across the whole clip. So the precalculations and the dispatched functions can be chosen once, during the filter instance creation.

Who is the culprit? Avisynth frame caching system? Or unfortunate memory layout? Memory? Mutexes waiting for each other?

Bad frame physical addresses? I read about that the cache lines and the address lookup and translation may depend on magic boundaries in processors. Logged the read and write pointers of ConvertToPlanarRGB for the different runs. But some thousand runs (automated, logged and analyzed) did not strengthen this possible reason.

At this point the free AI access was over :) Quickly subscribed my first AI month :)

I was adviced that what if I try GetFrame(100)? And yes, the slowdown was immediately over.

Reverted back to GetFrame(0) and after another advice I logged the first couple GetReadPtr and GetWritePtr pointers in ConvertToPlanarRGB and looked at it - this time within the same session. For the quick case (without ConvertBits) all addresses were the same. In the the slow cases the GetWritePtr pointers were altering between two addresses. And when ConvertBits's GetFrame(0) was removed, the addresses became identical again.

Those addresses came from env->NewVideoFrame, which in turn obtains the addresses from the buffer-reusing-helper, a so called Frame Registry.
Allocations based on Frame Registry are fast, really they are mostly not allocations. Frame Registry mechanism avoids OS re-allocations by giving back an unused (reference count = 0) frame/video frame buffer. Unused frames and buffer do not get freed up immediately, only in case of lack-of-memory.

So our write-pointer flip-flopped between two addresses at even-odd frames at the slow case. Analyzing the timings and the reference counts and even developing a last-released-first-reserved video frame buffer re-use logic, it still failed and was alternating between two addresses. (Btw: why it makes slower if writing into X address in even frames, Y address in odd frames instead of always writing into X address? This is subject to further investigation. Cache eviction?)

So far our good AI was helpful to detect what is not causing the problem.

The codebase is too complex and huge to pass an AI.

In the final solution I then tried to eliminate the creation of per-filter AvsCache instances prematurely, during the filter instantiation phase. Calling child->GetFrame(0) will construct a Cache object for the child (in our case, for ConvertToPlanarRGB), and fills it with frame #0 content, which somehow interacts with the future :)
I made another change: a No-Op filter simply returns one of its child/parameter clip unaltered. (e.g. during a clip1.ConvertBits(16) we detect that clip1 bit-depth is already 16-bits and return it.) In this case we prevent clip1 from getting yet another MTGuard and CacheGuard object.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions