Serious slowdown when filter calls GetFrame(0) in its constructor

We came across this problem during the updated `ConvertToPlanarRGB` tests.
Theoretically, the `ConvertBits` in this script is a NOP instruction, since nno bit depth conversion is performed.

However, if it calls `GetFrame(0)` in its constructor for frame-property detection, serious slowdown happens. When I removed the constructor-initiated `GetFrame(0)`, it went back to normal speed. As a sidenote:  `ConvertToPlanarRGB` is calling a similar `GetFrame(0)` but this has no slowing effect.

```
Blankclip(60000,width=1600, height=1600, pixel_type="YUV444P8")
ConvertToPlanarRGB(bits=8)
ConvertBits(8) # tested with and without this line.
```
The applied bits matched the whole flow: 8, 16 and 32 were used.

Benchmarks suggested that memory-cache size might also be involved in the problem. (<1% differences are only measurement errors: Avsmeter64 was used to get the fps data.)

```
# X,Y: Without-ConvertBits(n), With-ConvertBits(n) [fps]. ConvertBits constructor calls GetFrame(0)
# size:        400x400      600x600       800x800     1200x1200   1600x1600   2000x2000
# 8 bits       27376,27000  13200,13100  7600,7600    3351,3280   1720,1509    837,795
# 16 bits      24000,24000  11235,11200  6600,6100    1854,1420    671,599     380,353
# float        13900,14000    6300,5900  2600,1800     626, 545    285,270     178,171

```

Blank clip is single frame, and is generated once, then it goes into cache and remains there. Checked, true, we always get the very same single memory address for that precalculated frame. (BlankClip/Colorbars, all the same).

The in-constructor call `GetFrame(0)` is a usual way for obtaining clip-wide properties such as color matrix, full/limited/narrow range setting, etc.., assuming that such properties in frame #0 won't change across the whole clip. So the precalculations and the dispatched functions can be chosen once, during the filter instance creation.

Who is the culprit? Avisynth frame caching system? Or unfortunate memory layout? Memory? Mutexes waiting for each other?

Bad frame physical addresses? I read about that the cache lines and the address lookup and translation may depend on magic boundaries in processors. Logged the read and write pointers of ConvertToPlanarRGB for the different runs. But some thousand runs (automated, logged and analyzed) did not strengthen this possible reason.

At this point the free AI access was over :) Quickly subscribed my first AI month :)

I was adviced that what if I try `GetFrame(100)`? And yes, the slowdown was immediately over.

Reverted back to `GetFrame(0)` and after another advice I logged the first couple `GetReadPtr `and `GetWritePtr `pointers in ConvertToPlanarRGB and looked at it - this time within the same session. For the quick case (without ConvertBits) all addresses were the same. In the the slow cases the `GetWritePtr` pointers were altering between two addresses. And when `ConvertBits`'s `GetFrame(0)` was removed, the addresses became identical again.

Those addresses came from `env->NewVideoFrame`, which in turn obtains the addresses from the buffer-reusing-helper, a so called Frame Registry.
Allocations based on Frame Registry are fast, really they are mostly not allocations. Frame Registry mechanism avoids OS re-allocations by giving back an unused (reference count = 0) frame/video frame buffer. Unused frames and buffer do not get freed up immediately, only in case of lack-of-memory.

So our write-pointer flip-flopped between two addresses at even-odd frames at the slow case. Analyzing the timings and the reference counts and even developing a last-released-first-reserved video frame buffer re-use logic, it still failed and was alternating between two addresses. (Btw: why it makes slower if writing into X address in even frames, Y address in odd frames instead of always writing into X address? This is subject to further investigation. Cache eviction?)

So far our good AI was helpful to detect what is _not_ causing the problem. 

The codebase is too complex and huge to pass an AI.

In the final solution I then tried to eliminate the creation of per-filter `AvsCache `instances prematurely, during the filter instantiation phase. Calling `child->GetFrame(0)` will construct a Cache object for the child (in our case, for `ConvertToPlanarRGB`), and fills it with frame #0 content, which somehow interacts with the future :) 
I made another change: a No-Op filter simply returns one of its child/parameter clip unaltered. (e.g. during a `clip1.ConvertBits(16)` we detect that `clip1` bit-depth is already 16-bits and return it.) In this case we prevent `clip1` from getting yet another `MTGuard `and `CacheGuard` object.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serious slowdown when filter calls GetFrame(0) in its constructor #476

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Serious slowdown when filter calls GetFrame(0) in its constructor #476

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions