Optimize maximum filter #203

vincent-163 · 2019-08-15T17:55:02Z

The line:

Line 98 in d2b8761

local_max = maximum_filter(arr2D, footprint=neighborhood) == arr2D

is taking up around 80% of processing time (my feeling), which makes it a bottleneck in processing.

The maximum filter loops over all positions, takes the maximum of all numbers in a diamond-like area centered at the position. The time complexity of the naive implementation is O(NMW^2), where N is number of windows, M is number of frequency ranges, and W is window size (PEAK_NEIGHBORHOOD_SIZE=20).

Using a data structure known as monotonic queue, it should be possible to reduce the time complexity down to linear time, which is O(N*M), dividing processing time by up to 400.

The algorithm has been implemented in scipy as scipy.ndimage.filters.maximum_filter1d. By applying such an algorithm twice, on the X axis and then on the Y axis, we can get the rectangle maximum. The drawback is that the filter is similar but different from the original one, and the impact on accuracy is unknown. It is technically possible to compute the original maximum filter in linear time but it's harder to implement.

A simple benchmark shows that the time to scan all 5 files in the mp3 folder reduced from 2m55s to only 40s, a 4.5x improvement. Note that the number of fingerprints reduced from 494217 to 353834 due to a different maximum filter, and the impact is unknown. I expect some more improvement by removing the binary erosion, but on my machine it seems that the bottleneck has already moved from Python to MySQL. For indexing purposes it might be better to use an embedded database, such as RocksDB, to fully squeeze performance.

omertoptas · 2019-09-17T12:29:38Z

but on my machine it seems that the bottleneck has already moved from Python to MySQL

I am pretty sure that the real bottleneck is the part where the hashes are written in MySQL database.

vincent-163 · 2019-09-17T12:50:03Z

but on my machine it seems that the bottleneck has already moved from Python to MySQL

I am pretty sure that the real bottleneck is the part where the hashes are written in MySQL database.

Tests on my machine show about 80% of the time is spent on this particular step. If that was not your case, feel free to benchmark it and show your results.

If that was not the case, the only possible reason I can think of is that you're running the database on a magnetic disk. You should use SSD instead. While SSD is just 4x faster in terms of bandwidth, they can be up to 100x faster in terms of I/O.

mauricio-repetto · 2019-10-10T21:39:14Z

I think it is enough to change the filter shape, instead of struct = generate_binary_structure(2, 1) to struct = generate_binary_structure(2, 2), my times were comparable to the ones that are you presenting @vincent-163 with your changes. And I think in my case you preserve the same filter shape for the maximum filter and the erosion operation, which you are not doing on your current approach.

mauricio-repetto · 2019-10-11T15:29:53Z

@worldveil is there any reason of using for maximum_filter function the kernel that is in the example of the scipy docs when using the iterate_structure?
And despite it does not take too much time, I'm still not sure what is the purpose of the erosion operation and its combination with the maximum filter results.

mauricio-repetto · 2019-10-11T23:00:15Z

@vincent-163 I've added the changes that I mentioned to you to my pr here: #205 which is a pr that I've created to make dejavu available for python 3.6.6 and with the use of postgresql and mysql. I've added a couple of features more but feel free to try it and let me know your thoughts.

Thanks,
Mauricio.

Optimize maximum filter

5a1c670

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize maximum filter #203

Optimize maximum filter #203

vincent-163 commented Aug 15, 2019 •

edited

Loading

omertoptas commented Sep 17, 2019

vincent-163 commented Sep 17, 2019

mauricio-repetto commented Oct 10, 2019 •

edited

Loading

mauricio-repetto commented Oct 11, 2019

mauricio-repetto commented Oct 11, 2019

Optimize maximum filter #203

Are you sure you want to change the base?

Optimize maximum filter #203

Conversation

vincent-163 commented Aug 15, 2019 • edited Loading

omertoptas commented Sep 17, 2019

vincent-163 commented Sep 17, 2019

mauricio-repetto commented Oct 10, 2019 • edited Loading

mauricio-repetto commented Oct 11, 2019

mauricio-repetto commented Oct 11, 2019

vincent-163 commented Aug 15, 2019 •

edited

Loading

mauricio-repetto commented Oct 10, 2019 •

edited

Loading