hfst-ospell leaks memory #27

snomos · 2016-11-22T18:00:43Z

It seems that hfst-ospell is leaking memory. When run against a list of 12.5k typos, it starts out at about 130 Mb usage, and end up at almost 800 Mb before it is done. The memory use increases slowly throughout the processing. See attached screen shots of the memory consumption at different intervals. The zhfst file used is attached (rename *.zip to *.zhfst).

Tested against latest hfst-ospell code on macOS Sierra 10.12.1.

se.zip

snomos · 2016-11-22T18:04:08Z

The typos file used as input can be found here

Traubert · 2016-12-22T14:10:13Z

For some reason I can't get Valgrind to report memory leaks usefully for ospell, but those leaks are overall small in size. I think this isn't exactly a leak, but an effect of the cache slowly building up. The error model + lexicon are rich enough in this case that on average a cache entry is tens of megabytes, and as the correction encounters new first symbols, the cache gets bigger and bigger. Eg. you should see different behaviour if you first sort the typos vs. if you only have one type per first character.

Perhaps the system should have some sort of cache priority order, bumping off older cache entries once the cache grows beyond some size?

unhammer · 2017-10-27T13:15:03Z

Yeah, if I grep for only words starting with a, memory usage is consistent at ~126000 at least up to 100k words.

On regular input, memory usage flattens out as input grows, which is also consistent with the only "leak" being the cache on first symbols.

I tried e.g. clearing some random entries (whether .empty or not) on every correction, which gives slower memory growth at the cost of some speed loss (50% slower on 10k words when removing 3 random). Adding a use counter and strictly capping the amount of cached entries to 10 first symbols with plain LRU, the runtime is 118s vs 50s, with 486M max RES vs 712M max RES on a run of 10k words. With zero cache, we still use 414M on 10k words, so compared to that, upping it to 10 first symbols doesn't seem too bad. (I also tried LRU of 2 random entries, but there was basically no speed difference, probably because the list of cached entries is so short.)

However, I'm pretty sure .clear() can't be clearing everything, since sending 10k words-starting-with-a has a much lower memory usage – anyone have an idea why? https://github.com/hfst/hfst-ospell/compare/master...unhammer:LRU-cache?expand=1

snomos added bug ospell labels Nov 28, 2016

Traubert self-assigned this Nov 29, 2016

TinoDidriksen removed the ospell label Sep 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hfst-ospell leaks memory #27

hfst-ospell leaks memory #27

snomos commented Nov 22, 2016

snomos commented Nov 22, 2016

Traubert commented Dec 22, 2016

unhammer commented Oct 27, 2017 •

edited

Loading

hfst-ospell leaks memory #27

hfst-ospell leaks memory #27

Comments

snomos commented Nov 22, 2016

snomos commented Nov 22, 2016

Traubert commented Dec 22, 2016

unhammer commented Oct 27, 2017 • edited Loading

unhammer commented Oct 27, 2017 •

edited

Loading