Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hfst-ospell leaks memory #27

Open
snomos opened this issue Nov 22, 2016 · 3 comments
Open

hfst-ospell leaks memory #27

snomos opened this issue Nov 22, 2016 · 3 comments
Assignees
Labels

Comments

@snomos
Copy link
Member

snomos commented Nov 22, 2016

It seems that hfst-ospell is leaking memory. When run against a list of 12.5k typos, it starts out at about 130 Mb usage, and end up at almost 800 Mb before it is done. The memory use increases slowly throughout the processing. See attached screen shots of the memory consumption at different intervals. The zhfst file used is attached (rename *.zip to *.zhfst).

memory-leak-early

memory-leak-late

Tested against latest hfst-ospell code on macOS Sierra 10.12.1.

se.zip

@snomos
Copy link
Member Author

snomos commented Nov 22, 2016

The typos file used as input can be found here

@Traubert Traubert self-assigned this Nov 29, 2016
@Traubert
Copy link
Contributor

For some reason I can't get Valgrind to report memory leaks usefully for ospell, but those leaks are overall small in size. I think this isn't exactly a leak, but an effect of the cache slowly building up. The error model + lexicon are rich enough in this case that on average a cache entry is tens of megabytes, and as the correction encounters new first symbols, the cache gets bigger and bigger. Eg. you should see different behaviour if you first sort the typos vs. if you only have one type per first character.

Perhaps the system should have some sort of cache priority order, bumping off older cache entries once the cache grows beyond some size?

@unhammer
Copy link
Member

unhammer commented Oct 27, 2017

Yeah, if I grep for only words starting with a, memory usage is consistent at ~126000 at least up to 100k words.

On regular input, memory usage flattens out as input grows, which is also consistent with the only "leak" being the cache on first symbols.

I tried e.g. clearing some random entries (whether .empty or not) on every correction, which gives slower memory growth at the cost of some speed loss (50% slower on 10k words when removing 3 random). Adding a use counter and strictly capping the amount of cached entries to 10 first symbols with plain LRU, the runtime is 118s vs 50s, with 486M max RES vs 712M max RES on a run of 10k words. With zero cache, we still use 414M on 10k words, so compared to that, upping it to 10 first symbols doesn't seem too bad. (I also tried LRU of 2 random entries, but there was basically no speed difference, probably because the list of cached entries is so short.)

However, I'm pretty sure .clear() can't be clearing everything, since sending 10k words-starting-with-a has a much lower memory usage – anyone have an idea why? https://github.com/hfst/hfst-ospell/compare/master...unhammer:LRU-cache?expand=1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants