Storing ngrams #259

vsraptor · 2022-07-26T23:58:01Z

vsraptor
Jul 26, 2022

In the followng case :


kv.Set('[1,2]','kwy-str-list')
kv.Get('[1,2]').GetValue()
'kwy-str-list'

how is the KEY stored internally : 2 int32 values ? stringified list ?

Why m'I asking ?

When storing ngrams I can use word-idxs instead of the words themselves ... to save space


kv.Set('word1',1)
kv.Set('word2',2)
kv.Set('word3',3)

#2-gram
kv.Set('[1,3]',253)
# ... instead of 
kv.Set('word1:word3',253)

hendrikmuhs · 2022-07-27T06:34:13Z

hendrikmuhs
Jul 27, 2022
Maintainer

keys are stored as FSA, this is explained here.

That means for your example the structure would look like this:

w->o->r->d->1     # 1
          \->2     # 2
           \->3    # 3

(the first 1,2 3 belong to the key, the #1,#2,#3 are the values)

With other words: The structure uses prefix compression which works better the more equal prefixes you have. That's why FSA/FST's work well for natural language, but not so good for binary/random keys.

When storing ngrams I can use word-idxs instead of the words themselves ... to save space

It is highly data dependent whether this really saves space or not. One one side your ngrams probably have a large number of repetitions - which the FSA can use to compress better - on the other side your indices will for sure compress the data, but you also need a 2nd structure to resolve the indices back to an ngram. This will also cost some runtime performance. I am not even sure how you want to match your ngrams in the 2nd case, this seems quite expensive.

I suggest to try it out. Note:

use realistic datasets, comparing just e.g. top-10k keys won't give you good results, because compression works the better the more data you have
if you compare it against the indexed ngrams approach, ensure that this index uses a comparable data structure that is not using the heap but memory mapping like keyvi
keyvi uses memory mapping, this is super effective and you can create dictionaries much larger than you have RAM available. That also means size isn't that important, because the size of a dictionary is not RAM.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storing ngrams #259

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Storing ngrams #259

vsraptor Jul 26, 2022

Replies: 1 comment

hendrikmuhs Jul 27, 2022 Maintainer

vsraptor
Jul 26, 2022

hendrikmuhs
Jul 27, 2022
Maintainer