Replies: 1 comment
-
keys are stored as FSA, this is explained here. That means for your example the structure would look like this:
(the first 1,2 3 belong to the key, the #1,#2,#3 are the values) With other words: The structure uses prefix compression which works better the more equal prefixes you have. That's why FSA/FST's work well for natural language, but not so good for binary/random keys.
It is highly data dependent whether this really saves space or not. One one side your ngrams probably have a large number of repetitions - which the FSA can use to compress better - on the other side your indices will for sure compress the data, but you also need a 2nd structure to resolve the indices back to an ngram. This will also cost some runtime performance. I am not even sure how you want to match your ngrams in the 2nd case, this seems quite expensive. I suggest to try it out. Note:
|
Beta Was this translation helpful? Give feedback.
-
In the followng case :
how is the KEY stored internally : 2 int32 values ? stringified list ?
Why m'I asking ?
When storing ngrams I can use word-idxs instead of the words themselves ... to save space
Beta Was this translation helpful? Give feedback.
All reactions