CU-8698f8fgc: Fix negative sampling including indices for words without a vector #524
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
MedCAT 1.15.0 (more specifically PR #503) removed the unigram table in favour of an approach that doesn't need the massive array that was used before.
However, during the implementation, some details were overlooked. Namely, the
Vocab
keeps track of twodict
s. One forindex2word
and another forvec_index2word
. Theindex2word
map just maps the firstN
(the number of words in the vocab) postive integers to the corresponding word. Thevec_index2word
map does the same, however it omits words that don't have a corresponding vector. E.g it could have keys[0, 1, 2, 4, 10, 15]
- missing indices that don't have a corresponding word vector.Now, when the cumulative frequencies are created, the original PR made the assumption that
vec_index2word
also maps all consecutive integers. And as such, the index at which the cumulative probabilities were found to match was expected to be the index for each word. However, in reality it was the slot for said index invec_index2word
.In order to fix this, we'll need to remap the slot indices to word indices.
So what this PR does is exactly that. It creates a mapping from the "slot indices" in the to
vec_index2word
to the actual word indices. And then subsequently uses the mapping to map back to the actual word indices at sampling time.The PR also adds a new test for this as well. Something that makes sure that negative sampling doesn't return indices corresponding the words with no vectors.
PS:
The GHA workflow for the first commit will fail because it just adds the test (that will now fail).
The second commit provides the actual fix.