CU-8698f8fgc: Fix negative sampling including indices for words without a vector #524

mart-r · 2025-03-25T15:51:58Z

MedCAT 1.15.0 (more specifically PR #503) removed the unigram table in favour of an approach that doesn't need the massive array that was used before.

However, during the implementation, some details were overlooked. Namely, the Vocab keeps track of two dicts. One for index2word and another for vec_index2word. The index2word map just maps the first N (the number of words in the vocab) postive integers to the corresponding word. The vec_index2word map does the same, however it omits words that don't have a corresponding vector. E.g it could have keys [0, 1, 2, 4, 10, 15] - missing indices that don't have a corresponding word vector.
Now, when the cumulative frequencies are created, the original PR made the assumption that vec_index2word also maps all consecutive integers. And as such, the index at which the cumulative probabilities were found to match was expected to be the index for each word. However, in reality it was the slot for said index in vec_index2word.

In order to fix this, we'll need to remap the slot indices to word indices.

So what this PR does is exactly that. It creates a mapping from the "slot indices" in the to vec_index2word to the actual word indices. And then subsequently uses the mapping to map back to the actual word indices at sampling time.
The PR also adds a new test for this as well. Something that makes sure that negative sampling doesn't return indices corresponding the words with no vectors.

PS:
The GHA workflow for the first commit will fail because it just adds the test (that will now fail).
The second commit provides the actual fix.

…s do not include non-vectored indices

tomolopolis · 2025-03-25T15:52:03Z

Task linked: CU-8698f8fgc Fix negative sampling in medcat 1.15.0+

…ds without a vector

CU-8698f8fgc: Add new test to check that the negative sampling indice…

a837279

…s do not include non-vectored indices

CU-8698f8fgc: Add fix for negative sampling including indices for wor…

d8b167e

…ds without a vector

mart-r changed the title ~~CU-8698f8fgc: Add new test to check that the negative sampling indices do not include non-vectored indices~~ CU-8698f8fgc: Fix negative sampling including indices for words without a vector Mar 25, 2025

mart-r added 2 commits March 26, 2025 10:00

CU-8698f8fgc: Update tests to make sure index frequencies are respected

c42d7b2

CU-8698f8fgc: Add 3.9-friendly counter totalling method

5a6c2e5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CU-8698f8fgc: Fix negative sampling including indices for words without a vector #524

CU-8698f8fgc: Fix negative sampling including indices for words without a vector #524

mart-r commented Mar 25, 2025 •

edited

Loading

tomolopolis commented Mar 25, 2025

CU-8698f8fgc: Fix negative sampling including indices for words without a vector #524

Are you sure you want to change the base?

CU-8698f8fgc: Fix negative sampling including indices for words without a vector #524

Conversation

mart-r commented Mar 25, 2025 • edited Loading

tomolopolis commented Mar 25, 2025

mart-r commented Mar 25, 2025 •

edited

Loading