Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CU-8698f8fgc: Fix negative sampling including indices for words without a vector #524

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

mart-r
Copy link
Collaborator

@mart-r mart-r commented Mar 25, 2025

MedCAT 1.15.0 (more specifically PR #503) removed the unigram table in favour of an approach that doesn't need the massive array that was used before.

However, during the implementation, some details were overlooked. Namely, the Vocab keeps track of two dicts. One for index2word and another for vec_index2word. The index2word map just maps the first N (the number of words in the vocab) postive integers to the corresponding word. The vec_index2word map does the same, however it omits words that don't have a corresponding vector. E.g it could have keys [0, 1, 2, 4, 10, 15] - missing indices that don't have a corresponding word vector.
Now, when the cumulative frequencies are created, the original PR made the assumption that vec_index2word also maps all consecutive integers. And as such, the index at which the cumulative probabilities were found to match was expected to be the index for each word. However, in reality it was the slot for said index in vec_index2word.

In order to fix this, we'll need to remap the slot indices to word indices.

So what this PR does is exactly that. It creates a mapping from the "slot indices" in the to vec_index2word to the actual word indices. And then subsequently uses the mapping to map back to the actual word indices at sampling time.
The PR also adds a new test for this as well. Something that makes sure that negative sampling doesn't return indices corresponding the words with no vectors.

PS:
The GHA workflow for the first commit will fail because it just adds the test (that will now fail).
The second commit provides the actual fix.

@tomolopolis
Copy link
Member

@mart-r mart-r changed the title CU-8698f8fgc: Add new test to check that the negative sampling indices do not include non-vectored indices CU-8698f8fgc: Fix negative sampling including indices for words without a vector Mar 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants