Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In the previous implementation for
medcat.utils.data_utils.make_mc_train_test
there were a few issues that could leave the train set empty for small datasets.For example, if one would set (
data_utils.set_all_seeds
) a seed of73607120
and run on https://github.com/CogStack/cogstack-model-gateway/blob/main/tests/integration/assets/trainer_export.json (2 documents, 30 annotations), it would guarantee an empty train set. This was the case for around 40.5% of any random seed.The underlying issue was in the previous logic. The previous implentation guaranteed any document with only rare concepts (i.e ones with fewer than 10 examples across the entire dataset) would get a chance to be included in the test set (as long as the test size wasn't met). What that meant was that for small datasets with no concepts that had more than 10 examples, every document has a 90% chance of ending up in the test set (as per the
test_prob = 0.9
). The logic behind this seems to have been to populate the test set first. But for small datasets, this left the train set empty. And when the train set is empty,CAT.train_supervised_raw
fails while getting the training start because there's no documents in the train set.What this PR does is the following:
min_test_count
(defaults to 10)min_test_count
(defaults to 0.3)master
branchtests.utils.test_data_utils.TestTrainSplitFilteredTestsBase.test_nonempty_train
test would fail with the specified seedPS:
These changes mean that it's now far more likely that the test set will return empty for small datasets. That is because small datasets (such as the one I've linked to above) do not have any concepts with more than 10 examples, and thus they don't get added to the test set.
PPS:
This issue will probably not have had much affect on most real world applications since any real project would have a lot more documents and at least some concepts with 10+ examples across the dataset.