CU-86983ruw9 Fix test train split #521

mart-r · 2025-02-26T11:54:20Z

In the previous implementation for medcat.utils.data_utils.make_mc_train_test there were a few issues that could leave the train set empty for small datasets.

For example, if one would set (data_utils.set_all_seeds) a seed of 73607120 and run on https://github.com/CogStack/cogstack-model-gateway/blob/main/tests/integration/assets/trainer_export.json (2 documents, 30 annotations), it would guarantee an empty train set. This was the case for around 40.5% of any random seed.

The underlying issue was in the previous logic. The previous implentation guaranteed any document with only rare concepts (i.e ones with fewer than 10 examples across the entire dataset) would get a chance to be included in the test set (as long as the test size wasn't met). What that meant was that for small datasets with no concepts that had more than 10 examples, every document has a 90% chance of ending up in the test set (as per the test_prob = 0.9). The logic behind this seems to have been to populate the test set first. But for small datasets, this left the train set empty. And when the train set is empty, CAT.train_supervised_raw fails while getting the training start because there's no documents in the train set.

What this PR does is the following:

Fixes the logic to now align with the in code comments
- I.e only concepts with 10 or more examples across the dataset are ever considered for the test set
Adds the ability to change some thresholds for train-test splitting
- Allows setting the minimal number for test set as min_test_count (defaults to 10)
- Allows setting the maximal fraction of a concept's examples in the test set as min_test_count (defaults to 0.3)
Adds a few tests to make sure everything works as expected
- The tests were checked against the master branch
- The tests.utils.test_data_utils.TestTrainSplitFilteredTestsBase.test_nonempty_train test would fail with the specified seed

PS:
These changes mean that it's now far more likely that the test set will return empty for small datasets. That is because small datasets (such as the one I've linked to above) do not have any concepts with more than 10 examples, and thus they don't get added to the test set.

PPS:
This issue will probably not have had much affect on most real world applications since any real project would have a lot more documents and at least some concepts with 10+ examples across the dataset.

…ller datasets

…ng for minimum concept count and maximum test fraction

tomolopolis · 2025-02-26T11:54:23Z

Task linked: CU-86983ruw9 Fix test/train splitting

tomolopolis

lgtm

tomolopolis · 2025-03-31T13:36:03Z

Does the metrics side of thing error / warn when a given concept does not appear in the test set but appears in the train set?

mart-r · 2025-03-31T14:28:34Z

Does the metrics side of thing error / warn when a given concept does not appear in the test set but appears in the train set?

I don't think so.
As I understand, the original intention was to only include concepts in the test set that have a high enough number of occuraces in the entire dataset. So I don't think warning upon not including something in the test set would have been done at that point. With that said, I haven't explicitly tested this.

mart-r added 3 commits February 26, 2025 11:17

CU-86983ruw9: Fix train-test splitter leaving train set empty for sma…

89523d7

…ller datasets

CU-86983ruw9: Add additional optional arguments to test-train splitti…

f7035b3

…ng for minimum concept count and maximum test fraction

CU-86983ruw9: Add a few tests for test-train splitting

88818d3

mart-r changed the title ~~CU-86983ruw9 Fix test trian split~~ CU-86983ruw9 Fix test train split Mar 31, 2025

tomolopolis approved these changes Mar 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CU-86983ruw9 Fix test train split #521

CU-86983ruw9 Fix test train split #521

mart-r commented Feb 26, 2025 •

edited

Loading

tomolopolis commented Feb 26, 2025

tomolopolis left a comment

tomolopolis commented Mar 31, 2025

mart-r commented Mar 31, 2025

CU-86983ruw9 Fix test train split #521

Are you sure you want to change the base?

CU-86983ruw9 Fix test train split #521

Conversation

mart-r commented Feb 26, 2025 • edited Loading

tomolopolis commented Feb 26, 2025

tomolopolis left a comment

Choose a reason for hiding this comment

tomolopolis commented Mar 31, 2025

mart-r commented Mar 31, 2025

mart-r commented Feb 26, 2025 •

edited

Loading