Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CU-86983ruw9 Fix test train split #521

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

mart-r
Copy link
Collaborator

@mart-r mart-r commented Feb 26, 2025

In the previous implementation for medcat.utils.data_utils.make_mc_train_test there were a few issues that could leave the train set empty for small datasets.

For example, if one would set (data_utils.set_all_seeds) a seed of 73607120 and run on https://github.com/CogStack/cogstack-model-gateway/blob/main/tests/integration/assets/trainer_export.json (2 documents, 30 annotations), it would guarantee an empty train set. This was the case for around 40.5% of any random seed.

The underlying issue was in the previous logic. The previous implentation guaranteed any document with only rare concepts (i.e ones with fewer than 10 examples across the entire dataset) would get a chance to be included in the test set (as long as the test size wasn't met). What that meant was that for small datasets with no concepts that had more than 10 examples, every document has a 90% chance of ending up in the test set (as per the test_prob = 0.9). The logic behind this seems to have been to populate the test set first. But for small datasets, this left the train set empty. And when the train set is empty, CAT.train_supervised_raw fails while getting the training start because there's no documents in the train set.

What this PR does is the following:

  • Fixes the logic to now align with the in code comments
    • I.e only concepts with 10 or more examples across the dataset are ever considered for the test set
  • Adds the ability to change some thresholds for train-test splitting
    • Allows setting the minimal number for test set as min_test_count (defaults to 10)
    • Allows setting the maximal fraction of a concept's examples in the test set as min_test_count (defaults to 0.3)
  • Adds a few tests to make sure everything works as expected
    • The tests were checked against the master branch
    • The tests.utils.test_data_utils.TestTrainSplitFilteredTestsBase.test_nonempty_train test would fail with the specified seed

PS:
These changes mean that it's now far more likely that the test set will return empty for small datasets. That is because small datasets (such as the one I've linked to above) do not have any concepts with more than 10 examples, and thus they don't get added to the test set.

PPS:
This issue will probably not have had much affect on most real world applications since any real project would have a lot more documents and at least some concepts with 10+ examples across the dataset.

@tomolopolis
Copy link
Member

Task linked: CU-86983ruw9 Fix test/train splitting

@mart-r mart-r changed the title CU-86983ruw9 Fix test trian split CU-86983ruw9 Fix test train split Mar 31, 2025
Copy link
Member

@tomolopolis tomolopolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@tomolopolis
Copy link
Member

Does the metrics side of thing error / warn when a given concept does not appear in the test set but appears in the train set?

@mart-r
Copy link
Collaborator Author

mart-r commented Mar 31, 2025

Does the metrics side of thing error / warn when a given concept does not appear in the test set but appears in the train set?

I don't think so.
As I understand, the original intention was to only include concepts in the test set that have a high enough number of occuraces in the entire dataset. So I don't think warning upon not including something in the test set would have been done at that point. With that said, I haven't explicitly tested this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants