Skip to content
This repository was archived by the owner on Jul 28, 2025. It is now read-only.

Conversation

mart-r
Copy link
Collaborator

@mart-r mart-r commented Apr 2, 2025

Improve abstraction for RelCAT tokenizer and models:

  • Rely only on abstractions within rel_cat
  • Move implementations to new packages and modules for each type
  • Import implementations dynamically to avoid when necessary

mart-r and others added 15 commits March 24, 2025 11:36
* CU-8698ek477: Add TODO to MetaCAT ML utils regarding AdamW import

* CU-8698ek477: Fix AdamW import (trf->torch)
…ut a vector (#524)

* CU-8698f8fgc: Add new test to check that the negative sampling indices do not include non-vectored indices

* CU-8698f8fgc: Add fix for negative sampling including indices for words without a vector

* CU-8698f8fgc: Update tests to make sure index frequencies are respected

* CU-8698f8fgc: Add 3.9-friendly counter totalling method
* CU-8698gqumv: Add tests for Vocab upon regression testing

* CU-8698gqumv: Fix regression time vocab data
* CU-86983ruw9: Fix train-test splitter leaving train set empty for smaller datasets

* CU-86983ruw9: Add additional optional arguments to test-train splitting for minimum concept count and maximum test fraction

* CU-86983ruw9: Add a few tests for test-train splitting
* CU-8698hfkch: Add eval method to deid model

* CU-8698hfkch: lint checks

---------

Co-authored-by: Tom Searle <[email protected]>
@mart-r
Copy link
Collaborator Author

mart-r commented Apr 2, 2025

Ran it through GHA workflow on my fork as well, just in case:
mart-r#52

Copy link
Member

@tomolopolis tomolopolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible to merge the Bert and ModernBERT model implementations?

from medcat.utils.relation_extraction.models import Base_RelationExtraction


class LlamaModel_RelationExtraction(Base_RelationExtraction):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the underscore in the name, just BaseRelationExtraction and LlamaRelationExtraction look fine?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just moved the classes around - didn't change the names. That's what the convention has been since #173

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modernBERT should be pretty much a drop in replacement for BERT? if using AutoTokenizer and AutoModel, AutoConfig, to pull the relevant pretrained_model_name_or_path = "answerdotai/ModernBERT-base"

There's some differences in positional embeddings representations etc. but shouldn't need the 2 modules and 200 or so extra lines.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's quite a bit of stuff that's really similar, but some things that aren't exactly the same.
But the we discussed this with Vlad the other week was that I'd make some abstraction changes and he'd try and address some of the code duplication.

@mart-r
Copy link
Collaborator Author

mart-r commented Apr 10, 2025

Just an FYI as well, I should have probably just updated this PR, but the next one is probably more relevant:
#530

@vladd-bit vladd-bit merged commit 255334d into relation_extraction_llama Apr 16, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants