Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Allow sentences longer than the token limit for sequence tagger training #3519

Open
MattGPT-ai opened this issue Aug 3, 2024 · 7 comments · May be fixed by #3520
Open

[Feature]: Allow sentences longer than the token limit for sequence tagger training #3519

MattGPT-ai opened this issue Aug 3, 2024 · 7 comments · May be fixed by #3520
Labels
feature A new feature

Comments

@MattGPT-ai
Copy link
Contributor

Problem statement

Currently, we are not able to train SequenceTagger models with tagged Sentence objects exceeding the token limit (typically 512). It does seem there is some support for long sentences in embeddings via the allow_long_sentences option, but it does not appear that this applies to sequence tagging where the labels still need to be applied at the token level.

We have tried doing this, but if we don't limit the sentences to the token limit, we get an out of memory error. Not sure if this is a bug specifically, or just a lack of support for this feature.

Solution

Not sure if there is a more ideal way, but one solution for training is to split a sentence into "chunks" that are of length 512 tokens or less, and applying the labels to these chunks. It is important to avoid splitting chunks across a labeled entity boundary.

Additional Context

We have used this in training successfully, so I will be introducing our specific solution in a PR

@helpmefindaname
Copy link
Collaborator

Hi @MattGPT-ai
can you elaborate what token limit you are refering to, or file a bug report?
The only tokenlimit that should exist is in the TransformerEmbeddings, which you can workaround with the allow_long_sentences option.

If that's not the case, I'd like to have a reproducible example of that limitation

@MattGPT-ai
Copy link
Contributor Author

I did just confirm that I could successfully use allow_long_sentences=True to train a simple model and pick up on entity classes that were only contained beyond the transformers token limit of 512. However, our training still fails eventually in the training with a OutOfMemoryError: CUDA out of memory. error, and unfortunately, I can't share the dataset that causes the error since it contains PII.

I will try some more things to see if I can reproduce it or narrow down if there is a particular issue, perhaps a memory leak or maybe there is just a particularly large batch that it's failing on

@MattGPT-ai
Copy link
Contributor Author

https://gist.github.com/MattGPT-ai/80327ab5854cb0d978d23f205eeae882

Linking to a gist with notebooks that demonstrate success using allow_long_sentences, and an OOM failure that results by increasing the sentence size a bit. So I think that while this utility isn't always necessary, it can be helpful

@MattGPT-ai
Copy link
Contributor Author

Would it be possible to refactor the training script such that batching is based on the chunked inputs? It seems like now maybe it does not get a consistent batch size after chunking. Could you offer any insight here? I also see there is a mini_batch_chunk_size, not sure if that could help here, but I didn't quite understand from the docs what that parameter does. Trying to dig more into the source code

@helpmefindaname
Copy link
Collaborator

The problem with longer sentences is that they will inevitably require more RAM, since the gradient information of all tokens is required.
That said, having a sentence in your batch that contains 10k (sub-)tokens, means that you will have more than 20 transformer-passes in your TransformerEmbedding and are expected to fit the respective memory requirement.

With sentences that long, you will want to have only 1 sentence computed at a time. For this, you can use the mini_batch_chunk_size, which allows you to compute fewer sentences in parallel while keeping the batch_size as high. (also known as gradient accumulation).

Notice that you can split the batch by sentences without loss of quality, but you cannot split sentences, as the token embeddings/gradients are dependent on each other.

In general I would recommend setting the mini_batch_chunk_size to 1 when you are working with long sentences, if that doesn't work you have either a GPU with very little RAM and should consider upgrading or you have very long texts, where you can use a Sentence Splitter to split the long text into real sentences.
Notice with the later you need to adjust both the training and the inference code as the model will be only capable of predicting shorter sentences.

@MattGPT-ai
Copy link
Contributor Author

I am giving the mini_batch_chunk_size a test, does look like it helps to resolve the memory issue for our case.

I think at the very least if the chunking function isn't useful, that this could be reduced to a function to create a labeled sentence from a text with character-indexed entities.

As far as sentence chunking, I'm still a little unclear if there is no use case, perhaps because I'm confused by the multiple uses of the word "sentence." Let's say in our case we have very long texts, such as a resume, that contain many actual sentences. If some of the full resumes do not fit into memory, when is it invalid to split one into multiple Sentence objects so it essentially becomes multiple samples so that we don't lose any of the annotated data. Are you saying that the splitting would need to be done at actual sentence boundaries to be valid?

@helpmefindaname
Copy link
Collaborator

I agree that Sentence is confusing. That naming is only like that due to initial naming.

So I hope I can clarify how I meant this, I will refer to literal sentences as a linguistic Sentence and Sentence-object for the Flair-Class "Sentence":

You could split your Resumes to literal sentences using the SentenceSplitter. That means you won't have 1 Sentence-object per resume but multiple smaller ones. For those, the SentenceSplitter adds the next & previous objects as context, so you can use a FLERT-Transformer TransformerWordEmbeddings(..., use_context=True).
That way, you will require less memory for training, but still have prediction on a literal-sentence level while taking the literal-sentences around as context.

And yes, this way the actual sentence boundaries would always be valid, assumably making it easier for the model to learn

About the labels from char-indices:
I had to double-check, as I thought this already exists. Well, it kinda does in the JsonlDataset, but in not accessible as a general utils-function. So here I agree with you, that that would be a useful contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new feature
Projects
None yet
2 participants