Conditioning guide

Jump to bottom

AlexGW edited this page Apr 16, 2025 · 5 revisions

How conditioning works.

The process of auto regressive inference allows us to directly modify outputs of a model before they are passed to the decoder for the prediction of the next token.
We can take advantage this by identifying the sequence patterns in numbering that precede where the model makes a "mistake" or simply deviates from a specific numbering pattern (for example defined by the IMGT germlines).
Through identifying regex matches that indicate the beginning of the antibody sequences, the CDR2 gap or the start and end of the CDR3 we can modify the next token and pass the corrected sequence to the decoder at the time of the incorrect prediction.
This process was used to number shark VNAR sequences according to an IMGT definition which consistently placed the large gap or deletion in the CDR2 region.
Once the key rules of conditioning are worked out, the modified inference step can rapidly apply them to many sequences and generate large datasets of "corrected" data that can train new models or fine-tune existing models.

Follow the steps outlined in the notebook provided: https://github.com/oxpig/ANARCII/blob/main/notebook/conditioning.ipynb