This library allows sampling from a language model constrained by a context-free grammar.
The library is guaranteed to work with Python 3.11; it has not been tested with other versions. It also requires several Python packages:
pip install torch transformers gpustat numpy accelerate llguidance xgrammar scipy matplotlib
To run the sampling task, use the following command:
python run_task.py grammar_file prompt_file sample_style model
where:
grammar_fileis a file containing a context-free grammar in a format understandable by thellguidancelibrary (bothebnforlarkformats are supported),prompt_fileis a text file containing the prompt,sample styleis one of the following methods:
rs(Rejection Sampling),
ars(Adaptive Rejection Sampling),
rsft(Rejection Sampling with constrained First Token),
cars(Constrained Adaptive Rejection Sampling),
prefix(MCMC-Prefix),
priority(MCMC-Priority),
restart(MCMC-Restart),modelis a number between 0 and 3, specifying the model to be used.
The MCMC sampling methods are desciribed in the paper: Constrained Sampling for Language Models Should Be Easy: An MCMC Perspective.
- 0:
hsultanbey/codegen350multi_finetuned(a small model, suitable for local testing on machines without a GPU) - 1:
meta-llama/Llama-3.1-8B-Instruct - 2:
Qwen/Qwen2.5-7B-Instruct - 3:
Qwen/Qwen2.5-14B-Instruct
The program outputs the generated sequences to the standard output, along with basic information.
In a more formalized way, the generated sequences are saved in a JSON file located in the runs_log folder.
The subfolder name consists of three parts:
- a part derived from the grammar and prompt file locations,
- a hash of the grammar and prompt,
- the model number.
The following environment variables can be used to configure the program:
TCFG_LOG_LEVEL: Set toINFOorDEBUGfor more detailed output.HF_HOME: Specifies the path to the folder where language models will be stored.CUDA_VISIBLE_DEVICES: Specifies the GPU number on which the calculations will be run.
Several parameters can be set within the run_task.py file, including:
max_new_tokens: The maximum number of tokens to generate in the sampled sequence.n_samples: The number of sequences to generate.- For rs, ars, rsft, cars styles, the program stops after
n_stepscalls to the LLM, even ifn_samplessequences have not been produced. - For MCMC styles,
n_stepsrepresents the number of stepsk(as described in the paper). torch_dtypeis a floating point data type used for the LLM computations.- It is also easy to add more models from Hugging Face. However, for MCMC styles, they must also be listed in the
mcmc/lib.pyfile.