Evaluating Program Semantics Reasoning with Type Inference in System F
If you find this work useful, please cite us as:
@inproceedings{he2025tfbench,
author = {He, Yifeng and Yang, Luning and Gonzalo, Christopher and Chen, Hao},
title = {Evaluating Program Semantics Reasoning with Type Inference in System F},
booktitle = {Neural Information Processing Systems (NeurIPS)},
date = {2025-11-30/2025-12-07},
address = {San Diego, CA, USA},
}
We use Python 3.12. We recommend using uv to manage your Python dependencies.
cd TF-Bench
uv sync # create a virtual environment, and install dependencies
To run evaluation, you need GHC (the Glasgow Haskell Compiler) installed. We recommend using ghcup to install.
curl --proto '=https' --tlsv1.2 -sSf https://get-ghcup.haskell.org | sh
Due to the GHC dependency, the evaluation module currently only supports Linux and macOS. Our evaluation requires Haskell language extensions type operators and impredicative polymorphism, so we require GHC version >= 9.2.1. Our evaluation used GHC-9.6.7.
This script will build the benchmark (Prelude with NL) from the raw data.
uv run scripts/preprocess_benchmark.py -o tfb.json
git clone https://github.com/SecurityLab-UCD/alpharewrite.git
cd alpharewrite
stack build
stack exec alpharewrite-exe 1 ../tfb.json > ../tfb.pure.json
cd ..
For details, please check out the README of alpharewrite.
You can also use TF-Bench via HuggingFace datasets.
from datasets import load_dataset
split = "pure" # or "base"
dataset = load_dataset("SecLabUCD/TF-Bench", split=split)
Or through our provided package.
from tfbench import load_tfb_from_hf
dataset = load_tfb_from_hf(split)
git clone https://github.com/SecurityLab-UCD/TF-Bench.git
cd TF-Bench
uv sync
Please have your API key ready in .env
.
We use each provider's official SDK to access their models.
You can check our pre-supported models in tfbench.lm
module.
from tfbench.lm import supported_models
print(supported_models)
To run single model, which runs both base
and pure
splits:
uv run src/main.py -m gpt-5-2025-08-07
We use Ollama to manage and run the OSS models reported in the Appendix. We switched to vLLM for better performance and SDK design. Although the Ollama option is still available, it is no longer maintained. We recommend using vLLM instead.
curl -fsSL https://ollama.com/install.sh | sh # install ollama, you need sudo for this
ollama serve # start your own instance instead of a system service
NOTE: we required the ollama version at least 0.9.0 to enable thinking parsers. We use 0.11.7 for our experiments.
> ollama --version
ollama version is 0.11.7
Run the benchmark.
uv run src/main.py -m llama3:8b
We also support running any model that is on HuggingFace Hub out-of-the-box. We provide an example using Qwen3.
uv run src/main.py Qwen/Qwen3-4B-Instruct-2507 # or other models
Note that our main.py
uses a pre-defined model router,
which routes all un-recognized model names to HuggingFace.
We use the </think>
token to parse thinking process,
if the model do it differently, please see [Supporting other customized models].
To support your customized model,
you can input the path to your HuggingFace compatible checkpoint to our main.py
.
uv run src/main.py <path to your checkpoint>
Our package is also available on PyPi.
uv add tfbench
Or directly using pip, you know the way
pip install tfbench
Our supported model list is used to route the model name to the correct SDK. Even a newly released model is not in our supported models list, you can still use it by specifying the SDK client directly. We take OpenAI GPT-4.1 as and example here.
from tfbench.lm import OpenAIResponses
from tfbench import run_one_model
model = "gpt-4.1"
pure = True
client = OpenAIResponses(model_name=model, pure=pure, effort=None)
eval_result = run_one_model(client, pure=pure)
Implementing an LM
instance is all your need.
from tfbench.lm._types import LM, LMAnswer
class YourLM(LM):
def __init__(self, model_name: str, pure: bool = False):
"""initialize your model"""
super().__init__(model_name=model_name, pure=pure)
...
def _gen(self, prompt: str) -> LMAnswer:
"""your generation logic here"""
return LMAnswer(answer=content, reasoning_steps=thinking_content)
client = YourLM("xxx")
eval_result = run_one_model(client)