Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
ibigoula committed Jan 14, 2025
1 parent c115951 commit 7388768
Showing 1 changed file with 16 additions and 19 deletions.
35 changes: 16 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,31 @@
# The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Learning Capabilities
[![Arxiv](https://img.shields.io/badge/Arxiv-YYMM.NNNNN-red?style=flat-square&logo=arxiv&logoColor=white)](https://put-here-your-paper.com)
[![License](https://img.shields.io/github/license/UKPLab/arxiv2025-inherent-limits-plms)](https://github.com/UKPLab/arxiv2025-inherent-limits-plms?tab=Apache-2.0-1-ov-file)
[![License](https://img.shields.io/github/license/UKPLab/arxiv2025-inherent-limits-plms)](https://github.com/UKPLab/arxiv2025-inherent-limits-plms/blob/main/LICENSE)
[![Python Versions](https://img.shields.io/badge/Python-3.9-blue.svg?style=flat&logo=python&logoColor=white)](https://www.python.org/)
[![CI](https://github.com/UKPLab/arxiv2025-inherent-limits-plms/actions/workflows/main.yml/badge.svg)](https://github.com/UKPLab/arxiv2025-inherent-limits-plms/actions/workflows/main.yml)

This is the accompanying code repository for the paper: [The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Learning Capabilities](https://github.com/rochacbruno/python-project-template/).
This repository contains the code for generating the datasets and reproducing the experiments for our paper: [The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Learning Capabilities](https://github.com/rochacbruno/python-project-template/).

> **Abstract:** Large Language Models (LLMs), trained on extensive web-scale corpora, have demonstrated remarkable abilities across diverse tasks, especially as they are scaled up. Nevertheless, even state-of-the-art models struggle in certain cases, sometimes failing at problems solvable by young children, indicating that traditional notions of task complexity are insufficient for explaining LLM capabilities. However, exploring LLM capabilities is complicated by the fact that most widely-used models are also `instruction-tuned' to respond appropriately to prompts. With the goal of disentangling the factors influencing LLM performance, we investigate whether instruction-tuned models possess fundamentally different capabilities from base models that are prompted using in-context examples. Through extensive experiments across various model families, scales and task types, which included instruction tuning 90 different LLMs, we demonstrate that the performance of instruction-tuned models is significantly correlated with the in-context performance of their base counterparts. By clarifying what instruction-tuning contributes, we extend prior research into in-context learning, which suggests that base models use priors from pretraining data to solve tasks. Specifically, we extend this understanding to instruction-tuned models, suggesting that their pretraining data similarly sets a limiting boundary on the tasks they can solve, with the added influence of the instruction-tuning dataset.

Contact person: [Irina Bigoulaeva](mailto:irina.bigoulaeva@gmail.com)
Contact person: [Irina Bigoulaeva](mailto:ibigoula@gmail.com)

[UKP Lab](https://www.ukp.tu-darmstadt.de/) | [TU Darmstadt](https://www.tu-darmstadt.de/
)


## Getting Started

Prepare a virtual environment:
Prepare a new virtual environment:

```bash
python -m venv .venv
source .venv/bin/activate
pip install .
pip install -r requirements.txt
```
We recommend using a separate virtual environment, since we use [vLLM](https://docs.vllm.ai/en/stable/) for speeding up inference. As stated in their documentation, it is recommended to install vLLM in a fresh environment, since it compiles specific CUDA versions that may be incompatible with preexisting environments.
We recommend creating a new environment, since we use [vLLM](https://docs.vllm.ai/en/stable/) for speeding up model inference. This may cause incompatibilities in preexisting environments.

The current version of the code is designed to run as individual modules called by Bash scripts. Sample scripts can be found in the `scripts` folder.
The current version of the code is designed to run as individual modules called by Bash scripts. Sample scripts can be found in the `scripts` folder.


### Running on SLURM (Default)
Expand All @@ -48,7 +47,7 @@ python3 create_prompts.py --num_datapoints 100 --create_custom_prompts "True" --

```

Refer to the scripts in `scripts/without_slurm` for some sample scripts that can be used.
Refer to the scripts in `scripts/without_slurm` for some sample scripts that can be used. Optionally, other values for `$SLURM_JOB_ID` may be set, as long as the value is unique to the run.


## Reproducing the Experiments
Expand All @@ -60,17 +59,17 @@ Our experiments can be reproduced in the following steps:

## Dataset Creation

We reproduce the [FLAN dataset](https://arxiv.org/pdf/2109.01652) based on the [code of the authors](https://github.com/google-research/FLAN/tree/main/flan), which was distributed under the Apache 2.0 License. Our implementation is based on [HuggingFace Datasets](https://huggingface.co/docs/hub/en/datasets). For each task in the original FLAN, we found the equivalent in HuggingFace Datasets and reimplemented the preprocesing in the original code to the best of our ability. However, due to the differing data sources, we note that the contents of our version may differ slightly. We designate all areas that were modified from the original.
We reproduce the [FLAN dataset](https://arxiv.org/pdf/2109.01652) based on the [code of the authors](https://github.com/google-research/FLAN/tree/main/flan), which was distributed under the Apache 2.0 License. Our implementation is based on [HuggingFace Datasets](https://huggingface.co/docs/hub/en/datasets). For each task in the original FLAN, we found the equivalent in HuggingFace Datasets and reimplemented the preprocesing in the original code to the best of our ability. However, due to the differing data sources, we note that the contents of our version may differ slightly. In all modified files, we designate the areas that were changed from the original.

The data loading is handled by `data_utils.py`. This loads all datasets mentioned in [Wei et al., 2022](https://arxiv.org/pdf/2109.01652), although we use only a subset of these for our experiments. Please see our paper for more details.
The data loading is handled by `data_utils.py`. This loads all datasets mentioned in [the FLAN paper](https://arxiv.org/pdf/2109.01652), although we use only a subset of these for our experiments. Please see our paper for more details.

The preprocessing and prompt formatting is handled by `create_prompts.py`. This calls `data_utils.py` to load the data, preprocesses it, and finally formats it into one of two prompt types: *Regular Prompt* or *SampleGen Prompt*.
* *Regular Prompt* corresponds to the prompt formats of the original FLAN, which are included in `prompts/flan_orig.py`.
* *SampleGen Prompt* corresponds to the prompt format that we develop for training our SampleGen models.

The resulting dataset is saved to disk.

Since we instruction-tune our models with many prompt variations, we must create separate datasets or each variation. In all cases, the procedure is the same:
Since we instruction-tune our models with many prompt variations, we must create separate datasets for each variation. In all cases, the procedure is the same:

1. Generate individual task datasets
2. Create a training mixture from the individual tasks
Expand All @@ -93,23 +92,21 @@ For our experiments, the following dataset types must be created:
3. Gold Pipeline Test

#### 1: Regular Prompt
This is our baseline dataset, following the original FLAN dataset in prompt formatting. To ensure that the dataset matches the original FLAN as closely as possible, use all the tasks in `config.py`.

Create the datasets by running `create_prompts.py`:
This is our baseline dataset, following the original FLAN dataset in prompt formatting. Configure `task_list` in `config.py` to choose the subset of tasks to include, then run `create_prompts.py`:

```bash

python3 create_prompts.py --num_datapoints 100 --create_custom_prompts "False" --dataset_type "train+val"

```

The output will be a folder titled `flan_prompts` containing all the task datasets.
The output will be a folder titled `flan_prompts` containing the individual task datasets.

**NOTE:** To generate a dataset as close to the original FLAN dataset as possible, use the full `task_list` in `config.py`.

#### 2: SampleGen Prompt
To create our SampleGen Prompt, set the following arguments. Importantly, we set `create_custom_prompts=True`, which ensures that the samples will be formatted in the necessary way.

Call `create_prompts.py` with the following arguments:
#### 2: SampleGen Prompt
To create our SampleGen Prompt dataset, call `create_prompts.py` with the following arguments. Importantly, we set `create_custom_prompts=True`, which ensures that the samples will be formatted in the necessary way.

```bash

Expand Down Expand Up @@ -167,7 +164,7 @@ For model testing, call one of the two testing scripts: `test_regularprompt.sh`

## Cite

Please use the following citation:
If you found this repository helpful, please cite our paper:

```
@InProceedings{smith:20xx:CONFERENCE_TITLE,
Expand Down

0 comments on commit 7388768

Please sign in to comment.