Skip to content

Commit e69cd61

Browse files
committed
Add llm-blender
1 parent df34523 commit e69cd61

File tree

99 files changed

+10365
-1
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

99 files changed

+10365
-1
lines changed

.github/dependabot.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
version: 2
2+
updates:
3+
- package-ecosystem: 'github-actions'
4+
directory: '/'
5+
schedule:
6+
interval: 'daily'

.github/workflows/release.yml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
name: Release
2+
3+
on:
4+
push:
5+
tags:
6+
- "v[0-9].[0-9]+.[0-9]+*"
7+
8+
jobs:
9+
release-on-pypi:
10+
runs-on: ubuntu-latest
11+
12+
steps:
13+
- name: Checkout
14+
uses: actions/checkout@v4
15+
16+
- name: Install Hatch
17+
run: pip install hatch
18+
19+
- name: Build
20+
run: hatch build
21+
22+
- name: Publish on PyPi
23+
env:
24+
HATCH_INDEX_USER: __token__
25+
HATCH_INDEX_AUTH: ${{ secrets.PYPI_API_TOKEN }}
26+
run: hatch publish -y

.github/workflows/test.yml

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
name: Test
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
pull_request:
8+
9+
concurrency:
10+
group: test-${{ github.head_ref }}
11+
cancel-in-progress: true
12+
13+
env:
14+
PYTHONUNBUFFERED: "1"
15+
FORCE_COLOR: "1"
16+
HF_API_TOKEN: ${{ secrets.HF_API_TOKEN }}
17+
18+
jobs:
19+
run:
20+
name: Python ${{ matrix.python-version }} on ${{ startsWith(matrix.os, 'macos-') && 'macOS' || startsWith(matrix.os, 'windows-') && 'Windows' || 'Linux' }}
21+
runs-on: ${{ matrix.os }}
22+
strategy:
23+
fail-fast: false
24+
matrix:
25+
os: [ubuntu-latest, windows-latest, macos-12]
26+
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
27+
28+
steps:
29+
- name: Support longpaths
30+
if: matrix.os == 'windows-latest'
31+
run: git config --system core.longpaths true
32+
33+
- uses: actions/checkout@v4
34+
35+
- name: Set up Python ${{ matrix.python-version }}
36+
uses: actions/setup-python@v5
37+
with:
38+
python-version: ${{ matrix.python-version }}
39+
40+
- name: Install Hatch
41+
run: pip install --upgrade hatch
42+
43+
- name: Lint
44+
if: matrix.python-version == '3.9' && runner.os == 'Linux'
45+
run: hatch run lint:all

README.md

Lines changed: 105 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,105 @@
1-
# llm-blender
1+
# LLM-Blender
2+
3+
- LLM-Blender is an ensembling framework designed to achieve consistently superior performance by combining the outputs of multiple language models (LLMs). This work focuses on integrating LLM-Blender with Retrieval-Augmented Generation (RAG) pipelines to significantly improve the quality of generated text.
4+
5+
- LLM-Blender is a two-stage ensemble learning framework. In the first stage (ranking), pairwise comparison of candidates is performed, and they are then ranked. In the second stage (fusing), the top K candidates are merged to render the final output.
6+
7+
- The LLM-Blender comprises of two modules: the PairRanker and the GenFuser. The PairRanker module compares the outputs from multiple LLMs to provide the top-ranked outputs. It compares each candidate with the input in a pairwise manner, making it robust to subtle differences in the generated text. The GenFuser module uses the top-ranked outputs from the PairRanker module to generate an improved output. The module fuses the top K of the N-ranked candidates from the PairRanker, conditioned on the input instruction, to generate an enhanced output.
8+
9+
- A custom Haystack component, `LLMBlenderRanker`, has been implemented to integrate LLM-Blender with Haystack pipelines. The component utilizes the `PairRanker` module from the LLM-Blender framework, which compares each candidate with the input in a pairwise manner. Different LLMs can generate subtly different texts, since they are trained on different datasets and tasks. By comparing each text in a pairwise manner, the component ranks and ensembles the text so it is robust to these subtle differences.
10+
11+
- Haystack RAG Pipelines with the LLM-Blender component to ensemble LLMs were evaluated. The pipelines were evaluated on the BillSum and MixInstruct datasets using three metrics: BARTScore, BLEURT, and BERTScore. The `llama-3`, `phi-3`, `mistral-7b`, `openchat-3.5`, `starling-lm-7b-alpha` and `openhermes-2.5` LLMs were used in the ensemble.
12+
13+
## PairRanker
14+
15+
- The PairRanker module is responsible for comparing and ranking the outputs from LLM's. During the ranking stage, a specific input prompt (x) is passed to N different LLMs, and their outputs are compiled as candidates ($y_1$, …, $y_N$).
16+
17+
- The PairRanker then analyzes and ranks these candidates. For each input x, the candidates are obtained from N different LLMs. This input sequence, along with the candidates, is then subjected to a cross-attention text encoder, such as RoBERTa. The text encoder is tasked with learning and determining the superior candidate for the given input x.
18+
19+
- All the candidates are paired ($y_i$ and $y_j$), producing a matrix of pairwise comparison results. These pairs are evaluated based on the condition: given the input prompt, which candidate's output is better? By aggregating the results in the matrix, the PairRanker can rank all candidates and take the top K of them for generative fusion.
20+
21+
<img src="plots/blender.png" alt="RAG Pipelines Taxonomy" align="middle" height =250>
22+
23+
## GenFuser
24+
25+
- The primary goal of the GenFuser module is to capitalize on the strengths of the top K selected candidates from the PairRanker's ranking.
26+
27+
- After the PairRanker module ranks the candidates, the GenFuser module is employed to fuse the top K out of the N ranked candidates and generate an improved final output. It takes a seq2seq approach, fusing the set of top candidates while conditioning on the input prompt, to generate an improved and enhanced output.
28+
29+
## RAG Pipeline with the LLM Blender component
30+
31+
The results from the different LLMs on the MixInstruct dataset are ranked and combined using the LLM-Blender framework.
32+
33+
<br>
34+
<img src="plots/ranker_pipeline_single_llm.png" alt="RAG Pipelines Taxonomy" align="middle" height =100>
35+
36+
## Usage
37+
38+
To run the pipelines, you will need to clone this repository and install the required libraries.
39+
Install the llm-blender package:
40+
41+
```bash
42+
git clone https://github.com/avnlp/llm_blender
43+
cd llm_blender
44+
pip install -e .
45+
```
46+
47+
## LLM-Blender using Mistral, LLama-3 and Phi-3 models on the MixInstruct Dataset
48+
49+
``` python
50+
cd src/llm_blender/mix_instruct/
51+
python llm_blender_ranker_all_llms.py
52+
```
53+
54+
## LLMBlenderRanker Component Usage
55+
56+
```python
57+
llm_ranker = LLMBlenderRanker(model="llm-blender/PairRM")
58+
answers = [
59+
GeneratedAnswer(data="Paris is the capital of France.", query="What makes Paris unique?", documents=[]),
60+
GeneratedAnswer(
61+
data="The Eiffel Tower is an iconic landmark in Paris.", query="What makes Paris unique?", documents=[]
62+
),
63+
GeneratedAnswer(data="Berlin is a beautiful city.", query="What makes Paris unique?", documents=[]),
64+
]
65+
output = llm_ranker.run(answers=answers)
66+
ranked_answers = output["answers"]
67+
print(ranked_answers)
68+
69+
# [
70+
# GeneratedAnswer(
71+
# data="The Eiffel Tower is an iconic landmark in Paris.",
72+
# query="What makes Paris unique?",
73+
# documents=[],
74+
# meta={},
75+
# ),
76+
# GeneratedAnswer(
77+
# data="Paris is the capital of France.", query="What makes Paris unique?", documents=[], meta={}
78+
# ),
79+
# GeneratedAnswer(data="Berlin is a beautiful city.", query="What makes Paris unique?", documents=[], meta={}),
80+
# ]
81+
```
82+
83+
The API documentation can be found [here](src/llm_blender/README.md).
84+
85+
## Results
86+
87+
- A custom component, `LLMBlenderPairRanker`, was developed to integrate the LLM-Blender Framework with Haystack Pipelines. Haystack RAG Pipelines with the LLM-Blender component to ensemble LLMs were evaluated. The pipelines were evaluated on the BillSum and MixInstruct datasets using three metrics: BARTScore, BLEURT, and BERTScore.
88+
89+
-We successfully replicated the previously reported results for the LLM-Blender. Moreover, significantly improved performance was observed when utilizing newer LLM models, such as Llama-3-8B, Phi-3-mini and Mistral-7B. These findings demonstrate the potential of ensembling state-of-the-art LLMs to enhance the performance of RAG Pipelines on question-answering, summarization and instruction-following tasks.
90+
91+
-The authors of LLM-Blender obtained BERTScore values in the range of 62.26 to 74.68 on the MixInstruct dataset. They obtained a BERTScore value of 72.97 with the PairRanker. We obtained BERTScore values in the range of 72.62 to 76.86 using the newer LLMs. We obtained a BERTScore value of 75.83 with the PairRanker ensembling the results from Llama-3-8B, Phi-3-mini and Mistral-7B.
92+
93+
-The authors of LLM-Blender obtained BARTScore values in the range of -4.57 to -3.14 on the MixInstruct dataset. They obtained a BARTScore value of -3.14 with the PairRanker. We obtained BARTScore values in the range of -3.17 to -2.87 using the newer LLMs. We obtained a BARTScore value of -2.87 with the PairRanker ensembling the results from Llama-3-8B, Phi-3-mini and Mistral-7B.
94+
95+
-The authors of LLM-Blender obtained BLEURT values in the range of -1.23 to -0.37 on the MixInstruct dataset. They obtained a BLEURT value of -0.37 with the PairRanker. We obtained BLEURT values in the range of -0.41 to -0.23 using the newer LLMs. We obtained a BLEURT value of -0.26 with the PairRanker ensembling the results from Llama-3-8B, Phi-3-mini and Mistral-7B.
96+
97+
-The newer models like Llama-3-8B, Phi-3-mini, and Mistral-7B significantly outperformed all the models used by the LLM Blender authors on all the three metrics: BERTScore, BARTScore and BLEURT on the MixInstruct dataset.
98+
99+
- On the BillSum dataset, we obtained BERTScore values from 73.91 to 75.43, BARTScore values from -3.49 to -3.19, and BLEURT values from -0.39 to -0.20 across the different LLMs. The PairRanker model, ensembling the outputs from Llama-3-8B, Phi-3-mini, and Mistral-7B, achieved the highest scores of 75.83 for BERTScore, -3.19 for BARTScore, and -0.20 for BLEURT.
100+
101+
- For both the BillSum and MixInstruct datasets, the PairRanker model achieved the best performance when ensembling the outputs from Llama-3-8B, Phi-3-mini, and Mistral-7B. This combination of LLMs, ensembled using the LLM Blender, significantly outperformed each individual model's performance on all the evaluation metrics.
102+
103+
## License
104+
105+
The source files are distributed under the [MIT License](https://github.com/avnlp/llm-blender/blob/main/LICENSE).

paper/llm_blender.pdf

525 KB
Binary file not shown.

plots/billsum_3_llms.png

20.9 KB
Loading

plots/blender.png

22.9 KB
Loading

plots/blender_without_fuser.png

42.1 KB
Loading

plots/fuser.png

19 KB
Loading

plots/mixinstruct_3_llms.png

21.8 KB
Loading

0 commit comments

Comments
 (0)