Skip to content

Commit d2a9a73

Browse files
Update HyDE
1 parent e033d9d commit d2a9a73

File tree

4 files changed

+120
-0
lines changed

4 files changed

+120
-0
lines changed

hyde/image.png

111 KB
Loading

hyde/readme.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# HyDE
2+
This is the reproduction of the paper:
3+
- [Precise Zero-Shot Dense Retrieval without Relevance Labels](https://aclanthology.org/2023.acl-long.99/)
4+
5+
## Introduction
6+
HyDE is a method that generates a hypothetical document using an instruction-following language model (e.g., InstructGPT) to capture relevance patterns. The generated document is then encoded into an embedding vector using an unsupervised contrastively learned encoder (e.g., Contriever). This vector identifies a neighborhood in the corpus embedding space, from which similar real documents are retrieved based on vector similarity. This second step grounds the generated document to the actual corpus, with the encoder's dense bottleneck filtering out the hallucinations.
7+
8+
<center>
9+
<img src="./image.png" alt="HyDE" width="50%"/>
10+
</center>
11+
12+
## Running the Method
13+
Before conducting the experiment, you need to prepare the generator. In this example, we use VLLM to deploy the generator, you can skip this step if you wish to use the generator from OpenAI.
14+
```bash
15+
bash ./run_generator.sh
16+
```
17+
This script will start a `Qwen2-7B-Instruct` model server on port 8000. You can change the `MODEL_NAME` in the script if you want to use a different model.
18+
19+
You also need to prepare the retriever. In this example, we use `FlexRAG/wiki2021_atlas_contriever` retriever.
20+
```bash
21+
git lfs install
22+
git clone https://huggingface.co/FlexRAG/wiki2021_atlas_contriever
23+
```
24+
This will download the retriever to your local machine.
25+
26+
Then, run the following command to evaluate the HyDE on the test set of `Natural Questions`:
27+
```bash
28+
bash ./run.sh
29+
```
30+
This script will run the HyDE method on the test set of `Natural Questions` and save the results in the `results` directory. You can change the `DATASET_NAME` and the `SPLIT` variables in the script to evaluate on different datasets.
31+
32+
## Experiments
33+
34+
### Experiment Settings
35+
- **Model**: We use the `Qwen2-7B-Instruct` model.
36+
- **Retriever**: We use `FlexRAG/wiki2021_atlas_contriever` retriever.
37+
- **max_iterations**: We set the maximum number of iterations to 2.
38+
- **top_k**: We set the number of top-k retrieved documents to 5.
39+
- **temperature**: We set the generation temperature to 0 for deterministic generation.
40+
41+
### Experimental Results
42+
TODO
43+
44+
45+
## Citation
46+
If you use this code in your research, please cite the following paper:
47+
48+
```bibtex
49+
@software{Zhang_FlexRAG_2025,
50+
author = {Zhang, Zhuocheng and Feng, Yang and Zhang, Min},
51+
doi = {10.5281/zenodo.14593327},
52+
month = jan,
53+
title = {{FlexRAG}},
54+
url = {https://github.com/ictnlp/FlexRAG},
55+
year = {2025}
56+
}
57+
```
58+
59+
```bibtex
60+
@inproceedings{gao-etal-2023-precise,
61+
title = "Precise Zero-Shot Dense Retrieval without Relevance Labels",
62+
author = "Gao, Luyu and
63+
Ma, Xueguang and
64+
Lin, Jimmy and
65+
Callan, Jamie",
66+
editor = "Rogers, Anna and
67+
Boyd-Graber, Jordan and
68+
Okazaki, Naoaki",
69+
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
70+
month = jul,
71+
year = "2023",
72+
address = "Toronto, Canada",
73+
publisher = "Association for Computational Linguistics",
74+
url = "https://aclanthology.org/2023.acl-long.99/",
75+
doi = "10.18653/v1/2023.acl-long.99",
76+
pages = "1762--1777",
77+
abstract = "While dense retrieval has been shown to be effective and efficient across tasks and languages, it remains difficult to create effective fully zero-shot dense retrieval systems when no relevance labels are available. In this paper, we recognize the difficulty of zero-shot learning and encoding relevance. Instead, we propose to pivot through Hypothetical Document Embeddings (HyDE). Given a query, HyDE first zero-shot prompts an instruction-following language model (e.g., InstructGPT) to generate a hypothetical document. The document captures relevance patterns but is {\textquotedblleft}fake{\textquotedblright} and may contain hallucinations. Then, an unsupervised contrastively learned encoder (e.g., Contriever) encodes the document into an embedding vector. This vector identifies a neighborhood in the corpus embedding space, from which similar real documents are retrieved based on vector similarity. This second step grounds the generated document to the actual corpus, with the encoder`s dense bottleneck filtering out the hallucinations. Our experiments show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever and shows strong performance comparable to fine-tuned retrievers across various tasks (e.g. web search, QA, fact verification) and in non-English languages (e.g., sw, ko, ja, bn)."
78+
}
79+
```

hyde/run.sh

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
#!/bin/bash
2+
3+
MODEL_NAME=Qwen2-7B-Instruct
4+
BASE_URL=http://127.0.0.1:8000/v1
5+
DATASET_NAME=nq
6+
SPLIT=test
7+
8+
9+
python -m flexrag.entrypoints.run_assistant \
10+
name=$DATASET_NAME \
11+
split=$SPLIT \
12+
assistant_type=modular \
13+
modular_config.generator_type=openai \
14+
modular_config.openai_config.model_name=$MODEL_NAME \
15+
modular_config.openai_config.base_url=$BASE_URL \
16+
modular_config.do_sample=False \
17+
modular_config.retriever_type=hyde \
18+
modular_config.hyde_config.generator_type=openai \
19+
modular_config.hyde_config.openai_config.model_name=$MODEL_NAME \
20+
modular_config.hyde_config.openai_config.base_url=$BASE_URL \
21+
modular_config.hyde_config.database_path=wiki2021_atlas_contriever \
22+
modular_config.hyde_config.index_type=faiss \
23+
modular_config.hyde_config.query_encoder_config.encoder_type=hf \
24+
modular_config.hyde_config.query_encoder_config.hf_config.model_path=facebook/contriever-msmarco \
25+
modular_config.hyde_config.query_encoder_config.hf_config.device_id=[0] \
26+
eval_config.metrics_type=[retrieval_success_rate,generation_f1,generation_em] \
27+
eval_config.retrieval_success_rate_config.eval_field=text \
28+
eval_config.response_preprocess.processor_type=[simplify_answer] \
29+
log_interval=10

hyde/run_generator.sh

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
#!/bin/bash
2+
3+
MODEL_NAME=Qwen2-7B-Instruct
4+
5+
6+
python -m vllm.entrypoints.openai.api_server \ (base)
7+
--model $MODEL_NAME \
8+
--gpu-memory-utilization 0.95 \
9+
--tensor-parallel-size 2 \
10+
--port 8000 \
11+
--host 0.0.0.0 \
12+
--trust-remote-code

0 commit comments

Comments
 (0)