This is the PyTerrier plugin for the monoQA ranking and answer extracting approaches
This repostory can be installed using Pip.
pip install --upgrade git+https://github.com/terrierteam/monoQA.git
You can then import the package in Python after importing pyterrier:
import pyterrier as pt
pt.init()
import pyterrier_dr
For instance, to load up a monoQA model,
from pyterrier_monoQA import MonoQA
monoqa = MonoQA(top_k=10) # loads 9meo/monoQA by default
To use monoQA to rerank and extract an answer.
import pandas as pd
monoqa(pd.DataFrame([
{"qid":"1", "docno":"8172" ,"query":"measurement of dielectric constant of liquids by the use of microwave techniques","text":"microwave spectroscopy includes chapters on spectroscope technique\\nand design on measurements on gases liquids and solids on nuclear\\nproperties on molecular structure and on further possible applications\\nof microwaves\\n"},
{"qid":"2", "docno":"4330","query":"mathematical analysis and design details of waveguide fed microwave radiations","text":"a step by step method for designing waveguides and oscillatory systems\\n","score":-0.4900564551,"rank":0,"answer":"step by step method for designing waveguides and oscillatory systems"}
]))
# qid docno query text answer score rank
# 1 8172 measurement of dielectric constant of liquids ... microwave spectroscopy includes chapters on s... microwave spectroscopy includes chapters on sp... -0.289365 0
# 2 4330 mathematical analysis and design details of wa... a step by step method for designing waveguides... a step by step method for designing waveguides... -0.585294 0
import pyterrier as pt
from pyterrier_monoQA import MonoQA
monoqa = MonoQA(top_k=10) # loads 9meo/monoQA by default
dataset = pt.get_dataset("irds:vaswani")
bm25 = pt.BatchRetrieve(pt.get_dataset("vaswani").get_index(), wmodel="BM25")
mono_pipeline = bm25 >> pt.text.get_text(dataset, "text") >> monoqa
Note that both approaches require the document text to be included in the dataframe (see pt.text.get_text).
monoQA has the following options:
model
(default:'9meo/monoQA'
). HGF model name. Defaults to a version trained on OR-QuAC.tok_model
(default:'9meo/monoQA'
). HGF tokenizer name.batch_size
(default:4
). How many documents to process at the same time.text_field
(default:text
). The dataframe attribute in which the document text is stored.verbose
(default:True
). Show progress bar.top_k
(default:5
). Top k document for answer generator.
Checkout out the notebooks, even on Colab: - Vaswani [Github] [Colab]
We use a PyTerrier transformer to score documents using a T5 model.
Sequences longer than the model's maximum of 512 tokens are silently truncated. Consider splitting long texts into passages and aggregating the results (examples).
This repo contains code and data for EMNLP 2022 paper "monoQA: Multi-Task Learning of Reranking and Answer Extraction for Open-Retrieval Conversational Question Answering"
Install dependencies:
git clone https://github.com/thunlp/ConvDR.git
cd monoQA
pip install -r requirements.txt
By default, we expect raw data to be stored in ./datasets/raw
and processed data to be stored in ./datasets
:
mkdir datasets
Download necessary OR-QuAC files and store them into ./datasets/or-quac
:
mkdir datasets/or-quac
cd datasets/or-quac
wget https://ciir.cs.umass.edu/downloads/ORConvQA/all_blocks.txt.gz
wget https://ciir.cs.umass.edu/downloads/ORConvQA/qrels.txt.gz
gzip -d *.txt.gz
mkdir preprocessed
cd preprocessed
wget https://ciir.cs.umass.edu/downloads/ORConvQA/preprocessed/train.txt
wget https://ciir.cs.umass.edu/downloads/ORConvQA/preprocessed/test.txt
wget https://ciir.cs.umass.edu/downloads/ORConvQA/preprocessed/dev.txt
- Craig Macdonald, Nicola Tonellotto. Declarative Experimentation inInformation Retrieval using PyTerrier. Craig Macdonald and Nicola Tonellotto. In Proceedings of ICTIR 2020. https://arxiv.org/abs/2007.14271
- Sarawoot Kongyoung, University of Glasgow