A project by Anthony Liu and Alex Salman
The primary task of this program is to retrieve all the names of datasets in given documents
Last modified 06/08/2021
Program output guide and sample
- Exact Match
- spaCy NER
Fuzzy Match(disabled due to slow performance)- Custom Hyperparamters
- Optional Training During Each Run
Installing required packages
pip3 install -r requirements.txt
Store train data at location:
dataset/train/
Store test data at location:
dataset/test/
Running the program: use jpyter notebook
to run
main.ipynb
Q: Why this is an IR project instead of an hodgepodge of algorithms?
A: There are 4 components of an Information Retrieval system, "acquisition", "representation", "file organization", and "query". Although we are working primarily on string matching, this process is essential for the "query" component, where a query like "how many time XXX dataset was mentioned" is passed in. Therefore, we must devise a robust platform where documents are efficiently processed and stored, and where queries like this would receive accurate feedbacks.