Skip to content

ArthLeu/title-matching

Repository files navigation

Title Matcher - CSE 272 Final Project

A project by Anthony Liu and Alex Salman

The primary task of this program is to retrieve all the names of datasets in given documents

Last modified 06/08/2021

Competition Descriptions

Kaggle competition main page

Dataset download page

Program output guide and sample

Features as of 06/09/2021

  1. Exact Match
  2. spaCy NER
  3. Fuzzy Match (disabled due to slow performance)
  4. Custom Hyperparamters
  5. Optional Training During Each Run

Our Design Thoughts (Brainstorming Canvas)

Link to Google Docs

Useful Shell Commands

Installing required packages

pip3 install -r requirements.txt

Store train data at location:

dataset/train/

Store test data at location:

dataset/test/

Running the program: use jpyter notebook to run

main.ipynb

FAQ

Q: Why this is an IR project instead of an hodgepodge of algorithms?

A: There are 4 components of an Information Retrieval system, "acquisition", "representation", "file organization", and "query". Although we are working primarily on string matching, this process is essential for the "query" component, where a query like "how many time XXX dataset was mentioned" is passed in. Therefore, we must devise a robust platform where documents are efficiently processed and stored, and where queries like this would receive accurate feedbacks.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published