https://github.com/4OH4/doc-similarity
Find and rank relevant content in Python using NLP, TF-IDF and GloVe.
This repository includes two methods of ranking text content by similarity:
- Term Frequency - inverse document frequency (TF-idf)
- Semantic similarity, using GloVe word embeddings
Given a search query (text string) and a document corpus, these methods calculate a similarity metric for each document vs the query. Both methods exist as standalone modules, with explanation and demonstration code inside examples.ipynb
.
There is an associated blog post that explains the contents of this repository in more detail.
The code in this repository utilises, is derived from and extends the excellent Scikit-Learn, Gensim and NLTK packages.
Python 3 (v3.7 tested) and the following packages (all available via pip
):
pip install scikit-learn~=0.22
pip install gensim~=3.8
pip install nltk~=3.4
Or install via the requirements.txt
file:
pip install -r requirements.txt
After installing the requirements (if necessary), open and run examples.ipynb
using Jupyter Lab.
This module is a wrapper around the Scikit-Learn TfidfVectorizer
, with some additional functionality from nltk
to handle stopwords, lemmatization and cosine similarity calculation. To run:
from tfidf import rank_documents
document_scores = rank_documents(search_terms, documents)
There is a self-contained class - DocSim - for running sematic similarity queries. This can be imported as a module and used without additional code:
from docsim import DocSim
docsim = DocSim(verbose=True)
similarities = docsim.similarity_query(query_string, documents)
By default, a GloVe word embedding model is loaded (glove-wiki-gigaword-50
), although a custom model can also be used.
The word embedding models can be quite large and slow to load, although subsequent operations are faster. The multi-threaded version of the class loads the model in the background, to avoid locking the main thread for a significant period of time. It is used in a similar way, although will raise an exception if the model is still loading so the status of the model_ready
property should be checked first. The only difference is the import:
from docsim import DocSim_threaded
To install the package requirements to run the unit tests:
pip install -r requirements_unit_test.txt
To run all test cases, from the repository root:
pytest
Comments and feedback welcome! Please raise an issue if you find any errors or omissions.