Skip to content

Search pre-defined keywords into the scanned PDF files using Levenshtein algorithm.

License

Notifications You must be signed in to change notification settings

Lh4cKg/ocr-toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About - OCR Toolkit

Search pre-defined keywords into the scanned PDF files using Levenshtein algorithm.

Prerequisites


Python
Tesseract

Install dependencies for Linux


Requires libtesseract (>=3.04) and libleptonica (>=1.71).

On Debian/Ubuntu:

$ sudo apt-get install tesseract-ocr libtesseract-dev libleptonica-dev pkg-config

On RedHat/Fedora:

$ sudo dnf install tesseract tesseract-devel leptonica-devel leptonica

Install dependencies for Windows


  1. Tesseract Docs
  2. Tesseract
  3. Leptonica

Setup Project


$ git clone <project_repo>
$ cd <project_directory>/

Install Source dependencies from requirements


$ pip install -r requirements/dev.txt

Package Build and Install


$ python -m build

For Windows

$ pip install dist/ocrmatcher-<version>-py3-none-any.whl

For Linux

$ pip install dist/ocrmatcher-<version>-tar.gz

Using


  1. Add dataset folder current directory
  2. Add Scanned PDF files into dataset directory
  3. Add keywords.txt file into dataset directory
  4. Add Search Keywords to keywords.txt file (each keywords must be new line without numbering)

Commands


List of available commands

$ ocrmatcher --help

Or

$ python -m ocrmatcher --help

Add new keywords by add-keywords command

$ ocrmatcher add-keywords --k my-search-keyword1 my-search-keyword2 etc.

Search Keywords

$ ocrmatcher search 

Run with specific language

Search Keywords

$ ocrmatcher search --lang Occupant-Pigs

Run with specific threshold for two strings similarity, default is: 95

Search Keywords

$ ocrmatcher search --threshold 75

Pdf file convert to images

$ ocrmatcher pdf2img 

About

Search pre-defined keywords into the scanned PDF files using Levenshtein algorithm.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published