Newspaper-OCR-Project

Welcome to my project! I worked on this to be applied to the Michigan Daily, to be used to reformat old newspapers in the archives into the more modern format we use for online media. To accomplish this, I used Pytesseract, as well as OpenCV and Pillow for preprocessing the images of the archived newspapers and to morph it to make it more readable for the OCR algorithm.

Image Preprocessing

The first step to completing this project was to preprocess and morph the images using OpenCV and Pillow to make it more readable for the OCR algorithm, Newspapers are messy, especially old ones, and are filled with random noise such as black bars that would largely affect the efficiency and accuracy of Pytessesract. To counteract this, I used OpenCV to morph the images into a way that the computer could detect white contours to draw bounding boxes around blocks of text in each individual newspaper. This way, only the text would be read in at first, seperate from all the margins, titles, and everything else. This will be tuned more and used in the future to extract the headline and first paragraph from each paper to display with a "Read More" option on the main website.

The Image Preprocessing Process is Displayed Below:

Original Image

Greyscale and Slight Blur

Image Binirization by Applying Threshhold

Dilation

Detecting Countours and Creating Visible Bounding Boxes

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Easy OCR Test.ipynb		Easy OCR Test.ipynb
MichiganDailyParagraph copy 2.jpg		MichiganDailyParagraph copy 2.jpg
Newspaper OCR Pytesseract.ipynb		Newspaper OCR Pytesseract.ipynb
README.md		README.md
bhl_midaily_mdp.39015071754662_IMG00000987_full_5705_8928__0_default copy.jpg		bhl_midaily_mdp.39015071754662_IMG00000987_full_5705_8928__0_default copy.jpg
bhl_midaily_mdp.39015071755065_IMG00001023_full_5381_9044__0_default copy.jpg		bhl_midaily_mdp.39015071755065_IMG00001023_full_5381_9044__0_default copy.jpg
bhl_midaily_mdp.39015071755065_IMG00001032_full_5370_9016__0_default copy.jpg		bhl_midaily_mdp.39015071755065_IMG00001032_full_5370_9016__0_default copy.jpg
bhl_midaily_midaily.2023.04.19.001_IMG00000001_full_4800_8400__0_default copy.jpg		bhl_midaily_midaily.2023.04.19.001_IMG00000001_full_4800_8400__0_default copy.jpg
index_bbox copy.png		index_bbox copy.png
index_blur copy.png		index_blur copy.png
index_dilate copy.png		index_dilate copy.png
index_gray copy.png		index_gray copy.png
index_kernal copy.png		index_kernal copy.png
index_thresh copy.png		index_thresh copy.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Newspaper-OCR-Project

Image Preprocessing

Original Image

Greyscale and Slight Blur

Image Binirization by Applying Threshhold

Dilation

Detecting Countours and Creating Visible Bounding Boxes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Newspaper-OCR-Project

Image Preprocessing

Original Image

Greyscale and Slight Blur

Image Binirization by Applying Threshhold

Dilation

Detecting Countours and Creating Visible Bounding Boxes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages