Arabic OCR Dataset Generation Pipeline

This repository provides a pipeline for generating an Arabic OCR dataset. The pipeline processes textual data, converts it into various font styles, generates PDF and image representations, and stores the output in a structured format suitable for OCR training.

Features

Text Preprocessing: Splits Arabic text into manageable chunks.
Font Variations: Supports multiple Arabic fonts (Sakkal Majalla, Amiri, Arial, Calibri, Scheherazade New).
DOCX Generation: Creates formatted Microsoft Word documents.
PDF Conversion: Converts DOCX files to PDFs using libreoffice.
Image Extraction: Converts PDFs to high-resolution images.
Base64 Encoding: Stores images in Base64 format for easy integration.
Dataset Management: Uploads processed files to Hugging Face datasets.
State Persistence: Saves processing state to allow resumption from the last processed record.

Requirements

Python Dependencies

Install the required Python libraries using:

pip install datasets python-docx pdf2image PIL requests huggingface_hub

Technical Specifications & Environment

To ensure full reproducibility of the SARD dataset generation and proper handling of Arabic Complex Text Layout (CTL), the following software environment was utilized:

Python: 3.11
python-docx: 1.2.0
LibreOffice: 6.4.7.2
pdf2image: 1.17.0

Usage

1. Prepare Your Hugging Face Dataset

Ensure your dataset is available on Hugging Face and update the script with:

DATASET_NAME: Your dataset name.
repo_id: Your Hugging Face dataset repository ID.
YOUR_API_TOKEN: Your Hugging Face API token.

2. Run the Script

Execute the script to process text and generate the dataset:

python text2image.py

3. Resume Processing

If the script stops, it will resume from the last processed index using processing_state.json.

Output Format

Each batch of processed data is stored as a CSV file with the following columns:

image_name: Unique identifier for each image
chunk: The text content associated with the image
font_name: The font used in text rendering
image_base64: Base64-encoded image representation
sample_id: Unique ID
article_link: Link of the source article

Books Links

For books_links.pkl it has the links for the used books in the statistical study for choosing the fonts in SARD dataset

import pickle
with open("books_links.pkl", "rb") as f:
  data = f.load()
print(data[0:])

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
books_links.pkl		books_links.pkl
text2image.py		text2image.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Arabic OCR Dataset Generation Pipeline

Features

Requirements

Python Dependencies

Technical Specifications & Environment

Usage

1. Prepare Your Hugging Face Dataset

2. Run the Script

3. Resume Processing

Output Format

Books Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Arabic OCR Dataset Generation Pipeline

Features

Requirements

Python Dependencies

Technical Specifications & Environment

Usage

1. Prepare Your Hugging Face Dataset

2. Run the Script

3. Resume Processing

Output Format

Books Links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages