Skip to content

riotu-lab/sard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

Arabic OCR Dataset Generation Pipeline

This repository provides a pipeline for generating an Arabic OCR dataset. The pipeline processes textual data, converts it into various font styles, generates PDF and image representations, and stores the output in a structured format suitable for OCR training.

Features

  • Text Preprocessing: Splits Arabic text into manageable chunks.
  • Font Variations: Supports multiple Arabic fonts (Sakkal Majalla, Amiri, Arial, Calibri, Scheherazade New).
  • DOCX Generation: Creates formatted Microsoft Word documents.
  • PDF Conversion: Converts DOCX files to PDFs using libreoffice.
  • Image Extraction: Converts PDFs to high-resolution images.
  • Base64 Encoding: Stores images in Base64 format for easy integration.
  • Dataset Management: Uploads processed files to Hugging Face datasets.
  • State Persistence: Saves processing state to allow resumption from the last processed record.

Requirements

Python Dependencies

Install the required Python libraries using:

pip install datasets python-docx pdf2image PIL requests huggingface_hub

Technical Specifications & Environment

To ensure full reproducibility of the SARD dataset generation and proper handling of Arabic Complex Text Layout (CTL), the following software environment was utilized:

  • Python: 3.11
  • python-docx: 1.2.0
  • LibreOffice: 6.4.7.2
  • pdf2image: 1.17.0

Usage

1. Prepare Your Hugging Face Dataset

Ensure your dataset is available on Hugging Face and update the script with:

  • DATASET_NAME: Your dataset name.
  • repo_id: Your Hugging Face dataset repository ID.
  • YOUR_API_TOKEN: Your Hugging Face API token.

2. Run the Script

Execute the script to process text and generate the dataset:

python text2image.py

3. Resume Processing

If the script stops, it will resume from the last processed index using processing_state.json.

Output Format

Each batch of processed data is stored as a CSV file with the following columns:

image_name: Unique identifier for each image
chunk: The text content associated with the image
font_name: The font used in text rendering
image_base64: Base64-encoded image representation
sample_id: Unique ID
article_link: Link of the source article

Books Links

For books_links.pkl it has the links for the used books in the statistical study for choosing the fonts in SARD dataset

import pickle
with open("books_links.pkl", "rb") as f:
  data = f.load()
print(data[0:])

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages