This repository provides a pipeline for generating an Arabic OCR dataset. The pipeline processes textual data, converts it into various font styles, generates PDF and image representations, and stores the output in a structured format suitable for OCR training.
- Text Preprocessing: Splits Arabic text into manageable chunks.
- Font Variations: Supports multiple Arabic fonts (Sakkal Majalla, Amiri, Arial, Calibri, Scheherazade New).
- DOCX Generation: Creates formatted Microsoft Word documents.
- PDF Conversion: Converts DOCX files to PDFs using libreoffice.
- Image Extraction: Converts PDFs to high-resolution images.
- Base64 Encoding: Stores images in Base64 format for easy integration.
- Dataset Management: Uploads processed files to Hugging Face datasets.
- State Persistence: Saves processing state to allow resumption from the last processed record.
Install the required Python libraries using:
pip install datasets python-docx pdf2image PIL requests huggingface_hubTo ensure full reproducibility of the SARD dataset generation and proper handling of Arabic Complex Text Layout (CTL), the following software environment was utilized:
- Python: 3.11
- python-docx: 1.2.0
- LibreOffice: 6.4.7.2
- pdf2image: 1.17.0
Ensure your dataset is available on Hugging Face and update the script with:
DATASET_NAME: Your dataset name.repo_id: Your Hugging Face dataset repository ID.YOUR_API_TOKEN: Your Hugging Face API token.
Execute the script to process text and generate the dataset:
python text2image.pyIf the script stops, it will resume from the last processed index using processing_state.json.
Each batch of processed data is stored as a CSV file with the following columns:
image_name: Unique identifier for each image
chunk: The text content associated with the image
font_name: The font used in text rendering
image_base64: Base64-encoded image representation
sample_id: Unique ID
article_link: Link of the source articleFor books_links.pkl it has the links for the used books in the statistical study for choosing the fonts in SARD dataset
import pickle
with open("books_links.pkl", "rb") as f:
data = f.load()
print(data[0:])