Skip to content

eellak/glossAPI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

4c351a7 · Mar 23, 2025
Mar 12, 2025
Sep 23, 2024
Mar 23, 2025
Mar 23, 2025
Mar 11, 2025
Mar 23, 2025
Mar 6, 2025
Mar 23, 2025

Repository files navigation

GlossAPI

Release Version PyPI Test Status

A library for processing academic texts in Greek and other languages, developed by ΕΕΛΛΑΚ.

Features

  • PDF Processing: Extract text content from academic PDFs with structure preservation
  • Quality Control: Filter and cluster documents based on extraction quality
  • Section Extraction: Identify and extract academic sections from documents
  • Section Classification: Classify sections using machine learning models
  • Greek Language Support: Specialized processing for Greek academic texts
  • Metadata Handling: Process academic texts with accompanying metadata
  • Customizable Annotation: Map section titles to standardized categories

Installation

pip install glossapi==0.0.7

Usage

The recommended way to use GlossAPI is through the Corpus class, which provides a complete pipeline for processing academic documents:

from glossapi import Corpus
import logging

# Configure logging (optional)
logging.basicConfig(level=logging.INFO)

# Initialize Corpus with input and output directories
corpus = Corpus(
    input_dir="/path/to/documents",
    output_dir="/path/to/output",
    metadata_path="/path/to/metadata.parquet",  # Optional
    annotation_mapping={
        'Κεφάλαιο': 'chapter',
        # Add more mappings as needed
    }
)

# Step 1: Extract documents (quality control)
corpus.extract()

# Step 2: Extract sections from filtered documents
corpus.section()

# Step 3: Classify and annotate sections
corpus.annotate()

License

This project is licensed under the European Union Public Licence 1.2 (EUPL 1.2).