Spellchecker

A high-precision, rule- and dictionary-based spellchecker designed for Russian-language texts and briefly described in the article [Generation of Russian Poetry of Different Genres and Styles Using Neural Networks with Character-Level Tokenization](link to be inserted here after camera-ready paper submission). This tool emphasizes precision over recall, making it ideal for preparing clean data for pretraining and fine-tuning language models (LMs).

Key Features

Precision-First Approach:

The spellchecker prioritizes minimizing false positives over catching every possible error. This ensures that corrections are accurate and reliable, which is critical for preparing high-quality training data for language models.
Incorrect fixes can introduce anomalies that degrade the performance of generative LMs on downstream tasks. This tool avoids such issues by correcting only unambiguous errors.

CPU-Only, No ML Components:

The spellchecker operates entirely on the CPU and does not rely on machine learning models. This makes it lightweight, fast, and suitable for processing large text corpora (tens of GBs) in data preparation pipelines.

Interpretability and Determinism:

Every correction is traceable and deterministic. You can use a debugger to identify which rule or dictionary entry caused a specific correction, ensuring full transparency and control.

Extensible Dictionary and Rules:

The spellchecker is designed for easy expansion. You can add new words to the dictionary or define custom replacement rules to adapt the tool to specific domains or use cases.

Restoring Cyrillic Characters in Russian Text:

Some Latin characters and digits are visually identical or very similar to Cyrillic characters. When these characters appear in Russian text, they can be difficult to detect visually. However, their presence can significantly impact the quality of language model training, often leading to similar issues in generated texts. Our experience has shown that the frequency of such defects can be high enough to negatively affect model performance. To address this, we have developed a simple yet effective solution to restore Cyrillic characters in Russian text - see restore_cyrillic.py.

Dictionary files

Unfortunately, due to problems with LFS quotas I can't upload the binary files of the dictionary to this repository :(

Use the link to download the archive, unpack it to the root of the local copy of the repository.

Usage

Here’s a quick example of how to use the spellchecker:

from spellcheck import PoeticSpellchecker
from udpipe_parser import UdpipeParser


parser = UdpipeParser()
parser.load('./models')

schecker = PoeticSpellchecker(parser)
schecker.load('./data')

new_text, fixups = schecker.fix("Вмести в себя все от кровенья мира")
print(new_text)

Evaluation

The spellchecker is built on the principle of absolute minimization of false positives. It corrects only those errors where the intended correction is unambiguous. While it’s impossible to eliminate false positives entirely (e.g., in cases of intentionally distorted or stylized language), the system prioritizes accuracy and reliability above all else.

Performance on the RUPOR Dataset

The spellchecker has been evaluated on the RUPOR dataset. Given the focus on precision, the evaluation uses the F_0.5 metric (which emphasizes precision over recall) instead of the traditional F₁ score.

Domain	F_0.5	Precision	Recall
RUPOR poetry	0.75	0.98	0.39
RUPOR prose	0.82	1.0	0.47

More detailed description is coming soon

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contributing

We welcome contributions to improve the spellchecker! Here’s how you can help:

Expand the dictionary: Add new words or domain-specific terms.
Add new rules: Define custom replacement rules for common errors.
Report issues: If you encounter any false positives or false negatives, please open an issue on GitHub.

To contribute, fork the repository, make your changes, and submit a pull request.

Contact

For questions, suggestions, or collaborations, feel free to reach out:

Email: [[email protected]]

GitHub Issues: Open an issue

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data/speller/dict		data/speller/dict
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
emoji.py		emoji.py
requirements.txt		requirements.txt
restore_cyrillic.py		restore_cyrillic.py
spellcheck.py		spellcheck.py
spellchecker_run.py		spellchecker_run.py
tokenization_utils.py		tokenization_utils.py
udpipe_parser.py		udpipe_parser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spellchecker

Key Features

Dictionary files

Usage

Evaluation

Performance on the RUPOR Dataset

License

Contributing

Contact

About

Releases

Packages

Languages

License

Koziev/Spellchecker

Folders and files

Latest commit

History

Repository files navigation

Spellchecker

Key Features

Dictionary files

Usage

Evaluation

Performance on the RUPOR Dataset

License

Contributing

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages