A high-precision, rule- and dictionary-based spellchecker designed for Russian-language texts and briefly described in the article [Generation of Russian Poetry of Different Genres and Styles Using Neural Networks with Character-Level Tokenization](link to be inserted here after camera-ready paper submission). This tool emphasizes precision over recall, making it ideal for preparing clean data for pretraining and fine-tuning language models (LMs).
- Precision-First Approach:
-
The spellchecker prioritizes minimizing false positives over catching every possible error. This ensures that corrections are accurate and reliable, which is critical for preparing high-quality training data for language models.
-
Incorrect fixes can introduce anomalies that degrade the performance of generative LMs on downstream tasks. This tool avoids such issues by correcting only unambiguous errors.
- CPU-Only, No ML Components:
The spellchecker operates entirely on the CPU and does not rely on machine learning models. This makes it lightweight, fast, and suitable for processing large text corpora (tens of GBs) in data preparation pipelines.
- Interpretability and Determinism:
Every correction is traceable and deterministic. You can use a debugger to identify which rule or dictionary entry caused a specific correction, ensuring full transparency and control.
- Extensible Dictionary and Rules:
The spellchecker is designed for easy expansion. You can add new words to the dictionary or define custom replacement rules to adapt the tool to specific domains or use cases.
- Restoring Cyrillic Characters in Russian Text:
Some Latin characters and digits are visually identical or very similar to Cyrillic characters. When these characters appear in Russian text, they can be difficult to detect visually. However, their presence can significantly impact the quality of language model training, often leading to similar issues in generated texts. Our experience has shown that the frequency of such defects can be high enough to negatively affect model performance. To address this, we have developed a simple yet effective solution to restore Cyrillic characters in Russian text - see restore_cyrillic.py.
Unfortunately, due to problems with LFS quotas I can't upload the binary files of the dictionary to this repository :(
Use the link to download the archive, unpack it to the root of the local copy of the repository.
Here’s a quick example of how to use the spellchecker:
from spellcheck import PoeticSpellchecker
from udpipe_parser import UdpipeParser
parser = UdpipeParser()
parser.load('./models')
schecker = PoeticSpellchecker(parser)
schecker.load('./data')
new_text, fixups = schecker.fix("Вмести в себя все от кровенья мира")
print(new_text)
The spellchecker is built on the principle of absolute minimization of false positives. It corrects only those errors where the intended correction is unambiguous. While it’s impossible to eliminate false positives entirely (e.g., in cases of intentionally distorted or stylized language), the system prioritizes accuracy and reliability above all else.
The spellchecker has been evaluated on the RUPOR dataset. Given the focus on precision, the evaluation uses the F0.5 metric (which emphasizes precision over recall) instead of the traditional F1 score.
Domain | F0.5 | Precision | Recall |
---|---|---|---|
RUPOR poetry | 0.75 | 0.98 | 0.39 |
RUPOR prose | 0.82 | 1.0 | 0.47 |
More detailed description is coming soon
This project is licensed under the MIT License. See the LICENSE file for details.
We welcome contributions to improve the spellchecker! Here’s how you can help:
-
Expand the dictionary: Add new words or domain-specific terms.
-
Add new rules: Define custom replacement rules for common errors.
-
Report issues: If you encounter any false positives or false negatives, please open an issue on GitHub.
To contribute, fork the repository, make your changes, and submit a pull request.
For questions, suggestions, or collaborations, feel free to reach out:
Email: [[email protected]]
GitHub Issues: Open an issue