Model auditability statement
bezoku models are designed for Indigenous and Low-Resource languages. They are built from first principles from localized training data that has been annotated using the conllu format. This approach incurs low levels of technical debt, is cleaner and avoids inheriting bias from third-party tools.
All in-model word and character embeddings are dervied from localised orthographies in each conllu annotated corpus. This ensures:
- Full auditability: unbroken audit trail from corpus annotation to model weights
- No inherited biases from opaque upstream pretraining data (Wikipedia, Common Crawl, etc.)
- Tokenization using UTF-8 local orthographies to avoid any external tooling conflicts
bezoku does not use BPE, WordPiece, SentencePiece or similar subword tokenizers that:
- Prioritize English with 1-byte representations
- Perpetuate vocabulary biases from English centric multilingual pretraining
Each language model is trained exclusively on single language conllu corpus, ensuring:
- No cross-lingual transfer from high-resource languages, which may prioritise a different (dominant) orthography
- No multilingual model contamination
- Ensuring Indigenous and Low-Resource languages are prioritized
The model architecture has clean, linear data pipelines, especially for syntactic model development, especially excluding:
- External pretrained tooling (embeddings, tokenizers, encoders)
- Third-party preprocessing pipelines with opaque processing
Fixes to training data, for example when a new UD treebank is published, offer a clean, repeatable model maintenance process:
- Transparent training data which is open source
- Model weights published
- Avoiding data pipeline issues from third-party pretrained embeddings