auditability

Model auditability statement

Indigenous and Low Resource Language focus

bezoku models are designed for Indigenous and Low-Resource languages. They are built from first principles from localized training data that has been annotated using the conllu format. This approach incurs low levels of technical debt, is cleaner and avoids inheriting bias from third-party tools.

1. NO PRETRAINED EMBEDDINGS

All in-model word and character embeddings are dervied from localised orthographies in each conllu annotated corpus. This ensures:

Full auditability: unbroken audit trail from corpus annotation to model weights
No inherited biases from opaque upstream pretraining data (Wikipedia, Common Crawl, etc.)
Tokenization using UTF-8 local orthographies to avoid any external tooling conflicts

2. NO EXTERNAL TOKENIZERS

bezoku does not use BPE, WordPiece, SentencePiece or similar subword tokenizers that:

Prioritize English with 1-byte representations
Perpetuate vocabulary biases from English centric multilingual pretraining

3. MONOLINGUAL CORPUS PRIORITY

Each language model is trained exclusively on single language conllu corpus, ensuring:

No cross-lingual transfer from high-resource languages, which may prioritise a different (dominant) orthography
No multilingual model contamination
Ensuring Indigenous and Low-Resource languages are prioritized

4. LOWER TECHNICAL DEBT

The model architecture has clean, linear data pipelines, especially for syntactic model development, especially excluding:

External pretrained tooling (embeddings, tokenizers, encoders)
Third-party preprocessing pipelines with opaque processing

5. DEBUGGING CLARITY

Fixes to training data, for example when a new UD treebank is published, offer a clean, repeatable model maintenance process:

Transparent training data which is open source
Model weights published
Avoiding data pipeline issues from third-party pretrained embeddings

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

auditability

Indigenous and Low Resource Language focus

1. NO PRETRAINED EMBEDDINGS

2. NO EXTERNAL TOKENIZERS

3. MONOLINGUAL CORPUS PRIORITY

4. LOWER TECHNICAL DEBT

5. DEBUGGING CLARITY

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

auditability

Indigenous and Low Resource Language focus

1. NO PRETRAINED EMBEDDINGS

2. NO EXTERNAL TOKENIZERS

3. MONOLINGUAL CORPUS PRIORITY

4. LOWER TECHNICAL DEBT

5. DEBUGGING CLARITY

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages