GitHub - alephpi/Texo: A minimalist SOTA LaTeX OCR model which contains only 20M parameters

Texo is pronounced as /ˈtɛːkoʊ/

A minimalist free and open-source SOTA LaTeX OCR model which contains only 20M parameters.

Features

Free and open-source.
Fast and lightweight inference.
Trainable on consumer's-level GPU.
Well organized code as a tutorial.
Running in browser! (comming soon)

Prelude

Despite a growing number of STEM and AI learners with their note-taking needs today, a free, fast, more accessible yet precise LaTeX OCR tool is still absent. Lying exactly in the comfort zone of machine learning due to the closed vocabulary and less generalization requirements, such classical pattern recognition task can be considered as solved thanks to recent deep learning progress(TrOCR, GOT-2.0, UniMERNet, PPFormulaNet). So here comes the Texo, which tackles this problem in the scope of a personal project.

It is also a comprehensive practice to combine the knowledge and experiences I learned so far from school and online, as well as a tentative contribution to my beloved open source community.

Performance

Texo is a distilled version of PPFormulaNet-S and finetuned on UniMERNet-1M, hence it should preserve the most part of the performance compared to PPFormulaNet-S. Here is the evaluation results on the UniMERNet-Test dataset.

Model	Params	Metric	SPE	CPE	SCE	HWE
UniMERNet-T^†	107M	BLEU	0.909	0.902	0.566	0.883
		Edit distance	0.066	0.075	0.239	0.078
PPFormulaNet-S^†	57M	BLEU	0.8694	0.8071	-	-
		Edit distance	-	-	-	-
Texo-distill^*	20M	BLEU	0.9014	0.8909	0.7034	0.8606
		Edit distance	0.0780	0.1042	0.1941	0.0995
Texo-transfer^*	20M^**	BLEU	0.8597	0.8334	0.5549	0.7973
		Edit distance	0.0980	0.1306	0.2187	0.0999
Texo-transfer-onnx	as above	BLEU	0.8395	0.8136	0.5153	0.7787
		Edit distance	0.0980	0.1288	0.2050	0.0976

We only list the lightweight version for the SOTA models, as we can see in terms of sequential metrics like BLEU or Edit distance, our model has comparable performance while reducing a lot of parameters.

- means not reported in the paper.

†: Copy from the paper

*: Texo-distill uses the same tokenizer as UniMERNet and PPFormulaNet, hence the sequential metrics are strictly comparable. While Texo-transfer uses a customized tokenizer hence the metrics are not comparable (we have a shorter sequence length, see more in notes). Of course a more fair evaluation metric for LaTeX-OCR should be the CDM, but I'm lazy to do it due to its implementation complexity.

**: Slightly less than Texo-distill as the tokenizer's vocab is smaller.

Configure environment

git clone https://github.com/alephpi/Texo
uv sync

For those who don't use uv, it worths to try it. For those who insist not to use, I guess you know how to adapt.

Download model

# model only
python scripts/python/hf_hub.py pull

# for those who want to train from useful checkpoints
python scripts/python/hf_hub.py pull --with_useful_ckpts

Inference

Check demo.ipynb

Training

Requirements

Mine: 50G CPU memory, A40/L40S 46G.
Recommend: 50G CPU memory, 40G GPU memory.
Minimal: 20G CPU memory(with streaming dataloading) and 16G GPU memory(with accumulative gradient).

Download dataset (UniMER-1M)

Following https://huggingface.co/datasets/wanderkid/UniMER_Dataset as what I've done

If you are lazy, use the one that I arranged and normalized.

If you are interested in all the preprocessings, check here and here, where I collected and sorted all the useful KaTeX commands.

Launch

We use hydra to manage training configurations and experiments.

# train
python src/train.py

# resume from a checkpoint
python src/train.py training.resume_from_ckpt="<ckpt_path>"

# debug
python src/train.py --config-dir="./config" --config-name="train_debug.yaml"

# train on a slurm cluster
python src/train.py --multirun --config-dir="./config" --config-name="train_slurm.yaml"

See other training configurations in config directory.

Log

The training results are stored in outputs directiory. To visualize it, run

tensorboard --logdir outputs

Figures

Some beautiful loss curves to give you an impression of the loss scale and the convergence process.

Training loss

Validation loss

BLEU

Edit distance

Learning rate

Acknowledgements

transformers: framework, model decoder, tokenizer
UniMERNet: dataset, image processor
Im2Markup: latex preprocessing
KaTeX: latex vocabulary for training tokenizer and latex parser for preprocessing
my-unimernet: image processor (plus a nice codebase to demystify UniMERNet)
PaddleOCR: model architecture, pretraining weights
PaddleOCR2Pytorch and D-FINE: model encoder implementation
Im2Markup, LaTeX-OCR and TrOCR: pioneers
MixTeX and TexTeller: motivation
Telecom Paris for providing the GPU cluster.

License

AGPL-3.0

Name		Name	Last commit message	Last commit date
Latest commit History 190 Commits
.vscode		.vscode
TechnoSelection		TechnoSelection
assets		assets
config		config
data		data
outputs		outputs
scripts		scripts
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.ipynb		demo.ipynb
main.py		main.py
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Features

Prelude

Performance

Configure environment

Download model

Inference

Training

Requirements

Download dataset (UniMER-1M)

Launch

Log

Figures

Training loss

Validation loss

BLEU

Edit distance

Learning rate

Acknowledgements

License

About

Uh oh!

Releases

Packages

Languages

License

alephpi/Texo

Folders and files

Latest commit

History

Repository files navigation

Features

Prelude

Performance

Configure environment

Download model

Inference

Training

Requirements

Download dataset (UniMER-1M)

Launch

Log

Figures

Training loss

Validation loss

BLEU

Edit distance

Learning rate

Acknowledgements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages