-
Notifications
You must be signed in to change notification settings - Fork 479
Description
Problems:
Model often hallucinates with token loop sections.
Model often returns empty responses.
Reproduction:
Russian/Kazakh language scanned texts.
Possible solutions:
- Provide more detailed guidance on how to use the model with settings, preprocessing steps.
- Fine-tune the model on synthetic dataset, as I am not aware of existence of any annotated Kazakh/Russian datasets for OCR readily available.
I've tried to integrate chandra repo code into my pipeline with exact vllm settings, request settings, prompts and preprocessing, but still face with high error and hallucination rate that makes this model barely usable in production in my language set. In my test set I have ~60 documents and 5-6 of them on average always have token loop hallucination. 1 or 2 documents consistently have missing pages.
Scans I use are of decent quality with clearly recognizable text, but the model still fails at relatively high rate.
I also tried sending pdf images directly without intermediate conversion steps, hoping that lossless pipeline will help the situation, but no success.
Can't share the exact documents bc they are private.
Any help could be appreciated.