Skip to content

Conversation

nickjwhite
Copy link

This ensures that transformations like unicode normalisation are done on
the truth output as well as the OCR output, so that you can compare
the two properly.

Before this a perfect OCR result could show different lines for Truth and
OCR if the OCR output included characters that were normalised.

…e way as OCR output

This ensures that transformations like unicode normalisation are done on
the truth output as well as the OCR output, so that you can compare
the two properly.

Before this a perfect OCR result could show different lines for Truth and
OCR if the OCR output included characters that were normalised.
@Shreeshrii
Copy link
Collaborator

@nickjwhite Please provide a sample demonstrating this.

Before this a perfect OCR result could show different lines for Truth and
OCR if the OCR output included characters that were normalised.

I had noticed this in the past but do not have any ready example to test and verify.

@Shreeshrii
Copy link
Collaborator

Is this issue related?

@wollmers
Copy link

wollmers commented Sep 7, 2021

Is this issue related?

No, this issue looks more like the wrong normalisation form, which normalises long_s to s:

$ perl -e 'use utf8; use Unicode::Normalize; print NFC("ſ"),"\n";'
ſ
$ perl -e 'use utf8; use Unicode::Normalize; print NFKC("ſ"),"\n";'
s

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Dec 6, 2021

Ok, I have a sample now.

eng Praja exp0_159

Ground Truth: aṇṇi- aṇṇi- , 11 v. 904)² (p. 142) alakkaḻi- ... in the Coimbatore
OCR via CLI using custom IAST traineddata: aṇṇi- aṇṇi- , 11 v. 904)² (p. 142) alakkaḻi- ... in the Coimbatore
OCR via lstmeval using same custom IAST traineddata: aṇṇi- aṇṇi- , 11 v. 904)2 (p. 142) alaḵkaḻi- ... in the Coimbatore

Superscript 2 is getting normalized to number 2 for lstmeval.

@Shreeshrii
Copy link
Collaborator

similarly for trademark symbol

san Guru_Italic 0000203 exp0_0

GT: TOPOGRAPHIC FASHIONABLE WETTER Core™2 problem ALLOWED) *Call YOU, Kanpur coach
CLI OCR: TOPOGRAPHIC FASHIONABLE WETTER Core™2 problem ALLOWED) *Call YOU, Kanpur coach
lstmeval OCR: TOPOGRAPHIC FASHIONABLE WETTER CoreTM?2 problem ALLOWED) *Call YOU, Kanpur coach

@stweil I have attached a zip file with the custom IAST traineddata.

IAST_0.267000_136760_880600.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants