Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyphenation problems in Portuguese #2001

Closed
jodros opened this issue Feb 20, 2024 · 22 comments
Closed

Hyphenation problems in Portuguese #2001

jodros opened this issue Feb 20, 2024 · 22 comments
Labels
documentation Documentation bug or improvement issue question Ask for advice or investigate solutions
Milestone

Comments

@jodros
Copy link
Contributor

jodros commented Feb 20, 2024

изображение
изображение изображение изображение

One word I noticed to also have some trouble in being hyphenated is quando.

Yes, I know the first example isn't the best in terms of readability, but it's what I've right now since I'm trying the parallel package for now, I could give more examples for Russian soon...

@jodros
Copy link
Contributor Author

jodros commented Feb 20, 2024

изображение

Another format of the first example, this time I couldn't see the frames because -d frames isn't working well when I use parallel...

@jodros
Copy link
Contributor Author

jodros commented Feb 20, 2024

изображение

@alerque
Copy link
Member

alerque commented Feb 20, 2024

Lets split PT/RU into different issues because tracking down language-specific stuff doesn't always get resolved at the same time/via the same PR. Lets make this issue the PT one please.

For hyphenation issues, the first thing to check is if we even have break points to work with. Evidently not:

$ ./sile
SILE v0.14.17.r373-g72965ad (LuaJIT 2.1.1700206165) [Rust]
> SILE.showHyphenationPoints("quando", "pt")
quando
> SILE.showHyphenationPoints("apaziguam", "pt")
apa-zi-guam

So at least for "quando", for some reason the patterns are not allowing any hyphenation there. According to PT language rules, where should the points be?

The screen shots are kind of hard to work with for this because I can't tell if the problem is other metrics (like not having any stretch available) might be contributing to poor break choices. Also I can't even be sure I'm typing the same text as you are entering in many cases. Can you post the actual XMl/SIL input files you're testing too?

@Omikhleia
Copy link
Member

Omikhleia commented Feb 20, 2024

Unless I misunderstood the screenshot, it doesn't look as an hyphenation issue, but rather a justification issue (overfull lines)

These examples have fairly short columns: have you tried loosening the justification constraints?

As for TeX, by default, overfull lines are preferred over underfull lines when the constraints cannot be respected (on space stretching/shrinking, etc.).

You can try tweaking, in order:

  • linebreak.emergencyStretch (e.g. set it around 1em, it's a delicate setting)
  • linebreak.tolerance (defaults to 500, something around 2000 might be necessary when width is constraint, or even up to 5000 in very short columns)

There are other settings (pretolerance, and even the space stretchability) that might be changed too, but they are more difficult (IMHO) to tweak "correctly".

If this is indeed the issue at stakes, then it pops up quite regularly, e.g. see #620 (comment)

I know the documentation mentions we use the TeX paragraph shaping and also explains briefly the settings...
But perhaps we could make it clear for casual readers (that's quite of a FAQ, even in the TeX world...) -- especially when most Office solution nowadays prefer underfull lines (at the risk of bad paragraphing in most cases).

Note that making these settings dynamically adaptable (e.g. depending on font size and target line width) could be an interesting exercise for an experimental package, as a possible helper to minimize the occurrence of these situations. We can easily modify the typesetter to account for such dynamic approaches, which was harder in old TeX (i.e. at least before LuaTeX added hooks in many places, though I don't know how much "hackability" it would now have here).

@Omikhleia
Copy link
Member

(BTW, regarding quando, Typst too doesn't hyphenate it (see https://typst.app/tools/hyphenate/) at this point. It's quite logical, as it uses the same TeX hyphenation patterns as SILE -- but at least it shows it's from these original patterns, and not a SILE-specific issue.)

@Omikhleia Omikhleia added question Ask for advice or investigate solutions documentation Documentation bug or improvement issue labels Feb 26, 2024
@jodros
Copy link
Contributor Author

jodros commented Mar 3, 2024

I ran showHyphenationPoints in some other words with the same issue, and noticed that some of them are indeed missing the rules, e.g.

  • pri-meiro
  • re-cordo
  • to-mado
  • vai-dade
  • mal-dito

@jodros
Copy link
Contributor Author

jodros commented Mar 3, 2024

You can try tweaking, in order:

linebreak.emergencyStretch (e.g. set it around 1em, it's a delicate setting)
linebreak.tolerance

I've tested and confirm that sometimes this solved the problem, thanks.

@Omikhleia
Copy link
Member

I ran showHyphenationPoints in some other words with the same issue, and noticed that some of them are indeed missing the rules, e.g.

* pri-meiro

* re-cordo

* to-mado

* vai-dade

* mal-dito

But what should they be? SILE and Typst both use the TeX patterns, and both software show the same hyphenation points here, don't they?

@jodros
Copy link
Contributor Author

jodros commented Mar 4, 2024

But what should they be?

I forgot to tell, they should be:

  • pri-mei-ro
  • re-cor-do
  • to-ma-do
  • vai-da-de
  • mal-di-to

@Omikhleia
Copy link
Member

SILE is using (a Lua port of) https://github.com/hyphenation/tex-hyphen/blob/master/hyph-utf8/tex/generic/hyph-utf8/patterns/tex/hyph-pt.tex

So this is likely an issue for https://github.com/hyphenation/tex-hyphen (though it would be easier then if SILE was able to use TeX patterns directly rather than having its own error-prone re-implementation as a Lua table, or to ship with a conversion script).

@Omikhleia
Copy link
Member

(This being said, one can also register exceptions manually, with \hyphenator:add-exceptions)

@Omikhleia
Copy link
Member

Unless there's something clear to do here, I am going to suggest closing/rejecting this issue, inactive for 2+ months

  • Part of it is merely due to tuning configuration options for small columns, which is doable in the existing code base (via emergencyStretch, tolerance, etc.)
  • Part of it is due to existing TeX patterns as-they-are = Not Our Bug

@jodros jodros closed this as completed May 7, 2024
@alerque
Copy link
Member

alerque commented Jun 3, 2024

Just throwing this out there, we are in no way limited to using the hyphenation rules from tex-hyphen as is. We can correct them locally in our vendored copy when appropriate, submit fixes upstream if needed, and even use different hyphenation code for different languages. Particularly with the Rust wrapper there are several libraries we could surface.

If something is still wrong here (@jodros any references to official grammar guides and/or other discussion on implementations anywhere that help confirm this is a bug) I'd like to actually look into what it is. There may always be exceptions not covered by a codifiable rule, but even if that case we can add exceptions by default if they are well known and agreed on.

@alerque alerque reopened this Jun 3, 2024
@jodros
Copy link
Contributor Author

jodros commented Jun 4, 2024

We can correct them locally in our vendored copy when appropriate, submit fixes upstream if needed

I'm glad to read this.

Well, I've just take a look at languages/pt.lua right

  \begin{document}   
  \language[main=pt]   
     
  \script{     
    local words = { "quando","econômico", "recordo", "tomado", "vaidade", "maldito", "fonética","aproveitado" }   
     
    for _, word in ipairs(words) do   
        SILE.typesetter:typeset(SILE.showHyphenationPoints(word, "pt"))   
        SILE.call("par")   
    end   
  }   
   
  \end{document}   

Which gave me:

изображение

The only rule I found missing in the file is 1nô, and after have added I got eco-nô-mico.

Now, regarding the remaining syllables as -do -co -ca -to , it seems to me that we've indeed a bug, because they are all declared in the list of patterns...

@Omikhleia
Copy link
Member

Omikhleia commented Jun 6, 2024

@jodros

it seems to me that we've indeed a bug,

Yes, and I guess I quite understand it now.
It partly relates to #2017 with possibly an additional error in our implementation

Anyway, since we are using the default hard-coded (2, 2), why don't we get "eco-no-mi-co" indeed.

Ho ho, weird indeed .... but maybe there's an issue somewhere with Lua lists being 1-based and not 0-based?

See:

SILE v0.14.17 (Lua 5.2)
> SILE.showHyphenationPoints("economico", "pt")
eco-no-mico
> 
> SILE._hyphenators["pt"].rightmin = 1
> 
> SILE.showHyphenationPoints("economico", "pt")
eco-no-mi-co
> 

I think the issue is here:

-- Still inside the no-exceptions case
for i = 1, self.leftmin do
points[i] = 0
end
for i = #points - self.rightmin, #points do
points[i] = 0
end

Before applying the constraints, we have

points:  0 1 0 1 0 1 0 1 0 0 
word:     e c o n o m i c o    --> e-co-no-mi-co

After applying the leftmin

points:  0 0 0 1 0 1 0 1 0 0 
word:     e c o n o m i c o    --> eco-no-mi-co

And after applying the rightmin

points:  0 0 0 1 0 1 0 0 0 0 
word:     e c o n o m i c o    --> eco-no-mico

So we think we are using (2, 2), but we actually behave as (2, 3)... Which might be why #2017 failed to be noticed (English also being recommended at (2, 3) for standard typography...): A bug was hiding another.

I think the code should be:

for i = #points-self.rightmin+1, #points do points[i] = 0 end

But then I don't understand any longer the root problem I had which triggered me to open #2017, I'll re-investigate it... There might be more that meets the eye here...

Any thoughts and insights?1

Footnotes

  1. Besides the fact that no so long ago, SILE didn't know how to properly justify lines. It does know, it seems, how to properly perform hyphenation. It doesn't know how to properly break pages. Erm. 🐷

@alerque alerque added this to the v0.15.1 milestone Jun 6, 2024
@Omikhleia
Copy link
Member

Omikhleia commented Jun 6, 2024

@jodros

The only rule I found missing in the file is "1nô"

By the way, it's not missing, unless I am mistaken: it's just our current hyphenation patterns (coming from TeX) were likely crafted based on Portuguese from Portugal, and all dictionaries seem to have "económico"...

But according from some online resources, "econômico" is from Brazil (grafia no Brasil). It could be interesting to confirm. And if so, I still think it would be a good question to https://github.com/hyphenation/tex-hyphen ... Because even if "we are in no way limited to using the hyphenation rules from tex-hyphen as is", the general solution here would be to support BCP47 and possibly have different hyphenation patterns for different language variants. Admittedly, here it is quite possible that the introduction of this "1nô" in standard Portuguese wouldn't harm it much (I don't know!), but the general picture is that some specificity might need different patterns1

Footnotes

  1. And BCP47 was discussed long ago 🐷 Until it happens, SILE doesn't know well how to handle language codes and scripts... TeX has different patterns for German 1901 orthography and 1996 revised orthography (de-1901, de-1996), patterns for Serbian in latin or cyrillic, etc.

@jodros jodros changed the title Hyphenation problems in Portuguese and Russian Hyphenation problems in Portuguese Jun 6, 2024
@Omikhleia
Copy link
Member

Omikhleia commented Jun 6, 2024

So let's recap as the issue got long with several things:

  • Part of it is merely due to tuning configuration options for small columns, which is doable in the existing code base (via emergencyStretch, tolerance, etc.)
  • Part of it is due to existing TeX patterns as-they-are = Not Our Bug
  • Part of it is dues to bugs in our Liang hyphenation implementation, relating to Hyphenation minimun left/right constraints should be language-specific #2017 but overshadowing another likely bug in the handling of the hyphenation rightmin 🐞
  • Part of it is due to differences between "pt" (canonical Portuguese from Portugal) and "pt-BR" (Portuguese from Brazil)

@jodros
Copy link
Contributor Author

jodros commented Jun 6, 2024

Ho ho, weird indeed .... but maybe there's an issue somewhere with Lua lists being 1-based and not 0-based?

Interesting note.

according from some online resources, "econômico" is from Brazil

Yes, that's the Brazilian spelling. Maybe there are even other minor differences to be found...

Part of it is dues to bugs in our Liang hyphenation implementation, relating to #2017 but overshadowing another likely bug in the handling of the hyphenation rightmin 🐞

Since most of the issues I had were solved by changing ``linebreak.emergencyStretch`, this is the only remaining point to take of now.

@alerque alerque modified the milestones: v0.15.1, v0.15.2, v0.15.3 Jun 7, 2024
@Omikhleia
Copy link
Member

Yes, that's the Brazilian spelling. Maybe there are even other minor differences to be found...

Noted: hyphenation/tex-hyphen#61

@alerque alerque modified the milestones: v0.15.3, v0.15.4 Jun 10, 2024
@Omikhleia
Copy link
Member

Maybe there are even other minor differences to be found...

Likely: I came accross "antónimo" vs. "antônimo" in a translation file.

@jodros
Copy link
Contributor Author

jodros commented Jun 11, 2024

Likely: I came accross "antónimo" vs. "antônimo" in a translation file.

I'm gonna make a list with all major differences soon...

@alerque
Copy link
Member

alerque commented Jun 23, 2024

As I understand it everything this issue needs to track is taken care of except perhaps documentation on all the things that can be done to cope with narrow text width as gracefully as possible. Lets open an issue specific to that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Documentation bug or improvement issue question Ask for advice or investigate solutions
Projects
None yet
Development

No branches or pull requests

3 participants