- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to transliterate paths #9134
Comments
This issue has been automatically marked as stale because it has not had recent activity. The resources of the Hugo team are limited, and so we are asking for your help. |
I have read this issue. Arrest me if I'm wrong, but with how Hugo treats content paths, this issue is a cosmetic issue about end URLs. The example above:
Should now work fine as
I don't think the first example would work at all. I understand that people would want pretty URLs (that's what I use
I have not seen a (relatively) complete language aware transliteration library (the one used in the referenced PR failed to transliterate my name in my language (which is also the language of this year's Nobel price winner in litterature). I think this would be easier to fix if we drop the second point above. Then we could also possibly get away using the existing setting. |
That's great! With v0.123.0-DEV we can now make the round trip when If we take language-specific behavior out of the equation, all we need is something that does the equivalent of:
And that could certainly live under the existing setting. |
Yes, this is why I suggested a different approach some time ago (let the user choose the mapping and make it a config entry). The proposal was based on this blog post with an idea of how it could be implemented. The requirements for transliteration can vary widely from use case to use case, so I think it would still be best not to rely on a (hardcoded) library, but to provide a versatile configurable mapping and maybe provide some sensible internal defaults or maybe even only in the documentation of the feature. |
This is a bit more than what I think this issue, and all the other related issues in the description, are specifically about transliterating/mangling the final URL with a mapping function, either using a hard-coded common transliteration or using a generic mapping. So the focus of all these questions is on the second point above (having a working mapping). Once you have that, you can easily work around the first problem (a very specific mapping function does not cover all expected cases). A generic filtering/mapping function that could be hooked into the final stage of URL generation would be sufficient to handle all these use cases. Although it might be technically easier to fix only half of them, it would not solve the problem or address the intended use cases. So, in my opinion the most important part of the equation would be
which renders correctly with iconv, given you use the locale that contains the target language:
(note that iconv creates Ä -> AE, Ö -> OE ... in that case).
(note that it has the expected output as per #11246 (comment), both for Nynorsk and Bokmål, but I did not expect any differences between the two, to be honest) |
Are you sure the search engines cares about perfect transliteration? I suspect Google happily reads this:
Which I guess is what we have today. |
Perhaps (hopefully) the use case for transliterating path segments or URLs is obsolete, or will be soon. However, the number of comments on all these issues and the activity on the forum around this topic suggests otherwise. At least as a human, I can (sort of) read However, it might also be an option to actively discourage users from using transliteration and point them to full UTF-8 support. |
OK, I have 2 concerns here:
The package we currrently use to remove accents has an API like below: func main() {
chain := transform.Chain(
norm.NFD,
runes.Map(func(r rune) rune {
switch r {
case 'ą':
return 'a'
case 'ć':
return 'c'
case 'ę':
return 'e'
case 'ł':
return 'l'
case 'ń':
return 'n'
case 'ó':
return 'o'
case 'ś':
return 's'
case 'ż':
return 'z'
case 'ź':
return 'z'
case 'ø':
return 'o'
}
return r
}),
norm.NFC,
)
s, _, _ := transform.String(chain, "Bjørn Erik Pedersen")
fmt.Println(s) // Works for me.
} If we accept that the transliteration is a simple
This is me thinking out loud. |
I agree, which is why I would go with a more generic option, see my other comments.
I agree as well. This is one of hugo's biggest USPs, so it is better not to sacrifice it for features.
From my point of view, it would be sufficient to simply expose the mapping in this function to config. Everything else could be left to the user, so they clearly know that it is up to them to provide the mapping they need. Personally, I would bet a lot on the claim that a simple per-language configurable EDIT: note, however, that the target rune would need to be multi-character to support ä - ae (German) and ж - zh (Cyrillic) use cases and that the source rune would need to be multi-character to support both forms of UTF-8 accent rendering use cases. |
Are you sure this would make a performance difference to the hard-coded switch statement? |
OK, I have searched a little more around, and my current take on this is:
|
TLDR: I recommend deferring this indefinitely pending demand. This started with the addition of the But then you couldn't get to the term page with any of these:
And that generated some noise, in the Academic/Wowchemy/HugoBlox world in particular, despite the introduction of The inability to get back to the term page was the primary driver for creating this issue, irrelevant with v0.123.0. And then came the desire to have "accents" removed from non-composite characters, which is impossible, because they are not composite characters. I'm not sure if this desire was driven by compatibility requirements, aesthetic preference, or just a lack of understanding (e.g., "It's broken. It's not removing my accents."). So that means transliteration. But as soon as you open that box, it needs to be language specific. In my view there is insufficient "compatibility" or "aesthetic" demand to pursue this at the moment. The changes in v0.123.0 solved the initial problem, and actually solved another one in this area as well... all three work great:
|
As @jmooring mentioned: " deferring this indefinitely pending demand." My case was #7542 but I am not crying about it. I learned to live with it. There are other, more demanding things, that I think are worth spending more time on than this. Unless something simple is found out, I agree that deferring this will be the best approach. There are some good ideas in this issue, but it's all about how much time is allowed to be spent on that compared to the needs of users (me included). |
As a data point related to aesthetically pleasing URLs, Wikipedia doesn't feel this important. In the browser's address bar you see this:
When you cut/paste the URL and copy it into an email (for example):
Hugo's current behavior is identical. I'm inclined to remove "aesthetically pleasing URLs" as a reason to pursue this, leaving only compatibility with other systems that transliterate (e.g., Drupal, where transliteration is disabled by default). |
@jmooring Nice Wikipedia entry, I love to go to a Straußwirtschaft (aka Besenwirtschaft) in late summer. I don't follow your argument here though. "We don't have a use case because {fill in any big Internet player here} doesn't care" is not a plausible argument. In fact, it is a fallacy (ad populum). Using the same fallacy, I could argue the opposite: transliteration is standardised, so we have a use case. See https://en.wikipedia.org/wiki/List_of_ISO_romanizations. Or that even the Serbian government provides a transliteration of its website (https://www.srbija.gov.rs/, select "Latinica"). I would consider both arguments invalid because they ignore the context. We have a use case for transliteration only because many Hugo users (including myself), for various reasons and repeatedly over a long period of time, seem to have a use case that is expressed in several GitHub issues and forum posts. One could argue that "aesthetically pleasing URLs" is not a valid use case to begin with. But there are many other valid use cases, such as the (common) Cyrillic romanisation use case mentioned above, which was raised by a real Hugo user in a forum post. |
Not an argument, just a...
|
How do you like this solution? bfa25d5#diff-4469ddf2eec1abbb8007789a7478453a0f26ff01f5ac615214a6e95079a95eb3R314 cascade:
url: /:sectionstranslit/:slugorfilename |
Background
You can configure Hugo to remove non-spacing marks from composite characters in content paths by enabling
removePathAccents
in the site configuration.Removing the non-spacing marks has the desired effect in the first example, but it:
ß
,ł
,Ł
)ä
should becomeae
)This issue has been raised a few times on the forum, and stale bot has closed three related issues that continue to receive comments:
Also:
Proposal
Provide an option to convert path characters from Unicode to ASCII, commonly called "Transliteration."
For a site with English (en) as the default content language:
For a site with German (de) as the default content language:
Include a related template function so that you can access term pages:
The text was updated successfully, but these errors were encountered: