Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use correct encoding when processing SVG files #400

Open
kvid opened this issue Jul 5, 2024 · 4 comments
Open

Use correct encoding when processing SVG files #400

kvid opened this issue Jul 5, 2024 · 4 comments

Comments

@kvid
Copy link
Collaborator

kvid commented Jul 5, 2024

Here, the SVG file is specified to be utf-8 encoded:
https://github.com/wireviz/WireViz/blob/close_files/src/wireviz/wv_html.py#L45

It's not specified at:
https://github.com/wireviz/WireViz/blob/close_files/src/wireviz/svgembed.py#L62-L64

What is correct? Should we specify utf-8 at both places, or does it really depend on what encoding is specified in the leading part of the SVG file itself?

We probably should assume utf-8 and specify that encoding at both places. I also found a third place where that is already done: https://github.com/wireviz/WireViz/blob/close_files/src/wireviz/Harness.py#L662

Then, we probably also should (as a minimum only where embedding SVG into HTML, or ideally at all places reading SVG) verify that the encoding property of the leading xml tag is either absent (utf-8 is default, I believe) or equal to any of the legal value variations that specify utf-8. If we detect a discrepancy, should we raise an exception or just print a warning stating e.g. that some characters might be rendered wrongly due to an unexpected encoding in the SVG file?

If we can get or create some SVG with e.g. encoding="ISO-8859-1" containg some known characters outside the common ASCII range, we could test to see the effect of assuming the wrong encoding at the different parts of our code. Then it'll be easier to describe possible consequences in a warning message.

Originally posted by @kvid in #395 (comment)

@kvid
Copy link
Collaborator Author

kvid commented Jul 6, 2024

Comment by @formatc1702 in the original thread:

Good question. Since the HTML template contains <meta charset="UTF-8">, it would be best if the embedded SVG also was UTF-8. Not sure if we know or have control over how Graphviz chooses to encode its output. Are we just lucky that it's already UTF-8?

Originally posted by @formatc1702 in #395 (comment)

@martinrieder
Copy link
Contributor

martinrieder commented Jul 6, 2024

Citing the Graphviz FAQ - NonAscii:

Because we cannot always guess the encoding, you should set the graph attribute charset to UTF-8, Latin1 (alias ISO-8859-1 or ISO-IR-100) or Big-5 for Traditional Chinese. This can be done in the graph file or on the command line. For example charset=Latin1.

Output: It is essential that a font which has the glyphs for your specified characters is available at final rendering time. The choice of this font depends on the target code generator. [...]

For PostScript, the input must be either the ASCII subset of UTF-8 or Latin-1. (We have looked for more general solutions, but it appears that UTF-8 and Unicode are handled differently for every kind of font type in PostScript, and we don't have time to hack this case-by-case.

For SVG output, we just pass the raw UTF-8 (or other encoding) straight through to the generated code.

So, I would interpret this to be an issue about font selection and not UTF8. Just ensure that we provide a font to GV that is available for rendering (into raster images).

Please have a look at Sketchviz, which delivers two embedded CSS fonts along with the SVG. This works nicely because the font name is passed through.

@kvid
Copy link
Collaborator Author

kvid commented Jul 6, 2024

@martinrieder wrote:

Citing the Graphviz FAQ - NonAscii:

I took the liberty to append the #FaqNonAscii bookmark to your link so readers more easily find the section you quote.

Because we cannot always guess the encoding, you should set the graph attribute charset to UTF-8, Latin1 (alias ISO-8859-1 or ISO-IR-100) or Big-5 for Traditional Chinese. This can be done in the graph file or on the command line. For example charset=Latin1.

We can do that, but it shouldn't be critical because utf-8 is default when this attribute is absent: https://graphviz.org/docs/attrs/charset/

Output: It is essential that a font which has the glyphs for your specified characters is available at final rendering time. The choice of this font depends on the target code generator. [...]

WireViz currently use arial as default fontname unless specified by the user in options.fontname. For the target code generators we have used up to now (SVG and PNG generators), the arial font seems to be well supported also when using non-ASCII characters, but we have not tested this rigorously. I welcome a YAML input test case with a rich set of non-ASCII characters to be included e.g. in #63.

A user that specify a different font must also be responsible to ensure that such a font is available at final rendering time.

For PostScript, the input must be either the ASCII subset of UTF-8 or Latin-1. (We have looked for more general solutions, but it appears that UTF-8 and Unicode are handled differently for every kind of font type in PostScript, and we don't have time to hack this case-by-case.

PostScript target code is currently not used by WireViz, so these restrictions don't apply, unless this will be the first step to support PDF output in the future.

For SVG output, we just pass the raw UTF-8 (or other encoding) straight through to the generated code.

If the GV file has e.g. graph [charset="iso-8859-1"], then my experience is that non-ASCII characters in the same GV file are decoded according to this specified charset before rendering to PNG output.

However, the SVG output contains xml encoding="UTF-8" and nodes with non-ASCII characters are correctly UTF-8 encoded at some places, but at other places in the same SVG, the same node names are passed through with the input encoding, so my browser reports Encoding error when trying to render it.

This bug might be due to my old version: dot - graphviz version 2.44.1 (20200629.0846)

To generate such an input file, execute e.g.:

Path('ch.gv').write_text('graph {\n  graph [charset="iso-8859-1"]\n  AæøåA -- B\n}\n', encoding="iso-8859-1")

When I input UTF-8 encoded GV files, it seems to work.

So, I would interpret this to be an issue about font selection and not UTF8. Just ensure that we provide a font to GV that is available for rendering (into raster images).

I don't agree the issue is only about font selection. See my charset experience description above.

Please have a look at Sketchviz, which delivers two embedded CSS fonts along with the SVG. This works nicely because the font name is passed through.

I tried this, but was only allowed to download PNG unless I authorized SketchViz to log in to my github account on my behalf. I don't trust a random 3rd party software to access my github resources, and I don't have time to create a dummy account just now.

@martinrieder
Copy link
Contributor

martinrieder commented Jul 6, 2024

@kvid you do not need to link Sketchviz to GitHub. Just have a look at the sources and search for the font "Handlee" that is given from the example. There is a link to Google fonts CSS, which is referenced from the SVG that is embedded in the website.
Now take a look at the following example that has this font embedded directly in the SVG file: https://github.com/gpotter2/sketchviz/blob/master/examples/clusters.svg

PS: The latter actually contains a bug which resets the font to "Helvetica,Arial,sans-serif", but the embedded CSS font is available. I corrected this and uploaded the file here:
clusters

PPS: Looking at the logo that I uploaded to #373 (comment), I notice that this SVG file generated by svg2roughjs contains two font definitions. My observation is that the CSS definition style="font-family: ..." overrides the font-family="..." attribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants