Skip to content

PrettyPrinter strips newlines from text in nodes, even pcdata #4303

Closed
@scabug

Description

@scabug

=== What steps will reproduce the problem ===

scala> <foo>{"hi\nthere"}</foo>
res6: scala.xml.Elem =
<foo>hi
there</foo>

scala> new PrettyPrinter(9999,2).format(<foo>{"hi\nthere"}</foo>)
res7: String = <foo>hi there</foo>

scala> new PrettyPrinter(9999,2).format(<foo>{PCData("hi\nthere")}</foo>)
res8: String = <foo><![CDATA[hi there]]></foo>

Activity

scabug

scabug commented on Feb 28, 2011

@scabug
Author

Imported From: https://issues.scala-lang.org/browse/SI-4303?orig=1
Reporter: Ittay Dror (ittayd)

scabug

scabug commented on Mar 1, 2011

@scabug
Author

@axel22 said:
The correct behaviour needs to be checked by someone in the xml specification. Contributions are, of course, always welcome.

scabug

scabug commented on Feb 18, 2014

@scabug
Author

Francois Armand (fanf) said:
For people with that problem, it seems to simply changing the "doPreserve" method of PrettyPrinter to always returning true make what we want. I don't have the least knowledge about what is expecting by XML spec or DTD.

So bad that the doPreserve method is private...

scabug

scabug commented on Dec 22, 2014

@scabug
Author

Michael Beckerle (mbeckerle.dfdl) said:
I would like to comment on this issue of the XML specificatiion, and what the right behavior is.

XML 1.1 spec is very clear that if you insert a CR into text using via an "entity value literal" then that character must be preserved. This suggests to me that the only reasonable implementation would not do any whitespace normalization on output, as all the various unicode line-ending characters can be inserted by this same mechanism.

This from the XML 1.1 spec (this clarification is not in the original XML 1.0 spec, but I suggest it is the "right thing" to do for XML 1.0 implementations anyway)

2.3 Common Syntactic Constructs

This section defines some symbols used widely in the grammar.

S (white space) consists of one or more space (#x20) characters, carriage returns, line feeds, or tabs.
White Space
[3] S ::= (#x20 | #x9 | #xD | #xA)+

Note:The presence of #xD in the above production is maintained purely for backward compatibility with the First Edition. As explained in 2.11 End-of-Line Handling, all #xD characters literally present in an XML document are either removed or replaced by #xA characters before any other processing is done. The only way to get a #xD character to match this production is to use a character reference in an entity value literal.

scabug

scabug commented on Dec 22, 2014

@scabug
Author

@som-snytt said:

scala> import xml._
import xml._

scala> val n = new PCData("hi there.")
n: scala.xml.PCData = <![CDATA[hi there.]]>

scala> val p = new PrettyPrinter(80,5)
p: scala.xml.PrettyPrinter = scala.xml.PrettyPrinter@c86b9e3

scala> p format n
res0: String = <![CDATA[hi there.]]>

scala> val n = new PCData("""hi there,
     |   is there any way to fix this?""")
n: scala.xml.PCData =
<![CDATA[hi there,
  is there any way to fix this?]]>

scala> p format n
res1: String =
<![CDATA[hi there,
  is there any way to fix this?]]>

scala> p format <a>{n}</a>
res2: String = <a><![CDATA[hi there, is there any way to fix this?]]></a>

Footnote, you don't get incomplete parses from embedded Scala blocks:

scala> <a>{ PCData("""
<console>:1: error: in XML literal:  expected end of Scala block
       <a>{ PCData("""
                      ^
scabug

scabug commented on Dec 23, 2014

@scabug
Author

@som-snytt said (edited on Dec 23, 2014 9:46:38 PM UTC):
Took a quick look. First, Utility.serialize is the non-formatting option. Second the PrettyPrinter is pretty ugly. It's not obvious whether it's trying to minimize verticality. When is GSOC again?

scala> val xx = <a>{ PCData("Here is some very long text\nto split.") }</a>
xx: scala.xml.Elem =
<a><![CDATA[Here is some very long text
to split.]]></a>

scala> val pp = new PrettyPrinter(1000,2)
pp: scala.xml.PrettyPrinter = scala.xml.PrettyPrinter@13275d8

scala> pp format xx
res7: String = <a><![CDATA[Here is some very long text to split.]]></a>

scala> val pp = new PrettyPrinter(10,2)
pp: scala.xml.PrettyPrinter = scala.xml.PrettyPrinter@673919a7

scala> pp format xx
res8: String =
<a>
  <![CDATA[Here is some very long text
to split.]]>
</a>

scala> val pp = new PrettyPrinter(2,2)
pp: scala.xml.PrettyPrinter = scala.xml.PrettyPrinter@41853299

scala> pp format xx
res9: String =
"<a><![CDATA[Here is some very long text to split.]]></a>
"
scabug

scabug commented on Dec 23, 2014

@scabug
Author

Michael Beckerle (mbeckerle.dfdl) said:
Sorry GSOC means what?

scabug

scabug commented on Dec 23, 2014

@scabug
Author

@som-snytt said:
I was hoping a Google Summer of Code intern wanted to do a project with XML.

Maybe a student co-majoring in History. The "digital humanities" are huge these days.

scabug

scabug commented on Jul 17, 2015

@scabug
Author

@SethTisue said:
The scala-xml library is now community-maintained. Issues with it are now tracked at https://github.com/scala/scala-xml/issues instead of here in the Scala JIRA.

Interested community members: if you consider this issue significant, feel free to open a new issue for it on GitHub, with links in both directions.

scabug

scabug commented on Jul 29, 2015

@scabug
Author

Michael Beckerle (mbeckerle.dfdl) said:
Issue migrated to scala/scala-xml#76

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @scabug

        Issue actions

          PrettyPrinter strips newlines from text in nodes, even pcdata · Issue #4303 · scala/bug