Skip to content

xml2 read_html removes closing tags from JSON-LD when using a single option #373

@sbha

Description

@sbha

xml2::read_html(x) returns the HTML within a linked data JSON object as expected:

library(xml2)
library(magrittr)
library(rvest)

test_ld <- '<script type="application/ld+json">{"@context":"http://schema.org","@type":"ReproducibleExample", "description":"<p><strong>text within tags</strong>text after closing tag</p>"'

# tags preserved
test_ld %>% 
  read_html() %>% 
  html_node('script[type="application/ld+json"]') %>% 
  as.character()

[1] "<script type=\"application/ld+json\">{\"@context\":\"http://schema.org\",\"@type\":\"ReproducibleExample\", \"description\":\"<p><strong>text within tags</strong>text after closing tag</p>\"</script>"

Where description contains the HTML <p><strong>text within tags</strong>text after closing tag</p>

But if using xml2::read_html(x, options = 'HUGE') or with any single option (I've tested 5 or 6), the closing tags are removed from the HTML text in a JSON-LD object.

# tags removed
test_ld %>% 
  read_html(options = 'HUGE') %>% 
  html_node('script[type="application/ld+json"]') %>% 
  as.character()

# removed
test_ld %>% 
  read_html(options = "NOBLANKS") %>% 
  html_node('script[type="application/ld+json"]') %>% 
  as.character()

# removed
test_ld %>% 
  read_html(options = '') %>% 
  html_node('script[type="application/ld+json"]') %>% 
  as.character()

# all return:
[1] "<script type=\"application/ld+json\">{\"@context\":\"http://schema.org\",\"@type\":\"ReproducibleExample\", \"description\":\"<p><strong>text within tagstext after closing tag\"</script

description now becomes <p><strong>text within tagstext after closing tag

Setting options is necessary for some of the HTML I'm parsing. Is it possible to use options and preserve properly formatted HTML from a linked data object?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugan unexpected problem or unintended behavior

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions