Missing text content #2248

IWriteCode01 · 2024-12-16T19:42:33Z

IWriteCode01
Dec 16, 2024

A lot of the text content is missing on fetching the text using document.text() method. Including an example to show the discrepancy between raw html and extracted text content.

example web page: https://developer.atlassian.com/server/confluence/rest/v920/Intro
Extracted content using jsoup:

The Confluence Data Center REST API Support for Server products ended Feb. 15, 2024. Learn what this means for you. Confluence Data Center Guides Reference Resources Changelog Search Support Log in REST API Modules Java API Switch to classic view REST API About Confluence Data Center REST API Advanced Searching using CQL Confluence REST API examples Content properties in the REST API Custom actions with the blueprint API Expansions in the REST API Pagination in the REST API Access Mode Admin Group Admin User Attachments Backup and Restore Category Child Content Content Blueprint Content Body Content Descendant Content Labels Content Property Content Resource Content Restrictions Content Version Content Watchers GlobalColorScheme Group Instance Metrics Label Long Task Search Server Information Space Space Label Space Permissions Space Property Space Watchers SpaceColorScheme User User Group User Watch Webhooks Other operations Rate this page: Unusable Poor Okay Good Excellent Changelog System status Privacy Notice at Collection Developer Terms Trademark Cookie preferences © 2024 Atlassian

Answered by jhy

Dec 16, 2024

Hi,

Well, take a look at the source code (View Source + Line wrap). You'll see all the page content is in a giant Javascript blob¹:

 <script nonce="+B0Jpd/rO1dvg+rwHgeYiFk9FLKplTkeKjLnUkEAJPk=" type="text/javascript">
 window.__DATA__ = {"assets":
 {"-----------------------.js":"https://dac-static.atlassian.com/_static/
-----------------------.80866af279244d8a0ff3.bundle.js","-
...snip
 |\n|------------|---------------------------------|\n| Applicable |
   Confluence Server 5.5 - 8.5 \u003cbr> Confluence Data Center 5.6 and
   later|\n\nThe Confluence Server and Data Center REST API is for admins who
   want to script interactions with Confluence Server or Confluence Data Center
   and de…

View full answer

jhy · 2024-12-16T22:42:36Z

jhy
Dec 16, 2024
Maintainer

Hi,

Well, take a look at the source code (View Source + Line wrap). You'll see all the page content is in a giant Javascript blob¹:

 <script nonce="+B0Jpd/rO1dvg+rwHgeYiFk9FLKplTkeKjLnUkEAJPk=" type="text/javascript">
 window.__DATA__ = {"assets":
 {"-----------------------.js":"https://dac-static.atlassian.com/_static/
-----------------------.80866af279244d8a0ff3.bundle.js","-
...snip
 |\n|------------|---------------------------------|\n| Applicable |
   Confluence Server 5.5 - 8.5 \u003cbr> Confluence Data Center 5.6 and
   later|\n\nThe Confluence Server and Data Center REST API is for admins who
   want to script interactions with Confluence Server or Confluence Data Center
   and developers who want to integrate with or build on top of the Confluence
   platform.\n\n\u003cbr>\n\u003cdiv style=\"color: green;
   background-color: #f0f0f0; padding: 10px;\">\nFor REST API documentation,
   see \u003ca href=\"/server/confluence/rest/v900/intro\">Confluence Server
   and Data Center REST API reference\u003c/a>.\n\u003c/div>\n\nUsing Cloud?
   Find out about the [Confluence Cloud REST API]
   (/cloud/confluence/rest).\n\n\n## CRUD Operations\n\nConfluence's REST APIs
   provide access to resources (data entities) via URI paths. To use a REST
   API, your application will make an HTTP request and parse the response. By
   default, the response format is JSON. Your methods will be the standard HTTP
   methods: GET, PUT, POST and DELETE. 

</script>

<title>The Confluence Data Center REST API</title>

It looks like they have a form of Markdown as the content that they place into a script and then client side render it. Which also gives it appreciable rendering jank when you visit the page.

Recall that jsoup is a HTML parser, not a Javascript executor.

One approach to get the specific content in this case using jsoup would be to select() the appropriate script, and then feed it to a json parser. Because it's javascript and not json you may need to scrub off the window.__DATA__ = part, depending on the json parser.

Or, use a full headless browser like Playwright, which will more general, but necessarily have a higher resource overhead.

1: Well technically a clob, I guess.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing text content #2248

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Missing text content #2248

IWriteCode01 Dec 16, 2024

Replies: 1 comment

jhy Dec 16, 2024 Maintainer

IWriteCode01
Dec 16, 2024

jhy
Dec 16, 2024
Maintainer