Missing text content #2248
-
A lot of the text content is missing on fetching the text using document.text() method. Including an example to show the discrepancy between raw html and extracted text content. example web page: https://developer.atlassian.com/server/confluence/rest/v920/Intro
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi, Well, take a look at the source code (View Source + Line wrap). You'll see all the page content is in a giant Javascript blob1: <script nonce="+B0Jpd/rO1dvg+rwHgeYiFk9FLKplTkeKjLnUkEAJPk=" type="text/javascript">
window.__DATA__ = {"assets":
{"-----------------------.js":"https://dac-static.atlassian.com/_static/
-----------------------.80866af279244d8a0ff3.bundle.js","-
...snip
|\n|------------|---------------------------------|\n| Applicable |
Confluence Server 5.5 - 8.5 \u003cbr> Confluence Data Center 5.6 and
later|\n\nThe Confluence Server and Data Center REST API is for admins who
want to script interactions with Confluence Server or Confluence Data Center
and developers who want to integrate with or build on top of the Confluence
platform.\n\n\u003cbr>\n\u003cdiv style=\"color: green;
background-color: #f0f0f0; padding: 10px;\">\nFor REST API documentation,
see \u003ca href=\"/server/confluence/rest/v900/intro\">Confluence Server
and Data Center REST API reference\u003c/a>.\n\u003c/div>\n\nUsing Cloud?
Find out about the [Confluence Cloud REST API]
(/cloud/confluence/rest).\n\n\n## CRUD Operations\n\nConfluence's REST APIs
provide access to resources (data entities) via URI paths. To use a REST
API, your application will make an HTTP request and parse the response. By
default, the response format is JSON. Your methods will be the standard HTTP
methods: GET, PUT, POST and DELETE.
</script>
<title>The Confluence Data Center REST API</title> It looks like they have a form of Markdown as the content that they place into a script and then client side render it. Which also gives it appreciable rendering jank when you visit the page. Recall that jsoup is a HTML parser, not a Javascript executor. One approach to get the specific content in this case using jsoup would be to Or, use a full headless browser like Playwright, which will more general, but necessarily have a higher resource overhead. |
Beta Was this translation helpful? Give feedback.
Hi,
Well, take a look at the source code (View Source + Line wrap). You'll see all the page content is in a giant Javascript blob1: