Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docling_core/transforms/serializer/html.py
Original file line number Diff line number Diff line change
Expand Up @@ -787,7 +787,7 @@ def serialize(
)

# Join all parts without separators
inline_html = " ".join([p.text for p in parts if p.text])
inline_html = "".join([p.text for p in parts if p.text])

# Wrap in span if needed
if inline_html:
Expand Down
2 changes: 1 addition & 1 deletion docling_core/transforms/serializer/markdown.py
Original file line number Diff line number Diff line change
Expand Up @@ -670,7 +670,7 @@ def serialize(
visited=my_visited,
**kwargs,
)
text_res = " ".join([p.text for p in parts if p.text])
text_res = "".join([p.text for p in parts if p.text])
return create_ser_result(text=text_res, span_source=parts)


Expand Down
2 changes: 1 addition & 1 deletion test/data/doc/2408.09869v3_enriched_p2_p3_p5.gt.html
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@
<li style="list-style-type: '· ';">Can leverage different accelerators (GPU, MPS, etc).</li>
</ul>
<h2>2 Getting Started</h2>
<span class='inline-group'>To use Docling, you can simply install the docling package from PyPI. Documentation and examples are available in our GitHub repository at <a href="https://github.com/DS4SD/docling">github.com/DS4SD/docling</a> . All required model assets 1 are downloaded to a local huggingface datasets cache on first use, unless you choose to pre-install the model assets in advance.</span>
<span class='inline-group'>To use Docling, you can simply install the docling package from PyPI. Documentation and examples are available in our GitHub repository at <a href="https://github.com/DS4SD/docling">github.com/DS4SD/docling</a>. All required model assets 1 are downloaded to a local huggingface datasets cache on first use, unless you choose to pre-install the model assets in advance.</span>
<p>Docling provides an easy code interface to convert PDF documents from file system, URLs or binary streams, and retrieve the output in either JSON or Markdown format. For convenience, separate methods are offered to convert single documents or batches of documents. A basic usage example is illustrated below. Further examples are available in the Doclign code repository.</p>
<pre><code>from docling.document_converter import DocumentConverter Large</code></pre>
<pre><code>source = "https://arxiv.org/pdf/2206.01062" # PDF path or URL converter = DocumentConverter() result = converter.convert_single(source) print(result.render_as_markdown()) # output: "## DocLayNet: A Human -Annotated Dataset for Document -Layout Analysis [...]"</code></pre>
Expand Down
2 changes: 1 addition & 1 deletion test/data/doc/2408.09869v3_enriched_split.gt.html
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ <h2>1 Introduction</h2>
<li style="list-style-type: '· ';">Can leverage different accelerators (GPU, MPS, etc).</li>
</ul>
<h2>2 Getting Started</h2>
<span class='inline-group'>To use Docling, you can simply install the docling package from PyPI. Documentation and examples are available in our GitHub repository at <a href="https://github.com/DS4SD/docling">github.com/DS4SD/docling</a> . All required model assets 1 are downloaded to a local huggingface datasets cache on first use, unless you choose to pre-install the model assets in advance.</span>
<span class='inline-group'>To use Docling, you can simply install the docling package from PyPI. Documentation and examples are available in our GitHub repository at <a href="https://github.com/DS4SD/docling">github.com/DS4SD/docling</a>. All required model assets 1 are downloaded to a local huggingface datasets cache on first use, unless you choose to pre-install the model assets in advance.</span>
<p>Docling provides an easy code interface to convert PDF documents from file system, URLs or binary streams, and retrieve the output in either JSON or Markdown format. For convenience, separate methods are offered to convert single documents or batches of documents. A basic usage example is illustrated below. Further examples are available in the Doclign code repository.</p>
<pre><code>from docling.document_converter import DocumentConverter Large</code></pre>
<pre><code>source = "https://arxiv.org/pdf/2206.01062" # PDF path or URL converter = DocumentConverter() result = converter.convert_single(source) print(result.render_as_markdown()) # output: "## DocLayNet: A Human -Annotated Dataset for Document -Layout Analysis [...]"</code></pre>
Expand Down
2 changes: 1 addition & 1 deletion test/data/doc/2408.09869v3_enriched_split_p2.gt.html
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@
<li style="list-style-type: '· ';">Can leverage different accelerators (GPU, MPS, etc).</li>
</ul>
<h2>2 Getting Started</h2>
<span class='inline-group'>To use Docling, you can simply install the docling package from PyPI. Documentation and examples are available in our GitHub repository at <a href="https://github.com/DS4SD/docling">github.com/DS4SD/docling</a> . All required model assets 1 are downloaded to a local huggingface datasets cache on first use, unless you choose to pre-install the model assets in advance.</span>
<span class='inline-group'>To use Docling, you can simply install the docling package from PyPI. Documentation and examples are available in our GitHub repository at <a href="https://github.com/DS4SD/docling">github.com/DS4SD/docling</a>. All required model assets 1 are downloaded to a local huggingface datasets cache on first use, unless you choose to pre-install the model assets in advance.</span>
<p>Docling provides an easy code interface to convert PDF documents from file system, URLs or binary streams, and retrieve the output in either JSON or Markdown format. For convenience, separate methods are offered to convert single documents or batches of documents. A basic usage example is illustrated below. Further examples are available in the Doclign code repository.</p>
<pre><code>from docling.document_converter import DocumentConverter Large</code></pre>
<pre><code>source = "https://arxiv.org/pdf/2206.01062" # PDF path or URL converter = DocumentConverter() result = converter.convert_single(source) print(result.render_as_markdown()) # output: "## DocLayNet: A Human -Annotated Dataset for Document -Layout Analysis [...]"</code></pre>
Expand Down
6 changes: 3 additions & 3 deletions test/data/doc/concatenated.html
Original file line number Diff line number Diff line change
Expand Up @@ -363,10 +363,10 @@ <h2>1. Introduction</h2>
<ul>
<li style="list-style-type: '□ ';">item 1 of sub list</li>
<li style="list-style-type: '□ ';">
<span class='inline-group'>Here a code snippet: <code>print("Hello world")</code> (to be displayed inline)</span>
<span class='inline-group'>Here a code snippet:<code>print("Hello world")</code>(to be displayed inline)</span>
</li>
<li style="list-style-type: '□ ';">
<span class='inline-group'>Here a formula: <math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mrow><mi>E</mi><mo>&#x0003D;</mo><mi>m</mi><msup><mi>c</mi><mn>2</mn></msup></mrow><annotation encoding="TeX">E=mc^2</annotation></math> (to be displayed inline)</span>
<span class='inline-group'>Here a formula:<math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mrow><mi>E</mi><mo>&#x0003D;</mo><mi>m</mi><msup><mi>c</mi><mn>2</mn></msup></mrow><annotation encoding="TeX">E=mc^2</annotation></math>(to be displayed inline)</span>
</li>
</ul>
</li>
Expand All @@ -387,7 +387,7 @@ <h2>1. Introduction</h2>

</ul>
</div>
<span class='inline-group'>Some formatting chops: <strong>bold</strong> <em>italic</em> <u>underline</u> <del>strikethrough</del> <sub>subscript</sub> <sup>superscript</sup> <a href=".">hyperlink</a> &amp; <a href="https://github.com/DS4SD/docling"><del><u><em><strong>everything at the same time.</strong></em></u></del></a></span>
<span class='inline-group'>Some formatting chops:<strong>bold</strong><em>italic</em><u>underline</u><del>strikethrough</del><sub>subscript</sub><sup>superscript</sup><a href=".">hyperlink</a>&amp;<a href="https://github.com/DS4SD/docling"><del><u><em><strong>everything at the same time.</strong></em></u></del></a></span>
<ol>
<li style="list-style-type: '(i) ';">Item 1 in A</li>
<li style="list-style-type: '(ii) ';">Item 2 in A</li>
Expand Down
6 changes: 3 additions & 3 deletions test/data/doc/constructed_doc.embedded.html.gt
Original file line number Diff line number Diff line change
Expand Up @@ -167,10 +167,10 @@ item 2 of neighboring list
<ul>
<li style="list-style-type: '□ ';">item 1 of sub list</li>
<li style="list-style-type: '□ ';">
<span class='inline-group'>Here a code snippet: <code>print("Hello world")</code> (to be displayed inline)</span>
<span class='inline-group'>Here a code snippet:<code>print("Hello world")</code>(to be displayed inline)</span>
</li>
<li style="list-style-type: '□ ';">
<span class='inline-group'>Here a formula: <math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mrow><mi>E</mi><mo>&#x0003D;</mo><mi>m</mi><msup><mi>c</mi><mn>2</mn></msup></mrow><annotation encoding="TeX">E=mc^2</annotation></math> (to be displayed inline)</span>
<span class='inline-group'>Here a formula:<math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mrow><mi>E</mi><mo>&#x0003D;</mo><mi>m</mi><msup><mi>c</mi><mn>2</mn></msup></mrow><annotation encoding="TeX">E=mc^2</annotation></math>(to be displayed inline)</span>
</li>
</ul>
</li>
Expand All @@ -191,7 +191,7 @@ item 2 of neighboring list

</ul>
</div>
<span class='inline-group'>Some formatting chops: <strong>bold</strong> <em>italic</em> <u>underline</u> <del>strikethrough</del> <sub>subscript</sub> <sup>superscript</sup> <a href=".">hyperlink</a> &amp; <a href="https://github.com/DS4SD/docling"><del><u><em><strong>everything at the same time.</strong></em></u></del></a></span>
<span class='inline-group'>Some formatting chops:<strong>bold</strong><em>italic</em><u>underline</u><del>strikethrough</del><sub>subscript</sub><sup>superscript</sup><a href=".">hyperlink</a>&amp;<a href="https://github.com/DS4SD/docling"><del><u><em><strong>everything at the same time.</strong></em></u></del></a></span>
<ol>
<li style="list-style-type: '(i) ';">Item 1 in A</li>
<li style="list-style-type: '(ii) ';">Item 2 in A</li>
Expand Down
6 changes: 3 additions & 3 deletions test/data/doc/constructed_doc.embedded.md.gt
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,8 @@ This is the caption of figure 2.
- item 1 of neighboring list
- item 2 of neighboring list
- item 1 of sub list
- Here a code snippet: `print("Hello world")` (to be displayed inline)
- Here a formula: $E=mc^2$ (to be displayed inline)
- Here a code snippet:`print("Hello world")`(to be displayed inline)
- Here a formula:$E=mc^2$(to be displayed inline)

Here a code block:

Expand All @@ -61,7 +61,7 @@ $$E=mc^2$$

<!-- missing-form-item -->

Some formatting chops: **bold** *italic* underline ~~strikethrough~~ subscript superscript [hyperlink](.) &amp; [~~***everything at the same time.***~~](https://github.com/DS4SD/docling)
Some formatting chops:**bold***italic*underline~~strikethrough~~subscriptsuperscript[hyperlink](.)&amp;[~~***everything at the same time.***~~](https://github.com/DS4SD/docling)

- (i) Item 1 in A
- (ii) Item 2 in A
Expand Down
6 changes: 3 additions & 3 deletions test/data/doc/constructed_doc.placeholder.html.gt
Original file line number Diff line number Diff line change
Expand Up @@ -167,10 +167,10 @@ item 2 of neighboring list
<ul>
<li style="list-style-type: '□ ';">item 1 of sub list</li>
<li style="list-style-type: '□ ';">
<span class='inline-group'>Here a code snippet: <code>print("Hello world")</code> (to be displayed inline)</span>
<span class='inline-group'>Here a code snippet:<code>print("Hello world")</code>(to be displayed inline)</span>
</li>
<li style="list-style-type: '□ ';">
<span class='inline-group'>Here a formula: <math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mrow><mi>E</mi><mo>&#x0003D;</mo><mi>m</mi><msup><mi>c</mi><mn>2</mn></msup></mrow><annotation encoding="TeX">E=mc^2</annotation></math> (to be displayed inline)</span>
<span class='inline-group'>Here a formula:<math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mrow><mi>E</mi><mo>&#x0003D;</mo><mi>m</mi><msup><mi>c</mi><mn>2</mn></msup></mrow><annotation encoding="TeX">E=mc^2</annotation></math>(to be displayed inline)</span>
</li>
</ul>
</li>
Expand All @@ -191,7 +191,7 @@ item 2 of neighboring list

</ul>
</div>
<span class='inline-group'>Some formatting chops: <strong>bold</strong> <em>italic</em> <u>underline</u> <del>strikethrough</del> <sub>subscript</sub> <sup>superscript</sup> <a href=".">hyperlink</a> &amp; <a href="https://github.com/DS4SD/docling"><del><u><em><strong>everything at the same time.</strong></em></u></del></a></span>
<span class='inline-group'>Some formatting chops:<strong>bold</strong><em>italic</em><u>underline</u><del>strikethrough</del><sub>subscript</sub><sup>superscript</sup><a href=".">hyperlink</a>&amp;<a href="https://github.com/DS4SD/docling"><del><u><em><strong>everything at the same time.</strong></em></u></del></a></span>
<ol>
<li style="list-style-type: '(i) ';">Item 1 in A</li>
<li style="list-style-type: '(ii) ';">Item 2 in A</li>
Expand Down
6 changes: 3 additions & 3 deletions test/data/doc/constructed_doc.placeholder.md.gt
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,8 @@ This is the caption of figure 2.
- item 1 of neighboring list
- item 2 of neighboring list
- item 1 of sub list
- Here a code snippet: `print("Hello world")` (to be displayed inline)
- Here a formula: $E=mc^2$ (to be displayed inline)
- Here a code snippet:`print("Hello world")`(to be displayed inline)
- Here a formula:$E=mc^2$(to be displayed inline)

Here a code block:

Expand All @@ -61,7 +61,7 @@ $$E=mc^2$$

<!-- missing-form-item -->

Some formatting chops: **bold** *italic* underline ~~strikethrough~~ subscript superscript [hyperlink](.) &amp; [~~***everything at the same time.***~~](https://github.com/DS4SD/docling)
Some formatting chops:**bold***italic*underline~~strikethrough~~subscriptsuperscript[hyperlink](.)&amp;[~~***everything at the same time.***~~](https://github.com/DS4SD/docling)

- (i) Item 1 in A
- (ii) Item 2 in A
Expand Down
6 changes: 3 additions & 3 deletions test/data/doc/constructed_doc.referenced.html.gt
Original file line number Diff line number Diff line change
Expand Up @@ -167,10 +167,10 @@ item 2 of neighboring list
<ul>
<li style="list-style-type: '□ ';">item 1 of sub list</li>
<li style="list-style-type: '□ ';">
<span class='inline-group'>Here a code snippet: <code>print("Hello world")</code> (to be displayed inline)</span>
<span class='inline-group'>Here a code snippet:<code>print("Hello world")</code>(to be displayed inline)</span>
</li>
<li style="list-style-type: '□ ';">
<span class='inline-group'>Here a formula: <math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mrow><mi>E</mi><mo>&#x0003D;</mo><mi>m</mi><msup><mi>c</mi><mn>2</mn></msup></mrow><annotation encoding="TeX">E=mc^2</annotation></math> (to be displayed inline)</span>
<span class='inline-group'>Here a formula:<math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mrow><mi>E</mi><mo>&#x0003D;</mo><mi>m</mi><msup><mi>c</mi><mn>2</mn></msup></mrow><annotation encoding="TeX">E=mc^2</annotation></math>(to be displayed inline)</span>
</li>
</ul>
</li>
Expand All @@ -191,7 +191,7 @@ item 2 of neighboring list

</ul>
</div>
<span class='inline-group'>Some formatting chops: <strong>bold</strong> <em>italic</em> <u>underline</u> <del>strikethrough</del> <sub>subscript</sub> <sup>superscript</sup> <a href=".">hyperlink</a> &amp; <a href="https://github.com/DS4SD/docling"><del><u><em><strong>everything at the same time.</strong></em></u></del></a></span>
<span class='inline-group'>Some formatting chops:<strong>bold</strong><em>italic</em><u>underline</u><del>strikethrough</del><sub>subscript</sub><sup>superscript</sup><a href=".">hyperlink</a>&amp;<a href="https://github.com/DS4SD/docling"><del><u><em><strong>everything at the same time.</strong></em></u></del></a></span>
<ol>
<li style="list-style-type: '(i) ';">Item 1 in A</li>
<li style="list-style-type: '(ii) ';">Item 2 in A</li>
Expand Down
6 changes: 3 additions & 3 deletions test/data/doc/constructed_doc.referenced.md.gt
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,8 @@ This is the caption of figure 2.
- item 1 of neighboring list
- item 2 of neighboring list
- item 1 of sub list
- Here a code snippet: `print("Hello world")` (to be displayed inline)
- Here a formula: $E=mc^2$ (to be displayed inline)
- Here a code snippet:`print("Hello world")`(to be displayed inline)
- Here a formula:$E=mc^2$(to be displayed inline)

Here a code block:

Expand All @@ -61,7 +61,7 @@ $$E=mc^2$$

<!-- missing-form-item -->

Some formatting chops: **bold** *italic* underline ~~strikethrough~~ subscript superscript [hyperlink](.) &amp; [~~***everything at the same time.***~~](https://github.com/DS4SD/docling)
Some formatting chops:**bold***italic*underline~~strikethrough~~subscriptsuperscript[hyperlink](.)&amp;[~~***everything at the same time.***~~](https://github.com/DS4SD/docling)

- (i) Item 1 in A
- (ii) Item 2 in A
Expand Down
6 changes: 3 additions & 3 deletions test/data/doc/constructed_document.yaml.html
Original file line number Diff line number Diff line change
Expand Up @@ -167,10 +167,10 @@ <h2>1. Introduction</h2>
<ul>
<li style="list-style-type: '□ ';">item 1 of sub list</li>
<li style="list-style-type: '□ ';">
<span class='inline-group'>Here a code snippet: <code>print("Hello world")</code> (to be displayed inline)</span>
<span class='inline-group'>Here a code snippet:<code>print("Hello world")</code>(to be displayed inline)</span>
</li>
<li style="list-style-type: '□ ';">
<span class='inline-group'>Here a formula: <math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mrow><mi>E</mi><mo>&#x0003D;</mo><mi>m</mi><msup><mi>c</mi><mn>2</mn></msup></mrow><annotation encoding="TeX">E=mc^2</annotation></math> (to be displayed inline)</span>
<span class='inline-group'>Here a formula:<math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mrow><mi>E</mi><mo>&#x0003D;</mo><mi>m</mi><msup><mi>c</mi><mn>2</mn></msup></mrow><annotation encoding="TeX">E=mc^2</annotation></math>(to be displayed inline)</span>
</li>
</ul>
</li>
Expand All @@ -191,7 +191,7 @@ <h2>1. Introduction</h2>

</ul>
</div>
<span class='inline-group'>Some formatting chops: <strong>bold</strong> <em>italic</em> <u>underline</u> <del>strikethrough</del> <sub>subscript</sub> <sup>superscript</sup> <a href=".">hyperlink</a> &amp; <a href="https://github.com/DS4SD/docling"><del><u><em><strong>everything at the same time.</strong></em></u></del></a></span>
<span class='inline-group'>Some formatting chops:<strong>bold</strong><em>italic</em><u>underline</u><del>strikethrough</del><sub>subscript</sub><sup>superscript</sup><a href=".">hyperlink</a>&amp;<a href="https://github.com/DS4SD/docling"><del><u><em><strong>everything at the same time.</strong></em></u></del></a></span>
<ol>
<li style="list-style-type: '(i) ';">Item 1 in A</li>
<li style="list-style-type: '(ii) ';">Item 2 in A</li>
Expand Down
6 changes: 3 additions & 3 deletions test/data/doc/constructed_document.yaml.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,8 @@ This is the caption of figure 2.
- item 1 of neighboring list
- item 2 of neighboring list
- item 1 of sub list
- Here a code snippet: `print("Hello world")` (to be displayed inline)
- Here a formula: $E=mc^2$ (to be displayed inline)
- Here a code snippet:`print("Hello world")`(to be displayed inline)
- Here a formula:$E=mc^2$(to be displayed inline)

Here a code block:

Expand All @@ -61,7 +61,7 @@ $$E=mc^2$$

<!-- missing-form-item -->

Some formatting chops: **bold** *italic* underline ~~strikethrough~~ subscript superscript [hyperlink](.) &amp; [~~***everything at the same time.***~~](https://github.com/DS4SD/docling)
Some formatting chops:**bold***italic*underline~~strikethrough~~subscriptsuperscript[hyperlink](.)&amp;[~~***everything at the same time.***~~](https://github.com/DS4SD/docling)

- (i) Item 1 in A
- (ii) Item 2 in A
Expand Down
6 changes: 3 additions & 3 deletions test/data/doc/constructed_legacy_annot_mark_false.gt.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@ This is the caption of figure 2.
- item 1 of neighboring list
- item 2 of neighboring list
- item 1 of sub list
- Here a code snippet: `print("Hello world")` (to be displayed inline)
- Here a formula: $E=mc^2$ (to be displayed inline)
- Here a code snippet:`print("Hello world")`(to be displayed inline)
- Here a formula:$E=mc^2$(to be displayed inline)

Here a code block:

Expand All @@ -63,7 +63,7 @@ $$E=mc^2$$

<!-- missing-form-item -->

Some formatting chops: **bold** *italic* underline ~~strikethrough~~ subscript superscript [hyperlink](.) &amp; [~~***everything at the same time.***~~](https://github.com/DS4SD/docling)
Some formatting chops:**bold***italic*underline~~strikethrough~~subscriptsuperscript[hyperlink](.)&amp;[~~***everything at the same time.***~~](https://github.com/DS4SD/docling)

- (i) Item 1 in A
- (ii) Item 2 in A
Expand Down
6 changes: 3 additions & 3 deletions test/data/doc/constructed_legacy_annot_mark_true.gt.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@ This is the caption of figure 2.
- item 1 of neighboring list
- item 2 of neighboring list
- item 1 of sub list
- Here a code snippet: `print("Hello world")` (to be displayed inline)
- Here a formula: $E=mc^2$ (to be displayed inline)
- Here a code snippet:`print("Hello world")`(to be displayed inline)
- Here a formula:$E=mc^2$(to be displayed inline)

Here a code block:

Expand All @@ -63,7 +63,7 @@ $$E=mc^2$$

<!-- missing-form-item -->

Some formatting chops: **bold** *italic* underline ~~strikethrough~~ subscript superscript [hyperlink](.) &amp; [~~***everything at the same time.***~~](https://github.com/DS4SD/docling)
Some formatting chops:**bold***italic*underline~~strikethrough~~subscriptsuperscript[hyperlink](.)&amp;[~~***everything at the same time.***~~](https://github.com/DS4SD/docling)

- (i) Item 1 in A
- (ii) Item 2 in A
Expand Down
Loading
Loading