Skip to content

Can't Parse HTML from HTMLDocument #2780

@liamh101

Description

@liamh101

Describe the bug and add attachments

When creating a document from HTML, when adding HTML via the static call HTML::addHTML if the content contains a image tag with a p tag. The call fails with the DomDocument exception DOMDocument::loadXML(): Opening and ending tag mismatch.

The HTML was generated via PHP 8.4's HTMLDocument class.

Expected behavior

The HTML is accepted as valid HTML.

Is there an easy way to mitigate this?

Steps to reproduce

<?php

$dom = Dom\HTMLDocument::createFromString(<<<'HTML'
<!DOCTYPE html>
<html>
<body>
   <p><img style="aspect-ratio:12/13;" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAwAAAANCAYAAACdKY9CAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAAEnQAABJ0Ad5mH3gAAAJuaVRYdFNuaXBNZXRhZGF0YQAAAAAAeyJjbGlwUG9pbnRzIjpbeyJ4IjoxLCJ5IjozfSx7IngiOjIsInkiOjR9LHsieCI6MiwieSI6NX0seyJ4IjozLCJ5Ijo2fSx7IngiOjQsInkiOjd9LHsieCI6NCwieSI6OH0seyJ4Ijo0LCJ5Ijo5fSx7IngiOjUsInkiOjEwfSx7IngiOjUsInkiOjExfSx7IngiOjYsInkiOjEyfSx7IngiOjYsInkiOjEzfSx7IngiOjcsInkiOjEzfSx7IngiOjcsInkiOjE0fSx7IngiOjgsInkiOjE0fSx7IngiOjksInkiOjE0fSx7IngiOjEwLCJ5IjoxNH0seyJ4IjoxMSwieSI6MTR9LHsieCI6MTEsInkiOjEzfSx7IngiOjEyLCJ5IjoxMn0seyJ4IjoxMiwieSI6MTF9LHsieCI6MTIsInkiOjEwfSx7IngiOjEyLCJ5Ijo5fSx7IngiOjEyLCJ5Ijo4fSx7IngiOjEyLCJ5Ijo3fSx7IngiOjEyLCJ5Ijo1fSx7IngiOjEyLCJ5Ijo0fSx7IngiOjEyLCJ5IjozfSx7IngiOjEyLCJ5IjoyfSx7IngiOjExLCJ5IjoxfSx7IngiOjEwLCJ5IjowfSx7IngiOjksInkiOjB9LHsieCI6OCwieSI6MH0seyJ4Ijo3LCJ5IjowfSx7IngiOjYsInkiOjB9LHsieCI6NSwieSI6MH0seyJ4Ijo0LCJ5IjowfSx7IngiOjMsInkiOjB9LHsieCI6MiwieSI6MH0seyJ4IjoxLCJ5IjowfSx7IngiOjAsInkiOjB9XX0Gg0zKAAAAg0lEQVQoU42PwQ2AIAxFW3ULEoi6iGzkJo6EI7gCB&#43;8OAEEbGxIjobwLtMnPf0Wtxw0QVxCIIdrz9DsaMy8JkuN9FQp1AOHgWaQfetd57y9K8k4E&#43;QVtpsTfKo/SS2tLbiCUMgt58ljkEyAktazUyi8g3fJTImpaRaVaS7GBKLXEEO0NwE0ruorm1rsAAAAASUVORK5CYII&#61;" width="12" height="13" /></p>
</body>
</html>
HTML);

$doc = new PhpOffice\PhpWord\PhpWord();
$section = $doc->addSection([
    'headerHeight' => PhpOffice\PhpWord\Shared\Converter::cmToTwip(1.54),
]);

$html = $dom->saveHtml();

PhpOffice\PhpWord\Shared\Html::addHtml($section, $html);

PHPWord version(s) where the bug happened

1.3.0

PHP version(s) where the bug happened

8.4

Priority

  • I want to crowdfund the bug fix (with @algora-io) and fund a community developer.
    I want to pay the bug fix and fund a maintainer for that. (Contact @Progi1984)

Activity

added this to the 1.4.0 milestone on May 28, 2025
michalschroeder

michalschroeder commented on May 30, 2025

@michalschroeder
Contributor

Hey @liamh101

The issue isn’t related to the img tag being inside a p element. The actual problem lies in the unclosed img tag in the HTML that’s passed to Html::addHtml(). When using $dom->saveHtml(), self-closing tags like <img> are outputed without a closing slash, which will cause issues.

To ensure proper formatting and valid output, especially for tags like img, br, and hr, please use $dom->saveXml() instead. This will generate properly closed tags in the output.

liamh101

liamh101 commented on Jun 1, 2025

@liamh101
Author

Hey @michalschroeder,

This is exactly what we've done to mitigate this. Note for anyone else, make sure to pass the flag LIBXML_NOXMLDECL as an option.

modified the milestones: 1.4.0, 2.0.0 on Jun 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @Progi1984@liamh101@michalschroeder

        Issue actions

          Can't Parse HTML from HTMLDocument · Issue #2780 · PHPOffice/PHPWord