Skip to content

Commit

Permalink
Fix hasSingleTagInsideElement method
Browse files Browse the repository at this point in the history
It would fail for e.g. `<div> <p>foo</p> </div>`.

mozilla/readability uses children for the tag lookup, which return only elements.
PHP does not have children property so b580cf2
mistakenly used `childNodes` instead, but that can return any node type.

Let’s filter the children ourselves.

Also add comments from mozilla/readability’s `_hasSingleTagInsideElement`.
  • Loading branch information
jtojnar committed Mar 16, 2024
1 parent 38870cd commit 8687acf
Showing 1 changed file with 12 additions and 2 deletions.
14 changes: 12 additions & 2 deletions src/Readability.php
Original file line number Diff line number Diff line change
Expand Up @@ -1469,13 +1469,23 @@ private function isPhrasingContent($node): bool
}, iterator_to_array($node->childNodes)), true));
}

/**
* Checks if `$node` has only whitespace and a single element with `$tag` for the tag name.
* Returns false if `$node` contains non-empty text nodes
* or if it contains no element with given tag or more than 1 element.
*/
private function hasSingleTagInsideElement(\DOMElement $node, string $tag): bool
{
if (1 !== $node->childNodes->length || $node->childNodes->item(0)->nodeName !== $tag) {
$childNodes = iterator_to_array($node->childNodes);
$children = array_filter($childNodes, fn ($childNode) => $childNode instanceof \DOMElement);

Check failure on line 1480 in src/Readability.php

View workflow job for this annotation

GitHub Actions / CS Fixer & PHPStan (7.2)

Syntax error, unexpected ')' on line 1480

Check failure on line 1480 in src/Readability.php

View workflow job for this annotation

GitHub Actions / CS Fixer & PHPStan (7.2)

Syntax error, unexpected T_DOUBLE_ARROW, expecting ')' on line 1480

// There should be exactly 1 element child with given tag
if (1 !== \count($children) || $children[0]->nodeName !== $tag) {
return false;
}

$a = array_filter(iterator_to_array($node->childNodes), function ($childNode) {
// And there should be no text nodes with real content
$a = array_filter($childNodes, function ($childNode) {
return $childNode instanceof \DOMText &&
preg_match($this->regexps['hasContent'], $this->getInnerText($childNode));
});
Expand Down

0 comments on commit 8687acf

Please sign in to comment.