feat: implement a subset of SerpApi to extract knowledge card carousel images #350

binoverfl0w · 2025-07-30T00:57:30Z

GoogleSerp gem aims to simulate the capabilities of SerpApi on extracting data from search results, however, the scope is limited to knowledge card carousel image extraction.

As shown in the image below:

A search may result in multiple knowledge cards (About, Artworks etc.) which have different layouts. The layout that this PR attempts to extract is the layout shown by Artworks. It resembles a carousel of images where each element has the following structure:

<a href="...">
  <img src="..." />
  <div class="...">
    <div class="..."></div>
    <div class="..."></div>
  </div>
</a>

Note: This structure may differ on Firefox. For this implementation, I relied on Chrome to work out how I would extract the data.

Although one could use the CSS class selectors to locate all the elements needed, I tried to take another approach. Following a hierarchical approach to not be dependent on obfuscated class names.

A quick way to find the knowledge card that we need, is to navigate the document using doc.at_xpath('//div[h1[text()="Search Results"]]'). This will find the div element that contains a <h1> with Search Results content. The tabs shown on the search result are refined queries (ex: after searching up Van Gogh and clicking Artworks tab, it will redirect you to 'vincent van gogh artwork' search result), therefore we know that the carousel will be inside this main div.
Next CSS selector of interest is div[data-attrid^="kc:/"], it is not obfuscated so it is easier to read, and it will give us the div that is acting as a container for all the carousel images.

From here we can iterate every <a> element present under the knowledge card, and for each anchor element get its <img> child node and all text nodes under it.

Tested it with other similar layouts:

Handling image lazy-loading

During testing I noticed that src of image elements was sometimes set to a gif and then later loaded the actual image. Based on my observation if an <img> element has data-src attribute it will simply use that one, otherwise a <script> tag may trigger a function to change the image source by using its id.
The script has the following format:

(function(){var s="<image data>";var ii=['image_id'];...;_setImagesSrc(ii, s, ...);})();

I encountered some overloads of _setImagesSrc function but the important part is that the first argument is the image id and the second argument is the new image source. We can use the second argument to find the variable holding the data and associate it with our image element.

If neither data-src attribute is detected or the script functions, it will use the original src attribute of the image.

Testing

bundle exec rspec

…l images

binoverfl0w added 2 commits July 30, 2025 01:41

feat: implement a subset of SerpApi to extract knowledge card carouse…

79a4e05

…l images

test: add tests for additional similar layouts

12940d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: implement a subset of SerpApi to extract knowledge card carousel images #350

feat: implement a subset of SerpApi to extract knowledge card carousel images #350

Uh oh!

binoverfl0w commented Jul 30, 2025

Uh oh!

Uh oh!

feat: implement a subset of SerpApi to extract knowledge card carousel images #350

Are you sure you want to change the base?

feat: implement a subset of SerpApi to extract knowledge card carousel images #350

Uh oh!

Conversation

binoverfl0w commented Jul 30, 2025

Handling image lazy-loading

Testing

Uh oh!

Uh oh!