feat: implement a subset of SerpApi to extract knowledge card carousel images #350
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
GoogleSerp
gem aims to simulate the capabilities of SerpApi on extracting data from search results, however, the scope is limited to knowledge card carousel image extraction.As shown in the image below:

A search may result in multiple knowledge cards (About, Artworks etc.) which have different layouts. The layout that this PR attempts to extract is the layout shown by Artworks. It resembles a carousel of images where each element has the following structure:
Note: This structure may differ on Firefox. For this implementation, I relied on Chrome to work out how I would extract the data.
Although one could use the CSS class selectors to locate all the elements needed, I tried to take another approach. Following a hierarchical approach to not be dependent on obfuscated class names.
A quick way to find the knowledge card that we need, is to navigate the document using
doc.at_xpath('//div[h1[text()="Search Results"]]')
. This will find the div element that contains a<h1>
withSearch Results
content. The tabs shown on the search result are refined queries (ex: after searching up Van Gogh and clicking Artworks tab, it will redirect you to 'vincent van gogh artwork' search result), therefore we know that the carousel will be inside this main div.Next CSS selector of interest is
div[data-attrid^="kc:/"]
, it is not obfuscated so it is easier to read, and it will give us the div that is acting as a container for all the carousel images.From here we can iterate every
<a>
element present under the knowledge card, and for each anchor element get its<img>
child node and all text nodes under it.Tested it with other similar layouts:



Handling image lazy-loading
During testing I noticed that
src
of image elements was sometimes set to a gif and then later loaded the actual image. Based on my observation if an<img>
element hasdata-src
attribute it will simply use that one, otherwise a<script>
tag may trigger a function to change the image source by using its id.The script has the following format:
I encountered some overloads of
_setImagesSrc
function but the important part is that the first argument is the image id and the second argument is the new image source. We can use the second argument to find the variable holding the data and associate it with our image element.If neither
data-src
attribute is detected or the script functions, it will use the originalsrc
attribute of the image.Testing