Skip to content

feat: implement a subset of SerpApi to extract knowledge card carousel images #350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

binoverfl0w
Copy link

GoogleSerp gem aims to simulate the capabilities of SerpApi on extracting data from search results, however, the scope is limited to knowledge card carousel image extraction.

As shown in the image below:
Van Gogh Paintings

A search may result in multiple knowledge cards (About, Artworks etc.) which have different layouts. The layout that this PR attempts to extract is the layout shown by Artworks. It resembles a carousel of images where each element has the following structure:

<a href="...">
  <img src="..." />
  <div class="...">
    <div class="..."></div>
    <div class="..."></div>
  </div>
</a>

Note: This structure may differ on Firefox. For this implementation, I relied on Chrome to work out how I would extract the data.

Although one could use the CSS class selectors to locate all the elements needed, I tried to take another approach. Following a hierarchical approach to not be dependent on obfuscated class names.

A quick way to find the knowledge card that we need, is to navigate the document using doc.at_xpath('//div[h1[text()="Search Results"]]'). This will find the div element that contains a <h1> with Search Results content. The tabs shown on the search result are refined queries (ex: after searching up Van Gogh and clicking Artworks tab, it will redirect you to 'vincent van gogh artwork' search result), therefore we know that the carousel will be inside this main div.
Next CSS selector of interest is div[data-attrid^="kc:/"], it is not obfuscated so it is easier to read, and it will give us the div that is acting as a container for all the carousel images.

From here we can iterate every <a> element present under the knowledge card, and for each anchor element get its <img> child node and all text nodes under it.

Tested it with other similar layouts:
Adele songs
JK Rowling books
Leo Di Caprio movies

Handling image lazy-loading

During testing I noticed that src of image elements was sometimes set to a gif and then later loaded the actual image. Based on my observation if an <img> element has data-src attribute it will simply use that one, otherwise a <script> tag may trigger a function to change the image source by using its id.
The script has the following format:

(function(){var s="<image data>";var ii=['image_id'];...;_setImagesSrc(ii, s, ...);})();

I encountered some overloads of _setImagesSrc function but the important part is that the first argument is the image id and the second argument is the new image source. We can use the second argument to find the variable holding the data and associate it with our image element.

If neither data-src attribute is detected or the script functions, it will use the original src attribute of the image.

Testing

bundle exec rspec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant