Implements parser for artwork list in google results page #353

vaocode · 2025-08-14T23:14:16Z

This PR implements a solution to parse artworks from google's search result.

Cases covered

I've considered 2 cases: Artworks page (large carrousel from the example) and a default search page (small carrousel).

Architecture

We have the following classes:

GoogleSearchPageCrawler - Responsible for receiving a file path, fetching the page HTML and formatting the expected result as JSON
GoogleSearchPageCrawler::Parser - Knows how to parse the page DOM. It's parse method returns a GoogleSearchPageCrawler::Parser::Result: a data/value object that uses dry struct. This makes our data structure more explicit and prevents mistyping errors that happens we just use a hash.

Parsing logic

I've tried to make the scraper more error prone by using non obfuscated selectors such as [data-attrid="kc:/visual_art/visual_artist:works"] and looking for text nodes instead of classes or dom hierarchy to search for the name/extensions.

The parse_small_carrousel_artwork and parse_big_carrousel_artwork methods are intentionally kept separate, even though their logic is similar. Both parse the same concept (Result::Artwork), but from different DOM structures. Keeping them distinct ensures that each case retains its own logic and execution strategy.

For now, we're using methods but - if needed in the future - one idea is to split the parsing logic into multiple "sub-classes" instead of methods.

Example: GoogleSearchPageCrawler::Parser::ListResult, GoogleSearchPageCrawler::Parser::Artworks, etc.

Each class could parse a specific part of the page. While not strictly necessary, this approach can reduce cognitive load as the number of desired data grows.
It keeps the code more organized, cohesive and makes it easier to locate and fix a broken parsing rule.

Image parsing

The readme highlights that we have to keep the image attribute for both cases:

the base64 encoded image
the image link (those that require a click on the "show more" button)

When I've executed a test against the expected-array.json file, I've noticed that the <img> tag has a gif as SRC. And we have 2 cases:

img with id attribute

<img class="taFZJe" alt="The Potato Eaters" id="_L_FkZ4qlAtyDwbkP49Pj0QU_79" src="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" data-deferred="1">

The same ID can be found inside a script tag with their base64 encoded image.

<script nonce="xmO6un4J9murPFDygFfaMA">(function(){var s='data:image/webp;base64,UklGRjQMAABXRUJQVlA4ICgMAAAQRACdASrhAJsAPxGAt1QsKCU1KDV7MqAiCWcHDtAkSjkn/r/Xf+ydgBeraVdn/9Px1UGv5GYPyffrugwzIfw/Rc/+vnb/kP/hwO2JbSYKG1VQIN78tct7QVKKyA/XDj2TQ174tLSeF8ejv+SZJ2zx....';var ii=['_L_FkZ4qlAtyDwbkP49Pj0QU_79'];var r='';_setImagesSrc(ii,s,r);})();</script>

So, we have to find the script tag with the same ID and extract the base64 encoded image from there.

img with `data-src` attribute. Aditional request is needed

<img class="taFZJe" alt="Self-Portrait with Bandaged Ear" data-src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQ8juuefle5MyKZKBLRgPjsGSJon7vkt91SM7WTRuZOOyAyUI1v" src="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="/>

We just use the data-src and return it.

Usage

Running tests

bundle exec rspec to run the specs

Scraping a search page

Execute
bundle exec ruby scrape_files.rb FILENAME.HTML to use the GoogleSearchPageCrawler to crawl the page, parse the artworks and save the result inside the files folder.

It searches for the file in the files folder. Defaults to van-gogh-paintings.html

Notes

Interesting case - Following artwork link

There's another case that I found: when we click on an artwork and the list appears as a horizontal top list.

https://www.google.com/search?sca_esv=e77f9c08ad4d25ad&sxsrf=AE3TifPtmZe03gj5RYheiEGzNrtk6qieag:1755204957217&q=Bullfinch+and+weeping+cherry+blossoms&stick=H4sIAAAAAAAAAONgFuLQz9U3SCpPM1Hi1U_XNzRMNi5JrzKozNFSyk620i_LLC5NzIlPLCpBYmYWl1iV5xdlFy9iVXUqzclJy8xLzlBIzEtRKE9NLcjMS1dIzkgtKqpUSMrJLy7Ozy0GAHIW_uhnAAAA&sa=X&ved=2ahUKEwj_qO7_l4uPAxWrIrkGHbNdIp4QgOQBegQIMhAS

I didn't cover this case because I've noticed that it happens only when we follow an artwork link: the page loads with the artwork highlighted containing an empty href="#".

If we ever need to cover this case we can use the
div[data-attrid="kc:/visual_art/visual_artist:works"] [role="group"] a selector and maybe change the URL logic to have a "current page" information in order to return the correct link for the selected artwork.

RAW HTML analysis

Artwork specific page (Example from README)

The raw HTML (from 'view source code') lists all the artworks

Normal search page ('small carousell')

The raw HTML (from 'view source code') lists only 6 artworks. The other ones seems to be inside a javascript.

Testing in the playground: https://serpapi.com/playground?q=monet&location=Austin%2C+Texas%2C+United+States&gl=us&hl=en it seems that SerpAPI does consider this case.

I didn't try to parse them because I believe that this is outside of scope of this exercise but I see other options:

Instead of manually parsing scripts, we could use a real browser to run the javascript and get the HTML. This is less performant but - if other parts of the page also requires this method - can be an alternative.
We can follow the "Artworks" link and scrape everything from there using the "Artwork specific page" implementation (I believe that this is the way to go...)

vaocode added 26 commits August 14, 2025 19:23

Project setup with Gemfile and Rspec

93b80ac

Adds basic project and classes structure with solution documentation

590d506

Implements parser parse_artwork for titles

850e8e1

Adds extensions to parsing result

6e6da78

Adds load_fixture_file spec helper method

6a650ba

Adds link to parser

d379900

Adds image data to parser

c5086af

Implements .parse method for all artworks + specs

1eb9f30

Changes parsing title attribute to name

7ac84e3

Implements image url from data-src

f54550a

Implements image base64 script parser

e68157f

Split tests into integration / unit folders

2818a46

Add more tests for artworks page scraping

ab73772

Fixes cases when the base64 has hex encoded values

d3cdeac

Adds a scrape_files.rb with an execution example

4a13a96

Adds dry types/struct and parser Result object

5393ad0

Formats crawler output to json

624d206

Updates scrape_file.rb script to save the output json

a5d5cea

Implements conditional for big/small carrousel

22b4426

ref: renames fixture files

abbdb09

Implements parsing for small carrousel

baf919d

Implements image parsing for smal carrousel

4e172bf

Implements default search page small carrousel parsing

e599374

Adds one more test example for small carrousell

5425712

Updates solution docs

f7ed4d9

Adds more parsing example files

4c8724c

vaocode force-pushed the master branch from 3362f62 to 4c8724c Compare August 14, 2025 23:15

andypple83 closed this Aug 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implements parser for artwork list in google results page #353

Implements parser for artwork list in google results page #353

Uh oh!

vaocode commented Aug 14, 2025

Uh oh!

Uh oh!

Implements parser for artwork list in google results page #353

Implements parser for artwork list in google results page #353

Uh oh!

Conversation

vaocode commented Aug 14, 2025

Cases covered

Architecture

Parsing logic

Image parsing

img with id attribute

img with data-src attribute. Aditional request is needed

Usage

Running tests

Scraping a search page

Notes

Interesting case - Following artwork link

RAW HTML analysis

Artwork specific page (Example from README)

Normal search page ('small carousell')

Uh oh!

Uh oh!

img with `data-src` attribute. Aditional request is needed