Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
93b80ac
Project setup with Gemfile and Rspec
vaocode Aug 14, 2025
590d506
Adds basic project and classes structure with solution documentation
vaocode Aug 14, 2025
850e8e1
Implements parser parse_artwork for titles
vaocode Aug 14, 2025
6e6da78
Adds extensions to parsing result
vaocode Aug 14, 2025
6a650ba
Adds load_fixture_file spec helper method
vaocode Aug 14, 2025
d379900
Adds link to parser
vaocode Aug 14, 2025
c5086af
Adds image data to parser
vaocode Aug 14, 2025
1eb9f30
Implements .parse method for all artworks + specs
vaocode Aug 14, 2025
7ac84e3
Changes parsing title attribute to name
vaocode Aug 14, 2025
f54550a
Implements image url from data-src
vaocode Aug 14, 2025
e68157f
Implements image base64 script parser
vaocode Aug 14, 2025
2818a46
Split tests into integration / unit folders
vaocode Aug 14, 2025
ab73772
Add more tests for artworks page scraping
vaocode Aug 14, 2025
d3cdeac
Fixes cases when the base64 has hex encoded values
vaocode Aug 14, 2025
4a13a96
Adds a scrape_files.rb with an execution example
vaocode Aug 14, 2025
5393ad0
Adds dry types/struct and parser Result object
vaocode Aug 14, 2025
624d206
Formats crawler output to json
vaocode Aug 14, 2025
a5d5cea
Updates scrape_file.rb script to save the output json
vaocode Aug 14, 2025
22b4426
Implements conditional for big/small carrousel
vaocode Aug 14, 2025
abbdb09
ref: renames fixture files
vaocode Aug 14, 2025
baf919d
Implements parsing for small carrousel
vaocode Aug 14, 2025
4e172bf
Implements image parsing for smal carrousel
vaocode Aug 14, 2025
e599374
Implements default search page small carrousel parsing
vaocode Aug 14, 2025
5425712
Adds one more test example for small carrousell
vaocode Aug 14, 2025
f7ed4d9
Updates solution docs
vaocode Aug 14, 2025
4c8724c
Adds more parsing example files
vaocode Aug 14, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .rspec
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
--require spec_helper
--format doc
1 change: 1 addition & 0 deletions .tool-versions
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ruby 3.3.1
16 changes: 16 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# frozen_string_literal: true

source "https://rubygems.org"

# gem "rails"
gem 'nokogiri'
gem 'dry-types'
gem 'dry-struct'

group :development do
gem 'debug'
end

group :test do
gem 'rspec'
end
103 changes: 103 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
GEM
remote: https://rubygems.org/
specs:
bigdecimal (3.2.2)
concurrent-ruby (1.3.5)
date (3.4.1)
debug (1.11.0)
irb (~> 1.10)
reline (>= 0.3.8)
diff-lcs (1.6.2)
dry-core (1.1.0)
concurrent-ruby (~> 1.0)
logger
zeitwerk (~> 2.6)
dry-inflector (1.2.0)
dry-logic (1.6.0)
bigdecimal
concurrent-ruby (~> 1.0)
dry-core (~> 1.1)
zeitwerk (~> 2.6)
dry-struct (1.8.0)
dry-core (~> 1.1)
dry-types (~> 1.8, >= 1.8.2)
ice_nine (~> 0.11)
zeitwerk (~> 2.6)
dry-types (1.8.3)
bigdecimal (~> 3.0)
concurrent-ruby (~> 1.0)
dry-core (~> 1.0)
dry-inflector (~> 1.0)
dry-logic (~> 1.4)
zeitwerk (~> 2.6)
erb (5.0.2)
ice_nine (0.11.2)
io-console (0.8.1)
irb (1.15.2)
pp (>= 0.6.0)
rdoc (>= 4.0.0)
reline (>= 0.4.2)
logger (1.7.0)
nokogiri (1.18.9-aarch64-linux-gnu)
racc (~> 1.4)
nokogiri (1.18.9-aarch64-linux-musl)
racc (~> 1.4)
nokogiri (1.18.9-arm-linux-gnu)
racc (~> 1.4)
nokogiri (1.18.9-arm-linux-musl)
racc (~> 1.4)
nokogiri (1.18.9-arm64-darwin)
racc (~> 1.4)
nokogiri (1.18.9-x86_64-darwin)
racc (~> 1.4)
nokogiri (1.18.9-x86_64-linux-gnu)
racc (~> 1.4)
nokogiri (1.18.9-x86_64-linux-musl)
racc (~> 1.4)
pp (0.6.2)
prettyprint
prettyprint (0.2.0)
psych (5.2.6)
date
stringio
racc (1.8.1)
rdoc (6.14.2)
erb
psych (>= 4.0.0)
reline (0.6.2)
io-console (~> 0.5)
rspec (3.13.1)
rspec-core (~> 3.13.0)
rspec-expectations (~> 3.13.0)
rspec-mocks (~> 3.13.0)
rspec-core (3.13.5)
rspec-support (~> 3.13.0)
rspec-expectations (3.13.5)
diff-lcs (>= 1.2.0, < 2.0)
rspec-support (~> 3.13.0)
rspec-mocks (3.13.5)
diff-lcs (>= 1.2.0, < 2.0)
rspec-support (~> 3.13.0)
rspec-support (3.13.4)
stringio (3.1.7)
zeitwerk (2.7.3)

PLATFORMS
aarch64-linux-gnu
aarch64-linux-musl
arm-linux-gnu
arm-linux-musl
arm64-darwin
x86_64-darwin
x86_64-linux-gnu
x86_64-linux-musl

DEPENDENCIES
debug
dry-struct
dry-types
nokogiri
rspec

BUNDLED WITH
2.5.9
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Parse directly the HTML result page ([html file]) in this repository. No extra H
[html file]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/van-gogh-paintings.html
[expected array]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/expected-array.json

Add also to your array the painting thumbnails present in the result page file (not the ones where extra requests are needed).
Add also to your array the painting thumbnails present in the result page file (not the ones where extra requests are needed).

Test against 2 other similar result pages to make sure it works against different layouts. (Pages that contain the same kind of carrousel. Don't necessarily have to be paintings.)

Expand Down
94 changes: 94 additions & 0 deletions SOLUTION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
This PR implements a solution to parse artworks from google's search result.

# Cases covered
I've considered 2 cases: Artworks page (large carrousel from the example) and a default search page (small carrousel).

![](files/van-gogh-paintings.png)
![](files/default-page-search.png)



# Architecture
We have 2 main classes:

- `GoogleSearchPageCrawler` - Responsible for receiving a file path, fetch the page HTML and format the expected result as JSON
- `GoogleSearchPageCrawler::Parser` - Knows how to parse the page DOM. It's `parse` method returns a `GoogleSearchPageCrawler::Parser::Result`: a data/value object that uses dry struct. This makes our data structure more explicit and prevents mistyping errors that happens we just use a hash.

# Parsing logic

I've implemented everything in a single class but - if needed in the future - one idea is to split the parsing logic into multiple "sub-classes" instead of methods.

Example: `GoogleSearchPageCrawler::Parser::ListResult`, `GoogleSearchPageCrawler::Parser::Artworks`, etc.

Each class could parse a specific part of the page. While not strictly necessary, this approach can reduce cognitive load as the number of desired data grows.
It keeps the code more organized, cohesive and makes it easier to locate and fix a broken parsing rule.

I've tried to make the scraper more error prone by using a non obfuscated selectors such as `data-attrid="kc:/visual_art/visual_artist:works` and looking for text nodes instead of classes or dom hierarchy to search for the name/extensions.

The `parse_small_carrousel_artwork` and `parse_big_carrousel_artwork` methods are intentionally kept separate, even though their logic is similar. Both parse the same concept (Result::Artwork), but from different DOM structures. Keeping them distinct ensures that each case retains its own logic and execution strategy.

## Image parsing
The readme highlights that we have to keep the image attribute for both cases:

- the base64 encoded image
- the image link (those that require a click on the "show more" button)

When I've executed a test against the `expected-array.json` file, I've noticed that the `<img>` tag has a gif as SRC. And we have 2 cases:

### img with id attribute
```html
<img class="taFZJe" alt="The Potato Eaters" id="_L_FkZ4qlAtyDwbkP49Pj0QU_79" src="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" data-deferred="1">
```

The same ID can be found inside a script tag with their base64 encoded image.

```html
<script nonce="xmO6un4J9murPFDygFfaMA">(function(){var s='data:image/webp;base64,UklGRjQMAABXRUJQVlA4ICgMAAAQRACdASrhAJsAPxGAt1QsKCU1KDV7MqAiCWcHDtAkSjkn/r/Xf+ydgBeraVdn/9Px1UGv5GYPyffrugwzIfw/Rc/+vnb/kP/hwO2JbSYKG1VQIN78tct7QVKKyA/XDj2TQ174tLSeF8ejv+SZJ2zx....';var ii=['_L_FkZ4qlAtyDwbkP49Pj0QU_79'];var r='';_setImagesSrc(ii,s,r);})();</script>
```

So, we have to find the script tag with the same ID and extract the base64 encoded image from there.

### img with `data-src` attribute. Aditional request is needed
```html
<img class="taFZJe" alt="Self-Portrait with Bandaged Ear" data-src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQ8juuefle5MyKZKBLRgPjsGSJon7vkt91SM7WTRuZOOyAyUI1v" src="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="/>
```

We just use the `data-src` and return it.

# Usage

## Running tests
`bundle exec rspec` to run the specs

## Scraping a search page

Execute
`bundle exec ruby scrape_files.rb FILENAME.HTML` to use the `GoogleSearchPageCrawler` to crawl the page, parse the artworks and save the result inside the `files` folder.

It searches for the file in the `files` folder. Defaults to `van-gogh-paintings.html`

# Notes
## Interesting case - Following artwork link
There's another case that I found: when we click on an artwork and the list appears as a horizontal top list.

https://www.google.com/search?sca_esv=e77f9c08ad4d25ad&sxsrf=AE3TifPtmZe03gj5RYheiEGzNrtk6qieag:1755204957217&q=Bullfinch+and+weeping+cherry+blossoms&stick=H4sIAAAAAAAAAONgFuLQz9U3SCpPM1Hi1U_XNzRMNi5JrzKozNFSyk620i_LLC5NzIlPLCpBYmYWl1iV5xdlFy9iVXUqzclJy8xLzlBIzEtRKE9NLcjMS1dIzkgtKqpUSMrJLy7Ozy0GAHIW_uhnAAAA&sa=X&ved=2ahUKEwj_qO7_l4uPAxWrIrkGHbNdIp4QgOQBegQIMhAS

I didn't cover this case because I've noticed that it happens only when we follow an artwork link: the page loads with the artwork highlighted containing an empty href="#".

If we ever need to cover this case we can use the
`div[data-attrid="kc:/visual_art/visual_artist:works"] [role="group"] a` selector and maybe change the URL logic to have a "current page" information in order to return the correct link for the selected artwork.

## RAW HTML analysis

### Artwork specific page (Example from README)
The raw HTML (from 'view source code') lists all the artworks

### Normal search page ('small carousell')
The raw HTML (from 'view source code') lists only 6 artworks. The other ones seems to be inside a javascript.

Testing in the playground: https://serpapi.com/playground?q=monet&location=Austin%2C+Texas%2C+United+States&gl=us&hl=en it seems that SerpAPI does consider this case.

I didn't try to parse them because I believe that this is outside of scope of this exercise but I see other (and probably better) options:

- Instead of manually parsing scripts, we could consider using a real browser to evaluate the HTML before parsing the data. This is less performant but - if other parts of the page also requires this method - can be an alternative.
- We can follow the "Artworks" link and scrape everything from there using the "Artwork specific page" implementation (I believe that this is the way to go...)
Binary file added files/default-page-search.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading