serpapi · vaocode · Aug 14, 2025 · Aug 14, 2025 · Aug 14, 2025 · Aug 14, 2025
diff --git a/.rspec b/.rspec
@@ -0,0 +1,2 @@
+--require spec_helper
+--format doc
diff --git a/.tool-versions b/.tool-versions
@@ -0,0 +1 @@
+ruby 3.3.1
diff --git a/Gemfile b/Gemfile
@@ -0,0 +1,16 @@
+# frozen_string_literal: true
+
+source "https://rubygems.org"
+
+# gem "rails"
+gem 'nokogiri'
+gem 'dry-types'
+gem 'dry-struct'
+
+group :development do
+  gem 'debug'
+end
+
+group :test do
+  gem 'rspec'
+end
diff --git a/Gemfile.lock b/Gemfile.lock
@@ -0,0 +1,103 @@
+GEM
+  remote: https://rubygems.org/
+  specs:
+    bigdecimal (3.2.2)
+    concurrent-ruby (1.3.5)
+    date (3.4.1)
+    debug (1.11.0)
+      irb (~> 1.10)
+      reline (>= 0.3.8)
+    diff-lcs (1.6.2)
+    dry-core (1.1.0)
+      concurrent-ruby (~> 1.0)
+      logger
+      zeitwerk (~> 2.6)
+    dry-inflector (1.2.0)
+    dry-logic (1.6.0)
+      bigdecimal
+      concurrent-ruby (~> 1.0)
+      dry-core (~> 1.1)
+      zeitwerk (~> 2.6)
+    dry-struct (1.8.0)
+      dry-core (~> 1.1)
+      dry-types (~> 1.8, >= 1.8.2)
+      ice_nine (~> 0.11)
+      zeitwerk (~> 2.6)
+    dry-types (1.8.3)
+      bigdecimal (~> 3.0)
+      concurrent-ruby (~> 1.0)
+      dry-core (~> 1.0)
+      dry-inflector (~> 1.0)
+      dry-logic (~> 1.4)
+      zeitwerk (~> 2.6)
+    erb (5.0.2)
+    ice_nine (0.11.2)
+    io-console (0.8.1)
+    irb (1.15.2)
+      pp (>= 0.6.0)
+      rdoc (>= 4.0.0)
+      reline (>= 0.4.2)
+    logger (1.7.0)
+    nokogiri (1.18.9-aarch64-linux-gnu)
+      racc (~> 1.4)
+    nokogiri (1.18.9-aarch64-linux-musl)
+      racc (~> 1.4)
+    nokogiri (1.18.9-arm-linux-gnu)
+      racc (~> 1.4)
+    nokogiri (1.18.9-arm-linux-musl)
+      racc (~> 1.4)
+    nokogiri (1.18.9-arm64-darwin)
+      racc (~> 1.4)
+    nokogiri (1.18.9-x86_64-darwin)
+      racc (~> 1.4)
+    nokogiri (1.18.9-x86_64-linux-gnu)
+      racc (~> 1.4)
+    nokogiri (1.18.9-x86_64-linux-musl)
+      racc (~> 1.4)
+    pp (0.6.2)
+      prettyprint
+    prettyprint (0.2.0)
+    psych (5.2.6)
+      date
+      stringio
+    racc (1.8.1)
+    rdoc (6.14.2)
+      erb
+      psych (>= 4.0.0)
+    reline (0.6.2)
+      io-console (~> 0.5)
+    rspec (3.13.1)
+      rspec-core (~> 3.13.0)
+      rspec-expectations (~> 3.13.0)
+      rspec-mocks (~> 3.13.0)
+    rspec-core (3.13.5)
+      rspec-support (~> 3.13.0)
+    rspec-expectations (3.13.5)
+      diff-lcs (>= 1.2.0, < 2.0)
+      rspec-support (~> 3.13.0)
+    rspec-mocks (3.13.5)
+      diff-lcs (>= 1.2.0, < 2.0)
+      rspec-support (~> 3.13.0)
+    rspec-support (3.13.4)
+    stringio (3.1.7)
+    zeitwerk (2.7.3)
+
+PLATFORMS
+  aarch64-linux-gnu
+  aarch64-linux-musl
+  arm-linux-gnu
+  arm-linux-musl
+  arm64-darwin
+  x86_64-darwin
+  x86_64-linux-gnu
+  x86_64-linux-musl
+
+DEPENDENCIES
+  debug
+  dry-struct
+  dry-types
+  nokogiri
+  rspec
+
+BUNDLED WITH
+   2.5.9
diff --git a/README.md b/README.md
@@ -21,7 +21,7 @@ Parse directly the HTML result page ([html file]) in this repository. No extra H
 [html file]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/van-gogh-paintings.html
 [expected array]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/expected-array.json
 
-Add also to your array the painting thumbnails present in the result page file (not the ones where extra requests are needed). 
+Add also to your array the painting thumbnails present in the result page file (not the ones where extra requests are needed).
 
 Test against 2 other similar result pages to make sure it works against different layouts. (Pages that contain the same kind of carrousel. Don't necessarily have to be paintings.)
 

diff --git a/SOLUTION.md b/SOLUTION.md
@@ -0,0 +1,94 @@
+This PR implements a solution to parse artworks from google's search result.
+
+# Cases covered
+I've considered 2 cases: Artworks page (large carrousel from the example) and a default search page (small carrousel).
+
+![](files/van-gogh-paintings.png)
+![](files/default-page-search.png)
+
+
+
+# Architecture
+We have 2 main classes:
+
+- `GoogleSearchPageCrawler` - Responsible for receiving a file path, fetch the page HTML and format the expected result as JSON
+- `GoogleSearchPageCrawler::Parser` - Knows how to parse the page DOM. It's `parse` method returns a `GoogleSearchPageCrawler::Parser::Result`: a data/value object that uses dry struct. This makes our data structure more explicit and prevents mistyping errors that happens we just use a hash.
+
+# Parsing logic
+
+I've implemented everything in a single class but - if needed in the future - one idea is to split the parsing logic into multiple "sub-classes" instead of methods.
+
+Example: `GoogleSearchPageCrawler::Parser::ListResult`, `GoogleSearchPageCrawler::Parser::Artworks`, etc.
+
+Each class could parse a specific part of the page. While not strictly necessary, this approach can reduce cognitive load as the number of desired data grows.
+It keeps the code more organized, cohesive and makes it easier to locate and fix a broken parsing rule.
+
+I've tried to make the scraper more error prone by using a non obfuscated selectors such as `data-attrid="kc:/visual_art/visual_artist:works` and looking for text nodes instead of classes or dom hierarchy to search for the name/extensions.
+
+The `parse_small_carrousel_artwork` and `parse_big_carrousel_artwork` methods are intentionally kept separate, even though their logic is similar. Both parse the same concept (Result::Artwork), but from different DOM structures. Keeping them distinct ensures that each case retains its own logic and execution strategy.
+
+## Image parsing
+The readme highlights that we have to keep the image attribute for both cases:
+
+- the base64 encoded image
+- the image link (those that require a click on the "show more" button)
+
+When I've executed a test against the `expected-array.json` file, I've noticed that the `<img>` tag has a gif as SRC. And we have 2 cases:
+
+### img with id attribute
+```html
+<img class="taFZJe" alt="The Potato Eaters" id="_L_FkZ4qlAtyDwbkP49Pj0QU_79" src="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" data-deferred="1">
+```
+
+The same ID can be found inside a script tag with their base64 encoded image.
+
+```html
+<script nonce="xmO6un4J9murPFDygFfaMA">(function(){var s='data:image/webp;base64,UklGRjQMAABXRUJQVlA4ICgMAAAQRACdASrhAJsAPxGAt1QsKCU1KDV7MqAiCWcHDtAkSjkn/r/Xf+ydgBeraVdn/9Px1UGv5GYPyffrugwzIfw/Rc/+vnb/kP/hwO2JbSYKG1VQIN78tct7QVKKyA/XDj2TQ174tLSeF8ejv+SZJ2zx....';var ii=['_L_FkZ4qlAtyDwbkP49Pj0QU_79'];var r='';_setImagesSrc(ii,s,r);})();</script>
+```
+
+So, we have to find the script tag with the same ID and extract the base64 encoded image from there.
+
+### img with `data-src` attribute. Aditional request is needed
+```html
+<img class="taFZJe" alt="Self-Portrait with Bandaged Ear" data-src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQ8juuefle5MyKZKBLRgPjsGSJon7vkt91SM7WTRuZOOyAyUI1v" src="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="/>
+```
+
+We just use the `data-src` and return it.
+
+# Usage
+
+## Running tests
+`bundle exec rspec` to run the specs
+
+## Scraping a search page
+
+Execute
+`bundle exec ruby scrape_files.rb FILENAME.HTML` to use the `GoogleSearchPageCrawler` to crawl the page, parse the artworks and save the result inside the `files` folder.
+
+It searches for the file in the `files` folder. Defaults to `van-gogh-paintings.html`
+
+# Notes
+## Interesting case - Following artwork link
+There's another case that I found: when we click on an artwork and the list appears as a horizontal top list.
+
+https://www.google.com/search?sca_esv=e77f9c08ad4d25ad&sxsrf=AE3TifPtmZe03gj5RYheiEGzNrtk6qieag:1755204957217&q=Bullfinch+and+weeping+cherry+blossoms&stick=H4sIAAAAAAAAAONgFuLQz9U3SCpPM1Hi1U_XNzRMNi5JrzKozNFSyk620i_LLC5NzIlPLCpBYmYWl1iV5xdlFy9iVXUqzclJy8xLzlBIzEtRKE9NLcjMS1dIzkgtKqpUSMrJLy7Ozy0GAHIW_uhnAAAA&sa=X&ved=2ahUKEwj_qO7_l4uPAxWrIrkGHbNdIp4QgOQBegQIMhAS
+
+I didn't cover this case because I've noticed that it happens only when we follow an artwork link: the page loads with the artwork highlighted containing an empty href="#".
+
+If we ever need to cover this case we can use the
+`div[data-attrid="kc:/visual_art/visual_artist:works"] [role="group"] a` selector and maybe change the URL logic to have a "current page" information in order to return the correct link for the selected artwork.
+
+## RAW HTML analysis
+
+### Artwork specific page (Example from README)
+The raw HTML (from 'view source code') lists all the artworks
+
+### Normal search page ('small carousell')
+The raw HTML (from 'view source code') lists only 6 artworks. The other ones seems to be inside a javascript.
+
+Testing in the playground: https://serpapi.com/playground?q=monet&location=Austin%2C+Texas%2C+United+States&gl=us&hl=en it seems that SerpAPI does consider this case.
+
+I didn't try to parse them because I believe that this is outside of scope of this exercise but I see other (and probably better) options:
+
+- Instead of manually parsing scripts, we could consider using a real browser to evaluate the HTML before parsing the data. This is less performant but - if other parts of the page also requires this method - can be an alternative.
+- We can follow the "Artworks" link and scrape everything from there using the "Artwork specific page" implementation (I believe that this is the way to go...)
diff --git a/files/default-page-search.png b/files/default-page-search.png