-
Notifications
You must be signed in to change notification settings - Fork 311
Implements parser for artwork list in google results page #353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR implements a solution to parse artworks from google's search result.
Cases covered
I've considered 2 cases: Artworks page (large carrousel from the example) and a default search page (small carrousel).
Architecture
We have the following classes:
GoogleSearchPageCrawler
- Responsible for receiving a file path, fetching the page HTML and formatting the expected result as JSONGoogleSearchPageCrawler::Parser
- Knows how to parse the page DOM. It'sparse
method returns aGoogleSearchPageCrawler::Parser::Result
: a data/value object that uses dry struct. This makes our data structure more explicit and prevents mistyping errors that happens we just use a hash.Parsing logic
I've tried to make the scraper more error prone by using non obfuscated selectors such as
[data-attrid="kc:/visual_art/visual_artist:works"]
and looking for text nodes instead of classes or dom hierarchy to search for the name/extensions.The
parse_small_carrousel_artwork
andparse_big_carrousel_artwork
methods are intentionally kept separate, even though their logic is similar. Both parse the same concept (Result::Artwork), but from different DOM structures. Keeping them distinct ensures that each case retains its own logic and execution strategy.For now, we're using methods but - if needed in the future - one idea is to split the parsing logic into multiple "sub-classes" instead of methods.
Example:
GoogleSearchPageCrawler::Parser::ListResult
,GoogleSearchPageCrawler::Parser::Artworks
, etc.Each class could parse a specific part of the page. While not strictly necessary, this approach can reduce cognitive load as the number of desired data grows.
It keeps the code more organized, cohesive and makes it easier to locate and fix a broken parsing rule.
Image parsing
The readme highlights that we have to keep the image attribute for both cases:
When I've executed a test against the
expected-array.json
file, I've noticed that the<img>
tag has a gif as SRC. And we have 2 cases:img with id attribute
The same ID can be found inside a script tag with their base64 encoded image.
So, we have to find the script tag with the same ID and extract the base64 encoded image from there.
img with
data-src
attribute. Aditional request is neededWe just use the
data-src
and return it.Usage
Running tests
bundle exec rspec
to run the specsScraping a search page
Execute
bundle exec ruby scrape_files.rb FILENAME.HTML
to use theGoogleSearchPageCrawler
to crawl the page, parse the artworks and save the result inside thefiles
folder.It searches for the file in the
files
folder. Defaults tovan-gogh-paintings.html
Notes
Interesting case - Following artwork link
There's another case that I found: when we click on an artwork and the list appears as a horizontal top list.
https://www.google.com/search?sca_esv=e77f9c08ad4d25ad&sxsrf=AE3TifPtmZe03gj5RYheiEGzNrtk6qieag:1755204957217&q=Bullfinch+and+weeping+cherry+blossoms&stick=H4sIAAAAAAAAAONgFuLQz9U3SCpPM1Hi1U_XNzRMNi5JrzKozNFSyk620i_LLC5NzIlPLCpBYmYWl1iV5xdlFy9iVXUqzclJy8xLzlBIzEtRKE9NLcjMS1dIzkgtKqpUSMrJLy7Ozy0GAHIW_uhnAAAA&sa=X&ved=2ahUKEwj_qO7_l4uPAxWrIrkGHbNdIp4QgOQBegQIMhAS
I didn't cover this case because I've noticed that it happens only when we follow an artwork link: the page loads with the artwork highlighted containing an empty href="#".
If we ever need to cover this case we can use the
div[data-attrid="kc:/visual_art/visual_artist:works"] [role="group"] a
selector and maybe change the URL logic to have a "current page" information in order to return the correct link for the selected artwork.RAW HTML analysis
Artwork specific page (Example from README)
The raw HTML (from 'view source code') lists all the artworks
Normal search page ('small carousell')
The raw HTML (from 'view source code') lists only 6 artworks. The other ones seems to be inside a javascript.
Testing in the playground: https://serpapi.com/playground?q=monet&location=Austin%2C+Texas%2C+United+States&gl=us&hl=en it seems that SerpAPI does consider this case.
I didn't try to parse them because I believe that this is outside of scope of this exercise but I see other options: