Skip to content

Create google search parser #349

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

pmonfort
Copy link

@pmonfort pmonfort commented Jul 29, 2025

Google Search Parser - Challenge Solution

Solution

Created a GoogleSearchParser service with specs to validate the expected behavior.

The GoogleSearchParser gathers image information as long as the basic image structure exists on the page, regardless of the rest of the HTML structure. You can see this demonstrated in the test case: when HTML has malformed structure but with valid data.

Supported Image Structures

Minimal structure:

<a href="XXXXXX">
  <img class="taFZJe" data-src="XXXXXXXXXX" alt="XXXXXXX">
</a>

Maximum structure:

<a href="XXXXXX">
  <img class="taFZJe" data-src="XXXXXXXXXX" alt="XXXXXXX">
  <div class="KHK6lb">
    <div class="pgNMRc">XXXXXXX</div>
    <div class="cxzHyb">1889</div>
  </div>
</a>

Image Source Handling

I noticed that Google populates the carousel image sources dynamically using JavaScript. To address this, I implemented a function extract_image_base64_by_id() that extracts the source directly from the script tags.

Name Extraction Logic

For artwork names, the parser follows a fallback strategy:

  1. First, it attempts to extract the name from a div with class pgNMRc
  2. If that element doesn't exist or is empty, it falls back to using the image's alt attribute

This approach ensures robust name extraction even when the HTML structure varies or is incomplete.

Technical Implementation

  • Language: Ruby 3.3.5
  • Parser: Nokogiri for HTML parsing
  • Testing: RSpec for comprehensive test coverage
  • Test Pages: Added pablo_picasso, claude_monet, and leonardo_da_vinci for cross-validation

Setup

bundle install

Usage

Initialize GoogleSearchParser with the HTML. The parse_images method returns the JSON result.

parser = GoogleSearchParser.new(html)
parser.parse_images

Testing

The solution includes tests for various scenarios including malformed HTML structures to ensure robust parsing capabilities.

bundle exec rspec

@andypple83 andypple83 closed this Aug 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants