Skip to content

Create google search parser #349

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .ruby-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.3.5
10 changes: 10 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
source 'https://rubygems.org'

ruby '3.3.5'

gem 'nokogiri', '~> 1.15'
gem 'rspec', '~> 3.12'

group :development do
gem "debug", ">= 1.0.0"
end
77 changes: 77 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
GEM
remote: https://rubygems.org/
specs:
date (3.4.1)
debug (1.11.0)
irb (~> 1.10)
reline (>= 0.3.8)
diff-lcs (1.6.2)
erb (5.0.2)
io-console (0.8.1)
irb (1.15.2)
pp (>= 0.6.0)
rdoc (>= 4.0.0)
reline (>= 0.4.2)
nokogiri (1.18.9-aarch64-linux-gnu)
racc (~> 1.4)
nokogiri (1.18.9-aarch64-linux-musl)
racc (~> 1.4)
nokogiri (1.18.9-arm-linux-gnu)
racc (~> 1.4)
nokogiri (1.18.9-arm-linux-musl)
racc (~> 1.4)
nokogiri (1.18.9-arm64-darwin)
racc (~> 1.4)
nokogiri (1.18.9-x86_64-darwin)
racc (~> 1.4)
nokogiri (1.18.9-x86_64-linux-gnu)
racc (~> 1.4)
nokogiri (1.18.9-x86_64-linux-musl)
racc (~> 1.4)
pp (0.6.2)
prettyprint
prettyprint (0.2.0)
psych (5.2.6)
date
stringio
racc (1.8.1)
rdoc (6.14.2)
erb
psych (>= 4.0.0)
reline (0.6.2)
io-console (~> 0.5)
rspec (3.13.1)
rspec-core (~> 3.13.0)
rspec-expectations (~> 3.13.0)
rspec-mocks (~> 3.13.0)
rspec-core (3.13.5)
rspec-support (~> 3.13.0)
rspec-expectations (3.13.5)
diff-lcs (>= 1.2.0, < 2.0)
rspec-support (~> 3.13.0)
rspec-mocks (3.13.5)
diff-lcs (>= 1.2.0, < 2.0)
rspec-support (~> 3.13.0)
rspec-support (3.13.4)
stringio (3.1.7)

PLATFORMS
aarch64-linux-gnu
aarch64-linux-musl
arm-linux-gnu
arm-linux-musl
arm64-darwin
x86_64-darwin
x86_64-linux-gnu
x86_64-linux-musl

DEPENDENCIES
debug (>= 1.0.0)
nokogiri (~> 1.15)
rspec (~> 3.12)

RUBY VERSION
ruby 3.3.5p100

BUNDLED WITH
2.7.1
17 changes: 17 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,20 @@ Add also to your array the painting thumbnails present in the result page file (
Test against 2 other similar result pages to make sure it works against different layouts. (Pages that contain the same kind of carrousel. Don't necessarily have to be paintings.)

The suggested time for this challenge is 4 hours. But, you can take your time and work more on it if you want.

## Solution

Created a GoogleSearchParser service and added specs to validate the expected behavior. Built with Ruby 3.3.5, Nokogiri for HTML parsing and RSpec for testing.
Included three additional test pages (pablo_picasso, claude_monet, and leonardo_da_vinci) to ensure the parser works across different layouts.

# Setup

```
bundle install
```

# Run tests

```
bundle exec rspec
```
380 changes: 380 additions & 0 deletions files/claude_monet/expected_result.json

Large diffs are not rendered by default.

34 changes: 34 additions & 0 deletions files/claude_monet/search_result.html

Large diffs are not rendered by default.

12 changes: 12 additions & 0 deletions files/empty_page.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<!DOCTYPE html>
<html>
<head>
<title>Empty Page</title>
</head>
<body>
<div class="empty-page">
<h1>No Artworks Found</h1>
<p>This page contains no artwork information.</p>
</div>
</body>
</html>
332 changes: 332 additions & 0 deletions files/leonardo_da_vinci/expected_result.json

Large diffs are not rendered by default.

36 changes: 36 additions & 0 deletions files/leonardo_da_vinci/search_result.html

Large diffs are not rendered by default.

34 changes: 34 additions & 0 deletions files/malformed.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
<!DOCTYPE html>
<html>
<head>
<title>Malformed HTML</title>
</head>
<body>
<div class="search-results">
<div class="knowledge-panel">
<div class="kno-fv">
<div class="kno-fv__tab-content" data-attrid="artworks">
<div class="kno-fv__tab-panel">
<h3>Artworks</h3>
<div class="artworks-grid">
<div class="artwork-item">
<img src="https://example.com/starry-night.jpg" alt="The Starry Night" class="kno-fv__img">
<div class="artwork-info">
<span class="title">The Starry Night
<span class="year">1889
</div>
</div>
<div class="artwork-item">
<img src="https://example.com/self-portrait.jpg" alt="Van Gogh self-portrait" class="kno-fv__img">
<div class="artwork-info">
<span class="title">Van Gogh self-portrait</span>
<span class="year">1889</span>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</body>
</html>
33 changes: 33 additions & 0 deletions files/malformed_with_valid_data.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
<!DOCTYPE html>
<html>
<head>
<title>Malformed HTML</title>
</head>
<body>
<div class="search-results">
<div class="knowledge-panel">
<div class="kno-fv">
<h3>Artworks</h3>
<div class="artworks-grid">
<a href="/search?sca_esv=c2e426814f4d07e9&amp;gl=us&amp;hl=en&amp;q=Sunflowers&amp;stick=H4sIAAAAAAAAAONgFuLQz9U3MI_PNVLiArFMUszTjcu1lLKTrfTLMotLE3PiE4tKkJiZxSVW5flF2cWLWLmCS_PScvLLU4uKARitY11JAAAA&amp;sa=X&amp;ved=2ahUKEwjK-K-JwLWKAxXcQTABHePpOFoQtq8DegQIMxAX">
<img class="taFZJe" alt="Sunflowers" data-src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS1s9TeQSMp52s4RilDMm5lMGHK26HjE3T6D-88O1l6Xf3pDCvv" src="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" data-csiid="L_FkZ4qlAtyDwbkP49Pj0QU_13" data-lzy_="1">
</a>

<div class="iELo6" style="display:none;width:0px;top:0px;left:0px" jsdata="JI96Wc;unsupported;BCVKGw">
<a href="/search?sca_esv=c2e426814f4d07e9&amp;gl=us&amp;hl=en&amp;q=Wheat+Field+with+Cypresses&amp;stick=H4sIAAAAAAAAAONgFuLQz9U3MI_PNVLiArFMSnJMTeK1lLKTrfTLMotLE3PiE4tKkJiZxSVW5flF2cWLWKXCM1ITSxTcMlNzUhTKM0syFJwrC4pSi4tTiwEkoHZyWQAAAA&amp;sa=X&amp;ved=2ahUKEwjK-K-JwLWKAxXcQTABHePpOFoQtq8DegQIMxAv">
<img class="taFZJe" alt="Wheat Field with Cypresses" data-src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQ-YYo-yqMf-K5i2GTIoT8OmNzoTdfxd55p4TbIcmtxLbyYvKzO" src="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" data-csiid="L_FkZ4qlAtyDwbkP49Pj0QU_25" data-lzy_="1">
<div class="KHK6lb">
<div class="pgNMRc"> </div>
<div class="cxzHyb">1889</div>
</div>
</a>
</div>

</div>
</div>
</div>
</div>
</div>
</div>
</body>
</html>
352 changes: 352 additions & 0 deletions files/pablo_picasso/expected_result.json

Large diffs are not rendered by default.

Loading