Skip to content

Commit 4c8724c

Browse files
committed
Adds more parsing example files
1 parent f7ed4d9 commit 4c8724c

File tree

8 files changed

+233
-16
lines changed

8 files changed

+233
-16
lines changed

SOLUTION.md

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
This PR implements a solution to parse artworks from google's search result.
22

33
# Cases covered
4-
I've considered 2 cases: Artworks page (large carrousel from the example) and a default search page (small carrousel) and the
4+
I've considered 2 cases: Artworks page (large carrousel from the example) and a default search page (small carrousel).
55

66
![](files/van-gogh-paintings.png)
77
![](files/default-page-search.png)
@@ -20,17 +20,20 @@ I've implemented everything in a single class but - if needed in the future - on
2020

2121
Example: `GoogleSearchPageCrawler::Parser::ListResult`, `GoogleSearchPageCrawler::Parser::Artworks`, etc.
2222

23-
Each class could parse a specific part of the page. It's not strictly necessary but may help to lower the cognitive load if it gets too big, keeping the code more organized and cohesive by facilitate to know where to look for fixing a specific broken parsing rule.
23+
Each class could parse a specific part of the page. While not strictly necessary, this approach can reduce cognitive load as the number of desired data grows.
24+
It keeps the code more organized, cohesive and makes it easier to locate and fix a broken parsing rule.
2425

2526
I've tried to make the scraper more error prone by using a non obfuscated selectors such as `data-attrid="kc:/visual_art/visual_artist:works` and looking for text nodes instead of classes or dom hierarchy to search for the name/extensions.
2627

28+
The `parse_small_carrousel_artwork` and `parse_big_carrousel_artwork` methods are intentionally kept separate, even though their logic is similar. Both parse the same concept (Result::Artwork), but from different DOM structures. Keeping them distinct ensures that each case retains its own logic and execution strategy.
29+
2730
## Image parsing
2831
The readme highlights that we have to keep the image attribute for both cases:
2932

3033
- the base64 encoded image
3134
- the image link (those that require a click on the "show more" button)
3235

33-
When I've executed a test against the `expected-array.json` file, I've noticed that the `<img>` tag with a gif and not the correct src.
36+
When I've executed a test against the `expected-array.json` file, I've noticed that the `<img>` tag has a gif as SRC. And we have 2 cases:
3437

3538
### img with id attribute
3639
```html
@@ -55,12 +58,12 @@ We just use the `data-src` and return it.
5558
# Usage
5659

5760
## Running tests
58-
`bundle exec rspec` to run feature specs (uses fixtures) or more unit tests from the `lib` folder.
61+
`bundle exec rspec` to run the specs
5962

6063
## Scraping a search page
6164

6265
Execute
63-
`bundle exec ruby scrape_files.rb FILENAME.HTML` to use the `GoogleSearchPageCrawler` to crawl the page and parse the artworks.
66+
`bundle exec ruby scrape_files.rb FILENAME.HTML` to use the `GoogleSearchPageCrawler` to crawl the page, parse the artworks and save the result inside the `files` folder.
6467

6568
It searches for the file in the `files` folder. Defaults to `van-gogh-paintings.html`
6669

@@ -81,8 +84,11 @@ If we ever need to cover this case we can use the
8184
The raw HTML (from 'view source code') lists all the artworks
8285

8386
### Normal search page ('small carousell')
84-
The raw HTMl (from 'view source code') lists only 6 artworks. The other ones seems to be inside a javascript.
87+
The raw HTML (from 'view source code') lists only 6 artworks. The other ones seems to be inside a javascript.
8588

8689
Testing in the playground: https://serpapi.com/playground?q=monet&location=Austin%2C+Texas%2C+United+States&gl=us&hl=en it seems that SerpAPI does consider this case.
8790

88-
I didn't try to parse them because I believe that this is outside of scope of this exercise. Instead of manually parsing the script (like we did with the image) we could consider using a real browser to evaluate the HTML before parsing the data. This is less performant but, depending on how possible is to manually parse this case, can be an option.
91+
I didn't try to parse them because I believe that this is outside of scope of this exercise but I see other (and probably better) options:
92+
93+
- Instead of manually parsing scripts, we could consider using a real browser to evaluate the HTML before parsing the data. This is less performant but - if other parts of the page also requires this method - can be an alternative.
94+
- We can follow the "Artworks" link and scrape everything from there using the "Artwork specific page" implementation (I believe that this is the way to go...)

files/rommero-brito-artworks-results.json

Lines changed: 93 additions & 0 deletions
Large diffs are not rendered by default.

files/rommero-brito-artworks.html

Lines changed: 36 additions & 0 deletions
Large diffs are not rendered by default.

files/rommero-brito-search-results.json

Lines changed: 46 additions & 0 deletions
Large diffs are not rendered by default.

files/rommero-brito-search.html

Lines changed: 39 additions & 0 deletions
Large diffs are not rendered by default.

lib/google_search_page_crawler.rb

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,6 @@
22
require_relative 'google_search_page_crawler/parser'
33

44
class GoogleSearchPageCrawler
5-
attr_reader :agent
6-
75
def crawl(file_path)
86
html = open(file_path).read
97
parser = GoogleSearchPageCrawler::Parser.new(html)

lib/google_search_page_crawler/parser.rb

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,13 +9,13 @@ module Types
99
class GoogleSearchPageCrawler
1010
class Result < Dry::Struct
1111
class Artwork < Dry::Struct
12-
attribute :name, Types::String.default("")
13-
attribute :extensions, Types::Array.of(Types::String).default([])
14-
attribute :link, Types::String.default("")
15-
attribute :image, Types::String.default("")
12+
attribute :name, Types::String.default("".freeze)
13+
attribute :extensions, Types::Array.of(Types::String).default([].freeze)
14+
attribute :link, Types::String.default("".freeze)
15+
attribute :image, Types::String.default("".freeze)
1616
end
1717

18-
attribute :artworks, Types::Array.of(Result::Artwork).default([])
18+
attribute :artworks, Types::Array.of(Result::Artwork).default([].freeze)
1919
end
2020

2121
class Parser
@@ -52,7 +52,6 @@ def parse_small_carrousel_artwork(artwork_node)
5252
image_node.attr("src")
5353
end
5454

55-
5655
Result::Artwork.new({
5756
name: text_nodes.first,
5857
extensions: text_nodes.drop(1),

scrape_files.rb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,6 @@ def write_to_file(name, content)
1515

1616
results = crawler.crawl(file_path(file_name))
1717

18-
puts results.to_json
18+
# puts results.to_json
1919

2020
write_to_file(file_name.to_s.gsub(".html", "-results.json"), results)

0 commit comments

Comments
 (0)