You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: SOLUTION.md
+13-7Lines changed: 13 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
This PR implements a solution to parse artworks from google's search result.
2
2
3
3
# Cases covered
4
-
I've considered 2 cases: Artworks page (large carrousel from the example) and a default search page (small carrousel) and the
4
+
I've considered 2 cases: Artworks page (large carrousel from the example) and a default search page (small carrousel).
5
5
6
6

7
7

@@ -20,17 +20,20 @@ I've implemented everything in a single class but - if needed in the future - on
20
20
21
21
Example: `GoogleSearchPageCrawler::Parser::ListResult`, `GoogleSearchPageCrawler::Parser::Artworks`, etc.
22
22
23
-
Each class could parse a specific part of the page. It's not strictly necessary but may help to lower the cognitive load if it gets too big, keeping the code more organized and cohesive by facilitate to know where to look for fixing a specific broken parsing rule.
23
+
Each class could parse a specific part of the page. While not strictly necessary, this approach can reduce cognitive load as the number of desired data grows.
24
+
It keeps the code more organized, cohesive and makes it easier to locate and fix a broken parsing rule.
24
25
25
26
I've tried to make the scraper more error prone by using a non obfuscated selectors such as `data-attrid="kc:/visual_art/visual_artist:works` and looking for text nodes instead of classes or dom hierarchy to search for the name/extensions.
26
27
28
+
The `parse_small_carrousel_artwork` and `parse_big_carrousel_artwork` methods are intentionally kept separate, even though their logic is similar. Both parse the same concept (Result::Artwork), but from different DOM structures. Keeping them distinct ensures that each case retains its own logic and execution strategy.
29
+
27
30
## Image parsing
28
31
The readme highlights that we have to keep the image attribute for both cases:
29
32
30
33
- the base64 encoded image
31
34
- the image link (those that require a click on the "show more" button)
32
35
33
-
When I've executed a test against the `expected-array.json` file, I've noticed that the `<img>` tag with a gif and not the correct src.
36
+
When I've executed a test against the `expected-array.json` file, I've noticed that the `<img>` tag has a gif as SRC. And we have 2 cases:
34
37
35
38
### img with id attribute
36
39
```html
@@ -55,12 +58,12 @@ We just use the `data-src` and return it.
55
58
# Usage
56
59
57
60
## Running tests
58
-
`bundle exec rspec` to run feature specs (uses fixtures) or more unit tests from the `lib` folder.
61
+
`bundle exec rspec` to run the specs
59
62
60
63
## Scraping a search page
61
64
62
65
Execute
63
-
`bundle exec ruby scrape_files.rb FILENAME.HTML` to use the `GoogleSearchPageCrawler` to crawl the page and parse the artworks.
66
+
`bundle exec ruby scrape_files.rb FILENAME.HTML` to use the `GoogleSearchPageCrawler` to crawl the page, parse the artworks and save the result inside the `files` folder.
64
67
65
68
It searches for the file in the `files` folder. Defaults to `van-gogh-paintings.html`
66
69
@@ -81,8 +84,11 @@ If we ever need to cover this case we can use the
81
84
The raw HTML (from 'view source code') lists all the artworks
82
85
83
86
### Normal search page ('small carousell')
84
-
The raw HTMl (from 'view source code') lists only 6 artworks. The other ones seems to be inside a javascript.
87
+
The raw HTML (from 'view source code') lists only 6 artworks. The other ones seems to be inside a javascript.
85
88
86
89
Testing in the playground: https://serpapi.com/playground?q=monet&location=Austin%2C+Texas%2C+United+States&gl=us&hl=en it seems that SerpAPI does consider this case.
87
90
88
-
I didn't try to parse them because I believe that this is outside of scope of this exercise. Instead of manually parsing the script (like we did with the image) we could consider using a real browser to evaluate the HTML before parsing the data. This is less performant but, depending on how possible is to manually parse this case, can be an option.
91
+
I didn't try to parse them because I believe that this is outside of scope of this exercise but I see other (and probably better) options:
92
+
93
+
- Instead of manually parsing scripts, we could consider using a real browser to evaluate the HTML before parsing the data. This is less performant but - if other parts of the page also requires this method - can be an alternative.
94
+
- We can follow the "Artworks" link and scrape everything from there using the "Artwork specific page" implementation (I believe that this is the way to go...)
0 commit comments