Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 60 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,71 @@
# Extract Van Gogh Paintings Code Challenge
# GoogleKCParser

Goal is to extract a list of Van Gogh paintings from the attached Google search results page.
## Overview
`GoogleKCParser` is a Ruby module designed to parse Google Knowledge Carousel (KC) HTML files. It extracts carousel items such as artworks, animal breeds, artist albums, movie cast lists etc by analyzing the HTML structure and embedded data.

![Van Gogh paintings](https://github.com/serpapi/code-challenge/blob/master/files/van-gogh-paintings.png?raw=true "Van Gogh paintings")
## Project Structure
```
├── bin
│   └── run_parser.rb # Script for running the parser from the command line
├── files # Contains input HTML files and expected output JSON files
├── lib
│   └── google_kc_parser.rb # Parser module source code
└── test
└── test_parser.rb # File for testing the parser using multiple carousel formats
```

## Instructions
## Usage

This is already fully supported on SerpApi. ([relevant test], [html file], [sample json], and [expected array].)
Try to come up with your own solution and your own test.
Extract the painting `name`, `extensions` array (date), and Google `link` in an array.
### Running the parser from the command line

Fork this repository and make a PR when ready.
To parse an HTML file and output the extracted carousel data as JSON:

Programming language wise, Ruby (with RSpec tests) is strongly suggested but feel free to use whatever you feel like.
```bash
ruby bin/run_parser.rb <html_file>
```
Example:

Parse directly the HTML result page ([html file]) in this repository. No extra HTTP requests should be needed for anything.
```bash
ruby bin/run_parser.rb van-gogh-paintings.html
```
This will parse the file located in the files/ directory and output JSON results in a van-gogh-paintings-actual.json file and to stdout.

[relevant test]: https://github.com/serpapi/test-knowledge-graph-desktop/blob/master/spec/knowledge_graph_claude_monet_paintings_spec.rb
[sample json]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/van-gogh-paintings.json
[html file]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/van-gogh-paintings.html
[expected array]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/expected-array.json

Add also to your array the painting thumbnails present in the result page file (not the ones where extra requests are needed).
### Running the parser from IRB

Test against 2 other similar result pages to make sure it works against different layouts. (Pages that contain the same kind of carrousel. Don't necessarily have to be paintings.)
1. Start `irb` in the project root directory:

The suggested time for this challenge is 4 hours. But, you can take your time and work more on it if you want.
```bash
irb
```
2. Require the parser module
```bash
require_relative './lib/google_kc_parser'
```
3. Call the parser method with the HTML filename (located in files/):
```ruby
results = GoogleKCParser.parse('van-gogh-paintings.html')
puts results
```
## Running Tests
Tests are implemented using Ruby’s built-in Test::Unit framework.

To run all tests:
```bash
ruby test/test_parser.rb
```
The following google queries are currently tested:
1. van gogh paintings
2. dog breeds
3. michael jackson albums
4. stranger things cast

The test suite verifies there's an exact match between the expected results and the actual results by verifying result-per-result and field-by-field. The query HTML files and the expected JSON results for the above queries are already present in the files/ directory.

Feel free to test more queries. You need to add the HTML page together with the expected output in json in the files/ directory. They need to follow the following format e.g. for
'dog breeds': dog- breeds.html and dog-breeds-expected.json. The expected json should be an array of hashes with keys name, link, image, extensions. Order of results **does matter**.
## Notes
This parser focuses on Google Knowledge Carousel results by targeting specific HTML data attributes and embedded JavaScript.
Image data that is embedded as base64 or dynamically injected by scripts is properly decoded and extracted.
The parser is designed to be extensible for different carousel formats.
If you have any questions or need further assistance, feel free to reach out!
20 changes: 20 additions & 0 deletions bin/run_parser.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/usr/bin/env ruby
# Script to parse a Google Knowledge Carousel HTML file and output extracted results as JSON.
# Usage: ./run_parser.rb <html_file>

require_relative '../lib/google_kc_parser'

if ARGV.empty?
puts "Usage: #{$0} <html_file>"
exit 1
end

html_file = ARGV[0]

begin
results = GoogleKCParser.parse(html_file)
puts JSON.pretty_generate(results)
rescue => e
warn "Error parsing file #{html_file}: #{e.message}"
exit 1
end
62 changes: 62 additions & 0 deletions files/dog-breeds-expected.json

Large diffs are not rendered by default.

Loading