Skip to content

feat: implement a subset of SerpApi to extract knowledge card carousel images #350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,7 @@ build-iPhoneSimulator/
# unless supporting rvm < 1.11.0 or doing something fancy, ignore this:
.rvmrc
.DS_Store

.rspec_status

.idea/
3 changes: 3 additions & 0 deletions .rspec
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
--format documentation
--color
--require spec_helper
3 changes: 3 additions & 0 deletions .standard.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# For available configuration options, see:
# https://github.com/standardrb/standard
ruby_version: 3.1
13 changes: 13 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# frozen_string_literal: true

source "https://rubygems.org"

# Specify your gem's dependencies in google_serp.gemspec
gemspec

gem "irb"
gem "rake", "~> 13.0"

gem "rspec", "~> 3.0"

gem "standard", "~> 1.3"
125 changes: 125 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
PATH
remote: .
specs:
google_serp (0.1.0)
nokogiri (~> 1.18.9)

GEM
remote: https://rubygems.org/
specs:
ast (2.4.3)
date (3.4.1)
diff-lcs (1.6.2)
erb (5.0.2)
io-console (0.8.1)
irb (1.15.2)
pp (>= 0.6.0)
rdoc (>= 4.0.0)
reline (>= 0.4.2)
json (2.13.2)
language_server-protocol (3.17.0.5)
lint_roller (1.1.0)
nokogiri (1.18.9-aarch64-linux-gnu)
racc (~> 1.4)
nokogiri (1.18.9-aarch64-linux-musl)
racc (~> 1.4)
nokogiri (1.18.9-arm-linux-gnu)
racc (~> 1.4)
nokogiri (1.18.9-arm-linux-musl)
racc (~> 1.4)
nokogiri (1.18.9-arm64-darwin)
racc (~> 1.4)
nokogiri (1.18.9-x86_64-darwin)
racc (~> 1.4)
nokogiri (1.18.9-x86_64-linux-gnu)
racc (~> 1.4)
nokogiri (1.18.9-x86_64-linux-musl)
racc (~> 1.4)
parallel (1.27.0)
parser (3.3.9.0)
ast (~> 2.4.1)
racc
pp (0.6.2)
prettyprint
prettyprint (0.2.0)
prism (1.4.0)
psych (5.2.6)
date
stringio
racc (1.8.1)
rainbow (3.1.1)
rake (13.3.0)
rdoc (6.14.2)
erb
psych (>= 4.0.0)
regexp_parser (2.10.0)
reline (0.6.2)
io-console (~> 0.5)
rspec (3.13.1)
rspec-core (~> 3.13.0)
rspec-expectations (~> 3.13.0)
rspec-mocks (~> 3.13.0)
rspec-core (3.13.5)
rspec-support (~> 3.13.0)
rspec-expectations (3.13.5)
diff-lcs (>= 1.2.0, < 2.0)
rspec-support (~> 3.13.0)
rspec-mocks (3.13.5)
diff-lcs (>= 1.2.0, < 2.0)
rspec-support (~> 3.13.0)
rspec-support (3.13.4)
rubocop (1.75.8)
json (~> 2.3)
language_server-protocol (~> 3.17.0.2)
lint_roller (~> 1.1.0)
parallel (~> 1.10)
parser (>= 3.3.0.2)
rainbow (>= 2.2.2, < 4.0)
regexp_parser (>= 2.9.3, < 3.0)
rubocop-ast (>= 1.44.0, < 2.0)
ruby-progressbar (~> 1.7)
unicode-display_width (>= 2.4.0, < 4.0)
rubocop-ast (1.46.0)
parser (>= 3.3.7.2)
prism (~> 1.4)
rubocop-performance (1.25.0)
lint_roller (~> 1.1)
rubocop (>= 1.75.0, < 2.0)
rubocop-ast (>= 1.38.0, < 2.0)
ruby-progressbar (1.13.0)
standard (1.50.0)
language_server-protocol (~> 3.17.0.2)
lint_roller (~> 1.0)
rubocop (~> 1.75.5)
standard-custom (~> 1.0.0)
standard-performance (~> 1.8)
standard-custom (1.0.2)
lint_roller (~> 1.0)
rubocop (~> 1.50)
standard-performance (1.8.0)
lint_roller (~> 1.1)
rubocop-performance (~> 1.25.0)
stringio (3.1.7)
unicode-display_width (3.1.4)
unicode-emoji (~> 4.0, >= 4.0.4)
unicode-emoji (4.0.4)

PLATFORMS
aarch64-linux-gnu
aarch64-linux-musl
arm-linux-gnu
arm-linux-musl
arm64-darwin
x86_64-darwin
x86_64-linux-gnu
x86_64-linux-musl

DEPENDENCIES
google_serp!
irb
rake (~> 13.0)
rspec (~> 3.0)
standard (~> 1.3)

BUNDLED WITH
2.6.9
10 changes: 10 additions & 0 deletions Rakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# frozen_string_literal: true

require "bundler/gem_tasks"
require "rspec/core/rake_task"

RSpec::Core::RakeTask.new(:spec)

require "standard/rake"

task default: %i[spec standard]
11 changes: 11 additions & 0 deletions bin/console
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/usr/bin/env ruby
# frozen_string_literal: true

require "bundler/setup"
require "google_serp"

# You can add fixtures and/or initialization code here to make experimenting
# with your gem easier. You can also use a different console, if you like.

require "irb"
IRB.start(__FILE__)
8 changes: 8 additions & 0 deletions bin/setup
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#!/usr/bin/env bash
set -euo pipefail
IFS=$'\n\t'
set -vx

bundle install

# Do any other automated setup that you need to do here
32 changes: 32 additions & 0 deletions files/extra_results/adele_songs.html

Large diffs are not rendered by default.

92 changes: 92 additions & 0 deletions files/extra_results/adele_songs_expected_array.json

Large diffs are not rendered by default.

32 changes: 32 additions & 0 deletions files/extra_results/jk_rowling_books.html

Large diffs are not rendered by default.

98 changes: 98 additions & 0 deletions files/extra_results/jk_rowling_books_expected_array.json

Large diffs are not rendered by default.

32 changes: 32 additions & 0 deletions files/extra_results/leo_di_caprio_movies.html

Large diffs are not rendered by default.

98 changes: 98 additions & 0 deletions files/extra_results/leo_di_caprio_movies_expected_array.json

Large diffs are not rendered by default.

24 changes: 24 additions & 0 deletions google_serp.gemspec
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# frozen_string_literal: true

require_relative 'lib/google_serp/version'

Gem::Specification.new do |spec|
spec.name = 'google_serp'
spec.version = GoogleSerp::VERSION
spec.authors = ['binoverfl0w']
spec.email = ['[email protected]']

spec.description = 'Google SERP Scraper is a Ruby gem that allows you to scrape a subset of data from Google Search Engine Results Pages (SERPs) like knowledge cards. It provides a simple and efficient way to extract structured information from Google search results.'
spec.summary = spec.description
spec.homepage = 'https://github.com/binoverfl0w/code-challenge'
spec.required_ruby_version = '>= 3.1.0'

spec.metadata['homepage_uri'] = spec.homepage
spec.metadata['source_code_uri'] = spec.homepage

spec.bindir = 'exe'
spec.executables = spec.files.grep(%r{\Aexe/}) { |f| File.basename(f) }
spec.require_paths = ['lib']

spec.add_dependency 'nokogiri', '~> 1.18.9'
end
26 changes: 26 additions & 0 deletions lib/google_serp.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# frozen_string_literal: true

require_relative 'google_serp/knowledge_card'
require_relative 'google_serp/knowledge_card/image_carousel'
require 'nokogiri'
require 'open-uri'

# GoogleSerp is a module that provides functionality to parse Google SERP pages
module GoogleSerp
# Parses the given URI and extracts the knowledge card image carousel.
# @param uri [String] The URI of the Google SERP page to parse.
# @return [GoogleSerp::KnowledgeCard::ImageCarousel] The image carousel extracted from the knowledge card.
# @raise [StandardError] If no search results are found, or if the document cannot be parsed.
# @example
# GoogleSerp.parse('https://www.google.com/search?q=van+gogh+paintings')
# # => Returns a GoogleSerp::KnowledgeCard
def parse(uri)
doc = Nokogiri::HTML(URI.open(uri))
search_results = doc.at_xpath('//div[h1[text()="Search Results"]]')
raise 'No search results found' if search_results.nil?

KnowledgeCard.build_image_carousel(search_results)
end

module_function :parse
end
79 changes: 79 additions & 0 deletions lib/google_serp/knowledge_card.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# frozen_string_literal: true

require_relative 'knowledge_card/image_carousel'
require_relative 'knowledge_card/image_carousel/image'
require 'nokogiri'

module GoogleSerp
# The KnowledgeCard module provides methods to extract and build knowledge card elements
module KnowledgeCard
class ElementNotFoundError < StandardError; end

# Builds an ImageCarousel from the given Nokogiri node.
# @param element [Nokogiri::XML::Node] The Nokogiri node containing the knowledge card.
# @return [ImageCarousel] The constructed ImageCarousel object.
# @raise [ElementNotFoundError] If no carousel container is found in the document
def build_image_carousel(element)
# CSS selector to find the knowledge card container
carousel_container = element.css('div[data-attrid^="kc:/"]').first
raise ElementNotFoundError, 'No carousel container found in the document' if carousel_container.nil?

images = []
carousel_container.css('a').each do |anchor_element|
images << build_image(anchor_element, element.css('script'))
rescue ElementNotFoundError => _e
# Ignored
end
ImageCarousel.new(images: images)
end

# Builds an Image object from the given Nokogiri node.
# @param element [Nokogiri::XML::Node] The Nokogiri node containing the image information.
# @return [ImageCarousel::Image] The constructed Image object.
# @raise [ElementNotFoundError] If no image is found in the given node
def build_image(element, scripts)
image_element = element.css('img').first
raise ElementNotFoundError, 'No image found in the given node' if image_element.nil?

text_nodes = element.xpath('.//text()').map(&:text).reject(&:empty?)
name = text_nodes.shift.strip
extensions = text_nodes.empty? ? nil : text_nodes.map(&:strip)
href = element['href']
link = !href.start_with?('http') ? "https://www.google.com#{href}" : href
# choose data-src if it is available in the image element, otherwise attempt to resolve it from the script
# before falling back to the src attribute.
src = image_element['data-src'] || resolve_image_src_from_script(image_element, scripts) || image_element['src']
ImageCarousel::Image.new(name: name, extensions: extensions, link: link, image: src)
end

# Resolves the image source from the script content if the image src is not directly available.
# @param img [Nokogiri::XML::Node] The image node containing the id.
# @param scripts [Array<Nokogiri::XML::Node>] The set of script nodes to search for the image source.
# @return [String] The resolved image source URL or the original src if not found in scripts.
def resolve_image_src_from_script(img, scripts)
# The script has the following format:
# (function(){var s="<image data>";var ii=['image_id'];...;_setImagesSrc(ii, s, ...);})();
scripts.each do |script|
next unless script.content.include?(img['id'])

# Since _setImagesSrc is a function that may have overloads, we care only about the left part of the function
# until the second argument, which is the source of the image.
match = script.content.match(/_setImagesSrc\([^,]+,\s*([^,)]+)/)
next unless match

source_var_name = match[1].strip
# The source variable is defined in the script, e.g., var s="<image data>"';
regex = Regexp.new("\\s+#{source_var_name}\\s*=\\s*['\"]([^'\"]*)['\"]")
source_match = script.content.match(regex)
next unless source_match && !source_match[1].empty?

# The source may contain hex sequences, so we need to undump it.
return "\"#{source_match[1]}\"".undump
end

nil
end

module_function :build_image_carousel, :build_image, :resolve_image_src_from_script
end
end
24 changes: 24 additions & 0 deletions lib/google_serp/knowledge_card/image_carousel.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# frozen_string_literal: true

require_relative 'image_carousel/image'

module GoogleSerp
module KnowledgeCard
# Represents an image carousel tab in the knowledge card of a Google SERP.
class ImageCarousel
attr_reader :images

def initialize(images:)
@images = images
end

def to_json(*_args)
JSON.generate(@images)
end

def to_s
to_json
end
end
end
end
34 changes: 34 additions & 0 deletions lib/google_serp/knowledge_card/image_carousel/image.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# frozen_string_literal: true

require 'json'

module GoogleSerp
module KnowledgeCard
class ImageCarousel
# Represents an image in the image carousel of a knowledge card.
class Image
attr_reader :name, :extensions, :link, :image

def initialize(name:, extensions:, link:, image:)
@name = name
@extensions = extensions
@link = link
@image = image
end

def to_json(*_args)
{
name: @name,
extensions: @extensions,
link: @link,
image: @image
}.to_json
end

def to_s
to_json
end
end
end
end
end
5 changes: 5 additions & 0 deletions lib/google_serp/version.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# frozen_string_literal: true

module GoogleSerp
VERSION = "0.1.0"
end
5 changes: 5 additions & 0 deletions sig/google_serp.rbs
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
module GoogleSerp
VERSION: String

def parse: (String) -> KnowledgeCard::ImageCarousel
end
Loading