Skip to content
This repository was archived by the owner on Aug 26, 2020. It is now read-only.

Commit 22d4901

Browse files
author
Christoph Lupprich
committed
Ported scrAPI to Ruby 1.9.3 (unfortunately 1.9.2 will not work out because of a bug in Ruby itself).
1 parent becbe6e commit 22d4901

File tree

9 files changed

+43
-39
lines changed

9 files changed

+43
-39
lines changed

README.rdoc

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -40,13 +40,16 @@ To get the latest source code with regular updates:
4040

4141
svn co http://labnotes.org/svn/public/ruby/scrapi
4242

43+
== Version of Ruby
44+
45+
Currently ScrAPI does not run with Ruby 1.9.2, but with the dev versions of Ruby 1.9.3. This is due to a bug in Ruby's visibility context handling (see changelog #29578 and bug #3406 on the official Ruby page). Using the most recent dev version of Ruby is easy with RVM (http://rvm.beginrescueend.com/).
4346

4447
== Using TIDY
4548

46-
By default scrAPI uses Tidy to cleanup the HTML.
49+
By default scrAPI uses Tidy (actually Tidy-FFI) to cleanup the HTML.
4750

4851
You need to install the Tidy Gem for Ruby:
49-
gem install tidy
52+
gem install tidy_ffi
5053

5154
And the Tidy binary libraries, available here:
5255

@@ -56,15 +59,15 @@ By default scrAPI looks for the Tidy DLL (Windows) or shared library (Linux) in
5659

5760
Alternatively, just point Tidy to the library with:
5861

59-
Tidy.path = "...."
62+
TidyFFI.library_path = "...."
6063

6164
On Linux this would probably be:
6265

63-
Tidy.path = "/usr/local/lib/libtidy.so"
66+
TidyFFI.library_path = "/usr/local/lib/libtidy.so"
6467

6568
On OS/X this would probably be:
6669

67-
Tidy.path = “/usr/lib/libtidy.dylib”
70+
TidyFFI.library_path = “/usr/lib/libtidy.dylib”
6871

6972
For testing purposes, you can also use the built in HTML parser. It's useful for testing and getting up to grabs with scrAPI, but it doesn't deal well with broken HTML. So for testing only:
7073

@@ -86,3 +89,5 @@ HTML DOM extracted from Rails, Copyright (c) 2004 David Heinemeier Hansson. Unde
8689

8790
HTML parser by Takahiro Maebashi and Katsuyuki Komatsu, Ruby license.
8891
http://www.jin.gr.jp/~nahi/Ruby/html-parser/README.html
92+
93+
Porting to Ruby 1.9.x by Christoph Lupprich, http://lupprich.info

Rakefile

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
require "benchmark"
22
require "rubygems"
3-
Gem::manage_gems
43
require "rake"
54
require "rake/testtask"
65
require "rake/rdoctask"

lib/scraper/base.rb

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -906,10 +906,10 @@ def request(url, options)
906906
# end
907907
def skip(elements = nil)
908908
case elements
909-
when Array: @skip.concat elements
910-
when HTML::Node: @skip << elements
911-
when nil: @skip << true
912-
when true, false: @skip << elements
909+
when Array then @skip.concat elements
910+
when HTML::Node then @skip << elements
911+
when nil then @skip << true
912+
when true, false then @skip << elements
913913
end
914914
# Calling skip(element) as the last statement is
915915
# redundant by design.

lib/scraper/reader.rb

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
require "net/https"
1111
begin
1212
require "rubygems"
13-
require "tidy"
13+
require "tidy_ffi"
1414
rescue LoadError
1515
end
1616

@@ -95,6 +95,7 @@ def to_s
9595
# * :redirect_limit -- Number of redirects allowed (default is 3).
9696
# * :user_agent -- The User-Agent header to send.
9797
# * :timeout -- HTTP open connection/read timeouts (in second).
98+
# * :ssl_verify_mode -- SSL verification mode, defaults to OpenSSL::SSL::VERIFY_NONE
9899
#
99100
# It returns a hash with the following information:
100101
# * :url -- The URL of the requested page (may change by permanent redirect)
@@ -123,6 +124,7 @@ def read_page(url, options = nil)
123124
begin
124125
http = Net::HTTP.new(uri.host, uri.port)
125126
http.use_ssl = (uri.scheme == "https")
127+
http.verify_mode = options[:ssl_verify_mode] || OpenSSL::SSL::VERIFY_NONE
126128
http.close_on_empty_response = true
127129
http.open_timeout = http.read_timeout = options[:http_timeout] || DEFAULT_TIMEOUT
128130
path = uri.path.dup # required so we don't modify path
@@ -202,10 +204,8 @@ def parse_page(content, encoding = nil, options = nil, parser = :tidy)
202204
find_tidy
203205
options = (options || {}).update(TIDY_OPTIONS)
204206
options[:input_encoding] = encoding.gsub("-", "").downcase
205-
document = Tidy.open(options) do |tidy|
206-
html = tidy.clean(content)
207-
HTML::Document.new(html).find(:tag=>"html")
208-
end
207+
html = TidyFFI::Tidy.with_options(options).clean(content)
208+
document = HTML::Document.new(html).find(:tag=>"html")
209209
when :html_parser
210210
document = HTML::HTMLParser.parse(content).root
211211
else
@@ -223,14 +223,14 @@ def parse_page(content, encoding = nil, options = nil, parser = :tidy)
223223
module_function
224224

225225
def find_tidy()
226-
return if Tidy.path
226+
return if TidyFFI.library_path
227227
begin
228-
Tidy.path = File.join(File.dirname(__FILE__), "../tidy", "libtidy.so")
228+
TidyFFI.library_path = File.join(File.dirname(__FILE__), "../tidy", "libtidy.so")
229229
rescue LoadError
230230
begin
231-
Tidy.path = File.join(File.dirname(__FILE__), "../tidy", "libtidy.dll")
231+
TidyFFI.library_path = File.join(File.dirname(__FILE__), "../tidy", "libtidy.dll")
232232
rescue LoadError
233-
Tidy.path = File.join(File.dirname(__FILE__), "../tidy", "libtidy.dylib")
233+
TidyFFI.library_path = File.join(File.dirname(__FILE__), "../tidy", "libtidy.dylib")
234234
end
235235
end
236236
end

scrapi.gemspec

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
Gem::Specification.new do |spec|
22
spec.name = 'scrapi'
3-
spec.version = '1.2.1'
3+
spec.version = '1.2.2'
44
spec.summary = "scrAPI toolkit for Ruby. Uses CSS selectors to write easy, maintainable HTML scraping rules."
55
spec.description = <<-EOF
66
scrAPI is an HTML scraping toolkit for Ruby. It uses CSS selectors to write easy, maintainable scraping rules to select, extract and store data from HTML content.
@@ -13,10 +13,10 @@ EOF
1313
spec.files = Dir['{test,lib}/**/*', 'README.rdoc', 'CHANGELOG', 'Rakefile', 'MIT-LICENSE']
1414
spec.require_path = 'lib'
1515
spec.autorequire = 'scrapi.rb'
16-
spec.requirements << 'Tidy'
16+
spec.requirements << 'Tidy_ffi'
1717
spec.has_rdoc = true
1818
spec.rdoc_options << '--main' << 'README.rdoc' << '--title' << "scrAPI toolkit for Ruby" << '--line-numbers'
1919
spec.extra_rdoc_files = ['README.rdoc']
2020

21-
spec.add_dependency 'tidy', '>=1.1.0'
21+
spec.add_dependency 'tidy_ffy', '>=0.1.2'
2222
end

test/node_ext_test.rb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77

88
require "rubygems"
99
require "test/unit"
10-
require File.join(File.dirname(__FILE__), "../lib", "scrapi")
10+
require "./lib/scrapi"
1111

1212

1313
class NodeExtTest < Test::Unit::TestCase

test/reader_test.rb

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@
1212
require "webrick/https"
1313
require "logger"
1414
require "stringio"
15-
require File.join(File.dirname(__FILE__), "mock_net_http")
16-
require File.join(File.dirname(__FILE__), "../lib", "scrapi")
15+
require "./test/mock_net_http"
16+
require "./lib/scrapi"
1717

1818

1919
class ReaderTest < Test::Unit::TestCase
@@ -239,38 +239,38 @@ def test_should_handle_encoding_correctly
239239
# Test content encoding returned from HTTP server.
240240
with_webrick do |server, params|
241241
server.mount_proc "/test.html" do |req,resp|
242-
resp["Content-Type"] = "text/html; charset=my-encoding"
242+
resp["Content-Type"] = "text/html; charset=ASCII"
243243
resp.body = "Content comes here"
244244
end
245245
page = Reader.read_page(WEBRICK_TEST_URL)
246246
page = Reader.parse_page(page.content, page.encoding)
247-
assert_equal "my-encoding", page.encoding
247+
assert_equal "ASCII", page.encoding
248248
end
249249
# Test content encoding in HTML http-equiv header
250250
# that overrides content encoding returned in HTTP.
251251
with_webrick do |server, params|
252252
server.mount_proc "/test.html" do |req,resp|
253-
resp["Content-Type"] = "text/html; charset=my-encoding"
253+
resp["Content-Type"] = "text/html; charset=ASCII"
254254
resp.body = %Q{
255255
<html>
256256
<head>
257-
<meta http-equiv="content-type" value="text/html; charset=other-encoding">
257+
<meta http-equiv="content-type" value="text/html; charset=UTF-8">
258258
</head>
259259
<body></body>
260260
</html>
261261
}
262262
end
263263
page = Reader.read_page(WEBRICK_TEST_URL)
264264
page = Reader.parse_page(page.content, page.encoding)
265-
assert_equal "other-encoding", page.encoding
265+
assert_equal "UTF-8", page.encoding
266266
end
267267
end
268268

269269
def test_should_support_https
270270
begin
271271
options = WEBRICK_OPTIONS.dup.update(
272272
:SSLEnable=>true,
273-
:SSLVerifyClient => ::OpenSSL::SSL::VERIFY_NONE,
273+
:SSLVerifyClient => OpenSSL::SSL::VERIFY_NONE,
274274
:SSLCertName => [ ["C","JP"], ["O","WEBrick.Org"], ["CN", "WWW"] ]
275275
)
276276
server = WEBrick::HTTPServer.new(options)

test/scraper_test.rb

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@
88
require "rubygems"
99
require "time"
1010
require "test/unit"
11-
require File.join(File.dirname(__FILE__), "mock_net_http")
12-
require File.join(File.dirname(__FILE__), "../lib", "scrapi")
11+
require "./test/mock_net_http"
12+
require "./lib/scrapi"
1313

1414

1515
class ScraperTest < Test::Unit::TestCase
@@ -301,8 +301,8 @@ def test_skip_from_extractor
301301
assert_equal "this", scraper.this2
302302

303303
scraper = new_scraper(html) do
304-
process "#1", :this1=>:text, :skip=>true do
305-
false
304+
process "#1", :this1=>:text, :skip=>true do |element|
305+
element
306306
end
307307
process "#1", :this2=>:text
308308
end
@@ -351,7 +351,7 @@ def test_accessors
351351
[response, <<-EOF
352352
<html>
353353
<head>
354-
<meta http-equiv="content-type" value="text/html; charset=other-encoding">
354+
<meta http-equiv="content-type" value="text/html; charset=ASCII">
355355
</head>
356356
<body>
357357
<div id="x"/>
@@ -371,7 +371,7 @@ def test_accessors
371371
assert_equal "http://localhost/redirect", scraper.page_info.url.to_s
372372
assert_equal time, scraper.page_info.last_modified
373373
assert_equal "etag", scraper.page_info.etag
374-
assert_equal "other-encoding", scraper.page_info.encoding
374+
assert_equal "ASCII", scraper.page_info.encoding
375375
end
376376

377377

@@ -721,7 +721,7 @@ def test_prepare_and_result
721721
# Extracting the attribute skips the second match.
722722
scraper = new_scraper(DIVS123) do
723723
process("div") { |element| @count +=1 }
724-
define_method(:prepare) { @count = 1 }
724+
define_method(:prepare) { |element| @count = 1 }
725725
define_method(:result) { @count }
726726
end
727727
result = scraper.scrape

test/selector_test.rb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
# Developed for http://co.mments.com
55
# Code and documention: http://labnotes.org
66

7-
require File.join(File.dirname(__FILE__), "../lib", "scrapi")
7+
require "./lib/scrapi"
88

99

1010
class SelectorTest < Test::Unit::TestCase

0 commit comments

Comments
 (0)