Skip to content

Commit 143c793

Browse files
committed
First commit
0 parents  commit 143c793

14 files changed

+1302
-0
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
webcrawler

README.md

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
Webcrawler
2+
==========
3+
4+
A simple webcrawler that starting from an initial URL visits all URLs it finds
5+
on the domain as long as they belong the subdomain.
6+
7+
**Features**
8+
9+
- Concurrent
10+
- Deduplicate, tries not to crawl the same URLs more than once
11+
- Respect robots.txt directives
12+
13+
- Tries to be polite, if no crawling delay is found on the robots.txt it
14+
generate a randomized delay based on the response time of the server and a
15+
fixed value passed in as configuration
16+
- Allow to specify a list of exclusion, links to avoid based on their extension
17+
18+
**Dependencies**
19+
20+
- [goquery](https://github.com/PuerkitoBio/goquery) to easily parse HTML
21+
documents, gives convenient JQuery-like filtering methods, the alternative
22+
would've been walking down the document tree using the go std library
23+
- [rehttp](https://github.com/PuerkitoBio/rehttp) enable easy retry on HTTP
24+
errors, with retry number settings and an exponential backoff between each
25+
try
26+
- [robotstxt](github.com/temoto/robotstxt) allow to efficiently parse
27+
`robots.txt` files on the root of each domain
28+
29+
The project can be built with
30+
31+
```sh
32+
go build -o . ./...
33+
```
34+
35+
There's a bunch of tests for the most critical parts
36+
```sh
37+
go test -v ./...
38+
```
39+
40+
and run
41+
42+
```sh
43+
./webcrawler -target https://golang.org -concurrency 4 -depth 8
44+
```
45+
46+
it's possible to set most of the crawler settings by ENV variables:
47+
48+
- `USERAGENT` it's the User-Agent header we want to display
49+
- `CRAWLING_TIMEOUT` the number of seconds to wait for exiting crawling a page
50+
after the last link found
51+
- `CONCURRENCY` the number of worker goroutines to run in parallel while
52+
fetching websites; 0 means unlimited
53+
- `MAX_DEPTH` the number of links to fetch for each level; 0 means unbounded
54+
- `FETCHING_TIMEOUT` the timeout to wait if a fetch isn't responding
55+
- `POLITENESS_DELAY` the fixed delay to wait between multiple calls under the
56+
same domain
57+
58+
Supports extension exclusion from the crawl and some degree of politeness,
59+
checks for `/robots.txt` directives, if not found it assumes all subdomains are
60+
valid and tries to adjust a random delay for each call:
61+
62+
`delay = max(random(.5 * fixedDelay < x < 1.5 * fixedDelay), robots-delay, lastResponse time ** 2)`
63+
64+
The main crawling function consumes from a channel in a loop all the links to
65+
crawl, spawning goroutine workers to fetch new links on every page, limiting
66+
the concurrency with a semaphore. Every worker respect a delay between
67+
multiple calls to avoid flooding the target webserver.
68+
There's no recursion involved, making it quiet efficient and allowing for
69+
high levels of depth.
70+
71+
## Decisions
72+
73+
Originally I thought to design and implement the crawler as a simple
74+
microservices architecture, decoupling the fetching from the presentation
75+
service and using some queues to communicate asynchronously, `RabbitMQ` was
76+
taken into consideration for that.
77+
78+
I eventually decided to produce a simpler PoC explaining here weak points and
79+
improvements that could be made.
80+
81+
The application is entirely ephemeral, this means that stopping it will loose
82+
any progression on the crawling job. I implemented the core features trying to
83+
decouple responsibilities as much as possible in order to make it easier to
84+
plug different components:
85+
86+
- A `crawler` package which contains the crawling logic
87+
- `crawlingrules` defines a simple ruleset to follow while crawling, like
88+
robots.txt rules and delays to respect
89+
- A `messaging` package which offer a communication interface, used to push
90+
crawling results to different consumers, currently the only consumer is a
91+
simple goroutine that prints links found
92+
- `fetcher` is a package dedicated to the HTTP communication and parsing of
93+
HTML content, `Parser` and `Fetcher` interfaces allow to easily implement
94+
multiple solutions with different underlying backend libraries and behaviors
95+
96+
### Known issues
97+
98+
- No "checkpoint" persistence-like to graceful pause/restart the process
99+
- Deduplication could be better, no `rel=canonical` handling, doesn't check
100+
for `http` vs `https` version of the site when they display the same contents
101+
- `Retry-After` header is not respected after a 503 response
102+
- It's simple, no session handling/cookies
103+
- Logging is pretty simple, no external libraries, just print errors
104+
- Doesn't implement a sanitization of input except for missing scheme,
105+
if a domain requires `www` it cannot be omitted, otherwise it'll tries
106+
to contact the server with no succes, in other words it requires a correct
107+
URL as input
108+
109+
### Improvements
110+
111+
Given the freedom of the task, I sticked with the simplest solution, more akin
112+
to a PoC showing the core features, there's plenty of room for improvements for
113+
a production-ready solution, mostly dependents of the purpose of the software:
114+
115+
- REST interface to ingest jobs and query, probably behind a load-balancer
116+
- crawling logic with persistent state, persistent queue for links to crawl
117+
- configurable logging
118+
- extensibility and customization of the crawling rules, for example pluggable
119+
delay functions for each domain
120+
- better definition of errors and maybe a queue to notify them/gather them by
121+
stderr through some kind of aggregation stack (e.g. ELK)
122+
- **depending on the final purpose** separate even more the working logic from
123+
the business logic, probably adding a job-agnostic worker package
124+
implementing a shared nothing actor-model based on the goroutines which can
125+
be reused for different purposes as the project grows

0 commit comments

Comments
 (0)