forked from cortext/crawtext
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
316 lines (239 loc) · 8.64 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
![http://www.cortext.net](http://www.cortext.net/IMG/siteon0.png)
Crawtext
===============================================
Crawtext is a project of the Cortext Lab. It is independant from the **Cortext manager** plateform but deisgned to interact with it.
Get a free account and discover the tools you can use for your own research by registering at
![Cortext](http://manager.cortext.net/)
**Crawtext** is a tiny crawler in command line that let you investigate and collect the ressources of the web that match the special keywords. Usefull for archiving the web around a special theme, results could also be used with the cortext manager to explore the relationships between websites on a special topic.
Basic Principle
---------
Crawtext is a tiny crawler that goes from page to page colecting relevant article given a few keywords
The crawler needs:
* a **query** to select pertinent pages
and
* **starting urls** to collect data
Given a list of url
1. the robot will collect the article for each url
2. It will search for the keywords inside the text extracted from the article.
=> If the keywords are present in the page it stores the content of the page and
3. The links inside the page will be added to the next lists to be treated
Installation
------------
- First, you *must* have MongoDB installed:
* For Debian distribution
```
$ sudo apt-get install mongodb
```
* For OSX distribution install it with brew:
```
$ brew install mongodb
```
- Then create a virtualenvironnement (recommended)
```
$mkvirtualenv crawtext_env
(crawtext_env)$
```
- Clone the sources files with git
```
(crawtext_env)$ git clone https://github.com/cortext/crawtext
```
- Install requirements throught pip
```
pip install git+git://github.com/cortext/crawtext/tarball/master#egg=crawtext
```
That's all folk you know have a complete crawler working!
Getting started
====
1. Enter in the project directory
```
$cd crawtext
```
2. Create a new project
A project need to be configure with 3 basic requirements:
* a name: the name of your project
e.g: pesticides
* a query: search query or expression that have to be found in web pages.
The query expression supports simple logical operator : (AND, OR, NOT) and semantic operator: ("", *)
e.g: pesticides AND DDT
* one or multiple url to start crawl:
You have three options to add starting urls to your project:
** specifiying one url
** giving a text file where urls are stored line by line
e.g: examples/seeds.txt
** providing the acess key to Bing Search API to collect the 50 urls given by the search result
See how to get your ![BING API key](https://datamarket.azure.com/dataset/bing/search)
e.g: XVDVYU53456FDZ
Here as an example I will create a new project called pesticides with the 50 urls given by BING results search
```
$ python crawtext.py pesticides --query="pesticides AND ddt" --key=XVDVYU53456
```
Once the script has check the starting urls a few informations on the project will be displayed.
If everything is ok, lauch the crawl:
```
$ python crawtext.py pesticides start
```
Monitoring the project
====
See how is your crawl going using the report option:
```
$python crawtext.py pesticides report
```
Report will be stored in the dedicated file of your project
```
$cd projects/pesticides/report
```
If you prefere to receive and email add your email to project configuration:
```
$python crawtext.py pesticides add [email protected]
$python crawtext.py pesticides report
```
Exporting results
====
Export the results of crawl
```
$python crawtext.py pesticides export
```
Results, logs and sources will be stored in the dedicated file of your project
```
$cd projects/pesticides
```
Defaut export format is json.
If you want an export in csv :
```
$python crawtext.py pesticides export --format=csv
```
Managing the project (advanced)
====
Crawtext gives you some facilities to modify delete add more parameters
- **Add** or **modify** parameters:
You can add or change the following parameters:
- user (--user=)
- file (--file=)
- url (--url=)
- query (--query=)
- key (--key=)
- depth (--depth=)
- format (--format=)
using the following syntax:
```
$python crawtext.py add --user="[email protected]"
$python crawtext.py add --depth=10
```
- **Remove** parameters:
You can remove the following parameters:
- user (-u)
- file (-f)
- url (--url=http://example.com)
- query (-q)
- key (-k)
- depth (-d)
using the following syntax:
```
$python crawtext.py delete -u
$python crawtext.py delete -d
```
- **Delete**:
You can delete the entire project. Every single datasets will be destroyed so be carefull!
```
$python crawtext.py pesticides delete
```
Outputs
===
Datasets are stored in json and zip in 3 collections in the dedicated directory of your project:
* results
* sources
* logs
Anatomy of a source entry
====
Sources of your project correspond a datatable that can export in json or csv
informations that are stored are the following:
Date msg and status are updated for each run of the crawl with the corresponding status msg
```{
"_id" : ObjectId("546dc48edabe6e52c2e54908"),
"date" : [
ISODate("2014-11-20T11:38:06.397Z"),
ISODate("2014-11-20T11:38:06.424Z")
],
"depth" : 2,
"extension" : "org",
"file_type" : null,
"msg" : [
"Ok",
"Ok"
],
"netloc" : "fr.wikipedia.org",
"origin" : "bing",
"path" : "/wiki/Algue_verte",
"relative" : true,
"scheme" : "http",
"source_url" : "http://fr.wikipedia.org",
"status" : [
true,
true
],
"tld" : "wikipedia",
"url" : "http://fr.wikipedia.org/wiki/Algue_verte"
}
```
Anatomy of a result entry
====
{
"url":http://fr.wikipedia.org/wiki/Algue_verte,
"url_info":{ "origin" : "",
"status" : [ true ],
"extension" : "org",
"url" : "http://en.wikipedia.org/wiki/Algue_verte",
"netloc" : "en.wikipedia.org",
"source_url" : "http://en.wikipedia.org",
"relative" : false,
"file_type" : null,
"depth" : 2,
"tld" : "wikipedia",
"date" : [ { "$date" : 1416293382397 } ],
"path" : "/wiki/Responsible_Research_and_Innovation",
"scheme" : "http",
"msg" : [ "Ok" ] },
"title": "Algues vertes",
"text":"",
"html":"",
"links":[http://www.dmu.ac.uk/study/study.aspx", "http://www.dmu.ac.uk/research", "http://www.dmu.ac.uk/international/en/international.aspx", "http://www.dmu.ac.uk/business-services/business-services.aspx", "http://www.dmu.ac.uk/about-dmu/about-dmu.aspx", "#", "/study/undergraduate-study/undergraduate-study.aspx", "/study/postgraduate-study/postgraduate-study.aspx", "/information-for-parents/information-for-parents.aspx", "/information-for-teachers/information-for-teachers.aspx", "/dmu-students/dmu-students.aspx", "/alumni", "/dmu-staff/dmu-staff.aspx", "/international/en/before-you-apply-to-study-at-dmu/your-country/country-information.aspx", "/business-services/access-our-students-and-graduates/access-our-students-and-graduates.aspx", "/about-dmu/news/contact-details.aspx", "/study/undergraduate-study/student-support/advice-and-guidance-for-mature-students/advice-and-guidance-for-mature-students.aspx", "http://www.dmuglobal.com/", "/about-dmu/events/events.aspx"],
}
```
Anatomy of a logs entry
====
```
{
"_id" : ObjectId("546dc72cdabe6e53c004a603"),
"url" : "http://france3-regions.francetvinfo.fr/bretagne/algues-vertes",
"status" : false,
"code" : 500,
"msg" : "Requests Error: HTTPConnectionPool(host='france3-regions.francetvinfo.fr', port=80): Read timed out. (read timeout=5)"
}
```
Sources
====
You can see the code ![here] (https://github.com/cortext/crawtext)
Getting help
====
Crawtext is a simple module in command line to crawl the web given a query.
This interface offers you a full set of option to set up a project.
If you need any help on interacting with the shell command you can just type to see all the options:
```
python crawtext.py --help
```
You can also ask for pull request here http://github.com/cortext/crawtext/,
we will be happy to answer to any configuration problem or desired features.
COMMON PROBLEMS
----
* Mongo Database:
Sometimes if you shut your programm by forcing, you could have an error to connect to database such has:
```
couldn't connect to server 127.0.0.1:27017 at src/mongo/shell/mongo.js:145```
The way to repair it is to remove locks of mongod
```
sudo rm /var/lib/mongodb/mongod.lock```
```
sudo service mongodb restart```
If it doesn't work it means the index is corrupted so you have to repair it:
```
sudo mongod --repair```