Trawler

A job scheduler and analysis tool for webscraping (and other) tasks.

Datasources

Curently the following datasources are implemented:

"

facebook posts and reactions scrape facebook posts, comments and reactions (like, heart, etc)
gab (nazi-twitter) crawl posts for user
google dorking find interesting files and download them
json to csv convert json array into csv
mail sends mails and files - mostly usefull in pipelines
masscan udp based port scanner (requires docker)
motiondetection script to to motionanalysis in directory with videofiles
onionlist download tor-catalogue from onionlist.org
onions.danwin1210.de download tor-catalogue from danwin1210.de, and creates screenshots of each website in the result
tiktok get video metadata per hashtag, download them and analyse the text using easyOCR
url generic http scraper
urlscreenshotter scrapes comma separated list of urls and creates screenshot of each of them"

Create your own datasource

- copy template dir in ./jobs
- define fields in fields.js which are needed to start the job
- a job can output one or multiple files
- no directories should be used, please use archives
- use job_id.ext (eg job_id.json) as filename

Features

simple configuration of actions/datasources, also from 3rd party modules/repos
job monitoring and scheduling
schedule jobs
sqlite, csv and json browser
separation of datasets/artifacts (one archive per crawl)
scalable amount of workers (also on other machines)

Architecture

Frontend and API

GUI to create and schedule jobs
Displays pending, running and done jobs
Display csv and sqlite datasets

Worker(s)

Can be distributed (workers and c&c on different locations/servers)
Jobs are managed through json files (and can be distrubuted with an adapter like pouchDB)
Multithreaded

Install & run

Using NPM

npm i
npm run all

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
.github/workflows		.github/workflows
data/_sessiondata		data/_sessiondata
datasources		datasources
docs		docs
models		models
public		public
scripts		scripts
src		src
tests		tests
utils		utils
.dockerignore		.dockerignore
.env.template		.env.template
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.gitmodules		.gitmodules
.prettierrc		.prettierrc
Dockerfile		Dockerfile
README.md		README.md
api.js		api.js
babel.config.js		babel.config.js
docker-compose.yml		docker-compose.yml
ecosystem.config.js		ecosystem.config.js
package-lock.json		package-lock.json
package.json		package.json
start.sh		start.sh
vue.config.js		vue.config.js
worker.js		worker.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Trawler

Datasources

Create your own datasource

Features

Architecture

Frontend and API

Worker(s)

Install & run

Using NPM

About

Releases

Packages

Contributors 3

Languages

niczem/trawler

Folders and files

Latest commit

History

Repository files navigation

Trawler

Datasources

Create your own datasource

Features

Architecture

Frontend and API

Worker(s)

Install & run

Using NPM

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages