Tekir Crawler is a high‑performance, concurrent web crawler written in Go. It fetches web pages, parses HTML for metadata, discovers links, downloads images and favicons, respects robots.txt rules (unless overridden), rates limits per host, tracks backlinks, computes site scores, and persists structured data to JSON.
- Concurrent page and image crawling with worker pools
- Robots.txt compliance (optional override)
- Per-host rate limiting
- HTML parsing for titles, descriptions, keywords, images, and favicon
- Backlink tracking and site scoring
- Atomic JSON persistence to avoid data corruption
- Colorized, live status panel in terminal
- Rescore-only mode for existing data
Prerequisites:
- Go 1.18 or higher installed
Steps:
git clone https://github.com/tekircik/crawl.git
cd crawl
go mod tidy
go build -o tcrawl
# Optionally install to $GOPATH/bin:
go install
Run the crawler with:
./tcrawl [options] <start-url> [additional-urls...]
Common flags:
-v
: enable verbose info logging--disobey
: skip robots.txt checks-w N
: number of page crawler workers (default 5)--iw N
: number of image download workers (default 3)--score
: rescore existing JSON and exit
Examples:
# Crawl example.com with 10 workers, verbose, ignoring robots.txt:
./tcrawl --disobey -v --w 10 --iw 5 https://example.com
# Rescore saved data only:
./tcrawl --score
├── main.go # Entry point: flags, logging, orchestration
├── README.md # This technical documentation
├── crawler/ # Core crawling logic
│ ├── crawler.go # HTTP fetch, robots.txt, rate limiting, linkQueue
│ ├── parser.go # HTML parsing: metadata, links, image tasks
│ ├── metadata.go # Data structures: PageMetadata, ImageTask
│ ├── visited.go # Thread-safe URL deduplication
│ └── score.go # Two-pass site scoring algorithm
├── docs/README.md # In-depth project docs and architecture
├── favicons/ # Downloaded favicon assets
├── images/ # Downloaded image assets
├── crawled_data.json # Persisted crawl metadata
├── visited_sites.json # Persistent visited URLs store
└── crawl.log # Error and optional info log output
In crawler/metadata.go
:
// PageMetadata holds data extracted per page
type PageMetadata struct {
SiteAddress string `json:"url"`
Title string `json:"title"`
Description string `json:"description"`
Tags []string `json:"tags"`
Crawled int64 `json:"timestamp"`
Images []string `json:"images"`
FaviconPath string `json:"favicon"`
Backlinks []string `json:"backlinks"`
BacklinkCount int `json:"backlink_count"`
SiteScore float64 `json:"site_score,omitempty"`
}
// ImageTask represents an image download job
type ImageTask struct {
ImageURL string
BaseFilename string
Verbose bool
}
- Page Worker Pool:
--w N
goroutines consume URLs fromworklist
, invokecrawler.Crawl
, and emit links, metadata, and image tasks. - Image Worker Pool:
--iw N
goroutines consumeImageTask
fromimageQueue
and download files concurrently. - Metadata Handler: single goroutine aggregates
PageMetadata
, calculates backlinks, updates counts, and writes JSON periodically. - Status Reporter: live terminal panel refreshed each second.
Synchronization:
sync.Mutex
forcollectedMetadata
slicesync.RWMutex
for shared backlink map- Atomic counters for page and error metrics
- Seed start URLs into
worklist
- Page workers fetch and parse pages
- Extracted links sent to
linkQueue
, new URLs re-seeded - Metadata sent to
metadataQueue
for aggregation - Image URLs queued in
imageQueue
, downloaded by image workers - Metadata handler periodically persists
crawled_data.json
- Cached per-host robots directives using
github.com/temoto/robotstxt
- Default 1‑second delay enforced per host
- Override robots and limits with
--disobey
Implemented in crawler/score.go
:
- Base Score: up to 5 points for tags + up to 8 bonus points for title, description, favicon, and images (max 13).
- Backlink Influence: second pass blends 10% of average backlink scores into each page’s score.
- Progress bar displayed during scoring.
Rescore-only mode: --score
reads existing JSON, applies scoring, writes back, then exits.
- Metadata saved every 10s via
saveMetadataToFile
- Writes to temporary file then
os.Rename
for atomicity - Persistent visited URLs stored in
visited_sites.json
- Error Logger: always writes to
crawl.log
- Info Logger: writes to
crawl.log
only if-v
specified
- Adjust
--w
and--iw
to match I/O vs CPU workload - Monitor log for rate-limit warnings and increase delay if needed
- Unit tests can target
parser.go
,visited.go
, andcrawler.go
- Extend with custom metadata plugins or distributed work queues
Distributed under the MIT License. See LICENSE
for details.