Tekir Crawler

Overview

Tekir Crawler is a high‑performance, concurrent web crawler written in Go. It fetches web pages, parses HTML for metadata, discovers links, downloads images and favicons, respects robots.txt rules (unless overridden), rates limits per host, tracks backlinks, computes site scores, and persists structured data to JSON.

Features

Concurrent page and image crawling with worker pools
Robots.txt compliance (optional override)
Per-host rate limiting
HTML parsing for titles, descriptions, keywords, images, and favicon
Backlink tracking and site scoring
Atomic JSON persistence to avoid data corruption
Colorized, live status panel in terminal
Rescore-only mode for existing data

Installation

Prerequisites:

Go 1.18 or higher installed

Steps:

git clone https://github.com/tekircik/crawl.git
cd crawl
go mod tidy
go build -o tcrawl
# Optionally install to $GOPATH/bin:
go install

Usage

Run the crawler with:

./tcrawl [options] <start-url> [additional-urls...]

Common flags:

-v : enable verbose info logging
--disobey : skip robots.txt checks
-w N : number of page crawler workers (default 5)
--iw N : number of image download workers (default 3)
--score : rescore existing JSON and exit

Examples:

# Crawl example.com with 10 workers, verbose, ignoring robots.txt:
./tcrawl --disobey -v --w 10 --iw 5 https://example.com

# Rescore saved data only:
./tcrawl --score

Project Structure

├── main.go            # Entry point: flags, logging, orchestration
├── README.md          # This technical documentation
├── crawler/           # Core crawling logic
│   ├── crawler.go     # HTTP fetch, robots.txt, rate limiting, linkQueue
│   ├── parser.go      # HTML parsing: metadata, links, image tasks
│   ├── metadata.go    # Data structures: PageMetadata, ImageTask
│   ├── visited.go     # Thread-safe URL deduplication
│   └── score.go       # Two-pass site scoring algorithm
├── docs/README.md     # In-depth project docs and architecture
├── favicons/          # Downloaded favicon assets
├── images/            # Downloaded image assets
├── crawled_data.json  # Persisted crawl metadata
├── visited_sites.json # Persistent visited URLs store
└── crawl.log          # Error and optional info log output

Data Models

In crawler/metadata.go:

// PageMetadata holds data extracted per page
type PageMetadata struct {
  SiteAddress   string   `json:"url"`
  Title         string   `json:"title"`
  Description   string   `json:"description"`
  Tags          []string `json:"tags"`
  Crawled       int64    `json:"timestamp"`
  Images        []string `json:"images"`
  FaviconPath   string   `json:"favicon"`
  Backlinks     []string `json:"backlinks"`
  BacklinkCount int      `json:"backlink_count"`
  SiteScore     float64  `json:"site_score,omitempty"`
}

// ImageTask represents an image download job
type ImageTask struct {
  ImageURL     string
  BaseFilename string
  Verbose      bool
}

Concurrency Model

Page Worker Pool: --w N goroutines consume URLs from worklist, invoke crawler.Crawl, and emit links, metadata, and image tasks.
Image Worker Pool: --iw N goroutines consume ImageTask from imageQueue and download files concurrently.
Metadata Handler: single goroutine aggregates PageMetadata, calculates backlinks, updates counts, and writes JSON periodically.
Status Reporter: live terminal panel refreshed each second.

Synchronization:

sync.Mutex for collectedMetadata slice
sync.RWMutex for shared backlink map
Atomic counters for page and error metrics

Pipeline Overview

Seed start URLs into worklist
Page workers fetch and parse pages
Extracted links sent to linkQueue, new URLs re-seeded
Metadata sent to metadataQueue for aggregation
Image URLs queued in imageQueue, downloaded by image workers
Metadata handler periodically persists crawled_data.json

Rate Limiting & Robots.txt

Cached per-host robots directives using github.com/temoto/robotstxt
Default 1‑second delay enforced per host
Override robots and limits with --disobey

Site Scoring Algorithm

Implemented in crawler/score.go:

Base Score: up to 5 points for tags + up to 8 bonus points for title, description, favicon, and images (max 13).
Backlink Influence: second pass blends 10% of average backlink scores into each page’s score.
Progress bar displayed during scoring.

Rescore-only mode: --score reads existing JSON, applies scoring, writes back, then exits.

Persistence & Atomic Writes

Metadata saved every 10s via saveMetadataToFile
Writes to temporary file then os.Rename for atomicity
Persistent visited URLs stored in visited_sites.json

Logging

Error Logger: always writes to crawl.log
Info Logger: writes to crawl.log only if -v specified

Performance Tuning

Adjust --w and --iw to match I/O vs CPU workload
Monitor log for rate-limit warnings and increase delay if needed

Testing & Extensibility

Unit tests can target parser.go, visited.go, and crawler.go
Extend with custom metadata plugins or distributed work queues

License

Distributed under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
crawler		crawler
docs		docs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tekir Crawler

Overview

Features

Installation

Usage

Project Structure

Data Models

Concurrency Model

Pipeline Overview

Rate Limiting & Robots.txt

Site Scoring Algorithm

Persistence & Atomic Writes

Logging

Performance Tuning

Testing & Extensibility

License

About

Releases

Packages

Languages

License

computebaker/crawler

Folders and files

Latest commit

History

Repository files navigation

Tekir Crawler

Overview

Features

Installation

Usage

Project Structure

Data Models

Concurrency Model

Pipeline Overview

Rate Limiting & Robots.txt

Site Scoring Algorithm

Persistence & Atomic Writes

Logging

Performance Tuning

Testing & Extensibility

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages