Skip to content

The web indexing crawler for the Tekir search engine.

License

Notifications You must be signed in to change notification settings

computebaker/crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tekir Crawler

Overview

Tekir Crawler is a high‑performance, concurrent web crawler written in Go. It fetches web pages, parses HTML for metadata, discovers links, downloads images and favicons, respects robots.txt rules (unless overridden), rates limits per host, tracks backlinks, computes site scores, and persists structured data to JSON.

Features

  • Concurrent page and image crawling with worker pools
  • Robots.txt compliance (optional override)
  • Per-host rate limiting
  • HTML parsing for titles, descriptions, keywords, images, and favicon
  • Backlink tracking and site scoring
  • Atomic JSON persistence to avoid data corruption
  • Colorized, live status panel in terminal
  • Rescore-only mode for existing data

Installation

Prerequisites:

  • Go 1.18 or higher installed

Steps:

git clone https://github.com/tekircik/crawl.git
cd crawl
go mod tidy
go build -o tcrawl
# Optionally install to $GOPATH/bin:
go install

Usage

Run the crawler with:

./tcrawl [options] <start-url> [additional-urls...]

Common flags:

  • -v : enable verbose info logging
  • --disobey : skip robots.txt checks
  • -w N : number of page crawler workers (default 5)
  • --iw N : number of image download workers (default 3)
  • --score : rescore existing JSON and exit

Examples:

# Crawl example.com with 10 workers, verbose, ignoring robots.txt:
./tcrawl --disobey -v --w 10 --iw 5 https://example.com

# Rescore saved data only:
./tcrawl --score

Project Structure

├── main.go            # Entry point: flags, logging, orchestration
├── README.md          # This technical documentation
├── crawler/           # Core crawling logic
│   ├── crawler.go     # HTTP fetch, robots.txt, rate limiting, linkQueue
│   ├── parser.go      # HTML parsing: metadata, links, image tasks
│   ├── metadata.go    # Data structures: PageMetadata, ImageTask
│   ├── visited.go     # Thread-safe URL deduplication
│   └── score.go       # Two-pass site scoring algorithm
├── docs/README.md     # In-depth project docs and architecture
├── favicons/          # Downloaded favicon assets
├── images/            # Downloaded image assets
├── crawled_data.json  # Persisted crawl metadata
├── visited_sites.json # Persistent visited URLs store
└── crawl.log          # Error and optional info log output

Data Models

In crawler/metadata.go:

// PageMetadata holds data extracted per page
type PageMetadata struct {
  SiteAddress   string   `json:"url"`
  Title         string   `json:"title"`
  Description   string   `json:"description"`
  Tags          []string `json:"tags"`
  Crawled       int64    `json:"timestamp"`
  Images        []string `json:"images"`
  FaviconPath   string   `json:"favicon"`
  Backlinks     []string `json:"backlinks"`
  BacklinkCount int      `json:"backlink_count"`
  SiteScore     float64  `json:"site_score,omitempty"`
}

// ImageTask represents an image download job
type ImageTask struct {
  ImageURL     string
  BaseFilename string
  Verbose      bool
}

Concurrency Model

  • Page Worker Pool: --w N goroutines consume URLs from worklist, invoke crawler.Crawl, and emit links, metadata, and image tasks.
  • Image Worker Pool: --iw N goroutines consume ImageTask from imageQueue and download files concurrently.
  • Metadata Handler: single goroutine aggregates PageMetadata, calculates backlinks, updates counts, and writes JSON periodically.
  • Status Reporter: live terminal panel refreshed each second.

Synchronization:

  • sync.Mutex for collectedMetadata slice
  • sync.RWMutex for shared backlink map
  • Atomic counters for page and error metrics

Pipeline Overview

  1. Seed start URLs into worklist
  2. Page workers fetch and parse pages
  3. Extracted links sent to linkQueue, new URLs re-seeded
  4. Metadata sent to metadataQueue for aggregation
  5. Image URLs queued in imageQueue, downloaded by image workers
  6. Metadata handler periodically persists crawled_data.json

Rate Limiting & Robots.txt

  • Cached per-host robots directives using github.com/temoto/robotstxt
  • Default 1‑second delay enforced per host
  • Override robots and limits with --disobey

Site Scoring Algorithm

Implemented in crawler/score.go:

  1. Base Score: up to 5 points for tags + up to 8 bonus points for title, description, favicon, and images (max 13).
  2. Backlink Influence: second pass blends 10% of average backlink scores into each page’s score.
  3. Progress bar displayed during scoring.

Rescore-only mode: --score reads existing JSON, applies scoring, writes back, then exits.

Persistence & Atomic Writes

  • Metadata saved every 10s via saveMetadataToFile
  • Writes to temporary file then os.Rename for atomicity
  • Persistent visited URLs stored in visited_sites.json

Logging

  • Error Logger: always writes to crawl.log
  • Info Logger: writes to crawl.log only if -v specified

Performance Tuning

  • Adjust --w and --iw to match I/O vs CPU workload
  • Monitor log for rate-limit warnings and increase delay if needed

Testing & Extensibility

  • Unit tests can target parser.go, visited.go, and crawler.go
  • Extend with custom metadata plugins or distributed work queues

License

Distributed under the MIT License. See LICENSE for details.

About

The web indexing crawler for the Tekir search engine.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages