Wuzzy Search: A Technical Deep Dive into Decentralized Web Crawling and Search on AO

Demo Deployment

Demo Video: https://www.youtube.com/watch?v=V_kjW0JcP1U
Hyperbeam node: https://wuzzy-hyperbeam.hel.memeticblock.net
Primary Nest Process Id: 1X_nt5ctoJTw6Dc3M6x34_lFTWGTl-jhW8MY7Vff4fA
Crawler 1 Process Id: 8Or2nlJ-8jDzen1FWC1h11E6IYvXOROC2oRjBJLTKE8
Crawler 2 Process Id: CbumKxXkF3VflozGBWMfbrVfB7s48lJPb3mgw6wAo7g
Crawler 3 Process Id: ttBl4Hru8cdVhXsFoFdyBIDS71nch6IKB0n5dNJUqfc

Introduction

Wuzzy Search represents a groundbreaking approach to web indexing and search, built entirely on the AO (Actor Oriented) protocol and the Arweave ecosystem. Unlike traditional search engines that rely on centralized infrastructure, Wuzzy implements a fully decentralized architecture where autonomous agents crawl, index, and serve search results in a distributed manner.

System Architecture

Core Components

Wuzzy Search consists of two primary components that work in concert to provide decentralized search capabilities:

Wuzzy Nest: The central search index that acts as the repository for all crawled content. Each Nest maintains its own document database, search algorithms, and access control mechanisms. Multiple Nests can exist independently, creating a distributed network of specialized search indexes.

Wuzzy Crawler: Autonomous processes that fetch, parse, and process web content. Crawlers can be spawned independently and configured to focus on specific domains or content types. They operate autonomously, making decisions about which links to follow and what content to index.

Distributed Architecture Model

The system employs a hub-and-spoke model where:

A Nest serves as the central hub for document storage and search operations
Multiple Crawlers act as distributed spokes, each responsible for crawling specific domains or URL patterns
Crawlers submit processed documents to their designated Nest
Users query the Nest directly for search results
Each component operates as an independent AO process with its own state and permissions

This architecture enables horizontal scaling by adding more Crawlers to handle increased crawling loads or by deploying specialized Crawlers for different content types or domains.

Technical Implementation

AO Process Architecture

Both Nest and Crawler components are implemented as AO processes written in Lua, leveraging the hyper-aos runtime environment. The choice of AO provides several technical advantages:

Message-based Communication: All interactions between components occur through structured messages, ensuring loose coupling and fault tolerance
State Persistence: Process state is automatically persisted on Arweave, providing durability and recoverability
Parallel Execution: Multiple processes can operate simultaneously without coordination overhead
Built-in Security: AO's permission model provides natural access control mechanisms

Search Algorithms

Wuzzy implements two distinct search algorithms to cater to different use cases:

Simple Search Algorithm

The simple search algorithm provides fast, straightforward text matching:

Case-insensitive pattern matching across document content
Query terms are converted to regex patterns supporting both uppercase and lowercase variants
Results are ranked by term frequency (number of matches in the document)
Context highlighting provides ~100 characters before and after matches
Optimized for speed and simplicity, making it ideal for real-time search interfaces

BM25 (Best Matching 25) Algorithm

The BM25 implementation provides sophisticated relevance ranking based on term frequency and document length normalization:

BM25 Score = IDF × (tf × (k₁ + 1)) / (tf + k₁ × (1 - b + b × (|d| / avgdl)))

Where:

IDF (Inverse Document Frequency): log(1 + (N - df + 0.5) / (df + 0.5))
tf: Term frequency in the document
k₁: Controls term frequency saturation (default: 1.2)
b: Controls document length normalization (default: 0.75)
|d|: Document length
avgdl: Average document length across the corpus
N: Total number of documents
df: Number of documents containing the term

The implementation includes several optimizations:

Pre-calculation of average document length for efficient scoring
Dynamic IDF computation based on actual term occurrence
Memory-efficient storage of document statistics
Configurable parameters for fine-tuning relevance scoring

Content Processing Pipeline

The Crawler implements a sophisticated content processing pipeline that handles multiple content types and protocols:

HTML Processing

Parsing: Uses a custom HTML parser to extract structured content from web pages
Content Extraction: Removes HTML tags while preserving text content and structure
Entity Decoding: Converts HTML entities (&, <, etc.) to their text equivalents
Metadata Extraction: Pulls title tags and meta descriptions for enhanced search results
Link Discovery: Identifies and extracts all anchor tags for potential crawling targets

Link Processing and Domain Management

URL Normalization: Converts relative URLs to absolute URLs and removes query parameters and fragments
Domain Filtering: Only follows links within configured crawl domains to prevent crawler drift
Deduplication: Maintains an in-memory cache of previously crawled URLs to avoid redundant processing
Queue Management: Implements a breadth-first crawling strategy using internal queue structures

Content Type Support

Current implementation supports:

text/html: Full HTML parsing with link extraction and content processing
text/plain: Direct text content indexing
Future Extensions: The architecture cam support pluggable content processors for images, audio, video, and other media types

Protocol Support and Web Integration

Wuzzy provides comprehensive protocol support spanning traditional web and permaweb infrastructure:

Traditional Web Protocols

HTTP/HTTPS: Standard web crawling through Hyperbeam's relay device
Error Handling: Gracefully ignores missing or empty pages or content

Permaweb Integration

ARNS (Arweave Name System): Native support for arns:// URLs enabling decentralized domain resolution
Direct Arweave Access: ar:// protocol support for accessing content by transaction ID
Permanent Archival: All crawled content is automatically stored on Arweave, ensuring permanent accessibility

Access Control and Security

The system implements a comprehensive role-based access control (RBAC) system:

Permission Roles

Owner: Full administrative access to all operations
Admin: Administrative access with delegation capabilities
Component-specific roles: Granular permissions for operations like Index-Document, Add-Crawler, Request-Crawl

Security Features

Message Authentication: All inter-process communication is authenticated through AO's built-in security model
Permission Validation: Every operation validates sender permissions before execution
State Protection: Process state can only be modified by authorized principals
Audit Trail: All operations are logged in the AO message history for transparency

API Design and Integration

RESTful Query Interface

Wuzzy provides HTTP-based search queries through the Hyperbeam infrastructure:

GET /{nestId}/now/~lua@5.3a&module={viewModule}/search_bm25/serialize~json@1.0?query={searchTerm}

Response Format

Search results are returned as flat JSON structures optimized for client consumption:

interface WuzzyNestSearchResults {
  search_type: string
  total_hits: number
  has_more: 'true' | 'false'
  from: number
  page_size: number
  result_count: number
  
  // Flattened result arrays
  [key: `${number}_docid`]: string
  [key: `${number}_title`]: string | undefined
  [key: `${number}_description`]: string | undefined
  [key: `${number}_content`]: string
  [key: `${number}_score`]: number
}

Pagination Support

Offset-based Pagination: Uses from parameter for result offsetting
Configurable Page Size: Default 10 results, customizable via parameters
Result Metadata: Includes total hit counts and pagination status

Performance and Scalability

Horizontal Scaling Model

Wuzzy's architecture enables several scaling strategies:

Crawler Scaling

Domain Specialization: Deploy dedicated Crawlers for specific domains or content types
Geographic Distribution: Distribute Crawlers across different regions for improved latency
Load Balancing: Multiple Crawlers can work on different URL sets simultaneously
Resource Optimization: Each Crawler operates independently with its own resource allocation

Index Scaling

Multiple Nests: Deploy specialized Nests for different content categories
Federated Search: Implement client-side search federation across multiple Nests
Content Partitioning: Distribute documents across Nests based on domain, content type, or other criteria

Storage and Memory Management

State Optimization

Incremental Updates: Document updates modify only changed content rather than full re-indexing
Memory-efficient Storage: Optimized data structures for large-scale document collections

Caching Strategies

Crawl Memory: In-memory tracking of recently crawled URLs to prevent duplication
Queue Management: Efficient queue structures for managing large crawl backlogs

Future Extensions and Roadmap

Planned Enhancements

Advanced Content Processing

Multi-media Support: Processors for images, videos, and audio content
Content Classification: AI-powered content categorization, summarization, and tagging

Search Improvements

Semantic Search: Integration with LLM-based semantic understanding
Query Expansion: Automatic query term expansion for improved recall

Operational Features

Monitoring and Analytics: Comprehensive dashboard for system health and performance
Auto-scaling: Dynamic Crawler spawns from Nests based on crawl task depth

Conclusion

Wuzzy Search represents a paradigm shift toward truly decentralized web search infrastructure. By leveraging the AO protocol and Arweave's permanent storage capabilities, it creates a censorship-resistant, globally accessible search platform that operates without centralized control points.

The system's modular architecture, sophisticated search algorithms, and comprehensive protocol support make it suitable for a wide range of applications, from specialized domain search to general-purpose web discovery. Its integration with the permaweb ensures that indexed content remains permanently accessible, creating a lasting digital archive of human knowledge.

As the permaweb ecosystem continues to evolve, Wuzzy Search provides the essential infrastructure for content discovery and access, enabling the next generation of decentralized applications and services. The system's open architecture and extensible design position it as a foundational component for the decentralized web's future development.

The technical innovations demonstrated in Wuzzy Search, from autonomous agent-based crawling to decentralized search ranking, establish new patterns for building resilient, scalable infrastructure on blockchain-based platforms. This approach offers a compelling alternative to traditional search architectures while maintaining the performance and features users expect from modern search engines.

FilesExpand file tree

summary.md

Latest commit

History