Skip to content

Latest commit

 

History

History
221 lines (159 loc) · 11.5 KB

File metadata and controls

221 lines (159 loc) · 11.5 KB

Wuzzy Search: A Technical Deep Dive into Decentralized Web Crawling and Search on AO

Demo Deployment

Introduction

Wuzzy Search represents a groundbreaking approach to web indexing and search, built entirely on the AO (Actor Oriented) protocol and the Arweave ecosystem. Unlike traditional search engines that rely on centralized infrastructure, Wuzzy implements a fully decentralized architecture where autonomous agents crawl, index, and serve search results in a distributed manner.

System Architecture

Core Components

Wuzzy Search consists of two primary components that work in concert to provide decentralized search capabilities:

Wuzzy Nest: The central search index that acts as the repository for all crawled content. Each Nest maintains its own document database, search algorithms, and access control mechanisms. Multiple Nests can exist independently, creating a distributed network of specialized search indexes.

Wuzzy Crawler: Autonomous processes that fetch, parse, and process web content. Crawlers can be spawned independently and configured to focus on specific domains or content types. They operate autonomously, making decisions about which links to follow and what content to index.

Distributed Architecture Model

The system employs a hub-and-spoke model where:

  • A Nest serves as the central hub for document storage and search operations
  • Multiple Crawlers act as distributed spokes, each responsible for crawling specific domains or URL patterns
  • Crawlers submit processed documents to their designated Nest
  • Users query the Nest directly for search results
  • Each component operates as an independent AO process with its own state and permissions

This architecture enables horizontal scaling by adding more Crawlers to handle increased crawling loads or by deploying specialized Crawlers for different content types or domains.

Technical Implementation

AO Process Architecture

Both Nest and Crawler components are implemented as AO processes written in Lua, leveraging the hyper-aos runtime environment. The choice of AO provides several technical advantages:

  • Message-based Communication: All interactions between components occur through structured messages, ensuring loose coupling and fault tolerance
  • State Persistence: Process state is automatically persisted on Arweave, providing durability and recoverability
  • Parallel Execution: Multiple processes can operate simultaneously without coordination overhead
  • Built-in Security: AO's permission model provides natural access control mechanisms

Search Algorithms

Wuzzy implements two distinct search algorithms to cater to different use cases:

Simple Search Algorithm

The simple search algorithm provides fast, straightforward text matching:

  • Case-insensitive pattern matching across document content
  • Query terms are converted to regex patterns supporting both uppercase and lowercase variants
  • Results are ranked by term frequency (number of matches in the document)
  • Context highlighting provides ~100 characters before and after matches
  • Optimized for speed and simplicity, making it ideal for real-time search interfaces

BM25 (Best Matching 25) Algorithm

The BM25 implementation provides sophisticated relevance ranking based on term frequency and document length normalization:

BM25 Score = IDF × (tf × (k₁ + 1)) / (tf + k₁ × (1 - b + b × (|d| / avgdl)))

Where:

  • IDF (Inverse Document Frequency): log(1 + (N - df + 0.5) / (df + 0.5))
  • tf: Term frequency in the document
  • k₁: Controls term frequency saturation (default: 1.2)
  • b: Controls document length normalization (default: 0.75)
  • |d|: Document length
  • avgdl: Average document length across the corpus
  • N: Total number of documents
  • df: Number of documents containing the term

The implementation includes several optimizations:

  • Pre-calculation of average document length for efficient scoring
  • Dynamic IDF computation based on actual term occurrence
  • Memory-efficient storage of document statistics
  • Configurable parameters for fine-tuning relevance scoring

Content Processing Pipeline

The Crawler implements a sophisticated content processing pipeline that handles multiple content types and protocols:

HTML Processing

  • Parsing: Uses a custom HTML parser to extract structured content from web pages
  • Content Extraction: Removes HTML tags while preserving text content and structure
  • Entity Decoding: Converts HTML entities (&, <, etc.) to their text equivalents
  • Metadata Extraction: Pulls title tags and meta descriptions for enhanced search results
  • Link Discovery: Identifies and extracts all anchor tags for potential crawling targets

Link Processing and Domain Management

  • URL Normalization: Converts relative URLs to absolute URLs and removes query parameters and fragments
  • Domain Filtering: Only follows links within configured crawl domains to prevent crawler drift
  • Deduplication: Maintains an in-memory cache of previously crawled URLs to avoid redundant processing
  • Queue Management: Implements a breadth-first crawling strategy using internal queue structures

Content Type Support

Current implementation supports:

  • text/html: Full HTML parsing with link extraction and content processing
  • text/plain: Direct text content indexing
  • Future Extensions: The architecture cam support pluggable content processors for images, audio, video, and other media types

Protocol Support and Web Integration

Wuzzy provides comprehensive protocol support spanning traditional web and permaweb infrastructure:

Traditional Web Protocols

  • HTTP/HTTPS: Standard web crawling through Hyperbeam's relay device
  • Error Handling: Gracefully ignores missing or empty pages or content

Permaweb Integration

  • ARNS (Arweave Name System): Native support for arns:// URLs enabling decentralized domain resolution
  • Direct Arweave Access: ar:// protocol support for accessing content by transaction ID
  • Permanent Archival: All crawled content is automatically stored on Arweave, ensuring permanent accessibility

Access Control and Security

The system implements a comprehensive role-based access control (RBAC) system:

Permission Roles

  • Owner: Full administrative access to all operations
  • Admin: Administrative access with delegation capabilities
  • Component-specific roles: Granular permissions for operations like Index-Document, Add-Crawler, Request-Crawl

Security Features

  • Message Authentication: All inter-process communication is authenticated through AO's built-in security model
  • Permission Validation: Every operation validates sender permissions before execution
  • State Protection: Process state can only be modified by authorized principals
  • Audit Trail: All operations are logged in the AO message history for transparency

API Design and Integration

RESTful Query Interface

Wuzzy provides HTTP-based search queries through the Hyperbeam infrastructure:

GET /{nestId}/now/~lua@5.3a&module={viewModule}/search_bm25/serialize~json@1.0?query={searchTerm}

Response Format

Search results are returned as flat JSON structures optimized for client consumption:

interface WuzzyNestSearchResults {
  search_type: string
  total_hits: number
  has_more: 'true' | 'false'
  from: number
  page_size: number
  result_count: number
  
  // Flattened result arrays
  [key: `${number}_docid`]: string
  [key: `${number}_title`]: string | undefined
  [key: `${number}_description`]: string | undefined
  [key: `${number}_content`]: string
  [key: `${number}_score`]: number
}

Pagination Support

  • Offset-based Pagination: Uses from parameter for result offsetting
  • Configurable Page Size: Default 10 results, customizable via parameters
  • Result Metadata: Includes total hit counts and pagination status

Performance and Scalability

Horizontal Scaling Model

Wuzzy's architecture enables several scaling strategies:

Crawler Scaling

  • Domain Specialization: Deploy dedicated Crawlers for specific domains or content types
  • Geographic Distribution: Distribute Crawlers across different regions for improved latency
  • Load Balancing: Multiple Crawlers can work on different URL sets simultaneously
  • Resource Optimization: Each Crawler operates independently with its own resource allocation

Index Scaling

  • Multiple Nests: Deploy specialized Nests for different content categories
  • Federated Search: Implement client-side search federation across multiple Nests
  • Content Partitioning: Distribute documents across Nests based on domain, content type, or other criteria

Storage and Memory Management

State Optimization

  • Incremental Updates: Document updates modify only changed content rather than full re-indexing
  • Memory-efficient Storage: Optimized data structures for large-scale document collections

Caching Strategies

  • Crawl Memory: In-memory tracking of recently crawled URLs to prevent duplication
  • Queue Management: Efficient queue structures for managing large crawl backlogs

Future Extensions and Roadmap

Planned Enhancements

Advanced Content Processing

  • Multi-media Support: Processors for images, videos, and audio content
  • Content Classification: AI-powered content categorization, summarization, and tagging

Search Improvements

  • Semantic Search: Integration with LLM-based semantic understanding
  • Query Expansion: Automatic query term expansion for improved recall

Operational Features

  • Monitoring and Analytics: Comprehensive dashboard for system health and performance
  • Auto-scaling: Dynamic Crawler spawns from Nests based on crawl task depth

Conclusion

Wuzzy Search represents a paradigm shift toward truly decentralized web search infrastructure. By leveraging the AO protocol and Arweave's permanent storage capabilities, it creates a censorship-resistant, globally accessible search platform that operates without centralized control points.

The system's modular architecture, sophisticated search algorithms, and comprehensive protocol support make it suitable for a wide range of applications, from specialized domain search to general-purpose web discovery. Its integration with the permaweb ensures that indexed content remains permanently accessible, creating a lasting digital archive of human knowledge.

As the permaweb ecosystem continues to evolve, Wuzzy Search provides the essential infrastructure for content discovery and access, enabling the next generation of decentralized applications and services. The system's open architecture and extensible design position it as a foundational component for the decentralized web's future development.

The technical innovations demonstrated in Wuzzy Search, from autonomous agent-based crawling to decentralized search ranking, establish new patterns for building resilient, scalable infrastructure on blockchain-based platforms. This approach offers a compelling alternative to traditional search architectures while maintaining the performance and features users expect from modern search engines.