- Demo Video: https://www.youtube.com/watch?v=V_kjW0JcP1U
- Hyperbeam node: https://wuzzy-hyperbeam.hel.memeticblock.net
- Primary Nest Process Id:
1X_nt5ctoJTw6Dc3M6x34_lFTWGTl-jhW8MY7Vff4fA - Crawler 1 Process Id:
8Or2nlJ-8jDzen1FWC1h11E6IYvXOROC2oRjBJLTKE8 - Crawler 2 Process Id:
CbumKxXkF3VflozGBWMfbrVfB7s48lJPb3mgw6wAo7g - Crawler 3 Process Id:
ttBl4Hru8cdVhXsFoFdyBIDS71nch6IKB0n5dNJUqfc
Wuzzy Search represents a groundbreaking approach to web indexing and search, built entirely on the AO (Actor Oriented) protocol and the Arweave ecosystem. Unlike traditional search engines that rely on centralized infrastructure, Wuzzy implements a fully decentralized architecture where autonomous agents crawl, index, and serve search results in a distributed manner.
Wuzzy Search consists of two primary components that work in concert to provide decentralized search capabilities:
Wuzzy Nest: The central search index that acts as the repository for all crawled content. Each Nest maintains its own document database, search algorithms, and access control mechanisms. Multiple Nests can exist independently, creating a distributed network of specialized search indexes.
Wuzzy Crawler: Autonomous processes that fetch, parse, and process web content. Crawlers can be spawned independently and configured to focus on specific domains or content types. They operate autonomously, making decisions about which links to follow and what content to index.
The system employs a hub-and-spoke model where:
- A Nest serves as the central hub for document storage and search operations
- Multiple Crawlers act as distributed spokes, each responsible for crawling specific domains or URL patterns
- Crawlers submit processed documents to their designated Nest
- Users query the Nest directly for search results
- Each component operates as an independent AO process with its own state and permissions
This architecture enables horizontal scaling by adding more Crawlers to handle increased crawling loads or by deploying specialized Crawlers for different content types or domains.
Both Nest and Crawler components are implemented as AO processes written in Lua, leveraging the hyper-aos runtime environment. The choice of AO provides several technical advantages:
- Message-based Communication: All interactions between components occur through structured messages, ensuring loose coupling and fault tolerance
- State Persistence: Process state is automatically persisted on Arweave, providing durability and recoverability
- Parallel Execution: Multiple processes can operate simultaneously without coordination overhead
- Built-in Security: AO's permission model provides natural access control mechanisms
Wuzzy implements two distinct search algorithms to cater to different use cases:
The simple search algorithm provides fast, straightforward text matching:
- Case-insensitive pattern matching across document content
- Query terms are converted to regex patterns supporting both uppercase and lowercase variants
- Results are ranked by term frequency (number of matches in the document)
- Context highlighting provides ~100 characters before and after matches
- Optimized for speed and simplicity, making it ideal for real-time search interfaces
The BM25 implementation provides sophisticated relevance ranking based on term frequency and document length normalization:
BM25 Score = IDF × (tf × (k₁ + 1)) / (tf + k₁ × (1 - b + b × (|d| / avgdl)))
Where:
- IDF (Inverse Document Frequency):
log(1 + (N - df + 0.5) / (df + 0.5)) - tf: Term frequency in the document
- k₁: Controls term frequency saturation (default: 1.2)
- b: Controls document length normalization (default: 0.75)
- |d|: Document length
- avgdl: Average document length across the corpus
- N: Total number of documents
- df: Number of documents containing the term
The implementation includes several optimizations:
- Pre-calculation of average document length for efficient scoring
- Dynamic IDF computation based on actual term occurrence
- Memory-efficient storage of document statistics
- Configurable parameters for fine-tuning relevance scoring
The Crawler implements a sophisticated content processing pipeline that handles multiple content types and protocols:
- Parsing: Uses a custom HTML parser to extract structured content from web pages
- Content Extraction: Removes HTML tags while preserving text content and structure
- Entity Decoding: Converts HTML entities (&, <, etc.) to their text equivalents
- Metadata Extraction: Pulls title tags and meta descriptions for enhanced search results
- Link Discovery: Identifies and extracts all anchor tags for potential crawling targets
- URL Normalization: Converts relative URLs to absolute URLs and removes query parameters and fragments
- Domain Filtering: Only follows links within configured crawl domains to prevent crawler drift
- Deduplication: Maintains an in-memory cache of previously crawled URLs to avoid redundant processing
- Queue Management: Implements a breadth-first crawling strategy using internal queue structures
Current implementation supports:
text/html: Full HTML parsing with link extraction and content processingtext/plain: Direct text content indexing- Future Extensions: The architecture cam support pluggable content processors for images, audio, video, and other media types
Wuzzy provides comprehensive protocol support spanning traditional web and permaweb infrastructure:
- HTTP/HTTPS: Standard web crawling through Hyperbeam's relay device
- Error Handling: Gracefully ignores missing or empty pages or content
- ARNS (Arweave Name System): Native support for
arns://URLs enabling decentralized domain resolution - Direct Arweave Access:
ar://protocol support for accessing content by transaction ID - Permanent Archival: All crawled content is automatically stored on Arweave, ensuring permanent accessibility
The system implements a comprehensive role-based access control (RBAC) system:
- Owner: Full administrative access to all operations
- Admin: Administrative access with delegation capabilities
- Component-specific roles: Granular permissions for operations like
Index-Document,Add-Crawler,Request-Crawl
- Message Authentication: All inter-process communication is authenticated through AO's built-in security model
- Permission Validation: Every operation validates sender permissions before execution
- State Protection: Process state can only be modified by authorized principals
- Audit Trail: All operations are logged in the AO message history for transparency
Wuzzy provides HTTP-based search queries through the Hyperbeam infrastructure:
GET /{nestId}/now/~lua@5.3a&module={viewModule}/search_bm25/serialize~json@1.0?query={searchTerm}
Search results are returned as flat JSON structures optimized for client consumption:
interface WuzzyNestSearchResults {
search_type: string
total_hits: number
has_more: 'true' | 'false'
from: number
page_size: number
result_count: number
// Flattened result arrays
[key: `${number}_docid`]: string
[key: `${number}_title`]: string | undefined
[key: `${number}_description`]: string | undefined
[key: `${number}_content`]: string
[key: `${number}_score`]: number
}- Offset-based Pagination: Uses
fromparameter for result offsetting - Configurable Page Size: Default 10 results, customizable via parameters
- Result Metadata: Includes total hit counts and pagination status
Wuzzy's architecture enables several scaling strategies:
- Domain Specialization: Deploy dedicated Crawlers for specific domains or content types
- Geographic Distribution: Distribute Crawlers across different regions for improved latency
- Load Balancing: Multiple Crawlers can work on different URL sets simultaneously
- Resource Optimization: Each Crawler operates independently with its own resource allocation
- Multiple Nests: Deploy specialized Nests for different content categories
- Federated Search: Implement client-side search federation across multiple Nests
- Content Partitioning: Distribute documents across Nests based on domain, content type, or other criteria
- Incremental Updates: Document updates modify only changed content rather than full re-indexing
- Memory-efficient Storage: Optimized data structures for large-scale document collections
- Crawl Memory: In-memory tracking of recently crawled URLs to prevent duplication
- Queue Management: Efficient queue structures for managing large crawl backlogs
- Multi-media Support: Processors for images, videos, and audio content
- Content Classification: AI-powered content categorization, summarization, and tagging
- Semantic Search: Integration with LLM-based semantic understanding
- Query Expansion: Automatic query term expansion for improved recall
- Monitoring and Analytics: Comprehensive dashboard for system health and performance
- Auto-scaling: Dynamic Crawler spawns from Nests based on crawl task depth
Wuzzy Search represents a paradigm shift toward truly decentralized web search infrastructure. By leveraging the AO protocol and Arweave's permanent storage capabilities, it creates a censorship-resistant, globally accessible search platform that operates without centralized control points.
The system's modular architecture, sophisticated search algorithms, and comprehensive protocol support make it suitable for a wide range of applications, from specialized domain search to general-purpose web discovery. Its integration with the permaweb ensures that indexed content remains permanently accessible, creating a lasting digital archive of human knowledge.
As the permaweb ecosystem continues to evolve, Wuzzy Search provides the essential infrastructure for content discovery and access, enabling the next generation of decentralized applications and services. The system's open architecture and extensible design position it as a foundational component for the decentralized web's future development.
The technical innovations demonstrated in Wuzzy Search, from autonomous agent-based crawling to decentralized search ranking, establish new patterns for building resilient, scalable infrastructure on blockchain-based platforms. This approach offers a compelling alternative to traditional search architectures while maintaining the performance and features users expect from modern search engines.