A high-performance Rust application with a concurrent pipeline architecture that streams images from an S3 bucket, processes them in parallel, and uploads them to Meilisearch with optimal throughput and constant memory usage.
- Concurrent Pipeline Architecture: Three-stage pipeline with S3 listing, image processing, and uploading running concurrently
- True Streaming: Processing starts immediately as images are discovered, no waiting for full S3 scan
- Memory Efficient: Constant memory usage regardless of dataset size using channels and bounded queues
- S3 Integration: Recursively scans S3 buckets using rusty-s3 with pagination
- Parallel Image Processing: Configurable concurrent image downloads and processing
- Intelligent Filtering: Advanced monocolor detection with compression artifact tolerance
- Intelligent Batching: Dynamic batch uploading with adaptive sizing - never skips large images
- Batch Deletion: Efficiently removes low-color images using Meilisearch's batch delete API
- Built-in Resilience: Retry logic with exponential backoff for transient failures
- Base64 Encoding: Converts images to base64 for Meilisearch storage
- Highly Configurable: Command-line options for all performance parameters
- Dry Run Mode: Test configuration without uploading to Meilisearch
- Real-time Monitoring: Live progress tracking and detailed statistics
Make sure you have Rust installed, then build the project:
cargo build --releaseThe binary will be available at target/release/multimedia-commons-meilisearch-uploader.
./target/release/multimedia-commons-meilisearch-uploaderThis will use the default configuration:
- S3 Bucket: multimedia-commons
- S3 Region: us-west-2
- S3 Prefix: data/images/
- Meilisearch URL: https://ms-66464012cf08-103.fra.meilisearch.io
./target/release/multimedia-commons-meilisearch-uploader \
    --bucket my-bucket \
    --region us-east-1 \
    --prefix images/ \
    --meilisearch-url https://my-meilisearch.com \
    --meilisearch-key your-api-key \
    --max-downloads 100 \
    --max-uploads 20 \
    --batch-size 50Test the configuration without uploading to Meilisearch:
./target/release/multimedia-commons-meilisearch-uploader --dry-run| Option | Default | Description | 
|---|---|---|
| --bucket | multimedia-commons | S3 bucket name | 
| --region | us-west-2 | S3 region | 
| --prefix | data/images/ | S3 prefix path | 
| --meilisearch-url | https://ms-66464012cf08-103.fra.meilisearch.io | Meilisearch URL | 
| --meilisearch-key | (default provided) | Meilisearch API key | 
| --max-downloads | 50 | Maximum concurrent downloads | 
| --max-uploads | 10 | Maximum concurrent uploads | 
| --batch-size | 100 | Number of documents per batch | 
| --max-batch-bytes | 104857600 | Maximum batch size in bytes (100MB) | 
| --dry-run | false | Don't upload to Meilisearch | 
Each image is converted to a JSON document with the following structure:
{
  "id": "filename_without_extension",
  "base64": "base64_encoded_image_data",
  "url": "https://bucket.s3-region.amazonaws.com/path/to/image.jpg"
}The application uses AWS credentials from the environment. You have several options:
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_keyCreate ~/.aws/credentials:
[default]
aws_access_key_id = your_access_key
aws_secret_access_key = your_secret_keyIf running on EC2, the application will automatically use IAM roles.
For public buckets, the application will attempt anonymous access if no credentials are found.
Note: The multimedia-commons bucket is publicly accessible, so you can run the application without AWS credentials for read-only access.
The application uses a concurrent pipeline architecture for maximum performance:
S3 Lister → [Channel] → Image Processor → [Channel] → Batch Uploader
    ↓                        ↓                          ↓
Discovers images         Downloads &                Uploads to
continuously            processes images            Meilisearch
                        in parallel                 in batches
- Concurrent Execution: All three stages run simultaneously for maximum throughput
- No Blocking: Image processing starts immediately as images are discovered
- Constant Memory: Bounded channels prevent memory buildup regardless of dataset size
- Configurable Parallelism: Control concurrent downloads and uploads independently
- Efficient Resource Usage: CPU, network, and memory optimally utilized
- S3 Streaming: Continuous S3 object discovery with pagination
- Parallel Processing: Up to N concurrent image downloads/processing (configurable)
- Adaptive Batching: Intelligent batch sizing that handles large images by sending them separately
- Batch Deletion: Groups low-color image deletions into batches of 50 for efficient API usage
- Advanced Filtering: Grid-based monocolor detection with compression tolerance
- Retry Logic: Exponential backoff for transient failures
- Real-time Stats: Live progress monitoring across all pipeline stages
Throughput Optimization:
- --max-downloads: Controls concurrent image processing (default: 50)- Higher values = more parallel processing but more memory/CPU usage
- Tune based on your system resources and S3 rate limits
 
- --max-uploads: Controls Meilisearch upload concurrency (default: 10)- Tune based on your Meilisearch instance capacity
 
- --batch-size: Target documents per upload batch (default: 100)- Larger batches = fewer API calls but more memory per batch
- Large images are automatically sent in separate batches to ensure no data loss
 
Memory Management:
- Channel buffer sizes are automatically tuned for optimal memory usage
- The pipeline maintains constant memory regardless of dataset size
- Processing memory scales with --max-downloadssetting only
Monitoring:
- Watch the live output to see pipeline balance
- Optimal setup: S3 discovery keeps ahead of processing, processing keeps ahead of uploads
- Supported Formats: JPEG, PNG (detected by file extension)
- Advanced Color Analysis: Counts unique colors per image (minimum 40 colors to be considered rich content)
- Base64 Encoding: All valid images are encoded to base64 for Meilisearch storage
- Pipeline Processing: Images flow through the pipeline as discovered - no batching in memory
- Concurrent Downloads: Multiple images processed simultaneously with semaphore-based rate limiting
- Adaptive Upload Strategy: Large images that exceed batch size limits are sent in separate batches
- Batch Deletion Strategy: Low-color images are queued and deleted in batches of 50 using Meilisearch's batch API
- Zero Data Loss: All images are processed - rich images uploaded, simple images deleted from index
- Graceful Error Handling: Failed downloads/processing are logged and counted but don't stop the pipeline
- anyhow- Error handling
- base64- Base64 encoding
- clap- Command line parsing
- futures- Async utilities
- image- Image processing
- reqwest- HTTP client
- rusty-s3- S3 client
- serde- Serialization
- tokio- Async runtime
- url- URL parsing
The application includes comprehensive error handling:
- Automatic retries for transient failures
- Graceful handling of invalid images
- Logging of errors without stopping the entire process
- Final summary of errors encountered
- Certificate Errors: Make sure your system time is correct and you have updated CA certificates
- Access Denied: Verify your AWS credentials have S3 read permissions
- Out of Memory: Reduce --max-downloadsif processing very large images (the pipeline itself uses constant memory)
- Slow Processing: Increase --max-downloadsfor more parallel processing, but watch system resources
- Large Image Handling: Very large images are automatically sent in separate batches with logging
- Meilisearch Errors: Check that your Meilisearch URL and API key are correct
- Pipeline Stalls: If one stage becomes a bottleneck, tune the related concurrency parameters
Use the --dry-run flag to test the pipeline without uploading to Meilisearch:
# Test with lower concurrency to see pipeline stages clearly
./target/release/multimedia-commons-meilisearch-uploader --dry-run --max-downloads 5 --batch-size 10
# Test with higher concurrency for performance evaluation
./target/release/multimedia-commons-meilisearch-uploader --dry-run --max-downloads 20 --batch-size 50
# Test batch handling with smaller limits to see adaptive batching
./target/release/multimedia-commons-meilisearch-uploader --dry-run --max-batch-bytes 100000 --batch-size 3
# See what would be deleted vs uploaded
./target/release/multimedia-commons-meilisearch-uploader --dry-run --max-downloads 50This project is licensed under the MIT License.