Multimedia Commons Meilisearch Uploader

A high-performance Rust application with a concurrent pipeline architecture that streams images from an S3 bucket, processes them in parallel, and uploads them to Meilisearch with optimal throughput and constant memory usage.

Features

Concurrent Pipeline Architecture: Three-stage pipeline with S3 listing, image processing, and uploading running concurrently
True Streaming: Processing starts immediately as images are discovered, no waiting for full S3 scan
Memory Efficient: Constant memory usage regardless of dataset size using channels and bounded queues
S3 Integration: Recursively scans S3 buckets using rusty-s3 with pagination
Parallel Image Processing: Configurable concurrent image downloads and processing
Intelligent Filtering: Advanced monocolor detection with compression artifact tolerance
Intelligent Batching: Dynamic batch uploading with adaptive sizing - never skips large images
Batch Deletion: Efficiently removes low-color images using Meilisearch's batch delete API
Built-in Resilience: Retry logic with exponential backoff for transient failures
Base64 Encoding: Converts images to base64 for Meilisearch storage
Highly Configurable: Command-line options for all performance parameters
Dry Run Mode: Test configuration without uploading to Meilisearch
Real-time Monitoring: Live progress tracking and detailed statistics

Installation

Make sure you have Rust installed, then build the project:

cargo build --release

The binary will be available at target/release/multimedia-commons-meilisearch-uploader.

Usage

Basic Usage

./target/release/multimedia-commons-meilisearch-uploader

This will use the default configuration:

S3 Bucket: multimedia-commons
S3 Region: us-west-2
S3 Prefix: data/images/
Meilisearch URL: https://ms-66464012cf08-103.fra.meilisearch.io

Custom Configuration

./target/release/multimedia-commons-meilisearch-uploader \
    --bucket my-bucket \
    --region us-east-1 \
    --prefix images/ \
    --meilisearch-url https://my-meilisearch.com \
    --meilisearch-key your-api-key \
    --max-downloads 100 \
    --max-uploads 20 \
    --batch-size 50

Dry Run

Test the configuration without uploading to Meilisearch:

./target/release/multimedia-commons-meilisearch-uploader --dry-run

Command Line Options

Option	Default	Description
`--bucket`	`multimedia-commons`	S3 bucket name
`--region`	`us-west-2`	S3 region
`--prefix`	`data/images/`	S3 prefix path
`--meilisearch-url`	`https://ms-66464012cf08-103.fra.meilisearch.io`	Meilisearch URL
`--meilisearch-key`	(default provided)	Meilisearch API key
`--max-downloads`	`50`	Maximum concurrent downloads
`--max-uploads`	`10`	Maximum concurrent uploads
`--batch-size`	`100`	Number of documents per batch
`--max-batch-bytes`	`104857600`	Maximum batch size in bytes (100MB)
`--dry-run`	`false`	Don't upload to Meilisearch

Output Format

Each image is converted to a JSON document with the following structure:

{
  "id": "filename_without_extension",
  "base64": "base64_encoded_image_data",
  "url": "https://bucket.s3-region.amazonaws.com/path/to/image.jpg"
}

AWS Credentials

The application uses AWS credentials from the environment. You have several options:

Option 1: Environment Variables

export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key

Option 2: AWS Credentials File

Create ~/.aws/credentials:

[default]
aws_access_key_id = your_access_key
aws_secret_access_key = your_secret_key

Option 3: IAM Roles (for EC2 instances)

If running on EC2, the application will automatically use IAM roles.

Option 4: Anonymous Access

For public buckets, the application will attempt anonymous access if no credentials are found.

Note: The multimedia-commons bucket is publicly accessible, so you can run the application without AWS credentials for read-only access.

Performance

The application uses a concurrent pipeline architecture for maximum performance:

Pipeline Architecture

S3 Lister → [Channel] → Image Processor → [Channel] → Batch Uploader
    ↓                        ↓                          ↓
Discovers images         Downloads &                Uploads to
continuously            processes images            Meilisearch
                        in parallel                 in batches

Concurrent Execution: All three stages run simultaneously for maximum throughput
No Blocking: Image processing starts immediately as images are discovered
Constant Memory: Bounded channels prevent memory buildup regardless of dataset size
Configurable Parallelism: Control concurrent downloads and uploads independently
Efficient Resource Usage: CPU, network, and memory optimally utilized

Performance Features

S3 Streaming: Continuous S3 object discovery with pagination
Parallel Processing: Up to N concurrent image downloads/processing (configurable)
Adaptive Batching: Intelligent batch sizing that handles large images by sending them separately
Batch Deletion: Groups low-color image deletions into batches of 50 for efficient API usage
Advanced Filtering: Grid-based monocolor detection with compression tolerance
Retry Logic: Exponential backoff for transient failures
Real-time Stats: Live progress monitoring across all pipeline stages

Performance Tuning

Throughput Optimization:

--max-downloads: Controls concurrent image processing (default: 50)
- Higher values = more parallel processing but more memory/CPU usage
- Tune based on your system resources and S3 rate limits
--max-uploads: Controls Meilisearch upload concurrency (default: 10)
- Tune based on your Meilisearch instance capacity
--batch-size: Target documents per upload batch (default: 100)
- Larger batches = fewer API calls but more memory per batch
- Large images are automatically sent in separate batches to ensure no data loss

Memory Management:

Channel buffer sizes are automatically tuned for optimal memory usage
The pipeline maintains constant memory regardless of dataset size
Processing memory scales with --max-downloads setting only

Monitoring:

Watch the live output to see pipeline balance
Optimal setup: S3 discovery keeps ahead of processing, processing keeps ahead of uploads

Image Processing

Supported Formats: JPEG, PNG (detected by file extension)
Advanced Color Analysis: Counts unique colors per image (minimum 40 colors to be considered rich content)
Base64 Encoding: All valid images are encoded to base64 for Meilisearch storage
Pipeline Processing: Images flow through the pipeline as discovered - no batching in memory
Concurrent Downloads: Multiple images processed simultaneously with semaphore-based rate limiting
Adaptive Upload Strategy: Large images that exceed batch size limits are sent in separate batches
Batch Deletion Strategy: Low-color images are queued and deleted in batches of 50 using Meilisearch's batch API
Zero Data Loss: All images are processed - rich images uploaded, simple images deleted from index
Graceful Error Handling: Failed downloads/processing are logged and counted but don't stop the pipeline

Dependencies

anyhow - Error handling
base64 - Base64 encoding
clap - Command line parsing
futures - Async utilities
image - Image processing
reqwest - HTTP client
rusty-s3 - S3 client
serde - Serialization
tokio - Async runtime
url - URL parsing

Error Handling

The application includes comprehensive error handling:

Automatic retries for transient failures
Graceful handling of invalid images
Logging of errors without stopping the entire process
Final summary of errors encountered

Troubleshooting

Common Issues

Certificate Errors: Make sure your system time is correct and you have updated CA certificates
Access Denied: Verify your AWS credentials have S3 read permissions
Out of Memory: Reduce --max-downloads if processing very large images (the pipeline itself uses constant memory)
Slow Processing: Increase --max-downloads for more parallel processing, but watch system resources
Large Image Handling: Very large images are automatically sent in separate batches with logging
Meilisearch Errors: Check that your Meilisearch URL and API key are correct
Pipeline Stalls: If one stage becomes a bottleneck, tune the related concurrency parameters

Testing

Use the --dry-run flag to test the pipeline without uploading to Meilisearch:

# Test with lower concurrency to see pipeline stages clearly
./target/release/multimedia-commons-meilisearch-uploader --dry-run --max-downloads 5 --batch-size 10

# Test with higher concurrency for performance evaluation
./target/release/multimedia-commons-meilisearch-uploader --dry-run --max-downloads 20 --batch-size 50

# Test batch handling with smaller limits to see adaptive batching
./target/release/multimedia-commons-meilisearch-uploader --dry-run --max-batch-bytes 100000 --batch-size 3

# See what would be deleted vs uploaded
./target/release/multimedia-commons-meilisearch-uploader --dry-run --max-downloads 50

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multimedia Commons Meilisearch Uploader

Features

Installation

Usage

Basic Usage

Custom Configuration

Dry Run

Command Line Options

Output Format

AWS Credentials

Option 1: Environment Variables

Option 2: AWS Credentials File

Option 3: IAM Roles (for EC2 instances)

Option 4: Anonymous Access

Performance

Pipeline Architecture

Performance Features

Performance Tuning

Image Processing

Dependencies

Error Handling

Troubleshooting

Common Issues

Testing

License

About

Uh oh!

Releases

Packages

Languages

meilisearch/multimedia-commons-meilisearch-uploader

Folders and files

Latest commit

History

Repository files navigation

Multimedia Commons Meilisearch Uploader

Features

Installation

Usage

Basic Usage

Custom Configuration

Dry Run

Command Line Options

Output Format

AWS Credentials

Option 1: Environment Variables

Option 2: AWS Credentials File

Option 3: IAM Roles (for EC2 instances)

Option 4: Anonymous Access

Performance

Pipeline Architecture

Performance Features

Performance Tuning

Image Processing

Dependencies

Error Handling

Troubleshooting

Common Issues

Testing

License

About

Resources

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages