Skip to content

Conversation

solaws
Copy link

@solaws solaws commented Aug 28, 2025

This project contains a document vectorization pipeline using AWS services, specifically designed to process text, PDF, and Word documents, extract their content, generate vector embeddings in parallel and store them in a PostgreSQL database optimized for vector searches.

Thank you

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

soojo added 2 commits August 28, 2025 12:41
- Enhanced README.md with workflow diagram and detailed architecture
- Added complete example-workflow.json with all required metadata
- Created resources folder with workflow diagram and author photos
- Added professional author information for Solomon Ojo and Dave Horne
- Included comprehensive deployment guides and resource links
- Ready for AWS Step Functions workflows collection contribution
@solaws solaws closed this Sep 11, 2025
@solaws solaws reopened this Sep 11, 2025
- Complete document vectorization pipeline implementation
- Enhanced README.md with workflow diagram and comprehensive documentation
- Added example-workflow.json with all required metadata for AWS samples
- Included resources folder with workflow diagram and author photos
- Added professional author information for Solomon Ojo and Dave Horne
- All Lambda functions, deployment scripts, and configuration files
- Ready for production use and AWS Step Functions workflows collection
solaws and others added 16 commits October 9, 2025 11:36
Added a section detailing the data processing layers in the pipeline, explaining the purpose of each layer and its role in the data transformation workflow.
Removed quick start deployment instructions from README.
Updated resource links and author information in workflow metadata.
Updated the state machine to handle document processing and vector embedding generation.
Updated the AWS CloudFormation template to enhance the document processing workflow by modifying descriptions, adjusting security group rules, and updating Lambda function runtimes to Python 3.12. Removed unnecessary resources and added outputs for database secret and cluster ARNs.
Updated AvailabilityZone retrieval and added a custom S3 policy for Lambda.
…ries layer

- Reduced DatabaseInitFunction timeout from 300s to 120s (2 minutes)
- Created missing functions/shared directory with security utilities
- Added shared libraries for secure XML parsing and subprocess execution
- Updated template.yaml with proper layer structure
- Cleaned up demo configuration
@solaws solaws requested a review from bfreiberg October 9, 2025 18:23
Copy link
Contributor

@bfreiberg bfreiberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for your contribution. Your workflow will be merged to Serverlessland soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants