feat: refactor etl #563

IcyKallen · 2025-03-10T09:51:51Z

Fixes #

🤖 AI-Generated PR Description (Powered by Amazon Bedrock)

Description

This pull request includes a variety of changes across multiple files and directories. The primary changes are:

Updates to the infrastructure code for the API, knowledge base, and IAM helper.
Modifications to the ETL (Extract, Transform, Load) Lambda functions and related scripts.
Updates to the job Lambda function, including changes to the dependency package (llm_bot_dep). This package has seen modifications to various loader modules, utility functions, and the addition of new modules for processing parameters and utility functions.
Removal of the opensearch_vector_search.py module from the llm_bot_dep package.
Changes to the Glue job script.
Updates to the common utility functions for online Lambda functions.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Motivation and Context

[Provide the motivation and context for the changes here, including any dependencies or requirements.]

File Stats Summary

File number involved in this PR: 27, unfold to see the details:

The file changes summary is as follows:

Files	Changes	Change Summary
source/lambda/job/dep/llm_bot_dep/opensearch_vector_search.py	0 added, 1400 removed	This file is removed in this PR
source/lambda/job/dep/llm_bot_dep/sm_utils.py	3 added, 5 removed	The code changes remove the hard-coded AWS region, use the default AWS region instead, and remove unnecessary whitespace lines.
source/lambda/job/build_whl.sh	8 added, 0 removed	The code changes add functionality to change the current working directory to the script's location, ensuring the script runs correctly regardless of the initial directory.
source/infrastructure/lib/shared/iam-helper.ts	1 added, 1 removed	This code change appears to be removing a trailing newline character at the end of the file.
source/lambda/job/dep/llm_bot_dep/utils/secrets_manager_utils.py	33 added, 0 removed	This code defines a function to retrieve an API key from AWS Secrets Manager using the Boto3 library.
source/lambda/online/common_logic/common_utils/response_utils.py	8 added, 2 removed	The code changes include importing DynamoDBChatMessageHistory, reorganizing imports, and adding blank lines for better readability and code structure.
source/lambda/etl/sfn_handler.py	7 added, 0 removed	The code change adds validation for required fields (modelId, modelSecretName, modelApiUrl) when using the OpenAI model provider, raising a ValueError if any are missing.
source/lambda/job/dep/llm_bot_dep/enhance_utils.py	2 added, 5 removed	The code changes involve commenting out the tokenization step using NLTK, adding a TODO comment to use a better tokenization method, and removing the code for calculating the total number of tokens and chunk number.
source/lambda/job/dep/llm_bot_dep/schemas/processing_parameters.py	84 added, 0 removed	This code defines two Pydantic models, VLLMParameters and ProcessingParameters, for configuring parameters related to VLLM model and ETL document processing operations, respectively.
source/lambda/job/dep/llm_bot_dep/utils/s3_utils.py	153 added, 0 removed	This code provides utility functions for interacting with AWS S3 and loading content from files or S3 objects, including handling encoding detection and uploading/downloading files to/from S3.
source/lambda/job/dep/llm_bot_dep/loaders/auto.py	64 added, 58 removed	The code changes introduce a registry of file processors mapped by file extension, a function to process objects based on file type using the appropriate processor from the registry, and a markdown header text splitter for splitting documents after processing, except for certain file types like CSV, XLSX, and JSONL.
source/lambda/etl/main.py	8 added, 0 removed	This code change adds four new parameters (modelProvider, modelId, modelApiUrl, and modelSecretName) with default values to the response dictionary in both offline and online modes.
source/lambda/job/dep/llm_bot_dep/loaders/text.py	58 added, 42 removed	The code changes implement the following:

Added support for loading text files from S3 using the CustomTextLoader class.
Refactored the process_text function to accept ProcessingParameters object containing bucket and key information.
Removed text pre-processing and splitting logic, returning the entire document as a list. |
| source/lambda/job/dep/llm_bot_dep/loaders/html.py | 50 added, 22 removed | The code changes involve refactoring the CustomHtmlLoader class to accept a file path and S3 URI, downloading the file locally, processing it with a new VLLMParameters object, and returning a single Document object instead of a list. The process_html function is updated to use the new loader and handle temporary file creation/deletion. |
| source/infrastructure/lib/knowledge-base/knowledge-base-stack.ts | 15 added, 3 removed | The code changes add support for custom language models, including the ability to specify the model provider, model ID, model API URL, and model secret name. Additionally, it updates the langchain_community package version and adds the pypdf package to the list of additional Python modules. The retry attempts for the offline Glue job are also reduced to 1. |
| source/lambda/job/dep/llm_bot_dep/loaders/docx.py | 44 added, 41 removed | The code changes involve updating the CustomDocLoader class to load and process docx files from an S3 bucket, using a temporary local file, and returning the processed documents without splitting them. The process_doc function has been updated to handle ProcessingParameters and VLLMParameters objects, and to use the CustomDocLoader and CustomHtmlLoader classes. |
| source/lambda/job/dep/llm_bot_dep/loaders/csv.py | 48 added, 47 removed | The code changes include improvements to the CustomCSVLoader class, such as using a temporary file for processing, downloading files from S3, handling file encodings, and processing multiple rows per document based on a specified parameter. Additionally, a new process_csv function is introduced to handle the CSV processing workflow. |
| source/infrastructure/lib/api/intention-management.ts | 8 added, 8 removed | The code changes involve modifications to the way the Lambda function code is packaged and deployed, utilizing a custom command to copy the necessary files from the local file system to a temporary directory before deploying the Lambda function. |
| source/lambda/job/dep/llm_bot_dep/splitter_utils.py | 36 added, 78 removed | The code changes remove the _make_spacy_pipeline_for_splitting function and comment out the related classes NLTKTextSplitter and SpacyTextSplitter. It also removes some redundant lines of code in the find_child, parse_string_to_xml_node, extract_headings, _set_chunk_id, and split_text functions. |
| source/lambda/job/dep/llm_bot_dep/loaders/image.py | 62 added, 53 removed | The code changes involve refactoring the image processing pipeline, including downloading the image file locally, using a temporary file, and updating the logic for uploading the processed image to S3. It also introduces a new ProcessingParameters class to manage input parameters and uses functions from the llm_bot_dep.utils.s3_utils module for S3 operations. |
| source/lambda/job/dep/llm_bot_dep/loaders/markdown.py | 55 added, 28 removed | The code changes implement the following:

Refactored CustomMarkdownLoader to inherit from TextLoader, added lazy_load method to yield Documents, added load method to return list of Documents. Implemented process_md function to download file from S3, create temporary local file, use CustomMarkdownLoader to load and process the file, and clean up temporary file. |
| source/lambda/job/dep/llm_bot_dep/figure_llm.py | 126 added, 75 removed | The code changes involve adding support for OpenAI models in the figureUnderstand class, refactoring the image upload process, and introducing a new VLLMParameters class for managing LLM parameters. Additionally, it includes minor bug fixes and logging improvements. |
| source/lambda/job/dep/llm_bot_dep/loaders/json.py | 44 added, 51 removed | The code changes involve updating the CustomJsonLoader class to load JSON files from a local file path or an S3 URI. The process_json function is modified to download the file from S3 to a temporary local file, use the CustomJsonLoader to load the content, and return a list of processed documents. The changes also include importing required modules and removing unnecessary code. |
| source/lambda/job/dep/llm_bot_dep/loaders/jsonl.py | 96 added, 51 removed | The code changes introduce a new class CustomJsonlLoader that extends BaseLoader to load JSON Lines (JSONL) files. It downloads the JSONL file from an S3 bucket to a temporary local file, processes the file line by line to create Document objects with metadata, and returns a list of these documents. The process_jsonl function has been updated to use the new CustomJsonlLoader class and accept a ProcessingParameters object containing the S3 bucket and object key information. |
| source/lambda/job/dep/llm_bot_dep/loaders/xlsx.py | 106 added, 70 removed | The code changes introduce a new class CustomXlsxLoader that extends BaseLoader from langchain_community.document_loaders.base. It loads an Excel file from a file path or an S3 URI, processes the data, and returns a list of Document objects. The process_xlsx function is updated to use the CustomXlsxLoader class and accepts a ProcessingParameters object containing source bucket name, object key, and rows per document for Excel files. It downloads the file locally, uses the loader, and cleans up the temporary file. |
| source/lambda/job/dep/llm_bot_dep/loaders/pdf.py | 202 added, 219 removed | This code change adds support for processing PDFs using a Visual Language Large Model (VLLM) and makes several improvements to the PDF processing pipeline, including:

Added support for configuring VLLM parameters (model provider, model ID, API URL, and secret name).
Implemented a new method process_small_pdf to process small PDFs directly without splitting.
Refactored the PDF splitting and chunking logic.
Improved error handling and logging.
Removed unnecessary code and dependencies.
Optimized S3 operations using utility functions.
Simplified the process_pdf function and introduced a ProcessingParameters schema. |
| source/lambda/job/glue-job-script.py | 78 added, 153 removed | This code change introduces the following updates:
Adds new parameters MODEL_PROVIDER, MODEL_ID, MODEL_API_URL, and MODEL_SECRET_NAME for configuring the language model.
Replaces cb_process_object with process_object from a different module.
Introduces ProcessingParameters and VLLMParameters classes to encapsulate processing and model parameters.
Renames S3FileProcessor to S3FileIterator and removes file content processing logic.
Adds update_etl_object_table function to update ETL object table with processing status.
Refactors ingestion_pipeline and delete_pipeline to use ProcessingParameters.
Modifies create_processors_and_workers to use the new S3FileIterator. |

… xuhan-dev

IcyKallen added 12 commits March 4, 2025 18:37

fix: remove unused validation

3bb044c

Merge branch 'xuhan-dev' of github.com:aws-samples/Intelli-Agent into…

582ad13

… xuhan-dev

fix: fix endpoint parallel invocation issue

7d0bf8b

chore: optimize offline process

1e4c954

feat: refactor etl loaders

2b16eb9

Merge remote-tracking branch 'origin/dev' into xuhan-dev

d3b589a

fix: fix etl dependency

717598c

feat: support openai format offline image understanding

c35ddb3

Merge remote-tracking branch 'origin/dev' into xuhan-dev

b914522

fix: update etl whl package

422ff0c

Merge remote-tracking branch 'origin/dev' into xuhan-dev

ba1bf18

chore: update .whl package

3434a6e

IcyKallen merged commit 9c4a401 into dev Mar 10, 2025
6 checks passed

IcyKallen deleted the xuhan-dev branch March 10, 2025 09:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: refactor etl #563

feat: refactor etl #563

IcyKallen commented Mar 10, 2025 •

edited by github-actions bot

Loading

feat: refactor etl #563

feat: refactor etl #563

Conversation

IcyKallen commented Mar 10, 2025 • edited by github-actions bot Loading

Description

Type of change

Motivation and Context

File Stats Summary

IcyKallen commented Mar 10, 2025 •

edited by github-actions bot

Loading