Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: refactor etl #563

Merged
merged 12 commits into from
Mar 10, 2025
Merged

feat: refactor etl #563

merged 12 commits into from
Mar 10, 2025

Conversation

IcyKallen
Copy link
Contributor

@IcyKallen IcyKallen commented Mar 10, 2025

Fixes #

🤖 AI-Generated PR Description (Powered by Amazon Bedrock)

Description

This pull request includes a variety of changes across multiple files and directories. The primary changes are:

  1. Updates to the infrastructure code for the API, knowledge base, and IAM helper.
  2. Modifications to the ETL (Extract, Transform, Load) Lambda functions and related scripts.
  3. Updates to the job Lambda function, including changes to the dependency package (llm_bot_dep). This package has seen modifications to various loader modules, utility functions, and the addition of new modules for processing parameters and utility functions.
  4. Removal of the opensearch_vector_search.py module from the llm_bot_dep package.
  5. Changes to the Glue job script.
  6. Updates to the common utility functions for online Lambda functions.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Motivation and Context

[Provide the motivation and context for the changes here, including any dependencies or requirements.]

File Stats Summary

File number involved in this PR: 27, unfold to see the details:

The file changes summary is as follows:

Files
Changes
Change Summary
source/lambda/job/dep/llm_bot_dep/opensearch_vector_search.py 0 added, 1400 removed This file is removed in this PR
source/lambda/job/dep/llm_bot_dep/sm_utils.py 3 added, 5 removed The code changes remove the hard-coded AWS region, use the default AWS region instead, and remove unnecessary whitespace lines.
source/lambda/job/build_whl.sh 8 added, 0 removed The code changes add functionality to change the current working directory to the script's location, ensuring the script runs correctly regardless of the initial directory.
source/infrastructure/lib/shared/iam-helper.ts 1 added, 1 removed This code change appears to be removing a trailing newline character at the end of the file.
source/lambda/job/dep/llm_bot_dep/utils/secrets_manager_utils.py 33 added, 0 removed This code defines a function to retrieve an API key from AWS Secrets Manager using the Boto3 library.
source/lambda/online/common_logic/common_utils/response_utils.py 8 added, 2 removed The code changes include importing DynamoDBChatMessageHistory, reorganizing imports, and adding blank lines for better readability and code structure.
source/lambda/etl/sfn_handler.py 7 added, 0 removed The code change adds validation for required fields (modelId, modelSecretName, modelApiUrl) when using the OpenAI model provider, raising a ValueError if any are missing.
source/lambda/job/dep/llm_bot_dep/enhance_utils.py 2 added, 5 removed The code changes involve commenting out the tokenization step using NLTK, adding a TODO comment to use a better tokenization method, and removing the code for calculating the total number of tokens and chunk number.
source/lambda/job/dep/llm_bot_dep/schemas/processing_parameters.py 84 added, 0 removed This code defines two Pydantic models, VLLMParameters and ProcessingParameters, for configuring parameters related to VLLM model and ETL document processing operations, respectively.
source/lambda/job/dep/llm_bot_dep/utils/s3_utils.py 153 added, 0 removed This code provides utility functions for interacting with AWS S3 and loading content from files or S3 objects, including handling encoding detection and uploading/downloading files to/from S3.
source/lambda/job/dep/llm_bot_dep/loaders/auto.py 64 added, 58 removed The code changes introduce a registry of file processors mapped by file extension, a function to process objects based on file type using the appropriate processor from the registry, and a markdown header text splitter for splitting documents after processing, except for certain file types like CSV, XLSX, and JSONL.
source/lambda/etl/main.py 8 added, 0 removed This code change adds four new parameters (modelProvider, modelId, modelApiUrl, and modelSecretName) with default values to the response dictionary in both offline and online modes.
source/lambda/job/dep/llm_bot_dep/loaders/text.py 58 added, 42 removed The code changes implement the following:
  1. Added support for loading text files from S3 using the CustomTextLoader class.
  2. Refactored the process_text function to accept ProcessingParameters object containing bucket and key information.
  3. Removed text pre-processing and splitting logic, returning the entire document as a list. |
    | source/lambda/job/dep/llm_bot_dep/loaders/html.py | 50 added, 22 removed | The code changes involve refactoring the CustomHtmlLoader class to accept a file path and S3 URI, downloading the file locally, processing it with a new VLLMParameters object, and returning a single Document object instead of a list. The process_html function is updated to use the new loader and handle temporary file creation/deletion. |
    | source/infrastructure/lib/knowledge-base/knowledge-base-stack.ts | 15 added, 3 removed | The code changes add support for custom language models, including the ability to specify the model provider, model ID, model API URL, and model secret name. Additionally, it updates the langchain_community package version and adds the pypdf package to the list of additional Python modules. The retry attempts for the offline Glue job are also reduced to 1. |
    | source/lambda/job/dep/llm_bot_dep/loaders/docx.py | 44 added, 41 removed | The code changes involve updating the CustomDocLoader class to load and process docx files from an S3 bucket, using a temporary local file, and returning the processed documents without splitting them. The process_doc function has been updated to handle ProcessingParameters and VLLMParameters objects, and to use the CustomDocLoader and CustomHtmlLoader classes. |
    | source/lambda/job/dep/llm_bot_dep/loaders/csv.py | 48 added, 47 removed | The code changes include improvements to the CustomCSVLoader class, such as using a temporary file for processing, downloading files from S3, handling file encodings, and processing multiple rows per document based on a specified parameter. Additionally, a new process_csv function is introduced to handle the CSV processing workflow. |
    | source/infrastructure/lib/api/intention-management.ts | 8 added, 8 removed | The code changes involve modifications to the way the Lambda function code is packaged and deployed, utilizing a custom command to copy the necessary files from the local file system to a temporary directory before deploying the Lambda function. |
    | source/lambda/job/dep/llm_bot_dep/splitter_utils.py | 36 added, 78 removed | The code changes remove the _make_spacy_pipeline_for_splitting function and comment out the related classes NLTKTextSplitter and SpacyTextSplitter. It also removes some redundant lines of code in the find_child, parse_string_to_xml_node, extract_headings, _set_chunk_id, and split_text functions. |
    | source/lambda/job/dep/llm_bot_dep/loaders/image.py | 62 added, 53 removed | The code changes involve refactoring the image processing pipeline, including downloading the image file locally, using a temporary file, and updating the logic for uploading the processed image to S3. It also introduces a new ProcessingParameters class to manage input parameters and uses functions from the llm_bot_dep.utils.s3_utils module for S3 operations. |
    | source/lambda/job/dep/llm_bot_dep/loaders/markdown.py | 55 added, 28 removed | The code changes implement the following:

Refactored CustomMarkdownLoader to inherit from TextLoader, added lazy_load method to yield Documents, added load method to return list of Documents. Implemented process_md function to download file from S3, create temporary local file, use CustomMarkdownLoader to load and process the file, and clean up temporary file. |
| source/lambda/job/dep/llm_bot_dep/figure_llm.py | 126 added, 75 removed | The code changes involve adding support for OpenAI models in the figureUnderstand class, refactoring the image upload process, and introducing a new VLLMParameters class for managing LLM parameters. Additionally, it includes minor bug fixes and logging improvements. |
| source/lambda/job/dep/llm_bot_dep/loaders/json.py | 44 added, 51 removed | The code changes involve updating the CustomJsonLoader class to load JSON files from a local file path or an S3 URI. The process_json function is modified to download the file from S3 to a temporary local file, use the CustomJsonLoader to load the content, and return a list of processed documents. The changes also include importing required modules and removing unnecessary code. |
| source/lambda/job/dep/llm_bot_dep/loaders/jsonl.py | 96 added, 51 removed | The code changes introduce a new class CustomJsonlLoader that extends BaseLoader to load JSON Lines (JSONL) files. It downloads the JSONL file from an S3 bucket to a temporary local file, processes the file line by line to create Document objects with metadata, and returns a list of these documents. The process_jsonl function has been updated to use the new CustomJsonlLoader class and accept a ProcessingParameters object containing the S3 bucket and object key information. |
| source/lambda/job/dep/llm_bot_dep/loaders/xlsx.py | 106 added, 70 removed | The code changes introduce a new class CustomXlsxLoader that extends BaseLoader from langchain_community.document_loaders.base. It loads an Excel file from a file path or an S3 URI, processes the data, and returns a list of Document objects. The process_xlsx function is updated to use the CustomXlsxLoader class and accepts a ProcessingParameters object containing source bucket name, object key, and rows per document for Excel files. It downloads the file locally, uses the loader, and cleans up the temporary file. |
| source/lambda/job/dep/llm_bot_dep/loaders/pdf.py | 202 added, 219 removed | This code change adds support for processing PDFs using a Visual Language Large Model (VLLM) and makes several improvements to the PDF processing pipeline, including:

  • Added support for configuring VLLM parameters (model provider, model ID, API URL, and secret name).

  • Implemented a new method process_small_pdf to process small PDFs directly without splitting.

  • Refactored the PDF splitting and chunking logic.

  • Improved error handling and logging.

  • Removed unnecessary code and dependencies.

  • Optimized S3 operations using utility functions.

  • Simplified the process_pdf function and introduced a ProcessingParameters schema. |
    | source/lambda/job/glue-job-script.py | 78 added, 153 removed | This code change introduces the following updates:

  • Adds new parameters MODEL_PROVIDER, MODEL_ID, MODEL_API_URL, and MODEL_SECRET_NAME for configuring the language model.

  • Replaces cb_process_object with process_object from a different module.

  • Introduces ProcessingParameters and VLLMParameters classes to encapsulate processing and model parameters.

  • Renames S3FileProcessor to S3FileIterator and removes file content processing logic.

  • Adds update_etl_object_table function to update ETL object table with processing status.

  • Refactors ingestion_pipeline and delete_pipeline to use ProcessingParameters.

  • Modifies create_processors_and_workers to use the new S3FileIterator. |

@IcyKallen IcyKallen merged commit 9c4a401 into dev Mar 10, 2025
6 checks passed
@IcyKallen IcyKallen deleted the xuhan-dev branch March 10, 2025 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant