Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #
🤖 AI-Generated PR Description (Powered by Amazon Bedrock)
Description
This pull request includes a variety of changes across multiple files and directories. The primary changes are:
llm_bot_dep
). This package has seen modifications to various loader modules, utility functions, and the addition of new modules for processing parameters and utility functions.opensearch_vector_search.py
module from thellm_bot_dep
package.Type of change
Motivation and Context
[Provide the motivation and context for the changes here, including any dependencies or requirements.]
File Stats Summary
File number involved in this PR: 27, unfold to see the details:
The file changes summary is as follows:
| source/lambda/job/dep/llm_bot_dep/loaders/html.py | 50 added, 22 removed | The code changes involve refactoring the CustomHtmlLoader class to accept a file path and S3 URI, downloading the file locally, processing it with a new VLLMParameters object, and returning a single Document object instead of a list. The process_html function is updated to use the new loader and handle temporary file creation/deletion. |
| source/infrastructure/lib/knowledge-base/knowledge-base-stack.ts | 15 added, 3 removed | The code changes add support for custom language models, including the ability to specify the model provider, model ID, model API URL, and model secret name. Additionally, it updates the langchain_community package version and adds the pypdf package to the list of additional Python modules. The retry attempts for the offline Glue job are also reduced to 1. |
| source/lambda/job/dep/llm_bot_dep/loaders/docx.py | 44 added, 41 removed | The code changes involve updating the CustomDocLoader class to load and process docx files from an S3 bucket, using a temporary local file, and returning the processed documents without splitting them. The process_doc function has been updated to handle ProcessingParameters and VLLMParameters objects, and to use the CustomDocLoader and CustomHtmlLoader classes. |
| source/lambda/job/dep/llm_bot_dep/loaders/csv.py | 48 added, 47 removed | The code changes include improvements to the CustomCSVLoader class, such as using a temporary file for processing, downloading files from S3, handling file encodings, and processing multiple rows per document based on a specified parameter. Additionally, a new process_csv function is introduced to handle the CSV processing workflow. |
| source/infrastructure/lib/api/intention-management.ts | 8 added, 8 removed | The code changes involve modifications to the way the Lambda function code is packaged and deployed, utilizing a custom command to copy the necessary files from the local file system to a temporary directory before deploying the Lambda function. |
| source/lambda/job/dep/llm_bot_dep/splitter_utils.py | 36 added, 78 removed | The code changes remove the
_make_spacy_pipeline_for_splitting
function and comment out the related classesNLTKTextSplitter
andSpacyTextSplitter
. It also removes some redundant lines of code in thefind_child
,parse_string_to_xml_node
,extract_headings
,_set_chunk_id
, andsplit_text
functions. || source/lambda/job/dep/llm_bot_dep/loaders/image.py | 62 added, 53 removed | The code changes involve refactoring the image processing pipeline, including downloading the image file locally, using a temporary file, and updating the logic for uploading the processed image to S3. It also introduces a new ProcessingParameters class to manage input parameters and uses functions from the llm_bot_dep.utils.s3_utils module for S3 operations. |
| source/lambda/job/dep/llm_bot_dep/loaders/markdown.py | 55 added, 28 removed | The code changes implement the following:
Refactored CustomMarkdownLoader to inherit from TextLoader, added lazy_load method to yield Documents, added load method to return list of Documents. Implemented process_md function to download file from S3, create temporary local file, use CustomMarkdownLoader to load and process the file, and clean up temporary file. |
| source/lambda/job/dep/llm_bot_dep/figure_llm.py | 126 added, 75 removed | The code changes involve adding support for OpenAI models in the
figureUnderstand
class, refactoring the image upload process, and introducing a newVLLMParameters
class for managing LLM parameters. Additionally, it includes minor bug fixes and logging improvements. || source/lambda/job/dep/llm_bot_dep/loaders/json.py | 44 added, 51 removed | The code changes involve updating the
CustomJsonLoader
class to load JSON files from a local file path or an S3 URI. Theprocess_json
function is modified to download the file from S3 to a temporary local file, use theCustomJsonLoader
to load the content, and return a list of processed documents. The changes also include importing required modules and removing unnecessary code. || source/lambda/job/dep/llm_bot_dep/loaders/jsonl.py | 96 added, 51 removed | The code changes introduce a new class
CustomJsonlLoader
that extendsBaseLoader
to load JSON Lines (JSONL) files. It downloads the JSONL file from an S3 bucket to a temporary local file, processes the file line by line to createDocument
objects with metadata, and returns a list of these documents. Theprocess_jsonl
function has been updated to use the newCustomJsonlLoader
class and accept aProcessingParameters
object containing the S3 bucket and object key information. || source/lambda/job/dep/llm_bot_dep/loaders/xlsx.py | 106 added, 70 removed | The code changes introduce a new class
CustomXlsxLoader
that extendsBaseLoader
fromlangchain_community.document_loaders.base
. It loads an Excel file from a file path or an S3 URI, processes the data, and returns a list ofDocument
objects. Theprocess_xlsx
function is updated to use theCustomXlsxLoader
class and accepts aProcessingParameters
object containing source bucket name, object key, and rows per document for Excel files. It downloads the file locally, uses the loader, and cleans up the temporary file. || source/lambda/job/dep/llm_bot_dep/loaders/pdf.py | 202 added, 219 removed | This code change adds support for processing PDFs using a Visual Language Large Model (VLLM) and makes several improvements to the PDF processing pipeline, including:
Added support for configuring VLLM parameters (model provider, model ID, API URL, and secret name).
Implemented a new method
process_small_pdf
to process small PDFs directly without splitting.Refactored the PDF splitting and chunking logic.
Improved error handling and logging.
Removed unnecessary code and dependencies.
Optimized S3 operations using utility functions.
Simplified the
process_pdf
function and introduced aProcessingParameters
schema. || source/lambda/job/glue-job-script.py | 78 added, 153 removed | This code change introduces the following updates:
Adds new parameters MODEL_PROVIDER, MODEL_ID, MODEL_API_URL, and MODEL_SECRET_NAME for configuring the language model.
Replaces cb_process_object with process_object from a different module.
Introduces ProcessingParameters and VLLMParameters classes to encapsulate processing and model parameters.
Renames S3FileProcessor to S3FileIterator and removes file content processing logic.
Adds update_etl_object_table function to update ETL object table with processing status.
Refactors ingestion_pipeline and delete_pipeline to use ProcessingParameters.
Modifies create_processors_and_workers to use the new S3FileIterator. |