diff --git a/experimental/serverless-fleets/images/examples_docling_flow.png b/experimental/serverless-fleets/images/examples_docling_flow.png index b85eb8720..0273edd7b 100644 Binary files a/experimental/serverless-fleets/images/examples_docling_flow.png and b/experimental/serverless-fleets/images/examples_docling_flow.png differ diff --git a/experimental/serverless-fleets/tutorials/docling/Dockerfile b/experimental/serverless-fleets/tutorials/docling/Dockerfile deleted file mode 100644 index 76f05d9d8..000000000 --- a/experimental/serverless-fleets/tutorials/docling/Dockerfile +++ /dev/null @@ -1,34 +0,0 @@ -FROM python:3.11-slim-bookworm - -ENV GIT_SSH_COMMAND="ssh -o StrictHostKeyChecking=no" - -RUN apt-get update \ - && apt-get install -y libgl1 libglib2.0-0 curl wget git procps \ - && apt-get clean - -# This will install torch with *only* cpu support -# Remove the --extra-index-url part if you want to install all the gpu requirements -# For more details in the different torch distribution visit https://pytorch.org/. - -# without GPU: -# RUN pip install --no-cache-dir docling --extra-index-url https://download.pytorch.org/whl/cpu - -# with GPU: -RUN pip install --no-cache-dir docling --extra-index-url https://download.pytorch.org/whl/cu126 - -ENV HF_HOME=/tmp/ -ENV TORCH_HOME=/tmp/ - -COPY minimal.py /root/minimal.py - -RUN docling-tools models download - -# On container environments, always set a thread budget to avoid undesired thread congestion. -ENV OMP_NUM_THREADS=4 - -# On container shell: -# > cd /root/ -# > python minimal.py - -# Running as `docker run -e DOCLING_ARTIFACTS_PATH=/root/.cache/docling/models` will use the -# model weights included in the container image. diff --git a/experimental/serverless-fleets/tutorials/docling/README.md b/experimental/serverless-fleets/tutorials/docling/README.md index 83fb7fec4..de20da89e 100644 --- a/experimental/serverless-fleets/tutorials/docling/README.md +++ b/experimental/serverless-fleets/tutorials/docling/README.md @@ -4,8 +4,7 @@ This tutorial provides a comprehensive guide on using Docling to convert PDFs in Key steps covered in the Tutorial: 1. Upload the examples PDFs to COS -2. Containerization with Code Engine: Build the Docling container and push it to a registry for deployment. -3. Run a fleet of workers that automatically runs the container, ensuring scalability and efficiency. +2. Run a fleet of workers that automatically runs the official docling container, ensuring scalability and efficiency. 4. Download the resulting markdown files from COS This setup is ideal for automating document conversion workflows in a cost-effective, serverless environment. @@ -26,17 +25,41 @@ ls data/tutorials/docling/pdfs ./upload ``` -### Step 2 - Build and Push the container registry +### Step 2 - Review the commands -Build the container image using Code Engine's build capabilities by running the following command in the `tutorials/docling` directory. +Review the `commands.jsonl` which defines the tasks to run the docling command and arguments for each of the pdfs. ``` cd tutorials/docling -./build +cat commands.jsonl ``` + +
+ Output + +``` +➜ cat commands.jsonl + +{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/2203.01017v2.pdf", "--output", "/mnt/ce/data/result/docling_2203.01017v2.pdf.md" ]} +{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/2206.01062.pdf", "--output", "/mnt/ce/data/result/docling_2206.01062.pdf.md" ]} +{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/2305.03393v1-pg9.pdf", "--output", "/mnt/ce/data/result/docling_2305.03393v1-pg9.pdf.md" ]} +{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/2305.03393v1.pdf", "--output", "/mnt/ce/data/result/docling_2305.03393v1.pdf.md" ]} +{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/amt_handbook_sample.pdf", "--output", "/mnt/ce/data/result/docling_amt_handbook_sample.pdf.md" ]} +{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/code_and_formula.pdf", "--output", "/mnt/ce/data/result/docling_code_and_formula.pdf.md" ]} +{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/picture_classification.pdf", "--output", "/mnt/ce/data/result/docling_picture_classification.pdf.md" ]} +{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/redp5110_sampled.pdf", "--output", "/mnt/ce/data/result/docling_redp5110_sampled.pdf.md" ]} +{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/right_to_left_01.pdf", "--output", "/mnt/ce/data/result/docling_right_to_left_01.pdf.md" ]} +{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/right_to_left_02.pdf", "--output", "/mnt/ce/data/result/docling_right_to_left_02.pdf.md" ]} +{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/right_to_left_03.pdf", "--output", "/mnt/ce/data/result/docling_right_to_left_03.pdf.md" ]} +``` +
+
+ ### Step 3 - Run the Fleet -Now run the fleet to process the PDFs. In this tutorial we use the static array index with `--task 11` to specify the tasks for the 11 pdfs. The command is a bash script which is using the `CE_TASK_ID`, which contains values `0..10`, to fetch the pdf file. It's then running docling with 24 CPUs on the mx3d-24x240 worker. Therefore it's only running one instance per worker and utilizing the full worker. We run 4 instance and workers in parallel. Run the fleet with the following command in the `tutorials/docling` directory. +Now run the fleet to process the PDFs. In this tutorial we use the static array index with `--tasks-from-file commands.jsonl` to specify the tasks for the 11 pdfs. We give each task 24 vCPU, run docling with `--num-threads 24` and choose a mx3d-24x240 worker profile with 24 vCPU. Therefore we run only 1 docling command per worker at a time and utilize the full worker per pdf processing. We run `--max-scale 4` instances and workers in parallel. + +Launch the fleet with the following command in the `tutorials/docling` directory. ``` ./run ``` @@ -53,12 +76,9 @@ ibmcloud code-engine experimental fleet run --name fleet-0eb02f2f-1 --registry-secret fleet-registry-secret --worker-profile mx3d-24x240 --max-scale 4 - --tasks 11 + --tasks-from-file commands.jsonl --cpu 24 --memory 240G - --command=bash - --arg -c - --arg mkdir -p /mnt/ce/data/result/$CE_FLEET_ID/; cd /mnt/ce/data/tutorials/docling/pdfs; files=( * ); docling --artifacts-path=/root/.cache/docling/models ${files[CE_TASK_ID]} --num-threads 24 --output /mnt/ce/data/result/$CE_FLEET_ID/; Preparing your tasks: ⠼ Please wait...took 11.233582 seconds. Preparing your tasks: ⠴ Please wait... COS Bucket used 'ce-fleet-sandbox-data-fbfdde1d'... @@ -139,7 +159,7 @@ Succeeded Tasks: 0
-If you like you can jump to the machine and see docling processing by running the following command in the root directory: +(optional) If you like you can jump to the machine and see docling processing by running the following command in the root directory: ``` ./jump ``` @@ -147,6 +167,19 @@ If you like you can jump to the machine and see docling processing by running th You can use `htop` to see that docling is processing the PDFs ![](../../images/examples_docling.jpg) + +#### Playing with more parallism + +If you want to modify the tutorial to add some more parallism, e.g. to run 4 docling commands per worker, you could change the arguments and run script as follows: +1. the arguments in commands.jsonl to `--num-threads 6` +2. the cpu per task to `--cpu 6` +Now, with `--max-scale 4` you would only get a single worker. Modify `--max-scale 8` to get 2 workers, each processing 4 docling commands. + +#### Run with a Serverless GPU + +Run `./run_gpu` to launch the docling commands on a GPU. This example, is bringing up a single `gx3-24x120x1l40s` and runs the 11 pdfs sequentially. + + ### Step 4 - Download results Download the results from the COS by running the following command in the root directory: @@ -156,7 +189,7 @@ Download the results from the COS by running the following command in the root d You can find the results under ``` -ls -l data/result// +ls -l data/result/docling_* ``` diff --git a/experimental/serverless-fleets/tutorials/docling/build b/experimental/serverless-fleets/tutorials/docling/build deleted file mode 100755 index 2eb4fbe99..000000000 --- a/experimental/serverless-fleets/tutorials/docling/build +++ /dev/null @@ -1,11 +0,0 @@ -#!/bin/sh - -REGISTRY=$(ibmcloud ce secret get -n fleet-registry-secret --output json | jq -r '.data.server') - -uuid=$(uuidgen | tr '[:upper:]' '[:lower:]' | awk -F- '{print $1}') - -ibmcloud ce buildrun submit --source . --strategy dockerfile --image $REGISTRY/ce--fleet-docling/docling:latest --registry-secret fleet-registry-secret --name ce--fleet-docling-build-${uuid} --size xxlarge --timeout 1800 - -ibmcloud ce buildrun logs -f -n ce--fleet-docling-build-${uuid} - -# takes about 365.8s. \ No newline at end of file diff --git a/experimental/serverless-fleets/tutorials/docling/commands.jsonl b/experimental/serverless-fleets/tutorials/docling/commands.jsonl new file mode 100644 index 000000000..47eba41fb --- /dev/null +++ b/experimental/serverless-fleets/tutorials/docling/commands.jsonl @@ -0,0 +1,11 @@ +{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/2203.01017v2.pdf", "--output", "/mnt/ce/data/result/docling_2203.01017v2.pdf.md" ]} +{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/2206.01062.pdf", "--output", "/mnt/ce/data/result/docling_2206.01062.pdf.md" ]} +{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/2305.03393v1-pg9.pdf", "--output", "/mnt/ce/data/result/docling_2305.03393v1-pg9.pdf.md" ]} +{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/2305.03393v1.pdf", "--output", "/mnt/ce/data/result/docling_2305.03393v1.pdf.md" ]} +{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/amt_handbook_sample.pdf", "--output", "/mnt/ce/data/result/docling_amt_handbook_sample.pdf.md" ]} +{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/code_and_formula.pdf", "--output", "/mnt/ce/data/result/docling_code_and_formula.pdf.md" ]} +{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/picture_classification.pdf", "--output", "/mnt/ce/data/result/docling_picture_classification.pdf.md" ]} +{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/redp5110_sampled.pdf", "--output", "/mnt/ce/data/result/docling_redp5110_sampled.pdf.md" ]} +{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/right_to_left_01.pdf", "--output", "/mnt/ce/data/result/docling_right_to_left_01.pdf.md" ]} +{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/right_to_left_02.pdf", "--output", "/mnt/ce/data/result/docling_right_to_left_02.pdf.md" ]} +{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/right_to_left_03.pdf", "--output", "/mnt/ce/data/result/docling_right_to_left_03.pdf.md" ]} diff --git a/experimental/serverless-fleets/tutorials/docling/create_commands b/experimental/serverless-fleets/tutorials/docling/create_commands new file mode 100755 index 000000000..1af1681bf --- /dev/null +++ b/experimental/serverless-fleets/tutorials/docling/create_commands @@ -0,0 +1,9 @@ +#!/bin/sh + +#ls -l ../../data/tutorials/docling/pdfs/*.pdf | awk '{ printf " { \"command\":\"docling\", \"args\": [\"--num-threads\", \"24\", \""$8"\", \"--output\", \"/mnt/ce/data/result\"$8".md\" ]}\n" }' > commands.jsonl + +cd ../../data/tutorials/docling/pdfs +for file in *.pdf; do echo "{ \"command\":\"docling\", \"args\": [\"--num-threads\", \"24\", \"/mnt/ce/data/tutorials/docling/pdfs/"$file""\", \"--output\", \"/mnt/ce/data/result/docling_""$file".md\" ]}"; done > commands.jsonl +cd - +mv ../../data/tutorials/docling/pdfs/commands.jsonl . +cat commands.jsonl diff --git a/experimental/serverless-fleets/tutorials/docling/minimal.py b/experimental/serverless-fleets/tutorials/docling/minimal.py deleted file mode 100644 index 66bd2c85f..000000000 --- a/experimental/serverless-fleets/tutorials/docling/minimal.py +++ /dev/null @@ -1,7 +0,0 @@ -from docling.document_converter import DocumentConverter - -source = "https://arxiv.org/pdf/2408.09869" # document per local path or URL -converter = DocumentConverter() -result = converter.convert(source) -print(result.document.export_to_markdown()) -# output: ## Docling Technical Report [...]" diff --git a/experimental/serverless-fleets/tutorials/docling/run b/experimental/serverless-fleets/tutorials/docling/run index 64ff745aa..b26affafd 100755 --- a/experimental/serverless-fleets/tutorials/docling/run +++ b/experimental/serverless-fleets/tutorials/docling/run @@ -4,39 +4,25 @@ set -e uuid=$(uuidgen | tr '[:upper:]' '[:lower:]' | awk -F- '{print $1}') -IMAGE=$(ibmcloud cr images | grep "ce--fleet-docling" | awk '{print $1}') - -if [ -z "${IMAGE}" ]; then - echo "no image found. pls build a docling image with ./build.sh" - exit -1 -else - echo "using image: $IMAGE" -fi +IMAGE="quay.io/docling-project/docling-serve-cpu" echo ibmcloud code-engine experimental fleet run --name "fleet-${uuid}-1" echo " "--image $IMAGE echo " "--registry-secret fleet-registry-secret echo " "--worker-profile mx3d-24x240 echo " "--max-scale 4 -echo " "--tasks 11 +echo " "--tasks-from-file commands.jsonl echo " "--cpu 24 echo " "--memory 240G -echo " "--command="bash" -echo " "--arg "-c" -echo " "--arg "mkdir -p /mnt/ce/data/result/\$CE_FLEET_ID/; cd /mnt/ce/data/tutorials/docling/pdfs; files=( * ); docling --artifacts-path=/root/.cache/docling/models \${files[CE_TASK_ID]} --num-threads 24 --output /mnt/ce/data/result/\$CE_FLEET_ID/;" - ibmcloud code-engine experimental fleet run --name "fleet-${uuid}-1" \ --image $IMAGE \ --registry-secret fleet-registry-secret \ --worker-profile mx3d-24x240 \ --max-scale 4 \ ---tasks 11 \ +--tasks-from-file commands.jsonl \ --cpu 24 \ --memory 240G \ ---command="bash" \ ---arg "-c" \ ---arg "mkdir -p /mnt/ce/data/result/\$CE_FLEET_ID/; cd /mnt/ce/data/tutorials/docling/pdfs; files=( * ); docling --artifacts-path=/root/.cache/docling/models \${files[CE_TASK_ID]} --num-threads 24 --output /mnt/ce/data/result/\$CE_FLEET_ID/;" ibmcloud code-engine experimental fleet get --name "fleet-${uuid}-1" diff --git a/experimental/serverless-fleets/tutorials/docling/run_gpu b/experimental/serverless-fleets/tutorials/docling/run_gpu new file mode 100755 index 000000000..1c856df5c --- /dev/null +++ b/experimental/serverless-fleets/tutorials/docling/run_gpu @@ -0,0 +1,29 @@ +#!/bin/bash + +set -e + +uuid=$(uuidgen | tr '[:upper:]' '[:lower:]' | awk -F- '{print $1}') + +# https://github.com/docling-project/docling-serve?tab=readme-ov-file#container-images +IMAGE="quay.io/docling-project/docling-serve" + +echo ibmcloud code-engine experimental fleet run --name "fleet-${uuid}-1" +echo " "--image $IMAGE +echo " "--registry-secret fleet-registry-secret +echo " "--worker-profile gx3-24x120x1l40s +echo " "--max-scale 1 +echo " "--tasks-from-file commands.jsonl +echo " "--cpu 24 +echo " "--memory 120G + +ibmcloud code-engine experimental fleet run --name "fleet-${uuid}-1" \ +--image $IMAGE \ +--registry-secret fleet-registry-secret \ +--worker-profile gx3-24x120x1l40s \ +--max-scale 1 \ +--tasks-from-file commands.jsonl \ +--cpu 24 \ +--memory 120G \ + +ibmcloud code-engine experimental fleet get --name "fleet-${uuid}-1" +