Skip to content

Commit 5863cce

Browse files
jeremiaswernerreggeenr
authored andcommitted
rebase docling example to use official image and tasks-from-file
1 parent 3a1a55a commit 5863cce

File tree

8 files changed

+62
-81
lines changed

8 files changed

+62
-81
lines changed
19.1 KB
Loading

experimental/serverless-fleets/tutorials/docling/Dockerfile

Lines changed: 0 additions & 34 deletions
This file was deleted.

experimental/serverless-fleets/tutorials/docling/README.md

Lines changed: 39 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,7 @@ This tutorial provides a comprehensive guide on using Docling to convert PDFs in
44

55
Key steps covered in the Tutorial:
66
1. Upload the examples PDFs to COS
7-
2. Containerization with Code Engine: Build the Docling container and push it to a registry for deployment.
8-
3. Run a fleet of workers that automatically runs the container, ensuring scalability and efficiency.
7+
2. Run a fleet of workers that automatically runs the official docling container, ensuring scalability and efficiency.
98
4. Download the resulting markdown files from COS
109

1110
This setup is ideal for automating document conversion workflows in a cost-effective, serverless environment.
@@ -26,17 +25,39 @@ ls data/tutorials/docling/pdfs
2625
./upload
2726
```
2827

29-
### Step 2 - Build and Push the container registry
28+
### Step 2 - Review the commands
3029

31-
Build the container image using Code Engine's build capabilities by running the following command in the `tutorials/docling` directory.
30+
Review the `commands.jsonl` which defines the tasks to run the docling command and arguments for each of the pdfs.
3231
```
3332
cd tutorials/docling
34-
./build
33+
cat commands.jsonl
3534
```
3635

36+
<a name="Output"></a>
37+
<details>
38+
<summary>Output</summary>
39+
40+
```
41+
➜ cat commands.jsonl
42+
43+
{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/2203.01017v2.pdf", "--output", "/mnt/ce/data/result/docling_2203.01017v2.pdf.md" ]}
44+
{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/2206.01062.pdf", "--output", "/mnt/ce/data/result/docling_2206.01062.pdf.md" ]}
45+
{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/2305.03393v1-pg9.pdf", "--output", "/mnt/ce/data/result/docling_2305.03393v1-pg9.pdf.md" ]}
46+
{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/2305.03393v1.pdf", "--output", "/mnt/ce/data/result/docling_2305.03393v1.pdf.md" ]}
47+
{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/amt_handbook_sample.pdf", "--output", "/mnt/ce/data/result/docling_amt_handbook_sample.pdf.md" ]}
48+
{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/code_and_formula.pdf", "--output", "/mnt/ce/data/result/docling_code_and_formula.pdf.md" ]}
49+
{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/picture_classification.pdf", "--output", "/mnt/ce/data/result/docling_picture_classification.pdf.md" ]}
50+
{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/redp5110_sampled.pdf", "--output", "/mnt/ce/data/result/docling_redp5110_sampled.pdf.md" ]}
51+
{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/right_to_left_01.pdf", "--output", "/mnt/ce/data/result/docling_right_to_left_01.pdf.md" ]}
52+
{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/right_to_left_02.pdf", "--output", "/mnt/ce/data/result/docling_right_to_left_02.pdf.md" ]}
53+
{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/right_to_left_03.pdf", "--output", "/mnt/ce/data/result/docling_right_to_left_03.pdf.md" ]}
54+
```
55+
</details>
56+
<br/>
57+
3758
### Step 3 - Run the Fleet
3859

39-
Now run the fleet to process the PDFs. In this tutorial we use the static array index with `--task 11` to specify the tasks for the 11 pdfs. The command is a bash script which is using the `CE_TASK_ID`, which contains values `0..10`, to fetch the pdf file. It's then running docling with 24 CPUs on the mx3d-24x240 worker. Therefore it's only running one instance per worker and utilizing the full worker. We run 4 instance and workers in parallel. Run the fleet with the following command in the `tutorials/docling` directory.
60+
Now run the fleet to process the PDFs. In this tutorial we use the static array index with `--tasks-from-file commands.jsonl` to specify the tasks for the 11 pdfs. We give each task 24 vCPU, run docling with `--num-threads 24` and choose a mx3d-24x240 worker profile with 24 vCPU. Therefore we run only 1 docling command per worker at a time and utilize the full worker per pdf processing. We run `--max-scale 4` instances and workers in parallel. Launch the fleet with the following command in the `tutorials/docling` directory.
4061
```
4162
./run
4263
```
@@ -53,12 +74,9 @@ ibmcloud code-engine experimental fleet run --name fleet-0eb02f2f-1
5374
--registry-secret fleet-registry-secret
5475
--worker-profile mx3d-24x240
5576
--max-scale 4
56-
--tasks 11
77+
--tasks-from-file commands.jsonl
5778
--cpu 24
5879
--memory 240G
59-
--command=bash
60-
--arg -c
61-
--arg mkdir -p /mnt/ce/data/result/$CE_FLEET_ID/; cd /mnt/ce/data/tutorials/docling/pdfs; files=( * ); docling --artifacts-path=/root/.cache/docling/models ${files[CE_TASK_ID]} --num-threads 24 --output /mnt/ce/data/result/$CE_FLEET_ID/;
6280
Preparing your tasks: ⠼ Please wait...took 11.233582 seconds.
6381
Preparing your tasks: ⠴ Please wait...
6482
COS Bucket used 'ce-fleet-sandbox-data-fbfdde1d'...
@@ -139,14 +157,23 @@ Succeeded Tasks: 0
139157
</details>
140158
<br/>
141159

142-
If you like you can jump to the machine and see docling processing by running the following command in the root directory:
160+
(optional) If you like you can jump to the machine and see docling processing by running the following command in the root directory:
143161
```
144162
./jump <IP>
145163
```
146164

147165
You can use `htop` to see that docling is processing the PDFs
148166
![](../../images/examples_docling.jpg)
149167

168+
169+
#### Playing with more parallism
170+
171+
If you want to modify the tutorial to add some more parallism, e.g. to run 4 docling commands per worker, you could change the arguments and run script as follows:
172+
1. the arguments in commands.jsonl to `--num-threads 6`
173+
2. the cpu per task to `--cpu 6`
174+
Now, with `--max-scale 4` you would only get a single worker. Modify `--max-scale 8` to get 2 workers, each processing 4 docling commands.
175+
176+
150177
### Step 4 - Download results
151178

152179
Download the results from the COS by running the following command in the root directory:
@@ -156,7 +183,7 @@ Download the results from the COS by running the following command in the root d
156183

157184
You can find the results under
158185
```
159-
ls -l data/result/<fleet-id>/
186+
ls -l data/result/docling_*
160187
```
161188

162189

experimental/serverless-fleets/tutorials/docling/build

Lines changed: 0 additions & 11 deletions
This file was deleted.
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/2203.01017v2.pdf", "--output", "/mnt/ce/data/result/docling_2203.01017v2.pdf.md" ]}
2+
{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/2206.01062.pdf", "--output", "/mnt/ce/data/result/docling_2206.01062.pdf.md" ]}
3+
{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/2305.03393v1-pg9.pdf", "--output", "/mnt/ce/data/result/docling_2305.03393v1-pg9.pdf.md" ]}
4+
{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/2305.03393v1.pdf", "--output", "/mnt/ce/data/result/docling_2305.03393v1.pdf.md" ]}
5+
{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/amt_handbook_sample.pdf", "--output", "/mnt/ce/data/result/docling_amt_handbook_sample.pdf.md" ]}
6+
{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/code_and_formula.pdf", "--output", "/mnt/ce/data/result/docling_code_and_formula.pdf.md" ]}
7+
{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/picture_classification.pdf", "--output", "/mnt/ce/data/result/docling_picture_classification.pdf.md" ]}
8+
{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/redp5110_sampled.pdf", "--output", "/mnt/ce/data/result/docling_redp5110_sampled.pdf.md" ]}
9+
{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/right_to_left_01.pdf", "--output", "/mnt/ce/data/result/docling_right_to_left_01.pdf.md" ]}
10+
{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/right_to_left_02.pdf", "--output", "/mnt/ce/data/result/docling_right_to_left_02.pdf.md" ]}
11+
{ "command":"docling", "args": ["--num-threads", "24", "/mnt/ce/data/tutorials/docling/pdfs/right_to_left_03.pdf", "--output", "/mnt/ce/data/result/docling_right_to_left_03.pdf.md" ]}
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
#!/bin/sh
2+
3+
#ls -l ../../data/tutorials/docling/pdfs/*.pdf | awk '{ printf " { \"command\":\"docling\", \"args\": [\"--num-threads\", \"24\", \""$8"\", \"--output\", \"/mnt/ce/data/result\"$8".md\" ]}\n" }' > commands.jsonl
4+
5+
cd ../../data/tutorials/docling/pdfs
6+
for file in *.pdf; do echo "{ \"command\":\"docling\", \"args\": [\"--num-threads\", \"24\", \"/mnt/ce/data/tutorials/docling/pdfs/"$file""\", \"--output\", \"/mnt/ce/data/result/docling_""$file".md\" ]}"; done > commands.jsonl
7+
cd -
8+
mv ../../data/tutorials/docling/pdfs/commands.jsonl .
9+
cat commands.jsonl

experimental/serverless-fleets/tutorials/docling/minimal.py

Lines changed: 0 additions & 7 deletions
This file was deleted.

experimental/serverless-fleets/tutorials/docling/run

Lines changed: 3 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -4,39 +4,25 @@ set -e
44

55
uuid=$(uuidgen | tr '[:upper:]' '[:lower:]' | awk -F- '{print $1}')
66

7-
IMAGE=$(ibmcloud cr images | grep "ce--fleet-docling" | awk '{print $1}')
8-
9-
if [ -z "${IMAGE}" ]; then
10-
echo "no image found. pls build a docling image with ./build.sh"
11-
exit -1
12-
else
13-
echo "using image: $IMAGE"
14-
fi
7+
IMAGE="quay.io/docling-project/docling-serve-cpu"
158

169
echo ibmcloud code-engine experimental fleet run --name "fleet-${uuid}-1"
1710
echo " "--image $IMAGE
1811
echo " "--registry-secret fleet-registry-secret
1912
echo " "--worker-profile mx3d-24x240
2013
echo " "--max-scale 4
21-
echo " "--tasks 11
14+
echo " "--tasks-from-file commands.jsonl
2215
echo " "--cpu 24
2316
echo " "--memory 240G
24-
echo " "--command="bash"
25-
echo " "--arg "-c"
26-
echo " "--arg "mkdir -p /mnt/ce/data/result/\$CE_FLEET_ID/; cd /mnt/ce/data/tutorials/docling/pdfs; files=( * ); docling --artifacts-path=/root/.cache/docling/models \${files[CE_TASK_ID]} --num-threads 24 --output /mnt/ce/data/result/\$CE_FLEET_ID/;"
27-
2817

2918
ibmcloud code-engine experimental fleet run --name "fleet-${uuid}-1" \
3019
--image $IMAGE \
3120
--registry-secret fleet-registry-secret \
3221
--worker-profile mx3d-24x240 \
3322
--max-scale 4 \
34-
--tasks 11 \
23+
--tasks-from-file commands.jsonl \
3524
--cpu 24 \
3625
--memory 240G \
37-
--command="bash" \
38-
--arg "-c" \
39-
--arg "mkdir -p /mnt/ce/data/result/\$CE_FLEET_ID/; cd /mnt/ce/data/tutorials/docling/pdfs; files=( * ); docling --artifacts-path=/root/.cache/docling/models \${files[CE_TASK_ID]} --num-threads 24 --output /mnt/ce/data/result/\$CE_FLEET_ID/;"
4026

4127
ibmcloud code-engine experimental fleet get --name "fleet-${uuid}-1"
4228

0 commit comments

Comments
 (0)