Skip to content

When splitting_pdf_page is started, only the last set of API requests can succeed. #220

@issj6

Description

@issj6

Describe the bug
When I set split_pdf_page=True,split_pdf_concurrency_level=15.
Assuming the pdf is divided into 10 sets, it will report an error:
ERROR: Failed to send request for page 1
...
WARNING: Failed to partition set Unstructured-IO/unstructured-api#1, its elements will be omitted in the final result.
...
WARNING: Failed to partition set Unstructured-IO/unstructured-api#9, its elements will be omitted in the final result.
INFO: Successfully partitioned set Unstructured-IO/unstructured-api#10, elements added to the final result.

To Reproduce
code:

import os, json

import requests
from unstructured_client.models.operations import PartitionRequest
from unstructured_client.models.shared import PartitionParameters, ChunkingStrategy

os.environ["UNSTRUCTURED_API_KEY"] = "EMPTY"
os.environ["UNSTRUCTURED_API_URL"] = ""

import unstructured_client
from unstructured_client.models import shared, operations

requests_client = requests.Session()
client = unstructured_client.UnstructuredClient(
    api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"),
    server_url=os.getenv("UNSTRUCTURED_API_URL"),
    client=requests_client
)

filename = "./test_pdf.pdf"

file = open(filename, "rb")
req = operations.PartitionRequest(
    partition_parameters=shared.PartitionParameters(
        files=shared.Files(
            content=file.read(),
            file_name=filename,
        ),
        strategy=shared.Strategy.HI_RES,
        split_pdf_page=True,
        split_pdf_concurrency_level=15,
        chunking_strategy=ChunkingStrategy("by_title")
    )
)

try:
    res = client.general.partition(req)
    element_dicts = [element for element in res.elements]

    print(element_dicts)
    for e in element_dicts:
        print(e['text'])
except Exception as e:
    print(e)

Console Information:

INFO: Preparing to split document for partition.
INFO: Concurrency level set to 15
INFO: Splitting pages 1 to 23 (23 total)
INFO: Determined optimal split size of 2 pages.
INFO: Partitioning 11 files with 2 page(s) each.
INFO: Partitioning 1 file with 1 page(s).
INFO: Partitioning set Unstructured-IO/unstructured-api#1 (pages 1-2).
INFO: Partitioning set Unstructured-IO/unstructured-api#2 (pages 3-4).
INFO: Partitioning set Unstructured-IO/unstructured-api#3 (pages 5-6).
INFO: Partitioning set Unstructured-IO/unstructured-api#4 (pages 7-8).
INFO: Partitioning set Unstructured-IO/unstructured-api#5 (pages 9-10).
INFO: Partitioning set Unstructured-IO/unstructured-api#6 (pages 11-12).
INFO: Partitioning set Unstructured-IO/unstructured-api#7 (pages 13-14).
INFO: Partitioning set Unstructured-IO/unstructured-api#8 (pages 15-16).
INFO: Partitioning set Unstructured-IO/unstructured-api#9 (pages 17-18).
INFO: Partitioning set Unstructured-IO/unstructured-api#10 (pages 19-20).
INFO: Partitioning set Unstructured-IO/unstructured-api#11 (pages 21-22).
INFO: Partitioning set Unstructured-IO/unstructured-api#12 (pages 23-23).
ERROR: Failed to send request for page 1
ERROR: Failed to send request for page 3
ERROR: Failed to send request for page 5
ERROR: Failed to send request for page 7
ERROR: Failed to send request for page 9
ERROR: Failed to send request for page 11
ERROR: Failed to send request for page 13
ERROR: Failed to send request for page 15
ERROR: Failed to send request for page 17
ERROR: Failed to send request for page 19
ERROR: Failed to send request for page 21
WARNING: Failed to partition set Unstructured-IO/unstructured-api#1, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#2, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#3, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#4, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#5, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#6, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#7, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#8, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#9, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#10, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#11, its elements will be omitted in the final result.
INFO: Successfully partitioned set Unstructured-IO/unstructured-api#12, elements added to the final result.
INFO: Successfully partitioned the document.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions