Skip to content

bug/Wrongly detected fileType for exported documents #3980

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
srisudarsan opened this issue Apr 6, 2025 · 0 comments
Open

bug/Wrongly detected fileType for exported documents #3980

srisudarsan opened this issue Apr 6, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@srisudarsan
Copy link
Contributor

Describe the bug
I have a document exported from confluence which is downloaded as a .doc file, on trying to partition this file, getting errors as it is not able to detect the file extension. This occurs when the file is sent as byte stream and not when the file is sent as byte stream (as similar how unstructured python client SDK does this)

File partition fails with message "unstructured.partition.common.UnsupportedFileFormatError: Partitioning is not supported for the FileType.UNK file type." when using unstructured directly, "expected str, bytes or os.PathLike object, not int" when using client SDK

To Reproduce
from io import BytesIO
from unstructured.partition.auto import partition

with open("Test.doc", "rb") as f:
# Not directly sending the stream but sending it as wrapped bytes like how client SDK sends stream as uploaded file
elements = partition(file=BytesIO(f.read()))

Expected behavior
The extension should be detected as .doc and should return partitions.

Screenshots
NA

Environment Info
Python version: 3.10.15
unstructured version: 0.17.5
unstructured-inference version: 0.8.10

Additional context
The .doc format exported by confluence does not contain magic bytes, thus OLE file detection files and all other detection step fails, When reaching line , it fails as the file name returned is random and is sometimes an integer.

Since this is the last effort to identify the extension, can we use metadata file name before doing this check ?
This PR proposes a potential fix for the same - #3786

@srisudarsan srisudarsan added the bug Something isn't working label Apr 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant