Skip to content

Detect Confluence-exported DOC files stored in FileContents stream#4374

Open
zanvari wants to merge 1 commit into
Unstructured-IO:mainfrom
zanvari:fix-confluence-doc-filetype
Open

Detect Confluence-exported DOC files stored in FileContents stream#4374
zanvari wants to merge 1 commit into
Unstructured-IO:mainfrom
zanvari:fix-confluence-doc-filetype

Conversation

@zanvari

@zanvari zanvari commented Jun 12, 2026

Copy link
Copy Markdown

Summary

Fixes #3980.

Some Confluence-exported legacy .doc files are stored as OLE/CFB containers but do not contain the standard WordDocument stream used for DOC detection. Instead, they store document content in a FileContents stream.

This change extends _OleFileDetector to recognize FileContents as a valid DOC indicator and return FileType.DOC.

Changes

  • Detect OLE files containing a FileContents stream as FileType.DOC
  • Add a unit test covering Confluence-exported DOC detection

Testing

python3 -m compileall unstructured/file_utils/filetype.py
python3 -m pytest test_unstructured/file_utils/test_filetype.py -k OleFileDetector -v

Result:

  • 6 tests passed
  • Existing OLE file detection tests continue to pass

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 2 files

Shadow auto-approve: would auto-approve. Adds detection of Confluence-exported DOC files by recognizing a FileContents OLE stream. The change is small, well-tested, and only affects file type detection logic without impacting broader processing.

Re-trigger cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug/Wrongly detected fileType for exported documents

1 participant