Mimetype based detection in SimpleDirectoryReader #15436
Blackskyliner
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi there,
Is there interest in a patch to extend the SimpleDirectoryReader to also being able to parse file magic values and guess possible extensions for lookup in the file extractor dict?
I implemented this for myself by blatantly copying the current SimpleDirectoryCode and adding detection through the packages
python-magic
andmimetypes
.Background of why I think this is useful: If you use S3 storage and/or implement/use your own deduplication filesystem system based hashed file-names the extension of a file may just not exist. But the magic number within the file will always exist. (Thats the case I needed it for)
But before I would go through the cleanup needed for proper PR-quality I would like to know if the general idea/addition would get accepted or if there is any need for it (or maybe I am uninformed and there is already a Mime-Type based detection somewhere hidden in another reader component).
Beta Was this translation helpful? Give feedback.
All reactions