Skip to content

Conversation

@fedexman
Copy link

Fix media type inference for URLs with query parameters

When using presigned URLs (eg AWS S3) with ImageUrl, AudioUrl, or VideoUrl, the media type inference fails

https://pics.s3.ap-northeast-1.amazonaws.com/test/Capture-2025-11-21-112402.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=fdafdas%2F20251121%2Fap-northeast-1%2Fs3%2Faws4_request&X-Amz-Date=20251121T023200Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEDoaD

to the agent we get this error,

Internal server error: Could not infer media type from image URL: 
...
Explicitly provide a `media_type` instead

the reason is that the _infer_media_type function only check the end of the url with url.endswith('.mkv') but do not parse the url.

I propose to parse the url with

from urllib.parse import urlparse
path = urlparse(self.url).path
if path.endswith('.mkv'):
            return 'video/x-matroska'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't we using the mimetypes stdlib module? mimetypes.guess_type() already parses URLs and the current implementation doesn't take into account case insensitivity, etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Viicos Interestingly we already use that in DocumentUrl._infer_media_type, after checking a bunch of types ourselves :/

@fedexman Can you see if we can use mimetypes.guess_type() for all of these?

The method can be changed to just return str rather than XMediaType, as I don't think that type is used on any public fields.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Viicos Interestingly we already use that in DocumentUrl._infer_media_type, after checking a bunch of types ourselves :/

@fedexman Can you see if we can use mimetypes.guess_type() for all of these?

The method can be changed to just return str rather than XMediaType, as I don't think that type is used on any public fields.

@github-actions
Copy link

This PR is stale, and will be closed in 3 days if no reply is received.

@github-actions github-actions bot added the Stale label Nov 29, 2025
@fedexman
Copy link
Author

fedexman commented Dec 1, 2025

I'll update this week 🙏🙏

@DouweM DouweM removed the Stale label Dec 1, 2025
@fedexman
Copy link
Author

fedexman commented Dec 3, 2025

I refactored using mimetypes. Some types are defined in the standard library, if not, they use some os files that is dependent of the machine. For all the types not in the standard library I added them manually to have reliable behavior.
Good for rereview 🙇

@fedexman fedexman requested a review from DouweM December 3, 2025 13:44
Comment on lines 30 to 48
# Register manually MIME types that are not in the standard library
# Document types
mimetypes.add_type('text/markdown', '.mdx')
mimetypes.add_type('text/x-asciidoc', '.asciidoc')

# Video types
mimetypes.add_type('video/3gpp', '.three_gp')
mimetypes.add_type('video/x-flv', '.flv')
mimetypes.add_type('video/x-matroska', '.mkv')
mimetypes.add_type('video/x-ms-wmv', '.wmv')

# Audio types
mimetypes.add_type('audio/flac', '.flac')
mimetypes.add_type('audio/mpeg', '.mp3')
mimetypes.add_type('audio/ogg', '.oga')
# override stdlib mimetypes that use x- prefix with standard types
mimetypes.add_type('audio/aac', '.aac')
mimetypes.add_type('audio/aiff', '.aiff')
mimetypes.add_type('audio/wav', '.wav')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will affect the global mimetypes db for the current interpreter. Let's instantiate an explicit MimeTypes object instead, and attach the additional types directly to it.

You can then use your instance's guess_type() directly instead of the module-level guess_type().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be created at module level, or in the class init 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the module level, it's fine to share the added mime types between each class and the instantiation can impact performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants