[utils] Shorten proposed file name on create if too long #29989

dirkf · 2021-09-19T04:35:13Z

Please follow the guide below

Before submitting a pull request make sure you have:

Searched the bugtracker for similar pull requests
Read adding new extractor tutorial
Read youtube-dl coding conventions and adjusted the code to meet them
Covered the code with tests (note that PRs without tests will be REJECTED)
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Bug fix
Improvement
New extractor
New feature

Description of your pull request and other information

If the destination filename constructed from the output template is too long for the filesystem, the yt-dl download fails. The user can't predict how long extracted fields (eg title) may be, so this may be unexpected. The user then has to work out how to modify the output template (eg %(title)s -> %(title).25s) and run the command again.

Consequently there have been suggestions (of which #29912 is the most recent) that the destination filename should be shortened automatically. This doesn't seem unreasonable, since sanitize_path() already has the potential to edit the filename. PR #25475 failed because it was platform-specific.

This PR adds a function reduce_filename() to utils.py that shortens the filename component by a specified factor, and adds logic to sanitize_open() to detect errno.ENAMETOOLONG and iteratively call reduce_filename() until opening either succeeds, or fails for some other reason.

Resolves #30361, resolves #29912, resolves #29975, resolves #29627, resolves #28626, resolves #7765, resolves #5921, resolves #5908, probably others.

dirkf · 2021-09-19T15:18:34Z

Apparently the suspicion that I voiced here is valid.

For UNIX-like platforms, an over-long filename gives ENAMETOOLONG as expected.

For Windows, there may be two cases:

if the over-long filename is caught by the Windows API, we get EINVAL (contrary to Microsoft's open() documentation; perhaps it's in the CPython interface layer);
possibly, if the over-long filename is detected at the filesystem level (eg, with the \\?\... filename syntax), we might still get ENAMETOOLONG.

Any test results from versions of Windows other than what's in the CI VMs, or from OSX, would be welcome.

tylerszabo · 2021-10-13T20:48:34Z

This is an promising approach however the current implementation will be highly likely to clobber the video ID (which IMO is the more important part) in the default templates as the video ID is put at the end. A similar shortening that shortens in the middle might be a more reliable approach.

There's also the issue of temp files - as the temp files get increased integer suffixes the temp file prefixes will change as the suffixes increase. Without unit test coverage ensuring this behavior doesn't break assembling chunks I'm concerned that this might cause a regression at some point.

dirkf · 2021-10-13T21:57:40Z

The aim is to avoid what people have often complained about, namely having a download fail just because the proposed filename is too long for the file system:

the textwrap shortening does occur in the middle;
the --download-archive option exists for users who want to record the exact ID of each download, but perhaps there should be a variation, say --download-archive-index, which records both the ID and the filename under which the item was saved?
possibly the default values for the reduce_filename() parameters should be varied?
if you can suggest some test cases for temp files I'll be happy to add them, but the sanitize_open() function is expected to change the proposed filename and it's up to the caller to manage the result.

tylerszabo · 2021-10-13T22:09:29Z

I'm still concerned that this will create more bugs than it fixes - failing fast with a clear error message or altering default templates to reduce likelihood of exceeding length will both offer more predicable behavior without introducing additional edge cases without test coverage.

The same URL on the same OS with the same template will result in different filenames of saved on different filesystems with this change.

dirkf · 2021-10-13T22:31:09Z

On reflection (short-term memory < 3 weeks), the shortening doesn't occur in the middle, so I had a different shortening library in mind: more to consider.

Again, many users don't consider it acceptable that a download should fail just because of filename length restrictions. Personally, I haven't suffered from this thanks to ext2/4, but I don't believe that it can be a bug if the download is successfully saved under some reasonable filename, actually the reverse.

Plainly the likelihood of needing to shorten filenames can be reduced if extractors don't (Twitter, eg) generate pointlessly long titles or IDs. But I do agree with users who are disappointed when their download that appeared to work fails because of some filename issue.

Already we have the --restrict-filenames option which will corrupt IDs containing &.

Should the PR introduce --[no-]restrict-filename-length to en/[dis]/able its functionality?
Should there be a "WARNING: yt-dl wanted to save this as 'xxx' but the file name was shortened to 'xx'"?

The issue of test coverage is addressable if further test cases are provided.

tylerszabo · 2021-10-21T22:18:08Z

I like the proposed solution of --[no-]restrict-filename-length as that would give more control over the behavior.

It's worth noting that it might not be apparent that a name needs to be shortened right away - without knowing that "part 10000" will be the one to exceed length requirements the download may fail deep into a download (likely many gigabytes in unless chunks are very small) so it's important that any implementation be resilient to resuming and other operations involving part/temp files like merging and remuxing.

I'd like to propose another option: a "fallback short name". Perhaps %(extractor).20s_%(id).200s.ext or some other name that's unlikely to have errors even when writing a part file with a long suffix similar to the idea I noted in the issue for temp files - as that will be predictable, consistent, unlikely to encounter the issue and still clear to users without giving unfriendly final output (temp files are deleted by default).

For the purposes of plumbing it might be handy to use a less friendly, safer, but still intelligible name (such as extractor_id) for operations up until the final rename according the user's preferred templates.

But all of this aside I think I friendly error (instead of some error code) that just says "filename too long, use -o" or something like that before a download even starts would be super, especially for those that are using this in scripts (like me). This could be predicted by using a longer-than-expected suffix to test. Something like attempting to create a file called filename + "LONG_FILENAME_TEST.deleteme before attempting the download.

I'm of the opinion that software that clearly reports conditions (especially when it gives meaningful hints as to how to fix the issue) is preferable to software that tries to automatically avoid the error in a non-obvious way. I've spent way more time recovering from issues created by "helpful" software that silently did things differently than it normally would than I have by trying to figure out a way to handle an edge case.

dirkf · 2021-10-25T04:49:26Z

There's also the issue of temp files - as the temp files get increased integer suffixes the temp file prefixes will change as the suffixes increase. Without unit test coverage ensuring this behavior doesn't break assembling chunks I'm concerned that this might cause a regression at some point.

Since the sanitize_open() function can change the proposed filename already, yt-dl has to be able to reassemble fragments whose filenames may not be as expected. At least in downloader/fragment.py, the actual filename for the fragment is stashed in the download fragment context, so it doesn't need to follow any pattern.

One possible failure mode: if a directory containing an incomplete download is resumed in a context that changes the available filename length, eg after moving the directory to a shorter pathname, or upgrading the filesystem to support longer pathnames. The resume logic depends on the name of the .ytdl file being derived repeatably from the output filename. My view: don't do that, but if you do, be prepared to have to start the download from scratch. The same problem could already be manifested if a directory containing an incomplete POSIX download is resumed in a Windows system.

Fragment filename management may be more tricky if simultaneous fragment downloads are supported (yt-dlp?).

dirkf · 2021-10-25T13:34:00Z

On reflection (short-term memory < 3 weeks), the shortening doesn't occur in the middle, so I had a different shortening library in mind: more to consider.

So now we have a solution based on the repr/reprlib (Py2/3) module, imported as compat_reprlib, that tries to truncate in the middle of the filename body, since there didn't seem to be a good way of doing that with textwrap. No doubt a better solution could be coded from scratch but this seems adequate for what is already an edge case.

pukkandan · 2022-01-31T03:33:16Z

If the actual filename is different from the one in the info-dict, postprocessors could fail

Since the sanitize_open() function can change the proposed filename already, yt-dl has to be able to reassemble fragments whose filenames may not be as expected

Actually, no. It appears like that when looking at sanitize_open, but sanitize_path is also called when creating the filename.

youtube-dl/youtube_dl/YoutubeDL.py

Line 721 in af9e725

return sanitize_path(filename)

So the except block in sanitize_open doesn't actually do anything. I haven't looked at the blame, but I assume it is a remnant from before output template was a thing

dirkf · 2022-01-31T12:01:16Z

Thanks for the tip.

Looking again, it seems that sanitize_open() is called when downloading, though not necessarily with the exact filename that came from the quoted line of prepare_filename(). First of all 'f%s' % f['format_id'] is inserted into the prepared filename. Then in the HTTP downloader .part is added, and in the Fragment downloader .ytdl is added.

A format_id is forced to be a string and then sanitised to remove spaces and format selection syntax (but there was a case where a character generated by the extractor in a format_id still clashed with format selection syntax: =, BBC?). The characters added in the downloader are valid according to sanitize_path(), but if a format_id contained an invalid Windows filename character that wasn't sanitised (:, say), the except block of sanitize_open() could be entered. However the second call to sanitize_path() shouldn't affect the stem resulting from prepare_filename().

If the inserted and appended elements happen to cause the filename to exceed a length limit, the except block won't help, and the error will be re-raised.

So this is a problem:

naïvely abbreviating the filename in the downloader could give a different stem according to the length of the format_id, invalidating the caller's info_dict;
as there doesn't seem to be length limit for a format_id, either a limit has to be imposed, or a fake format_id could be used ('f{i}' for i in enumerate(formats)), or the maximum format_id length for the current item could be determined, so that a known additional length can be allowed for when abbreviating the filename;
if the filename is abbreviated when sanitize_open() is called, the inserted/appended elements must have a known maximum length and be ignored when abbreviating, or
the filename must be abbreviated, allowing for the format_id and other elements potentially added by the downloader, before calling any downloader, say by creating a temporary file with dummy additional elements.

Gruesome, really.

pukkandan · 2022-01-31T13:29:36Z

as there doesn't seem to be length limit for a format_id, either a limit has to be imposed, or a fake format_id could be used ('f{i}' for i in enumerate(formats)), or the maximum format_id length for the current item could be determined, so that a known additional length can be allowed for when abbreviating the filename;

We could also hash the format id if it is too long. But there may be people who depend on the intermediate filenames in their scripts

I also thought about having the downloader return the final filename and then overwriting it in the infodict. But this would cause the --get-filename and actual filename to be different, potentially breaking people's workflows. To avoid that, it seems necessary that the final filename be determined in prepare_filename itself. We can sanitize the temporary filenames again in the downloader. It is mostly not an issue as long as the final filename can be fixed ahead of time.

akwiatkowski · 2022-03-01T09:32:00Z

While I wait to be merged I added my own alias which helped. Keep in mind I'm not bash expert and truncation is not included here but this helped me.

alias ydl='function _ydl() { command="youtube-dl \"$1\" -o \"`youtube-dl --get-filename \"$1\" | perl -pe \"s/[^a-zA-Z0-9\. ]//g\" | perl -pe \"s/\s{2,}/ /g\" `\" "; echo "$command"; bash -c "$command"; }; _ydl'

rubyFeedback · 2022-10-05T12:45:09Z

unfortunately I am really bad with bash code / shell scripts. Can we have some work around like an additional commandline flag? Something like "--safe-download" or so, that will download no matter what. Right now I don't know how to get the file name if the name is too long, yt-tdl just errors out on me and nothing gets downloaded as a result.

dirkf · 2022-10-05T13:03:35Z

Just put an output template specification in your config file or on the command-line, replacing the default %(title)s-%(id)s.%(ext)s:

-o "%(title).200s-%(id)s.%(ext)s"

Or change 200 to a smaller number if that doesn't work for you.

mzhboy · 2022-11-02T22:42:15Z

for CJK users 200 is still too long.
my suggestion:
230/3=76 suitable for most cases
or
230/4=57 make sure filename alway fit

dirkf · 2022-11-03T11:23:06Z

This is why a better solution is the approach attempted in the PR: ie, detecting what actual length is valid.

kapitaali · 2023-03-21T06:45:07Z

While I wait to be merged I added my own alias which helped. Keep in mind I'm not bash expert and truncation is not included here but this helped me.
alias ydl='function _ydl() { command="youtube-dl \"$1\" -o \"`youtube-dl --get-filename \"$1\" | perl -pe \"s/[^a-zA-Z0-9\. ]//g\" | perl -pe \"s/\s{2,}/ /g\" `\" "; echo "$command"; bash -c "$command"; }; _ydl'

I would love to use this but EXT filesystem says that the file name is too long. Otherwise a great hack and I'd like to use it if it would shorten it.

So to add an extra layer of regex, or how would you shorten the length to 255 characters?

dirkf · 2023-03-21T14:58:33Z

Just use the known work-around.

dirkf force-pushed the df-utils-trimfname-patch branch 9 times, most recently from 827c7c6 to aaa81b7 Compare September 19, 2021 13:30

dirkf marked this pull request as ready for review September 19, 2021 15:04

dirkf force-pushed the df-utils-trimfname-patch branch from aaa81b7 to 4af5991 Compare September 22, 2021 15:00

pukkandan mentioned this pull request Oct 1, 2021

[Feature request] Handle Long filenames in default template and temporary files yt-dlp/yt-dlp#1136

Open

7 tasks

dirkf force-pushed the df-utils-trimfname-patch branch 2 times, most recently from dfa5e93 to a0159ad Compare October 25, 2021 13:06

dirkf added 2 commits October 25, 2021 14:06

Shorten proposed file name on create if too long

38b3adb

Avoid finding explicit unicode literals in comments, etc

a0159ad

dirkf mentioned this pull request Nov 9, 2021

[Twitter] Unable to use long filenames in 64-bit W10 Pro. #29407

Closed

6 tasks

Kerrick mentioned this pull request Dec 17, 2021

fixed bug #15024 in extractor/twitter.p #24604

Closed

5 tasks

dirkf mentioned this pull request Feb 17, 2022

Don't trim underscores from uploader, playlist_id yt-dlp/yt-dlp#2761

Closed

4 tasks

dirkf mentioned this pull request Apr 30, 2022

Facebook video and special characters yt-dlp/yt-dlp#3444

Closed

6 tasks

dirkf mentioned this pull request Aug 10, 2022

[twitter] Shorten title attribute for #29912 #31151

Closed

11 tasks

dirkf force-pushed the master branch from 01bf89e to 4c6fba3 Compare August 26, 2022 07:51

dirkf mentioned this pull request Sep 11, 2022

«File name too long» should not error #29912

Open

3 tasks

dirkf mentioned this pull request Mar 19, 2023

File name too long - videos with long/illegal titles cause failure to create new file #31884

Closed

6 tasks

dirkf mentioned this pull request May 2, 2023

intelligent filename trim yt-dlp/yt-dlp#6974

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[utils] Shorten proposed file name on create if too long #29989

[utils] Shorten proposed file name on create if too long #29989

dirkf commented Sep 19, 2021 •

edited

Loading

dirkf commented Sep 19, 2021

tylerszabo commented Oct 13, 2021

dirkf commented Oct 13, 2021

tylerszabo commented Oct 13, 2021

dirkf commented Oct 13, 2021 •

edited

Loading

tylerszabo commented Oct 21, 2021

dirkf commented Oct 25, 2021

dirkf commented Oct 25, 2021

pukkandan commented Jan 31, 2022

dirkf commented Jan 31, 2022

pukkandan commented Jan 31, 2022

akwiatkowski commented Mar 1, 2022

rubyFeedback commented Oct 5, 2022 •

edited

Loading

dirkf commented Oct 5, 2022 •

edited

Loading

mzhboy commented Nov 2, 2022

dirkf commented Nov 3, 2022

kapitaali commented Mar 21, 2023 •

edited

Loading

dirkf commented Mar 21, 2023

[utils] Shorten proposed file name on create if too long #29989

Are you sure you want to change the base?

[utils] Shorten proposed file name on create if too long #29989

Conversation

dirkf commented Sep 19, 2021 • edited Loading

Please follow the guide below

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

Description of your pull request and other information

dirkf commented Sep 19, 2021

tylerszabo commented Oct 13, 2021

dirkf commented Oct 13, 2021

tylerszabo commented Oct 13, 2021

dirkf commented Oct 13, 2021 • edited Loading

tylerszabo commented Oct 21, 2021

dirkf commented Oct 25, 2021

dirkf commented Oct 25, 2021

pukkandan commented Jan 31, 2022

dirkf commented Jan 31, 2022

pukkandan commented Jan 31, 2022

akwiatkowski commented Mar 1, 2022

rubyFeedback commented Oct 5, 2022 • edited Loading

dirkf commented Oct 5, 2022 • edited Loading

mzhboy commented Nov 2, 2022

dirkf commented Nov 3, 2022

kapitaali commented Mar 21, 2023 • edited Loading

dirkf commented Mar 21, 2023

dirkf commented Sep 19, 2021 •

edited

Loading

dirkf commented Oct 13, 2021 •

edited

Loading

rubyFeedback commented Oct 5, 2022 •

edited

Loading

dirkf commented Oct 5, 2022 •

edited

Loading

kapitaali commented Mar 21, 2023 •

edited

Loading