Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[utils] Shorten proposed file name on create if too long #29989

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dirkf
Copy link
Contributor

@dirkf dirkf commented Sep 19, 2021

Please follow the guide below


Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

If the destination filename constructed from the output template is too long for the filesystem, the yt-dl download fails. The user can't predict how long extracted fields (eg title) may be, so this may be unexpected. The user then has to work out how to modify the output template (eg %(title)s -> %(title).25s) and run the command again.

Consequently there have been suggestions (of which #29912 is the most recent) that the destination filename should be shortened automatically. This doesn't seem unreasonable, since sanitize_path() already has the potential to edit the filename. PR #25475 failed because it was platform-specific.

This PR adds a function reduce_filename() to utils.py that shortens the filename component by a specified factor, and adds logic to sanitize_open() to detect errno.ENAMETOOLONG and iteratively call reduce_filename() until opening either succeeds, or fails for some other reason.

Resolves #30361, resolves #29912, resolves #29975, resolves #29627, resolves #28626, resolves #7765, resolves #5921, resolves #5908, probably others.

@dirkf dirkf force-pushed the df-utils-trimfname-patch branch 9 times, most recently from 827c7c6 to aaa81b7 Compare September 19, 2021 13:30
@dirkf dirkf marked this pull request as ready for review September 19, 2021 15:04
@dirkf
Copy link
Contributor Author

dirkf commented Sep 19, 2021

Apparently the suspicion that I voiced here is valid.

For UNIX-like platforms, an over-long filename gives ENAMETOOLONG as expected.

For Windows, there may be two cases:

  • if the over-long filename is caught by the Windows API, we get EINVAL (contrary to Microsoft's open() documentation; perhaps it's in the CPython interface layer);
  • possibly, if the over-long filename is detected at the filesystem level (eg, with the \\?\... filename syntax), we might still get ENAMETOOLONG.

Any test results from versions of Windows other than what's in the CI VMs, or from OSX, would be welcome.

@tylerszabo
Copy link
Contributor

This is an promising approach however the current implementation will be highly likely to clobber the video ID (which IMO is the more important part) in the default templates as the video ID is put at the end. A similar shortening that shortens in the middle might be a more reliable approach.

There's also the issue of temp files - as the temp files get increased integer suffixes the temp file prefixes will change as the suffixes increase. Without unit test coverage ensuring this behavior doesn't break assembling chunks I'm concerned that this might cause a regression at some point.

@dirkf
Copy link
Contributor Author

dirkf commented Oct 13, 2021

The aim is to avoid what people have often complained about, namely having a download fail just because the proposed filename is too long for the file system:

  • the textwrap shortening does occur in the middle;
  • the --download-archive option exists for users who want to record the exact ID of each download, but perhaps there should be a variation, say --download-archive-index, which records both the ID and the filename under which the item was saved?
  • possibly the default values for the reduce_filename() parameters should be varied?
  • if you can suggest some test cases for temp files I'll be happy to add them, but the sanitize_open() function is expected to change the proposed filename and it's up to the caller to manage the result.

@tylerszabo
Copy link
Contributor

I'm still concerned that this will create more bugs than it fixes - failing fast with a clear error message or altering default templates to reduce likelihood of exceeding length will both offer more predicable behavior without introducing additional edge cases without test coverage.

The same URL on the same OS with the same template will result in different filenames of saved on different filesystems with this change.

@dirkf
Copy link
Contributor Author

dirkf commented Oct 13, 2021

On reflection (short-term memory < 3 weeks), the shortening doesn't occur in the middle, so I had a different shortening library in mind: more to consider.

Again, many users don't consider it acceptable that a download should fail just because of filename length restrictions. Personally, I haven't suffered from this thanks to ext2/4, but I don't believe that it can be a bug if the download is successfully saved under some reasonable filename, actually the reverse.

Plainly the likelihood of needing to shorten filenames can be reduced if extractors don't (Twitter, eg) generate pointlessly long titles or IDs. But I do agree with users who are disappointed when their download that appeared to work fails because of some filename issue.

Already we have the --restrict-filenames option which will corrupt IDs containing &.

  • Should the PR introduce --[no-]restrict-filename-length to en/[dis]/able its functionality?
  • Should there be a "WARNING: yt-dl wanted to save this as 'xxx' but the file name was shortened to 'xx'"?

The issue of test coverage is addressable if further test cases are provided.

@tylerszabo
Copy link
Contributor

I like the proposed solution of --[no-]restrict-filename-length as that would give more control over the behavior.

It's worth noting that it might not be apparent that a name needs to be shortened right away - without knowing that "part 10000" will be the one to exceed length requirements the download may fail deep into a download (likely many gigabytes in unless chunks are very small) so it's important that any implementation be resilient to resuming and other operations involving part/temp files like merging and remuxing.

I'd like to propose another option: a "fallback short name". Perhaps %(extractor).20s_%(id).200s.ext or some other name that's unlikely to have errors even when writing a part file with a long suffix similar to the idea I noted in the issue for temp files - as that will be predictable, consistent, unlikely to encounter the issue and still clear to users without giving unfriendly final output (temp files are deleted by default).

For the purposes of plumbing it might be handy to use a less friendly, safer, but still intelligible name (such as extractor_id) for operations up until the final rename according the user's preferred templates.

But all of this aside I think I friendly error (instead of some error code) that just says "filename too long, use -o" or something like that before a download even starts would be super, especially for those that are using this in scripts (like me). This could be predicted by using a longer-than-expected suffix to test. Something like attempting to create a file called filename + "LONG_FILENAME_TEST.deleteme before attempting the download.

I'm of the opinion that software that clearly reports conditions (especially when it gives meaningful hints as to how to fix the issue) is preferable to software that tries to automatically avoid the error in a non-obvious way. I've spent way more time recovering from issues created by "helpful" software that silently did things differently than it normally would than I have by trying to figure out a way to handle an edge case.

@dirkf
Copy link
Contributor Author

dirkf commented Oct 25, 2021

There's also the issue of temp files - as the temp files get increased integer suffixes the temp file prefixes will change as the suffixes increase. Without unit test coverage ensuring this behavior doesn't break assembling chunks I'm concerned that this might cause a regression at some point.

Since the sanitize_open() function can change the proposed filename already, yt-dl has to be able to reassemble fragments whose filenames may not be as expected. At least in downloader/fragment.py, the actual filename for the fragment is stashed in the download fragment context, so it doesn't need to follow any pattern.

One possible failure mode: if a directory containing an incomplete download is resumed in a context that changes the available filename length, eg after moving the directory to a shorter pathname, or upgrading the filesystem to support longer pathnames. The resume logic depends on the name of the .ytdl file being derived repeatably from the output filename. My view: don't do that, but if you do, be prepared to have to start the download from scratch. The same problem could already be manifested if a directory containing an incomplete POSIX download is resumed in a Windows system.

Fragment filename management may be more tricky if simultaneous fragment downloads are supported (yt-dlp?).

@dirkf dirkf force-pushed the df-utils-trimfname-patch branch 2 times, most recently from dfa5e93 to a0159ad Compare October 25, 2021 13:06
@dirkf
Copy link
Contributor Author

dirkf commented Oct 25, 2021

On reflection (short-term memory < 3 weeks), the shortening doesn't occur in the middle, so I had a different shortening library in mind: more to consider.

So now we have a solution based on the repr/reprlib (Py2/3) module, imported as compat_reprlib, that tries to truncate in the middle of the filename body, since there didn't seem to be a good way of doing that with textwrap. No doubt a better solution could be coded from scratch but this seems adequate for what is already an edge case.

@pukkandan
Copy link
Contributor

If the actual filename is different from the one in the info-dict, postprocessors could fail

Since the sanitize_open() function can change the proposed filename already, yt-dl has to be able to reassemble fragments whose filenames may not be as expected

Actually, no. It appears like that when looking at sanitize_open, but sanitize_path is also called when creating the filename.

return sanitize_path(filename)

So the except block in sanitize_open doesn't actually do anything. I haven't looked at the blame, but I assume it is a remnant from before output template was a thing

@dirkf
Copy link
Contributor Author

dirkf commented Jan 31, 2022

Thanks for the tip.

Looking again, it seems that sanitize_open() is called when downloading, though not necessarily with the exact filename that came from the quoted line of prepare_filename(). First of all 'f%s' % f['format_id'] is inserted into the prepared filename. Then in the HTTP downloader .part is added, and in the Fragment downloader .ytdl is added.

A format_id is forced to be a string and then sanitised to remove spaces and format selection syntax (but there was a case where a character generated by the extractor in a format_id still clashed with format selection syntax: =, BBC?). The characters added in the downloader are valid according to sanitize_path(), but if a format_id contained an invalid Windows filename character that wasn't sanitised (:, say), the except block of sanitize_open() could be entered. However the second call to sanitize_path() shouldn't affect the stem resulting from prepare_filename().

If the inserted and appended elements happen to cause the filename to exceed a length limit, the except block won't help, and the error will be re-raised.

So this is a problem:

  • naïvely abbreviating the filename in the downloader could give a different stem according to the length of the format_id, invalidating the caller's info_dict;
  • as there doesn't seem to be length limit for a format_id, either a limit has to be imposed, or a fake format_id could be used ('f{i}' for i in enumerate(formats)), or the maximum format_id length for the current item could be determined, so that a known additional length can be allowed for when abbreviating the filename;
  • if the filename is abbreviated when sanitize_open() is called, the inserted/appended elements must have a known maximum length and be ignored when abbreviating, or
  • the filename must be abbreviated, allowing for the format_id and other elements potentially added by the downloader, before calling any downloader, say by creating a temporary file with dummy additional elements.

Gruesome, really.

@pukkandan
Copy link
Contributor

as there doesn't seem to be length limit for a format_id, either a limit has to be imposed, or a fake format_id could be used ('f{i}' for i in enumerate(formats)), or the maximum format_id length for the current item could be determined, so that a known additional length can be allowed for when abbreviating the filename;

We could also hash the format id if it is too long. But there may be people who depend on the intermediate filenames in their scripts

I also thought about having the downloader return the final filename and then overwriting it in the infodict. But this would cause the --get-filename and actual filename to be different, potentially breaking people's workflows. To avoid that, it seems necessary that the final filename be determined in prepare_filename itself. We can sanitize the temporary filenames again in the downloader. It is mostly not an issue as long as the final filename can be fixed ahead of time.

@akwiatkowski
Copy link

While I wait to be merged I added my own alias which helped. Keep in mind I'm not bash expert and truncation is not included here but this helped me.

alias ydl='function _ydl() { command="youtube-dl \"$1\" -o \"`youtube-dl --get-filename \"$1\" | perl -pe \"s/[^a-zA-Z0-9\. ]//g\" | perl -pe \"s/\s{2,}/ /g\" `\" "; echo "$command"; bash -c "$command"; }; _ydl'

@rubyFeedback
Copy link

rubyFeedback commented Oct 5, 2022

unfortunately I am really bad with bash code / shell scripts. Can we have some work around like an additional commandline flag? Something like "--safe-download" or so, that will download no matter what. Right now I don't know how to get the file name if the name is too long, yt-tdl just errors out on me and nothing gets downloaded as a result.

@dirkf
Copy link
Contributor Author

dirkf commented Oct 5, 2022

Just put an output template specification in your config file or on the command-line, replacing the default %(title)s-%(id)s.%(ext)s:

-o "%(title).200s-%(id)s.%(ext)s"

Or change 200 to a smaller number if that doesn't work for you.

@mzhboy
Copy link

mzhboy commented Nov 2, 2022

for CJK users 200 is still too long.
my suggestion:
230/3=76 suitable for most cases
or
230/4=57 make sure filename alway fit

@dirkf
Copy link
Contributor Author

dirkf commented Nov 3, 2022

This is why a better solution is the approach attempted in the PR: ie, detecting what actual length is valid.

@kapitaali
Copy link

kapitaali commented Mar 21, 2023

While I wait to be merged I added my own alias which helped. Keep in mind I'm not bash expert and truncation is not included here but this helped me.

alias ydl='function _ydl() { command="youtube-dl \"$1\" -o \"`youtube-dl --get-filename \"$1\" | perl -pe \"s/[^a-zA-Z0-9\. ]//g\" | perl -pe \"s/\s{2,}/ /g\" `\" "; echo "$command"; bash -c "$command"; }; _ydl'

I would love to use this but EXT filesystem says that the file name is too long. Otherwise a great hack and I'd like to use it if it would shorten it.

So to add an extra layer of regex, or how would you shorten the length to 255 characters?

@dirkf
Copy link
Contributor Author

dirkf commented Mar 21, 2023

Just use the known work-around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment