Description
Currently prefetch is fixed (default = 2), which works poorly for different file sizes and always requires manual tuning. It's especially painful in large amount of small files.
Prefetch can be derived automatically from file size.
Parallelism is more sensitive and should stay user-controlled. The computed prefetch can be treated as a total budget and split across workers.
Heuristic:
if avg_file_size <= 256 * KiB: # aggressive - latency/overhead is dominating
total_prefetch = math.clamp(4 * MiB / avg_file_size, 8, 128) # 4 MiB is the target total size of data in-flight
elif avg_file_size <= 64 * MiB: # moderate
total_prefetch = 8
else:
total_prefetch = 2 # conservative
prefetch = math.ceil(total_prefetch / parallel)
This should be set before UDF runs that takes File as input. chain.avg("file.size") is the only overhead. If settings(prefetch=5) takes priority if defined. If multiple File - use 1st one for estimation.
This removes the need for manual tuning and adapts to small vs large files.
Description
Currently prefetch is fixed (default = 2), which works poorly for different file sizes and always requires manual tuning. It's especially painful in large amount of small files.
Prefetch can be derived automatically from file size.
Parallelism is more sensitive and should stay user-controlled. The computed prefetch can be treated as a total budget and split across workers.
Heuristic:
This should be set before UDF runs that takes File as input.
chain.avg("file.size")is the only overhead. Ifsettings(prefetch=5)takes priority if defined. If multiple File - use 1st one for estimation.This removes the need for manual tuning and adapts to small vs large files.