Skip to content

Dynamic prefetch #1706

@dmpetrov

Description

@dmpetrov

Description

Currently prefetch is fixed (default = 2), which works poorly for different file sizes and always requires manual tuning. It's especially painful in large amount of small files.

Prefetch can be derived automatically from file size.

Parallelism is more sensitive and should stay user-controlled. The computed prefetch can be treated as a total budget and split across workers.

Heuristic:

if avg_file_size <= 256 * KiB:   # aggressive - latency/overhead is dominating
    total_prefetch = math.clamp(4 * MiB / avg_file_size, 8, 128) # 4 MiB is the target total size of data in-flight
elif avg_file_size <= 64 * MiB:  # moderate
    total_prefetch = 8
else:
    total_prefetch = 2  # conservative

prefetch = math.ceil(total_prefetch / parallel)

This should be set before UDF runs that takes File as input. chain.avg("file.size") is the only overhead. If settings(prefetch=5) takes priority if defined. If multiple File - use 1st one for estimation.

This removes the need for manual tuning and adapts to small vs large files.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions