-
-
Notifications
You must be signed in to change notification settings - Fork 329
Coalesce and parallelize partial shard reads #3004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I'm curious what folks think of this and looking for input! One particular design question to note: this adds a nested concurrent_map (there's already one in the codec pipeline's read/read_batch) so it's technically possible to have concurrency of |
@aldenks thanks for working on this, it's a great optimization direction to pursue. That being said, since this is a performance optimization, it might be helpful to get some metrics for how much perf we get with these changes, in terms of speed and also latency / number of requests. If reducing requests is the goal, you could consider implementing a As for the specific code changes (and the nested concurrent map), nothing jumps out to me as problematic, but that might just be my ignorance of async python and the sharding code 😁 |
I'm working on a couple performance tests as Davis suggested above. Trying airspeed velocity which is new to me but looks useful. |
Something like this might be more useful than asv here. A regression test of this form would also be great, if it can be made to work |
dafc09e
to
2c18e2e
Compare
Got pulled away after pushing those changes but before describing them -- I'll update this tomorrow with profiling results. |
2c18e2e
to
9f6727d
Compare
alright!
I don't necessarily need a detailed review at this point -- if yall let me know any open questions you have or give a thumbs up ill work on the remaining todos in the description. TestsAdded
ProfilingI added a slow test which reads a few subsets of a large-ish shard containing small-ish chunks under a range of settings. See The test is currently skipped and I don't necessarily think it should be a test, probably belongs in To recreate the figures below:
Plotting codeimport pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df_coalesce = pd.read_json(
"zarr-python-partial-shard-read-performance-with-coalesce.json"
)
df_no_coalesce = pd.read_json(
"zarr-python-partial-shard-read-performance-no-coalesce.json"
)
df_no_coalesce["coalesce_max_gap"] = "Before optimization"
df = pd.concat([df_no_coalesce, df_coalesce])
# Define statements and metrics
statements = df["statement"].unique()
metrics = ["time", "store_get_calls"]
metric_titles = ["Execution Time", "Number of Store Get Calls"]
metric_ylabels = ["Time (s)", "Number of Calls"]
# Create labels for coalesce values
coalesce_values = df["coalesce_max_gap"].unique()
coalesce_labels = [
x if isinstance(x, str) else "Disabled" if x == -1 else f"{x/1024/1024:.0f}MiB"
for x in coalesce_values
]
# Get unique get_latency values
latency_values = df["get_latency"].unique()
# For each unique get_latency value, create a separate figure
for latency in latency_values:
latency_df = df[df["get_latency"] == latency]
max_time = latency_df["time"].max()
max_store_get_calls = latency_df["store_get_calls"].max()
plt.figure(figsize=(16, 12))
plt.suptitle(
"Performance when reading a subset of a shard\n"
"Execution time and count of store `get` calls for 3 array access patterns\n"
"Array and shard shape: (512, 512, 512), chunk shape: (64, 64, 64)\n"
"Store get latency: local ssd"
+ (
""
if latency == 0
else f" + {latency * 1000:.0f}ms (simulated object storage)"
),
fontsize=14, )
# One row per statement, two columns: time and store_get_calls
for i, statement in enumerate(statements):
statement_df = latency_df[latency_df["statement"] == statement]
# Plot for time
plt.subplot(3, 2, i * 2 + 1)
sns.barplot(
data=statement_df,
x="concurrency",
y="time",
hue="coalesce_max_gap",
hue_order=coalesce_values,
palette="muted",
errorbar=None,
)
plt.title(f"{statement} - {metric_titles[0]}")
plt.xlabel("Concurrency")
plt.ylabel(metric_ylabels[0])
plt.ylim(0, max_time * 1.1)
plt.legend(title="Coalesce Max Gap", labels=coalesce_labels)
# Plot for store_get_calls
plt.subplot(3, 2, i * 2 + 2)
sns.barplot(
data=statement_df,
x="concurrency",
y="store_get_calls",
hue="coalesce_max_gap",
hue_order=coalesce_values,
palette="muted",
errorbar=None,
)
plt.title(f"{statement} - {metric_titles[1]}")
plt.xlabel("Concurrency")
plt.ylabel(metric_ylabels[1])
plt.ylim(0, max_store_get_calls * 1.1)
plt.legend(title="Coalesce Max Gap", labels=coalesce_labels)
plt.tight_layout()
plt.savefig(f"zarr-metrics-by-statement-with-latency-{latency}s.png", dpi=300)
plt.show()
plt.close() Local SSDThe results shown in this figure read from my local SSD. Before this optimization and when run on my ssd, the time to read chunks serially in the loop in Simulated Object StorageThis optimization is most important reading from higher latency storage. To simulate this I added 10ms of latency to each get call. (I read that with specific networking setups you can get down to 1ms latency from S3, I also regularly see >100ms in practice, so 10ms feels fair) TakeawaysNone of these are particularly surprising, but my takeaways are:
|
This is very cool! On the topic of benchmarking: When reading compressed & sharded datasets, LDeakin's Hopefully this PR will help If it's helpful, and if this PR is ready, then I could try running LDeakin's |
Thanks @JackKelly! I'd be happy for more benchmarking of this. I took a look at LDeakin's zarr_benchmarks and I would expect the changes in the PR to not have any impact. All the read tests in there appear to read complete shards and this PR doesn't touch that path. If you're adding benchmarks I might advocate for one that reads subsets of multiple shards as I find that's a pretty common read pattern in practice. |
Noted! I've started making some notes for my plans for benchmarking here. Feel free to comment! (BTW, please don't let my benchmarks hold up this PR! IMHO, the benchmarks that Alden has presented above are more than sufficient to demonstrate that this PR speeds up |
(Just a quick note to say that I'm afraid I'm now planning to experiment with storing NWP data in Parquet (using Polars)... so I might not get round to benchmarking Zarr readers soon, I'm sorry...) |
Any maintainers, I don't necessarily need a detailed review at this point -- if yall let me know any open questions you have or give a thumbs up ill work on the remaining todos in the description. |
This PR optimizes reading more than one, but not all, chunks from a shard. Towards #1758.
concurrent_map
.To be explicit, in my mind the optimizations in this have two goals:
Mostly these goals aren't in conflict. The maximum coalesced size option provides a knob to ensure we aren't coalescing requests into a single very large request which would be faster to make as multiple concurrent calls.
TODO:
docs/user-guide/*.rst
changes/