Skip to content

Overhaul AiohttpStore#65

Draft
kylebarron wants to merge 3 commits intomainfrom
kyle/explore-1
Draft

Overhaul AiohttpStore#65
kylebarron wants to merge 3 commits intomainfrom
kyle/explore-1

Conversation

@kylebarron
Copy link
Member

Change list

AiohttpStore should not implement synchronous methods

As an async backend, the AiohttpStore should not be implementing synchronous methods, because that muddles for downstream users which requests have overhead and which requests don't.

AiohttpStore cannot implement GetAsync

I hit this and then had a 20 minute conversation about it with @vincentsarago and @geospatial-jeff 😅.

The problem is: object stores guarantee more information than the HTTP spec. If we declare obspec to be a specification modeling object stores, then we cannot satisfy the spec with generic HTTP responses.

The existing aiohttp implementation worked by materializing the HTTP result into a single buffer. That is not an option; GetAsync is intended to represent streaming HTTP requests. Downloading the entire request up-front will provide poor memory characteristics for any library that depends on aiohttp.

I tried to model an AiohttpGetResultAsync holding a response: aiohttp.ClientResponse. But HTTP responses do not always have a content length defined. This is in contrast to object stores that guarantee the content length will always exist, even for streamed responses.

AiohttpStore cannot implement HeadAsync

For the same reason as above, since the HTTP content-length is not defined,

Alternatively, we could error at runtime

Since content-length is defined much of the time, but not all, it would be possible to just error at runtime when content-length doesn't exist. This is the approach taken by object_store: apache/arrow-rs-object-store#340

AiohttpStore implements GetRangeAsync and GetRangesAsync

So with the above discussion, the aiohttp store implements just the range request functionality.

@maxrjones
Copy link
Member

@kylebarron as another alternative, could you broaden the type for size in ObjectMeta in obspec to int | None such that HTTP-based stores can also use the protocols?

@maxrjones
Copy link
Member

Also here's an example of the current version in use, following up from your question today. VirtualiZarr uses reader = BlockStoreReader(store, path) to get a file-like object to pass to h5py.File(reader, mode="r")

@kylebarron
Copy link
Member Author

@kylebarron as another alternative, could you broaden the type for size in ObjectMeta in obspec to int | None such that HTTP-based stores can also use the protocols?

Of course, yes, that's possible to do. But the problem is that it starts to fundamentally break down our abstractions. Knowing the length of an object is obviously extremely valuable information. Right now that length is always known because object stores always know that. If we changed the type to int | None, then downstream applications can no longer rely on information about size, even though all object stores provide that information.

@kylebarron
Copy link
Member Author

kylebarron commented Feb 10, 2026

Also here's an example of the current version in use, following up from your question today. VirtualiZarr uses reader = BlockStoreReader(store, path) to get a file-like object to pass to h5py.File(reader, mode="r")

I think we should probably use synchronous HTTP clients whenever the end code is synchronous, and vice versa async clients with async usage.

Creating a new aiohttp client session for every request incurs a lot of overhead and we shouldn't allow that.

For the AiohttpStore, I think I'd opt towards requiring the end-user to pass in a valid aiohttp.ClientSession.

async with aiohttp.ClientSession() as session:
	store = AiohttpStore(
		session,
        base_url,
        headers={"Authorization": f"Bearer {token}"},
    )
    registry = ObjectStoreRegistry({base_url: store})

Then

  1. the AiohttpStore doesn't have to manage any setup/teardown state
  2. The registry itself doesn't need to define setup/teardown. I don't like how the current registry checks for __aenter__ and __aexit__ on any of the stores, because __aenter__ and __aexit__ are not defined on any of our protocols

But alternatively for end users who want to make synchronous requests, they can use a RequestsStore

@maxrjones
Copy link
Member

maxrjones commented Feb 10, 2026

@kylebarron as another alternative, could you broaden the type for size in ObjectMeta in obspec to int | None such that HTTP-based stores can also use the protocols?

Of course, yes, that's possible to do. But the problem is that it starts to fundamentally break down our abstractions. Knowing the length of an object is obviously extremely valuable information. Right now that length is always known because object stores always know that. If we changed the type to int | None, then downstream applications can no longer rely on information about size, even though all object stores provide that information.

Downstream applications can raise an Error if size is none and continue to rely on it. It's a tradeoff between optimizing for object storage and allowing extensibility to other protocols. The value of obspec-utils is largely in the extensibility, at least for my use case which is out of region earth data usage, which is why I'd prefer the option that lets an http based store implement the Head protocol.

@kylebarron
Copy link
Member Author

Downstream applications can raise an Error if size is none and continue to rely on it.

I don't think that really works. Downstream applications wouldn't know when they can raise an error and when they couldn't. Which parts of obspec they can disregard.

I feel like if we're going to go the "raise an exception" route, it would be better to do it in the AiohttpStore itself.

@maxrjones
Copy link
Member

Downstream applications can raise an Error if size is none and continue to rely on it.

I don't think that really works. Downstream applications wouldn't know when they can raise an error and when they couldn't. Which parts of obspec they can disregard.

I feel like if we're going to go the "raise an exception" route, it would be better to do it in the AiohttpStore itself.

alright, sounds like the "raise an exception" route is the best compromise to not entirely remove HeadAsync and GetAsync operations from AiohttpStore. Separately, we'll implement a sync requests store.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants