Skip to content

[feature request] SFTPStore #336

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
danielgafni opened this issue Mar 10, 2025 · 12 comments
Open

[feature request] SFTPStore #336

danielgafni opened this issue Mar 10, 2025 · 12 comments

Comments

@danielgafni
Copy link

It would be great if obstore had an interface for SFTP.

@kylebarron
Copy link
Member

kylebarron commented Mar 10, 2025

This is unlikely to happen natively unless the Rust object_store library adds support for it (which I don't think is likely).

However, this is why I've started thinking about exposing the obstore API as its own specification (#330 ). Then a third party could implement its own Python implementation that conforms to the same API as what obstore uses.

@danielgafni
Copy link
Author

This is unlikely to happen natively unless the Rust object_store library adds support for it

Alright, I can move this issue there.

which I don't think is likely

Could you elaborate on that? It seems like quite a common use case.

thinking about exposing the obstore API as its own specification (#330 )

I think this is going to be useful!

@kylebarron
Copy link
Member

which I don't think is likely

Could you elaborate on that? It seems like quite a common use case.

I expect they'll say that the primary storage backends for the DataFusion, Parquet, and Arrow communities are commodity object storage solutions, specifically S3, GCS, and Azure. I doubt they'll want to take on the maintenance burden themselves for an SFTP implementation as well, but I don't know for sure 🤷‍♂ .

You could also create your own Rust crate that implements the ObjectStore trait which uses SFTP for connection. And then you could export your own Python package that implements obspec (#330) and it would work exactly the same as obstore.

https://github.com/Eugeny/russh looks like a great pure-Rust library for working with SSH and SFTP. Might be a good place to look if you wanted to implement your own Rust SFTP ObjectStore backend.

Or, you could implement an SFTP backend in pure Python, as long as you find an async SFTP client in Python.

@kylebarron
Copy link
Member

It doesn't look like russh supports range requests for file reads, though, which could be problematic: https://docs.rs/russh-sftp/2.0.8/russh_sftp/client/struct.SftpSession.html#method.read

@kylebarron
Copy link
Member

It looks like paramiko supports range requests: https://docs.paramiko.org/en/stable/api/sftp.html#paramiko.sftp_file.SFTPFile.read

@ion-elgreco
Copy link

@kylebarron you could perhaps use opendal which as sftp support and then use https://docs.rs/object_store_opendal/0.50.0/object_store_opendal/struct.OpendalStore.html for the objectstore wrapper

@danielgafni
Copy link
Author

Wow that project looks amazing

@kylebarron
Copy link
Member

Sorry I have no interest in taking on the maintenance of SFTP in this library. But I am fully supportive of you making an API-compatible "obspec-sftp" library, which would be interchangeable with obstore. Indeed, this is the goal of obspec, for there to be a compatible ecosystem that I don't have to maintain all myself.

Do either of you have any feedback on obspec so far?

Also note that opendal have their own Python binding: https://opendal.apache.org/docs/python/opendal.html

@danielgafni
Copy link
Author

I like the idea of obspec.

May I ask why currently all these functions (I assume they are supposed to be standalone functions like in obstore) do not take a store argument?

@kylebarron
Copy link
Member

May I ask why currently all these functions (I assume they are supposed to be standalone functions like in obstore) do not take a store argument?

There are two ways to handle subtyping in Python, nominal and structural subtyping. Neither really supports pure functions. You can use protocols with pure functions, but then it has to be defined with a single __call__ method. Like how the credential providers are typed:

@staticmethod
def __call__() -> S3Credential | Coroutine[Any, Any, S3Credential]:
"""Return an `S3Credential`."""

You can't use structural subtyping based on the name of a pure function.

So in order to handle an object store with multiple types of methods, get, list, etc, we need to expose protocols as methods. So all of those protocols you see in obspec take self as a parameter.

And then in obstore we now have a wrapper to expose class methods in addition to top-level pure functions.

@danielgafni
Copy link
Author

Uhh I've missed that PR. Everything makes complete sense now, thanks!

@kylebarron
Copy link
Member

In #346 I ran into a core problem with obspec-obstore, where it's really hard to ensure any raised exceptions come from (or are subclassed from) obspec. I think it's probably crucial for obspec to have common exceptions, so that callers of obspec can have some allowed exceptions.

I think to keep things moving, it's best to release a new version of obstore without obspec, and then we can come back to obspec after release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants