-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not expose S3 credentials to JupyterHub users #47
Comments
Definitely an issue. Not sure how to solve this on this side might be something you have to ask on the Jupyter side, if there is a way to disable logging of specific variables in the config. |
I'm thinking about using a proxy to connect to S3, and provide unique access tokens for each Hub user. The idea is that the proxy evaluates the token, the logged in user and the action he/she's trying to perform, and determines if it's a valid action or not. It's important to mention that I'm using a user-based prefixes strategy, where each user has its own namespace within the S3 bucket. The user should only have permission to read/write his/her namespace in the bucket. I might work, but there's some implementations required on top of your s3contents app. |
That makes sense. You can take a look at this issue that faces a similar problem: #45 but it doesn't solve the credentials issue. One solution might be to pass the credentials from a JupyterHub setting that the users have to input. |
I wrote that issue. 😂 |
ROFL :) |
There is always setting up an IAM role for the host. That does mean they have access from a server standpoint (which means they can fetch whatever is allowed by that role), but at least doesn't expose the tokens directly in config. |
Yes, we thought about IAM roles as well. But, as you say, the correct way would be to have a different user in AWS for each user in your system. Which is not possible if you have a ton of users. That's why we discarded that option. |
@martinzugnoni Is this issue still a problem for you? I have a few different work arounds that I could type-up if they would help you. I have one block of JupyterHub config that drives AWS to dynamically create a new IAM User for each user on jupyterHub (creates them at first login if they don't already exist - either way, it pulls the IAM User keys each login (this does make it really hard to use the keys for anything else) - it should work for up to 4999 users (one IAM User is consumed by jupyterhub to do the work) I also have a block starts out the same but then generates temporary keys off of the IAM User keys (so these keys expire and become harmless - in exchange for having a maximum time the users session can run - depending on the style chosen, the max time on the temp keys is 12 or 36 hours (min time 10 minutes)) The above could also be switched to do federated users - which has a basically unlimited number of users, but does force a 12 hour max on the keys Assuming you still have an issue, let me know which constraint is more important to you and I'll see if I can make some sample code |
@martinzugnoni We have a different approach that may be useful for you or other people dealing with this problem. All our installation is based on OpenShift, but that must work at least in any Kubernetes environment, and other environments as well. We have a central datalake based on Ceph. Each user has its own S3 credentials and a set of buckets he has access too (instead of one bucket and prefixes). We store all users information (credentials) in a Hashicorp Vault instance. Users authenticate themselves on JupyterHub through OAuth using Keycloak. We store access information, so basically JWT access_token and refresh_token (encrypted), in the JupyterHub database (tokens are refreshed periodically). When a user launches a Notebook, we use the pre_spawn_start function from JupyterHub to connect to Vault using the access_token and retrieve user's aws_key and secret. We have a special dynamic policy in Vault attached to the path where we store secrets (/some_path/user_id/secret) that allows each user to retrieve his secrets and only his (reason why we need a valid access token from Keycloak to enforce this policy). Then we simply inject the secrets as env vars in the notebook and use S3Contents to connect. No problem if the user sees them as they are his! |
@martinzugnoni Here is the article I published with more details on our implementation. |
Thanks @guimou for the Medium article. A question if you don't mind: in the section ####################### |
Is this still an issue? I've tried looking for the jupyter_notebook_config.py in my baremetal (two node) k3s cluster from the view of a regular user and I can't seem to find it in |
While configuring s3contents, we need to provide credentials to connect to S3 service. We can do it by directly writing into
~/.jupyter/jupyter_notebook_config.py
config file, or reading from env variables.In any of those cases, any logged in user can read the config file o review the exposed env variables using a terminal session.
I can't find a way to securely connect to S3 without exposing the credentials to the JupyterHub users.
I'm using the
dockerspawner
, with the officialjupyterhub/singleuser
image.Any suggestions?
Thanks.
The text was updated successfully, but these errors were encountered: