Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OAuth proxy for SSO to eliminate basic auth #613

Open
stephen-soltesz opened this issue Jan 16, 2020 · 17 comments
Open

OAuth proxy for SSO to eliminate basic auth #613

stephen-soltesz opened this issue Jan 16, 2020 · 17 comments

Comments

@stephen-soltesz
Copy link
Contributor

Discussed in 2020-01-15 Monitoring eng meeting: basic auth for prometheus and related services have reached their limit. It should be possible to deploy an oauth proxy that supports SSO.

Some pointers (not guaranteed to be helpful, but evidence that folks are doing this kind of thing):

@robertodauria
Copy link
Contributor

Thank you @stephen-soltesz for researching this!

I'm adding another one to the list. This solution is basically the same as your first link (oauth2_proxy) but it lets the NGINX Ingress Controller (which we already use for TLS) create the nginx configuration automatically by adding a couple of annotations:

@nkinkade
Copy link
Contributor

nkinkade commented Jan 8, 2021

@robertodauria: I have lost state on the general work you were doing for this issue. My general understanding is that you hit an impasse which, at the time, seemed insurmountable. Is this correct? I notice that Prometheus 2.23.0 makes the React UI the default one. There is likely a way to revert to the "Classic" UI, but I haven't tried to update Prometheus to find out if this will be blocker for updating Promethus.

Would you update this issue with some details on where this stands and the block you hit? You have told me via VC and/or Slack what the block was, but apparently it didn't stick. I want to see if we can find a way to make this work to clear the way for upgrading Prometheus (and to just generally have a better auth flow).

Issue #595 is related/blocked by this issue.

@nkinkade
Copy link
Contributor

nkinkade commented Jan 8, 2021

Is this the complete set of changes you arrived at before hitting a wall?

https://github.com/m-lab/prometheus-support/compare/sandbox-roberto-oauth-proxy

@robertodauria
Copy link
Contributor

@nkinkade Yes. I have proposed a VC to discuss the status and next steps, please let me know if that works or you prefer rescheduling.

@stephen-soltesz
Copy link
Contributor Author

I'm excited to see this progressing. 👍

@nkinkade
Copy link
Contributor

nkinkade commented Jan 13, 2021

Notes from a meeting with @robertodauria this morning regarding blockers for this issue:

Problem 1

  • Grafana was not working with the oauth2_proxy
  • something to do with Grafana not sending along the JWT auth token in the HTTP headers

Problem 2

  • How do we get all the other services that access Prometheus (mlab-ns, rebot, etc) to also support this new OAuth auth mechanism?
  • Can our nginx ingress support both HTTP Basic auth and OAuth, such that we can migrate Grafana while still leaving other services using HTTP Basic auth, until we can migrate them?

Problem 3

  • need oauth_proxy in both the platform-cluster and prometheus-federation

Problem 4

  • prometheus-federation scrapes the platform-cluster instance using http basic auth, so prometheus-federation is another consumer.

Services that access prometheus that would need to support oauth:

  • mlab-ns
  • rebot
  • grafana
  • prometheus-federation (consumes platform-cluster through federation using basic auth)

@nkinkade
Copy link
Contributor

It appears that Prometheus now (as of the latest version 0.24.0) natively supports TLS and http basic auth:

https://prometheus.io/docs/prometheus/latest/configuration/https/
https://inuits.eu/blog/prometheus-server-tls/

I will experiment with this next week. It's unclear how basic auth does not function with the new, default React UI via the nginx ingress, but would work with the new native support. But I may be missing something.

This would not take us to the next level for authentication or authorization, but could possibly be a workaround to start using the latest UI and newer versions of Prometheus.

@nkinkade
Copy link
Contributor

nkinkade commented Feb 1, 2021

@robertodauria, @stephen-soltesz: now that issue #595 is resolved and closed, no longer blocking us from safely updating Prometheus to the latest versions, I propose that we either close or backlog this issue due to the difficulty of making OAuth work for all possible consumers of Prometheus data in both clusters (platform and prometheus-federation). Do either of you have an opinion on this?

@stephen-soltesz
Copy link
Contributor Author

An SSO-like solution would greatly simplify the team's access to these services without sacrificing operational security. The basic auth solution was only a little better than nothing, and it adds friction every time I need to open prometheus directly. The experience is more halting the less familiar one is with these systems. So, I fear it will discourage new team members from using and contributing to the monitoring system as a well curated whole. This could be one of the "broken windows" that make it easier to rationalize the next partial or "hacky" approach. See: https://en.wikipedia.org/wiki/Broken_windows_theory

I would like a better picture of what would we have to change in order to use the oauth proxy? Or, phrased differently, if we were starting from scratch, how would we organize the pieces of the system to work the way we want?

For example, once we're able to retire mlab-ns's usage (admittedly an indefinite period in the future), then rebot and grafana should be able to access prometheus directly over the private GKE network (right?), and then the question is whether a GKE network could communicate privately with the platform cluster or not.

@nkinkade
Copy link
Contributor

nkinkade commented Feb 1, 2021

In my mind, at the moment, the question isn't whether using OAuth would be a benefit or not; it clearly would be. The problem lies in migrating two clusters to that authentication mechanism, along with any services that need access to Prometheus metrics in either cluster, and sometimes both at the same time. And this from a system (Prometheus) that does not support OAuth logins, requiring additional proxying services in the cluster or at the edge, adding possibly a non-trivial amount of complex technical overlay to the overall system.

My recollection isn't that we implemented HTTP basic authentication as a solution that was "only a little better than nothing". Indeed, my recollection is that early on we had no authentication at all, and didn't consider it any sort of major shortcoming, other than the possibility of bots or malicious people swamping our Prometheus instances with expensive queries. Or in a less likely scenario someone leveraging near real-time telemetry to attempt to compromise the overall health of the system more effectively. I don't believe we felt that "security", as such, was the major consideration, but more we wanted to just put up some basic barrier to prevent flagrant abuse, either unintentional or intentional, and HTTP basic auth provided that pretty well without the need to do much else (other than use the nginx proxy, which today isn't even necessary any longer).

That aside, probably the biggest blocker right now is our "federated" scraping. We scrape the platform cluster from the prometheus-federation cluster using basic auth, which is one of just a couple auth mechanisms Prometheus even supports for scraping, the other being a bearer token. Possibly we could obviate the need for scraping the platform cluster at all if we were to migrate platform cluster alerting to the platform cluster?

Then we have, as you mentioned, mlab-ns. I'm sure that there is some python module that would allow us to use OAuth there, but how much effort do we want to put into any engineering work on the mlab-ns code base?

As it stands, all of these components already natively support HTTP basic authentication:

  • User browsers
  • Prometheus (for scrape jobs)
  • Prometheus (for clients to access scraped metrics e.g., Grafana, though we aren't leveraging this now)
  • Grafana
  • nginx (ingress controller in both clusters)
  • urllib2 (used by mlab-ns to authenticate to nginx ingresses)
  • Golang http package (used by rebot to query Promtheus)

I'm curious where you currently find the major "friction" in using Prometheus with HTTP basic authentication? For me, there used to be some friction in constantly needing to open some "AAA Prometheus Links" dashboard in Grafana, but a year or two ago I simply added every link to my startup pages to "prime" my browser session to already be authenticated to all clusters and in all projects. And as of today, I have even further eliminated all friction by installing the Chrome extension "Multipass", which lets me use a regex to match sites with some stored basic auth credentials. Granted, this only works in my local browser, but then again I can't think of a time I really needed to access Prometheus in some way other than through my browser.

@robertodauria
Copy link
Contributor

For example, once we're able to retire mlab-ns's usage (admittedly an indefinite period in the future), then rebot and grafana should be able to access prometheus directly over the private GKE network (right?), and then the question is whether a GKE network could communicate privately with the platform cluster or not.

There was a proposal to make Grafana contact Prometheus over private networks only, but that would mean creating inter-project networks and I thought you said it's something we want to avoid. Sandbox Grafana today can access staging/production Prometheus with HTTP basic auth, even if they are in separate GCP projects, which I think is a desirable behavior and something we want to keep.

The last time I spent a few days trying to make this work, I could log into Prometheus with my @measurementlab.net account with oauth-proxy, but then the proxy didn't like the way Grafana passed the OAuth token. Specifically, the oauth-proxy logs mentioned it could not find a valid token in the request, even after enabling "Forward OAuth Identity" and/or "With credentials" in the Data Source configuration, and I could not find a way to work around that.

A possible next step to figure out how the components interact (and if the problem is Grafana's handling of the OAuth token) could be writing a small PoC which we deploy behind nginx/oauth-proxy with an endpoint that sends an authenticated request on behalf of the currently logged in user to another service with OAuth authentication enabled. If that works, it points towards an issue with how Grafana sends the token along with the request rather than something wrong in oauth-proxy's configuration.

I agree having an SSO for the components of our infrastructure to seamlessly work with a single user authentication would be nice, even if perhaps I don't see it as fundamental - HTTP basic auth is a bit inconvenient and doesn't allow managing users but it works. I'm not convinced there is something wrong with how the pieces of our infrastructure are currently organized, but rather an issue with Grafana or oauth-proxy (or how I had configured them at the time) that we need to figure out before we can make progress on this.

@nkinkade
Copy link
Contributor

nkinkade commented Feb 2, 2021

As far as I can tell, HTTP basic auth is actually working quite well, is natively supported by all our tools, and suits our needs, minus a single use case: operators accessing the Prometheus Expression Browser in their their local browsers. Is this correct? If so, I've found that a simple browser extension eliminates issues for this use case. The more I think about it, OAuth feels more like a protocol designed to be interacted with directly by a person in their browser (fitting this use case I mention), but not so much as an authentication mechanism for non-humans (mlab-ns, rebot, Grafana accessing the backend datastore, etc.).

@robertodauria
Copy link
Contributor

robertodauria commented Feb 2, 2021

Regardless of whether we are going to implement this or not, last night I managed to get sandbox Grafana configured and correctly passing the authentication to a Prometheus data source in "Server" mode, meaning that it's the backend that connects to Prometheus and not the user's browser -- all of our data sources use server mode already.

Changes I made:

  • Grafana .ini:
    • Disable internal OAuth login with Google
    • Enable automatic OAuth sign-on
    • Specify which header to use to get the username from oauth2_proxy (X-Email)
    • Enable automatic sign-up for users coming from OAuth, with default role Viewer
  • Ingress configuration
    • Set up an oauth2 ingress for each subdomain, pointing to the same oauth2_proxy service
      • prometheus.mlab-sandbox.measurementlab.net/oauth2
      • grafana.mlab-sandbox.measurementlab.net/oauth2
    • Add the following obscure-looking snippet (to the grafana and prometheus ingresses, NOT to the oauth2_proxy ingresses), which puts the email associated with the OAuth authentication in a header:
    nginx.ingress.kubernetes.io/configuration-snippet: |
          auth_request_set $user   $upstream_http_x_auth_request_user;
          auth_request_set $email  $upstream_http_x_auth_request_email;
          proxy_set_header X-User  $user;
          proxy_set_header X-Email $email;
    
  • Data source configuration
    • Set to forward the _oauth2_proxy cookie - no other options needed, no "Forward OAuth" nor "With Credentials"

This, however, doesn't solve the problem of cross-project authentication (e.g. sandbox grafana connecting to staging/prod prometheus). We should either have one instance of oauth2_proxy that's shared across the projects, or multiple instances of oauth2_proxy with common session storage (I see Redis is an option).

This is likely the only place on the Internet where all of this is documented: https://stackoverflow.com/questions/62559654/grafana-oauth-proxy-still-displaying-native-login-form

@nkinkade
Copy link
Contributor

nkinkade commented Feb 2, 2021

@robertodauria: This is great! Thank you for doing this. Would you be able to finish setting up this up such that interactive users (OAuth) access URLs like:

https://prometheus.mlab-.measurementlab.net
https://prometheus-platform-cluster.mlab-.measurementlab.net

... and automation (basic auth... mlab-ns, rebot, etc.) use URLs like the following (path doesn't matter, just something logical):

https://prometheus.mlab-.measurementlab.net/basicauth
https://prometheus-platform-cluster.mlab-.measurementlab.net/basicauth

This will require us to manually modify automated services to use the new URL, but will keep things more simple for humans.

@nkinkade
Copy link
Contributor

nkinkade commented Feb 9, 2021

@robertodauria: Do you consider this issue resolved now? If so, would you close this issue?

@robertodauria
Copy link
Contributor

Not yet. We've made some progress, but most of the issues you outlined at #613 (comment) are still outstanding - for example, all the clients have to be updated to use the -basicauth URL, and the platform-cluster Prometheus isn't using OAuth yet.

@stephen-soltesz
Copy link
Contributor Author

#795 completed support for oauth or basicauth on prometheus federation and platform-cluster prometheus (including datasources in Grafana).

What do you all think about making this the standard configuration for prometheus in the data-processing cluster as well as for alertmanager? Is there some configuration that would make it easier to make this the default?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants