Mitigating Let's Encrypt Rate Limiting Issues #174

osterman · 2018-07-14T00:01:38Z

what

We're concerned about LetsEncrypt rate limiting issues. It's fair enough to switch our staging environment over to using Lets Encrypt's staging env, but I'm concerned about this in production.

why

It basically means we could be blocked from changes to our infrastructure if let's encrypt rate limits us again. So we need a solution to that in some respect. Naively we could switch to using a wildcard cert. *.example.net and just make sure all of the servers use the dns name of server-123-123.example.net

osterman · 2018-07-14T00:31:49Z

There are a few options.

option 1

Use an ACM certificate provisioned with terraform and associated with the nginx-ingress.

https://github.com/cloudposse/terraform-aws-acm-request-certificate

Reference implementation here: https://github.com/cloudposse/terraform-root-modules/tree/master/aws/acm

Then set the ingress annotations to leverage this ACM certificate (e.g. SAN for *.ourapp.us-west-2.staging.example.net, ourapp.us-west-2.staging.example.net)

AWS Service annotations

service.beta.kubernetes.io/aws-load-balancer-ssl-cert (IAM or ACM ARN)
(via: https://gist.github.com/mgoodness/1a2926f3b02d8e8149c224d25cc57dc1)

These are passed to the Helm chart in the helmfile.yaml
https://github.com/cloudposse/geodesic/blob/master/rootfs/conf/kops/helmfile.yaml#L556-L557

option 2

Use a different operational domain for production to reduce sharing across stages. E.g. treat example.net as a staging domain and example.co as the production operations domain. This is what another one of our customers do. They incidentally use ACM certs as well, but only because we started this journey before kube-lego existed

other considerations

The likelihood of getting rate limited in production is small for a few reasons:

Very few new services are launched
Namespaces are seldom, if ever, destroyed
certificates are still long-lived so requests to APIs are few and far between. They can be renewed earlier than the 90 day cut off and rate limits would have to be in effect for several days for it to utlimately fail or timeout.

The reason you're at elevated risk in staging is due to the large number of publically exposed services as a result of running "unlimited staging environments". By moving staging to the staging domain of Let's Encrypt, the risks of inducingn rate limits in production. By using an entirely separate domain in production, the impact is even further mitigated.

osterman added documentation customer labels Jul 14, 2018

osterman self-assigned this Jul 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mitigating Let's Encrypt Rate Limiting Issues #174

Mitigating Let's Encrypt Rate Limiting Issues #174

osterman commented Jul 14, 2018 •

edited

Loading

osterman commented Jul 14, 2018 •

edited

Loading

Mitigating Let's Encrypt Rate Limiting Issues #174

Mitigating Let's Encrypt Rate Limiting Issues #174

Comments

osterman commented Jul 14, 2018 • edited Loading

what

why

osterman commented Jul 14, 2018 • edited Loading

option 1

AWS Service annotations

option 2

other considerations

osterman commented Jul 14, 2018 •

edited

Loading

osterman commented Jul 14, 2018 •

edited

Loading