Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions cluster-init/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,25 @@ overlay := non-prod

.PHONY: default
default: ## Deploy cluster management tools
cd ../kube-state-metrics/ && make
cd ../sealed-secrets && make overlay=$(overlay)
cd ../cert-manager && make overlay=$(overlay)
cd ../istio && make overlay=$(overlay)
cd ../observability/ && make
cd ../observability/kube-state-metrics/ && make
cd ../observability/prometheus && make
cd ../observability/oms-agent && make
cd ../egress && make overlay=$(overlay)
cd ../argo-cd && make overlay=$(overlay)

.PHONY: delete
delete: ## Remove cluster management tools
cd ../argo-cd && make delete overlay=$(overlay) || true
cd ../egress && make delete overlay=$(overlay) || true
cd ../observability/ && make delete || true
cd ../observability/oms-agent && make delete || true
cd ../observability/prometheus && make delete || true
cd ../observability/kube-state-metrics/ && make delete || true
cd ../istio && make delete overlay=$(overlay) || true
cd ../cert-manager && make delete overlay=$(overlay) || true
cd ../sealed-secrets && make delete overlay=$(overlay) || true
cd ../kube-state-metrics/ && make delete || true

.PHONY: help
help: ## Display this help screen
Expand Down
11 changes: 11 additions & 0 deletions istio/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,17 @@ init1: ## Install SSL certs and Istio profile
init2: ## Install custom manifests
kustomize build overlays-2/$(overlay) | kubectl apply -f -

.PHONY: restart_proxies
restart_proxies: ## Restarts all istio dataplane proxies, can be used when rolling out upgrade
kubectl rollout restart deployment/argocd-application-controller -n argocd
kubectl rollout restart deployment/argocd-dex-server -n argocd
kubectl rollout restart deployment/argocd-redis -n argocd
kubectl rollout restart deployment/argocd-repo-server -n argocd
kubectl rollout restart deployment/argocd-server -n argocd
kubectl rollout restart deployment/doc-index-updater -n doc-index-updater
kubectl rollout restart deployment/medicines-api -n medicines-api
cd ../observability/prometheus && make
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small point, wondering if this line is valuable. Probs won't restart prometheus pods unless there's an update. Maybe we should update the readme to do the make instead? What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command will be used when istio has been upgraded, in which case I think this line will cause the prometheus pod to restart due to istioctl kube-inject, which is what we want - is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's true. 👍


.PHONY: delete
delete: ## Remove Istio
kubectl delete istiooperators.install.istio.io -n istio-system istiocontrolplane --ignore-not-found || true
Expand Down
33 changes: 0 additions & 33 deletions istio/init-1/profile.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,6 @@ spec:
meshConfig:
outboundTrafficPolicy:
mode: REGISTRY_ONLY
addonComponents:
kiali:
enabled: true
grafana:
enabled: false
prometheus:
enabled: true
tracing:
enabled: true
components:
pilot:
enabled: true
Expand All @@ -31,24 +22,6 @@ spec:
patches:
- path: spec.minReplicas
value: 2
telemetry:
enabled: true
k8s:
resources:
requests:
cpu: "200m"
memory: "500M"
overlays:
- kind: Deployment
name: istio-telemetry
patches:
- path: spec.replicas
value: 2
- kind: HorizontalPodAutoscaler
name: istio-telemetry
patches:
- path: spec.minReplicas
value: 2
ingressGateways:
- name: istio-ingressgateway
enabled: true
Expand All @@ -58,9 +31,3 @@ spec:
values:
sidecarInjectorWebhook:
rewriteAppHTTPProbe: true
telemetry:
enabled: true
v1:
enabled: false
v2:
enabled: true
15 changes: 0 additions & 15 deletions kube-state-metrics/base/cluster-role-binding.yaml

This file was deleted.

117 changes: 0 additions & 117 deletions kube-state-metrics/base/cluster-role.yaml

This file was deleted.

43 changes: 0 additions & 43 deletions kube-state-metrics/base/deployment.yaml

This file was deleted.

8 changes: 0 additions & 8 deletions kube-state-metrics/base/kustomization.yaml

This file was deleted.

7 changes: 0 additions & 7 deletions kube-state-metrics/base/service-account.yaml

This file was deleted.

18 changes: 0 additions & 18 deletions kube-state-metrics/base/service.yaml

This file was deleted.

63 changes: 63 additions & 0 deletions observability/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Monitoring

## AKS

Azure Kubernetes Service (AKS) provides good high-level monitoring of the cluster, such as the CPU and memory usage of each node in the cluster. To view this find the cluster in the Azure portal and then click on the "Insights" tab.

## Custom dashboards

We have custom dashboards for the doc-index-updater that can be found by searching for "Shared Dashboards" in the Azure Portal.

They are set up in the following way:

- [Prometheus](https://prometheus.io/) scrapes metrics from different pods in the cluster (such as [Istio](https://istio.io/) and [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics#overview)).
- [Azure's OMS agent](https://docs.microsoft.com/en-us/azure/azure-monitor/platform/log-analytics-agent) scrapes this data and adds it to the logs analytics workspace for the cluster.
- The Azure Monitor dashboard runs queries against the log analytics workspace and plots the results.

### Prometheus

Prometheus is no longer installed by Istio, so we have a set of [manifests](./prometheus) for that.

There are two parts to the [config](./prometheus/overlay/prometheus-cm.yaml):

- `prometheus.yml` specifies what pods to scrape and other general settings
- `prometheus.rules.yml` specifies some [Prometheus rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/), basically each rule is a query that Prometheus runs regularly and stores the results as a new metric. These are what we export to the Azure Monitor (by setting the `azure_monitor: true` label for each rule, see the Azure OMS agent section below).

Prometheus [stores its data locally on disk](https://prometheus.io/docs/prometheus/latest/storage/). This means that if the Prometheus pod is deleted then Prometheus's database is deleted as well. **This happens if you run `make` in the deployments repo** in order to force Prometheus to refresh its config. It is possible to make [Prometheus can reload its config whilst still running](https://prometheus.io/docs/prometheus/latest/configuration/configuration/) if you enable the `--web.enable-lifecycle` flag but I haven't figured out how to inject that into the Istio profile yet.

### Azure OMS agent

The OMS agent pulls logs and metrics from the Kubernetes cluster and add it to a log analytics workspace.

This is configured by the `oms_agent` block in terraform (in the [products](https://github.com/MHRA/products) repo):

```terraform
resource "azurerm_kubernetes_cluster" "cluster" {
# ...other properties...

addon_profile {
oms_agent {
enabled = true
log_analytics_workspace_id = azurerm_log_analytics_workspace.cluster.id
}
}
}
```

The configuration for the OMS agent lives [here](./oms-agent/container-azm-ms-agentconfig.yaml).

In this configuration we tell the OMS agent to only scrape Prometheus metrics which have the label `azure_monitor: true` by setting the scrape URLs in `prometheus-data-collection-settings` to:

```yaml
urls = [
"http://prometheus.istio-system.svc.cluster.local:9090/federate?match[]={azure_monitor=%22true%22}"
]
```

(This uses [Prometheus federation](https://prometheus.io/docs/prometheus/latest/federation/)).

### Azure Monitor Dashboard

The code for the dashboard lives in terraform in [modules/cluster/dashboard.tf](../modules/cluster/dashboard.tf). The JSON code for the dashboard is pretty gnarly so if you want to make changes I would recommend making them in the UI, then exporting the dashboard and JSON and pop that into terraform (and don't forget to parametrise things like the subscription id etc).

The queries for the Azure Monitor dashboard and written using Azure's [Kusto Query Language (KQL)](https://docs.microsoft.com/en-us/azure/data-explorer/kusto/concepts/).
1 change: 1 addition & 0 deletions observability/kube-state-metrics/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
install.yaml
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
.PHONY: default
default: ## Deploy using Kustomize
kustomize build ./base | kubectl apply -f -
kustomize build ./overlay | kubectl apply -f -

.PHONY: delete
delete: ## Deploy using Kustomize
kustomize build ./base | kubectl delete --ignore-not-found -f - || true
kustomize build ./overlay | kubectl delete --ignore-not-found -f - || true

.PHONY: help
help: ## Display this help screen
Expand Down
Loading