Skip to content

Commit 13ec216

Browse files
authored
Merge pull request #54 from prometheus/prom53-impl
Rename proposals to follow PROM-53 design; link old files for compat.
2 parents 7d2aaf1 + 6499e42 commit 13ec216

27 files changed

+2528
-2502
lines changed

proposals/0001-proposal-process.md

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
## Unified Proposal Process for Prometheus and Prometheus-community
2+
3+
* **Owners:**
4+
* [@bwplotka](https://github.com/bwplotka)
5+
6+
* **Related Issues and PRs:**
7+
* https://github.com/bwplotka/proposals/pull/1
8+
9+
* **Implementation Status:**
10+
* Implemented
11+
12+
* **Other docs or links:**
13+
* https://github.com/openshift/enhancements/ (a bit too complex, too tedious to fight with Python linter)
14+
* [KEPs](https://github.com/kubernetes/enhancements)
15+
* [Thanos Proposal process](https://thanos.io/tip/contributing/proposal-process.md/)
16+
17+
> TL;DR: This design document is proposing a single way of proposing changes to software, processes and anything else regarding [Prometheus](https://github.com/prometheus) and [Prometheus community](https://github.com/prometheus-community) projects. The proposal is to move to https://github.com/prometheus/proposals as the primary source and record of officially proposed design docs. Design docs will be snapshot-ed in the form of markdown files in GitHub. Discussions related to proposals will be within relevant GitHub PRs, Issues and Discussions. See a demo repository at https://github.com/bwplotka/proposals and the demo [PR with the new (this) proposal](https://github.com/bwplotka/proposals/pull/1).
18+
19+
## Why
20+
21+
It's essential to clearly explain the reasons behind certain design decisions to have a community consensus. This is especially important in Prometheus, where every decision might have a significant impact given the high adoption and stability of the software and standards we work on. In our world, no decision is perfect, so having a design document that explains the trade-offs we made is essential, so it can also be used later on as a reference and for knowledge-sharing purposes.
22+
23+
It is important to have a single way of proposing those bigger ideas and have a single place of reviewing, discovering them, working on them and generally having a record of past decisions and approvals.
24+
25+
### Pitfalls of the current solution
26+
27+
Currently, we have no strict process for proposing ideas in the form of design docs. We try to maintain an index of Google Docs at https://prometheus.io/docs/introduction/design-doc/, but it has some flaws:
28+
29+
1. We are missing a consistent design doc template that will enable authors to focus on content and get readers a more consistent experience (less friction!)
30+
2. Authors own Google Docs, so even after acceptance, they can change without approvals or notifications (Google Docs are versioned, but it is hard to track changes). They can also get deleted, e.g. check the "Persist Retroactive Rules" design doc on https://prometheus.io/docs/introduction/design-doc/.
31+
3. There is no process related to the discovery, review and approval of design docs (or updating the index). This results in unknowns and simply less motivation for the community to bring ideas to the table. It also makes it hard to find consensus given many discussion channels. Approvals should also be signed, verifiable and transparent.
32+
4. Google Docs are not searchable/discoverable easily. Updating the index manually takes effort, and it's easy to miss.
33+
34+
## Goals
35+
36+
* Allow easy and fruitful collaboration on ideas.
37+
* Allow verifiable and transparent decision-making on design ideas.
38+
* Clearly version and track changes of accepted/rejected proposals.
39+
* Have a consistent design style that is readable and understandable.
40+
* Move [the previous design docs](https://prometheus.io/docs/introduction/design-doc/) to new place/process.
41+
42+
## Non-Goals
43+
44+
Automation for implementation status tracking. We can figure that out later.
45+
46+
## How
47+
48+
I proposed a dedicated repository, `github.com/prometheus/proposals` that will contain proposals in the Markdown format ([GFM](https://github.github.com/gfm/)), formatting/link checking tooling and instructions on how to collaborate on ideas (see [Alternatives 1 and 2](#alternatives) for rationales and alternatives considered around placement).
49+
50+
Markdown files might feel like more overhead than Google Docs, but based on the research and feedback from people, it feels like a good trade-off. It's a good balance between writing effort versus readability, review and decision clarity (see [Alternatives 3](#alternatives)).
51+
52+
Given initial feedback, I would propose not rendering proposals on the Prometheus website. This could confuse proposed features with stable implementation. We don't want to treat those design documents as feature documentation.
53+
54+
One good argument against repositories like this or websites with design docs is the beauty of a stable URL of the Google Document, so it can be referenced and valid longer. However, GitHub allows permalinks. We can also make sure the format of the proposal repo is treated as an immutable resource.
55+
56+
### Details
57+
58+
I propose to host in git only accepted proposals, simply in the root directory. So it will roughly look like this:
59+
60+
![repo view.png](../assets/proposal-repo.png)
61+
62+
Given that, the process of proposing change with the design doc would look as follows (we can use the below text as the initial instruction):
63+
64+
1. Fork `github.com/prometheus/proposals`.
65+
2. Create a GitHub Pull Request with a design document in markdown format to the repository's root directory. Make sure to use [template](../0000-00-00_template.md) as the guide for what sections should be present in the document. Put the creation date (the day you started preparing this design doc) as the prefix and some unique name as the suffix in the file name.
66+
1. If you prefer Google Docs to any other collaboration tool, feel free to use it in the initial state. We recommend [Open Source Design Doc Template](https://docs.google.com/document/d/1zeElxolajNyGUB8J6aDXwxngHynh4iOuEzy3ylLc72U/edit#). However, the approval process will only happen officially in the Pull Request.
67+
3. Automatic formatter is enabled in the repository. Use `make` locally to format it. Use `make check` to check all links (will be done on the CI too).
68+
4. The design is accepted if the PR is merged into this repository. It's ok to eventually decide to reject the proposal and close the PR with meaningful reasons for why it was rejected.
69+
1. If more eyes are needed, or no consensus was made: Propose and announce your idea on
70+
[Prometheus DevSummit](https://docs.google.com/document/d/11LC3wJcVk00l8w5P3oLQ-m3Y37iom6INAMEu2ZAGIIE/edit) or mailing list to gather more information. You are welcome to start working on the design doc before a bigger discussion--it is often easier to have a discussion with prior information provided. Be prepared that the idea might be rejected later--still, the record of the document in the Pull Request is useful even in rejected state to inform about past decisions and opportunities considered.
71+
2. To merge the PR, we need approval (consensus) from the maintainers of the related component(s).
72+
3. Optionally: Find a sponsor among Prometheus maintainers to get momentum on a change.
73+
74+
Once PR get merged, the design doc can change, but it requires (less strict, but still) a PR with review and merge by a maintainer.
75+
76+
Two features are present in the current index page: Implementation Status and TODO design docs.
77+
78+
### TODO Proposals
79+
80+
For `TODO's, so ideas for design docs we know we want (e.g. it was decided on the DevSummit). I propose to use GitHub Issues in the `github.com/prometheus/proposals`repository for those with appropriate`TODO` labels.
81+
82+
### Implementation Status
83+
84+
For Implementation Status, I propose a "best effort" `Implementation Status` field in markdown and a roughly maintained list of links to relevant PRs and Issues. We can iterate over it, but without automation, I don't expect owners to always update this field with new changes.
85+
86+
## Alternatives
87+
88+
1. Different placement of design docs in markdown: `prometheus/prometheus`.
89+
90+
We could place them in [Prometheus repo](https://github.com/prometheus/prometheus), bringing more visibility. There would be some issues, though:
91+
92+
* CI checks would get run on every design doc change, which will bring a lot of pain. We could invest in special CI rules to avoid this, but it's not trivial.
93+
* Design docs buried somewhere in the Prometheus repo will be less discoverable. It also makes it harder to categorize GitHub Discussions and Issue related for design docs.
94+
* Design docs do not only relate to Prometheus repo but full ecosystem or even neighbour projects like Alertmanager and clients. Ideally, we can share the same design process across all the things Prometheus Team and community help with.
95+
96+
2. Different placement of design docs in markdown: `prometheus/docs`
97+
98+
We could place them in [Prometheus docs](https://github.com/prometheus/docs), which already hosts [index of design docs](https://github.com/prometheus/docs/blob/main/content/docs/introduction/design-doc.md). There are some issues too:
99+
100+
* Still, some CI checks for a website would be unrelated.
101+
* Design docs buried somewhere in website implementation and content, despite not rendering the proposals on the website. Less discoverable, and it also makes it harder to categorize GitHub Discussions and Issue related for design docs.
102+
103+
3. Stick to Google Docs for design docs.
104+
105+
It's worth exploring the idea of sticking to Google Docs as we do right now and trying to mitigate the flaws mentioned in [Pitfalls of the current solution](#pitfalls-of-the-current-solution) in some way.
106+
107+
* We could ask to use a consistent template.
108+
* We could maintain Prometheus Google Drive and ask owners to transfer ownership to us, which would give some immutability. Still tracking consensus and maintaining this is non-trivial. Discovery is also poor and would mean extra work to maintain the index page like https://prometheus.io/docs/introduction/design-doc/
109+
110+
4. Have `Accepted`, `Rejected`, and `Implemented` directories for different statuses
111+
112+
This is a double maintenance effort. Using GitHub PR Close or Merge actions already indicates approval or rejection, so why not use that?
113+
114+
The implementation status proved to be stale very quickly. Manual interactions to update it is not viable, as we see from the current index page.
115+
116+
5. Move some metadata to the front matter.
117+
118+
We could put the title, author, PRs, data and implementation status as YAML in [front matter](https://frontmatter.codes/docs/markdown#front-matter-highlighting) in markdown. This is useful if we would like to build further automation.
119+
120+
Given YAGNI, I think it's not needed at this point. The regular markdown list is good enough and can be automated/changed in later iterations of the proposal repo.
121+
122+
## Action Plan
123+
124+
The tasks to do in order to migrate to the new idea:
125+
126+
* [ ] Copy https://github.com/bwplotka/proposals to github.com/prometheus/proposals
127+
* [ ] Copy instructions from here to README.md
128+
* [ ] Migrate all the accepted docs from https://prometheus.io/docs/introduction/design-doc/ (mainly from Google Docs). Update status on the way (things are not up-to-date).
129+
* [ ] Migrate TODO ideas to GH issues.
130+
* [ ] Decide what to do with in-progress unapproved proposals. Potentially move to PRs (or ask owners to do so?)
131+
* [ ] Announce changes
Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# Secure Alertmanager cluster traffic
2+
3+
* **Owners:**
4+
* Max Inden [email protected]
5+
6+
* **Implementation:** Implemented
7+
8+
* **Related Issues and PRs:**
9+
* https://github.com/prometheus/alertmanager/pull/2237
10+
11+
> NOTE(bwplotka): This proposal was moved from [Alertmanager repo](https://github.com/prometheus/alertmanager/blob/6ef6e6868dbeb7984d2d577dd4bf75c65bf1904f/doc/design/secure-cluster-traffic.md) before we had [unified proposal process](2022-11-23-proposal-process.md), so it does not follow consistent style guide. For new proposal see [README](/README.md).
12+
13+
## Status Quo
14+
15+
Alertmanager supports [high
16+
availability](https://github.com/prometheus/alertmanager/blob/master/README.md#high-availability)
17+
by interconnecting multiple Alertmanager instances building an Alertmanager
18+
cluster. Instances of a cluster communicate on top of a gossip protocol managed
19+
via Hashicorps [*Memberlist*](https://github.com/hashicorp/memberlist) library.
20+
*Memberlist* uses two channels to communicate: TCP for reliable and UDP for
21+
best-effort communication.
22+
23+
Alertmanager instances use the gossip layer to:
24+
25+
- Keep track of membership
26+
- Replicate silence creation, update and deletion
27+
- Replicate notification log
28+
29+
As of today the communication between Alertmanager instances in a cluster is
30+
sent in clear-text.
31+
32+
## Goal
33+
34+
Instances in a cluster should communicate among each other in a secure fashion.
35+
Alertmanager should guarantee confidentiality, integrity and client authenticity
36+
for each message touching the wire. While this would improve the security of
37+
single datacenter deployments, one could see this as a necessity for
38+
wide-area-network deployments.
39+
40+
## Non-Goal
41+
42+
Even though solutions might also be applicable to the API endpoints exposed by
43+
Alertmanager, it is not the goal of this design document to secure the API
44+
endpoints.
45+
46+
## Proposed Solution - TLS Memberlist
47+
48+
*Memberlist* enables users to implement their own [transport
49+
layer](https://godoc.org/github.com/hashicorp/memberlist#Transport) without the
50+
need of forking the library itself. That transport layer needs to support
51+
reliable as well as best-effort communication. Instead of using TCP and UDP like
52+
the default transport layer of *Memberlist*, the suggestion is to only use TCP
53+
for both reliable as well as best-effort communication. On top of that TCP
54+
layer, one can use mutual TLS to secure all communication. A proof-of-concept
55+
implementation can be found here:
56+
https://github.com/mxinden/memberlist-tls-transport.
57+
58+
The data gossiped between instances does not have a low-latency requirement that
59+
TCP could not fulfill, same would apply for the relatively low data throughput
60+
requirements of Alertmanager.
61+
62+
TCP connections could be kept alive beyond a single message to reduce latency as
63+
well as handshake overhead costs. While this is feasible in a 3-instance
64+
Alertmanager cluster, the discussed custom implementation would need to limit
65+
the amount of open connections for clusters with many instances (#connections =
66+
n*(n-1)/2).
67+
68+
As of today, Alertmanager already forces *Memberlist* to use the reliable TCP
69+
instead of the best-effort UDP connection to gossip large notification logs and
70+
silences between instances. The reason is, that those packets would otherwise
71+
exceed the [MTU](https://en.wikipedia.org/wiki/Maximum_transmission_unit) of
72+
most UDP setups. Splitting packets is not supported by *Memberlist* and was not
73+
considered worth the effort to be implemented in Alertmanager either. For more
74+
info see this [Github
75+
issue](https://github.com/prometheus/alertmanager/issues/1412).
76+
77+
With the last [Prometheus developer
78+
summit](https://docs.google.com/document/d/1-C5PycocOZEVIPrmM1hn8fBelShqtqiAmFptoG4yK70/edit)
79+
in mind, the Prometheus projects preferred security mechanism seems to be mutual
80+
TLS. Having Alertmanager use the same mechanism would ease deployment with the
81+
rest of the Prometheus stack.
82+
83+
As a side effect (benefit) Alertmanager would only need a single open port (TCP
84+
traffic) instead of two open ports (TCP and UDP traffic) for cluster
85+
communication. This does not affect the API endpoint which remains a separate
86+
TCP port.
87+
88+
## Alternatives
89+
90+
### Symmetric Memberlist
91+
92+
*Memberlist* supports [symmetric key
93+
encryption](https://godoc.org/github.com/hashicorp/memberlist#Keyring) via
94+
AES-128, AES-192 or AES-256 ciphers. One can specify multiple keys for rolling
95+
updates. Securing the cluster traffic via symmetric encryption would just
96+
involve small configuration changes in the Alertmanager code base.
97+
98+
### Replace Memberlist
99+
100+
Coordinating membership might not be required by the Alertmanager cluster
101+
component. Instead this could be bound to static configuration or e.g. DNS
102+
service discovery. On the other hand, gossiping silences and notifications is
103+
ideally done in an eventual consistent gossip fashion, given that Alertmanager
104+
is supposed to scale beyond a 3-instance cluster and beyond local-area-network
105+
deployments. With these requirements in mind, replacing *Memberlist* with an
106+
entirely self-built communication layer is a great undertaking.
107+
108+
### TLS Memberlist with DTLS
109+
110+
Instead of redirecting all best-effort traffic via the reliable channel as
111+
proposed above, one could also secure the best-effort channel itself using UDP
112+
and [DTLS](https://en.wikipedia.org/wiki/Datagram_Transport_Layer_Security) in
113+
addition to securing the reliable traffic via TCP and TLS. DTLS is not supported
114+
by the Golang standard library.

0 commit comments

Comments
 (0)