# Secure Alertmanager cluster traffic Type: Design document Date: 2019-02-21 Author: Max Inden ## Status Quo Alertmanager supports [high availability](https://github.com/prometheus/alertmanager/blob/master/README.md#high-availability) by interconnecting multiple Alertmanager instances building an Alertmanager cluster. Instances of a cluster communicate on top of a gossip protocol managed via Hashicorps [_Memberlist_](https://github.com/hashicorp/memberlist) library. _Memberlist_ uses two channels to communicate: TCP for reliable and UDP for best-effort communication. Alertmanager instances use the gossip layer to: - Keep track of membership - Replicate silence creation, update and deletion - Replicate notification log As of today the communication between Alertmanager instances in a cluster is sent in clear-text. ## Goal Instances in a cluster should communicate among each other in a secure fashion. Alertmanager should guarantee confidentiality, integrity and client authenticity for each message touching the wire. While this would improve the security of single datacenter deployments, one could see this as a necessity for wide-area-network deployments. ## Non-Goal Even though solutions might also be applicable to the API endpoints exposed by Alertmanager, it is not the goal of this design document to secure the API endpoints. ## Proposed Solution - TLS Memberlist _Memberlist_ enables users to implement their own [transport layer](https://godoc.org/github.com/hashicorp/memberlist#Transport) without the need of forking the library itself. That transport layer needs to support reliable as well as best-effort communication. Instead of using TCP and UDP like the default transport layer of _Memberlist_, the suggestion is to only use TCP for both reliable as well as best-effort communication. On top of that TCP layer, one can use mutual TLS to secure all communication. A proof-of-concept implementation can be found here: https://github.com/mxinden/memberlist-tls-transport. The data gossiped between instances does not have a low-latency requirement that TCP could not fulfill, same would apply for the relatively low data throughput requirements of Alertmanager. TCP connections could be kept alive beyond a single message to reduce latency as well as handshake overhead costs. While this is feasible in a 3-instance Alertmanager cluster, the discussed custom implementation would need to limit the amount of open connections for clusters with many instances (#connections = n*(n-1)/2). As of today, Alertmanager already forces _Memberlist_ to use the reliable TCP instead of the best-effort UDP connection to gossip large notification logs and silences between instances. The reason is, that those packets would otherwise exceed the [MTU](https://en.wikipedia.org/wiki/Maximum_transmission_unit) of most UDP setups. Splitting packets is not supported by _Memberlist_ and was not considered worth the effort to be implemented in Alertmanager either. For more info see this [Github issue](https://github.com/prometheus/alertmanager/issues/1412). With the last [Prometheus developer summit](https://docs.google.com/document/d/1-C5PycocOZEVIPrmM1hn8fBelShqtqiAmFptoG4yK70/edit) in mind, the Prometheus projects preferred security mechanism seems to be mutual TLS. Having Alertmanager use the same mechanism would ease deployment with the rest of the Prometheus stack. As a side effect (benefit) Alertmanager would only need a single open port (TCP traffic) instead of two open ports (TCP and UDP traffic) for cluster communication. This does not affect the API endpoint which remains a separate TCP port. ## Alternative Solutions ### Symmetric Memberlist _Memberlist_ supports [symmetric key encryption](https://godoc.org/github.com/hashicorp/memberlist#Keyring) via AES-128, AES-192 or AES-256 ciphers. One can specify multiple keys for rolling updates. Securing the cluster traffic via symmetric encryption would just involve small configuration changes in the Alertmanager code base. ### Replace Memberlist Coordinating membership might not be required by the Alertmanager cluster component. Instead this could be bound to static configuration or e.g. DNS service discovery. On the other hand, gossiping silences and notifications is ideally done in an eventual consistent gossip fashion, given that Alertmanager is supposed to scale beyond a 3-instance cluster and beyond local-area-network deployments. With these requirements in mind, replacing _Memberlist_ with an entirely self-built communication layer is a great undertaking. ### TLS Memberlist with DTLS Instead of redirecting all best-effort traffic via the reliable channel as proposed above, one could also secure the best-effort channel itself using UDP and [DTLS](https://en.wikipedia.org/wiki/Datagram_Transport_Layer_Security) in addition to securing the reliable traffic via TCP and TLS. DTLS is not supported by the Golang standard library.