doc: Add 'Secure Alertmanager cluster traffic' design document
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
This commit is contained in:
parent
51eebbef85
commit
d81b9a5435
|
@ -0,0 +1,117 @@
|
|||
# Secure Alertmanager cluster traffic
|
||||
|
||||
Type: Design document
|
||||
|
||||
Date: 2019-02-21
|
||||
|
||||
Author: Max Inden <IndenML@gmail.com>
|
||||
|
||||
|
||||
## Status Quo
|
||||
|
||||
Alertmanager supports [high
|
||||
availability](https://github.com/prometheus/alertmanager/blob/master/README.md#high-availability)
|
||||
by interconnecting multiple Alertmanager instances building an Alertmanager
|
||||
cluster. Instances of a cluster communicate on top of a gossip protocol managed
|
||||
via Hashicorps [_Memberlist_](https://github.com/hashicorp/memberlist) library.
|
||||
_Memberlist_ uses two channels to communicate: TCP for reliable and UDP for
|
||||
best-effort communication.
|
||||
|
||||
Alertmanager instances use the gossip layer to:
|
||||
|
||||
- Keep track of membership
|
||||
- Replicate silence creation, update and deletion
|
||||
- Replicate notification log
|
||||
|
||||
As of today the communication between Alertmanager instances in a cluster is
|
||||
sent in clear-text.
|
||||
|
||||
|
||||
## Goal
|
||||
|
||||
Instances in a cluster should communicate among each other in a secure fashion.
|
||||
Alertmanager should guarantee confidentiality, integrity and client authenticity
|
||||
for each message touching the wire. While this would improve the security of
|
||||
single datacenter deployments, one could see this as a necessity for
|
||||
wide-area-network deployments.
|
||||
|
||||
|
||||
## Non-Goal
|
||||
|
||||
Even though solutions might also be applicable to the API endpoints exposed by
|
||||
Alertmanager, it is not the goal of this design document to secure the API
|
||||
endpoints.
|
||||
|
||||
|
||||
## Proposed Solution - TLS Memberlist
|
||||
|
||||
_Memberlist_ enables users to implement their own [transport
|
||||
layer](https://godoc.org/github.com/hashicorp/memberlist#Transport) without the
|
||||
need of forking the library itself. That transport layer needs to support
|
||||
reliable as well as best-effort communication. Instead of using TCP and UDP like
|
||||
the default transport layer of _Memberlist_, the suggestion is to only use TCP
|
||||
for both reliable as well as best-effort communication. On top of that TCP
|
||||
layer, one can use mutual TLS to secure all communication. A proof-of-concept
|
||||
implementation can be found here:
|
||||
https://github.com/mxinden/memberlist-tls-transport.
|
||||
|
||||
The data gossiped between instances does not have a low-latency requirement that
|
||||
TCP could not fulfill, same would apply for the relatively low data throughput
|
||||
requirements of Alertmanager.
|
||||
|
||||
TCP connections could be kept alive beyond a single message to reduce latency as
|
||||
well as handshake overhead costs. While this is feasible in a 3-instance
|
||||
Alertmanager cluster, the discussed custom implementation would need to limit
|
||||
the amount of open connections for clusters with many instances (#connections =
|
||||
n*(n-1)/2).
|
||||
|
||||
As of today, Alertmanager already forces _Memberlist_ to use the reliable TCP
|
||||
instead of the best-effort UDP connection to gossip large notification logs and
|
||||
silences between instances. The reason is, that those packets would otherwise
|
||||
exceed the [MTU](https://en.wikipedia.org/wiki/Maximum_transmission_unit) of
|
||||
most UDP setups. Splitting packets is not supported by _Memberlist_ and was not
|
||||
considered worth the effort to be implemented in Alertmanager either. For more
|
||||
info see this [Github
|
||||
issue](https://github.com/prometheus/alertmanager/issues/1412).
|
||||
|
||||
With the last [Prometheus developer
|
||||
summit](https://docs.google.com/document/d/1-C5PycocOZEVIPrmM1hn8fBelShqtqiAmFptoG4yK70/edit)
|
||||
in mind, the Prometheus projects preferred security mechanism seems to be mutual
|
||||
TLS. Having Alertmanager use the same mechanism would ease deployment with the
|
||||
rest of the Prometheus stack.
|
||||
|
||||
As a side effect (benefit) Alertmanager would only need a single open port (TCP
|
||||
traffic) instead of two open ports (TCP and UDP traffic) for cluster
|
||||
communication. This does not affect the API endpoint which remains a separate
|
||||
TCP port.
|
||||
|
||||
|
||||
## Alternative Solutions
|
||||
|
||||
### Symmetric Memberlist
|
||||
|
||||
_Memberlist_ supports [symmetric key
|
||||
encryption](https://godoc.org/github.com/hashicorp/memberlist#Keyring) via
|
||||
AES-128, AES-192 or AES-256 ciphers. One can specify multiple keys for rolling
|
||||
updates. Securing the cluster traffic via symmetric encryption would just
|
||||
involve small configuration changes in the Alertmanager code base.
|
||||
|
||||
|
||||
### Replace Memberlist
|
||||
|
||||
Coordinating membership might not be required by the Alertmanager cluster
|
||||
component. Instead this could be bound to static configuration or e.g. DNS
|
||||
service discovery. On the other hand, gossiping silences and notifications is
|
||||
ideally done in an eventual consistent gossip fashion, given that Alertmanager
|
||||
is supposed to scale beyond a 3-instance cluster and beyond local-area-network
|
||||
deployments. With these requirements in mind, replacing _Memberlist_ with an
|
||||
entirely self-built communication layer is a great undertaking.
|
||||
|
||||
|
||||
### TLS Memberlist with DTLS
|
||||
|
||||
Instead of redirecting all best-effort traffic via the reliable channel as
|
||||
proposed above, one could also secure the best-effort channel itself using UDP
|
||||
and [DTLS](https://en.wikipedia.org/wiki/Datagram_Transport_Layer_Security) in
|
||||
addition to securing the reliable traffic via TCP and TLS. DTLS is not supported
|
||||
by the Golang standard library.
|
Loading…
Reference in New Issue