Errcheck [1] enforces error handling accross all go files. Functions can
be excluded via `scripts/errcheck_excludes.txt`.
This patch adds errcheck to the `test` Make target.
[1] https://github.com/kisielk/errcheck
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
Alertmanager is exiting with a non-zero exit code if the initial cluster
join fails. This behavior could be not wanted because:
- As Alertmanager is a critical component with an at-least-once
guarantee, failing on joining the cluster is unnecessary as
Alertmanager still functions by itself.
- In an environment like Kubernetes discovering peers via DNS, peers
might roll out one-by-one, leaving the DNS entries unpopulated for the
first peer of a set. Failing on initial join prevents a roll-out.
Instead of failing on the initial join this patch only logs the failure.
The cluster can be later joined via the `handleReconnect`.
This is a regression introduced in PR #1456 [1].
[1] https://github.com/prometheus/alertmanager/pull/1456
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
* cluster: make sure we don't miss the first pushPull
During the join, memberlist initiates a pushPull to get initial data.
Unfortunately, at this point the nflog and silence listener have not
been registered yet, so the first data arrives only after one pushPull
cycle (1min by default !).
Signed-off-by: Corentin Chary <c.chary@criteo.com>
* Gossip large messages via SendReliable
For messages beyond half of the maximum gossip
packet size, send the message to all peer nodes
via TCP.
The choice of "larger than half the max gossip
size" is relatively arbitrary. From brief testing,
the overhead from memberlist on a packet seemed to
only use ~3 of the available 1400 bytes, and most
gossip messages seem to be <<500 bytes.
* Add tests for oversized/normal message gossiping
* Make oversize metric names consistent
* Remove errant printf in test
* Correctly increment WaitGroup
* Add comment for OversizedMessage func
* Add metric for oversized messages dropped
Code was added to drop oversized messages if the
buffered channel they are sent on is full. This
is a good thing to surface as a metric.
* Add counter for total oversized messages sent
* Change full queue log level to debug
Was previously a warning, which isn't necessary
now that there is a metric tracking it.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* initial impl
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add reconnectTimeout
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Fix locking
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Remove unused PeerStatuses
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add metrics
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Actually use peerJoinCounter
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Cleanup peers map on peer timeout
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add reconnect test
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* test removing failed peers
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Use peer address as map key
If a peer is restarted, it will rejoin with the
same IP but different ULID. So the node will
rejoin the cluster, but its peers will never
remove it from their internal list of failed nodes
because its ULID has changed.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add failed peers from creation
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Remove warnIfAlone()
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Update metric names
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Address comments
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
The cluster configuration uses DefaultLANConfig which seems
to be quite sensitive to WAN conditions. Allowing the tuning of these 3
parameters (TCP Timeout, Probe Interval and Probe Timeout) makes
clustering more robust across WAN connections.
Signed-off-by: Rhys Meaclem <rhysmeaclem@gmail.com>
With prometheus/prometheus commit
e114ce0ff7a1ae06b24fdc479ffc7422074c1ebe [1] Prometheus switches from
using `api/alerts` to `api/v1/alerts`. This commit is included starting
from Prometheus v0.17.0. As discussed on the prometheus-developers
mailing list [2] the deprecation period is long over.
[1] github.com/prometheus/prometheus/commit/e114ce0ff7a1ae06b24fdc479ffc7422074c1ebe
[2]
https://groups.google.com/d/msg/prometheus-developers/2CCuFTMbmAg/Qg58rvyzAQAJ
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
* Use default values to store values from config
* fix typo and reserved keywork
* move to long help texts
* add one more unit test for resolver
* update comments
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
This change replaces the deprecated InstrumentHandler function by
the equivalent functions from the promhttp package.
The following metrics are removed:
* http_request_duration_microseconds (Summary).
* http_request_size_bytes (Summary).
* http_requests_total (Counter).
And the following metrics are added instead:
* alertmanager_http_request_duration_seconds (Histogram).
* alertmanager_http_response_size_bytes (Histogram).
* promhttp_metric_handler_requests_in_flight (Gauge).
* promhttp_metric_handler_requests_total (Counter).
* cluster: add alertmanager_cluster_messages_queued metric
* cluster: add metrics for sent messages
This change adds 2 new metrics:
- alertmanager_cluster_messages_sent_total
- alertmanager_cluster_messages_sent_size_total
* Fix marshaling for entries being broadcast
Individual notifications logs and silences being broadcast to the other
peers need to be encoded using the same length-delimited format as when
doing full-state synchronization.
* main: fix argument order for cluster.Join()
cluster.Join() was called with the push/pull and gossip interval
parameters being swapped one for another.
* Wait for the gossip to settle before sending notifications
See #1209 for details.
As an heuristic for mesh readyness, try to see if
the mesh looks stable (the number of peers isn't changing too much).
This implementation always mark the altermanager as ready after a maximum of 60s.
This adds one new flags to control this behavior:
```
--cluster.settle-timeout=60s mesh settling timeout. Do not wait more than this duration on startup.
```
It also adds `/-/ready` which always return 200 (in order to make it clear
that we are ready as soon as we can receive requests).
The mesh status is exposed in `/api/v1/status` and visible on `/#/status`.
* cluster: fix typos and base interval on gossipInterval
* cluster: gather alertmanager_peer_position all the time
This change moves the gathering of the alertmanager_peer_position metric
outside of the clusterWait() function so that the metric is computed
accurately even when no alerting group fires.
* cluster: add alertmanager_cluster_health_score metric
This metric is retrieved from the memberlist library.
* Add mesh metrics
This change adds 2 new metrics for the mesh:
* alertmanager_peer_connection, state of the connection between the
Alertmanager instance and a peer.
* alertmanager_peer_terminations_total, total number of terminated
connection.
It also moves the gathering of the alertmanager_peer_position metric
outside of the meshWait() function so that the metric is computed
accurately even when no alerting group fires.
* Remove 'nick' label from alertmanager_peer_connection metric
* Switch cmd/amtool to kingpin
* Touch-ups
* Implement long help
* Add missing short-form of --output
* Fix backwards compatibility for config file options
* Fix vendoring
* Review fixes
* Fix flag word order
This adds metrics that look like this:
```
alertmanager_alerts{state="active"} 6
alertmanager_alerts{state="suppressed"} 0
alertmanager_silences{state="active"} 1
alertmanager_silences{state="expired"} 1
alertmanager_silences{state="pending"} 0
```
This can be used to monitor alertmanager's usage and validate that
alertmanagers in a mesh have a similar number of silences and alerts.
* Fix crash when no mesh router is configured
This adds a check for `meshListen != ""` around the waitFunc code as we
have around the other mesh-related code parts above.
Fixes https://github.com/prometheus/alertmanager/issues/914
* Update bindata
Infer path from Navigation.Location
Build uses template, local dev uses elm-reactor
Remove unneeded local dev go server
Add script.js make target
Compiles and uglifies script.js
Before:
~570kb
After:
~170kb
Bootstrap loading state
Add trailing slash via JS & add routePrefix console param
Add Javascript script tag to `index.html` which adds a trailing slash to
the url pathname if none is present. This is done to ensure assets like
`script.js` are loaded properly.
Example without patch:
If the pathname is "mxinden.com/alertmanager" the browser will try to
download the `script.js` asset from "mxinden.com/script.js". This
request will fail.
Example with patch:
If the pathname is "mxinden.com/alertmanager", Javascript redirects the
browser to "mxinden.com/alertmanager/" and then the `script.js` asset
will be downloaded from "mxinden.com/alertmanager/script.js". This
request will succeed.
Add `-web.route-prefix` as a console parameter. This configures a
Prefix for the internal routes of web endpoints. Defaults to path of
-web.external-url like in *Prometheus*.
Trim slashes off of route prefix and add one slash at the beginning.
Make sure route prefix is not empty or just a slash before prefixing
router.
This makes the /alerts endpoint return the alert status introduced by PR
# 717. In addition it enables alert filtering via matchers like on the
/alerts/groups endpoint.
* Implement alertmanager cli tool 'amtool'
The primary goal of an alertmanager tool is to provide a cli interface
for the prometheus alertmanager.
My vision for this tool has two parts:
- Silence management (query, add, delete)
- Alert management (query, maybe more in future?)
Resolves: #567
* Vendor dependencies.
This updates several old dependencies, removes
some that are no longer needed, and adds
`pkg/labels` from prometheus `dev-2.0` branch.
* Add metrics selector parsing code
This is a temporary simplified re-implementation
of promQL's metric selector parsing.
* Add alerts filtering
Filter alerts through `?filter=` query string.
* Add silences filtering
Filter silences through `?filter=` query string.
* Move `parse` to `pkg/parse`
* api: Expose mesh status
The weaveworks mesh package reveals information about the current status
of the mesh network between alertmanager instances. This commit exposes
the current address and connection status of each instance connected to
the targeted alertmanager instance via the /status API endpoint.
* api: Replace LocalConnectionStatus with PeerStatus
Now meshStatus contains all peers (Name, NickName, UID) of the network.
Additionally adding Name and Nickname of current target to toplevel of
meshNetwork object.
* Find MAC address if mesh.hardware-addr not given
Defaulting to the machine's MAC address fails
sometimes fails and causes a panic. Allow the user
to specify custom address to skip this so they can
run AlertManager.
* -mesh.hardware-address -> -mesh.peer-id
* Fix command-line invocation
This commit replaces the previous NotifyInfo provider with the new
nflog package. It needs adjustments in the behavior of the deduping
stage.
The nflog stores notification digests per receiver per alert aggregation
group rather than one entry for alert per receiver. This drastically
reduces the number of entries and removes interference
across aggregation groups.