Commit Graph

36 Commits

Author SHA1 Message Date
stuart nelson 445fbdf1a8
gossip large messages via SendReliable (#1415)
* Gossip large messages via SendReliable

For messages beyond half of the maximum gossip
packet size, send the message to all peer nodes
via TCP.

The choice of "larger than half the max gossip
size" is relatively arbitrary. From brief testing,
the overhead from memberlist on a packet seemed to
only use ~3 of the available 1400 bytes, and most
gossip messages seem to be <<500 bytes.

* Add tests for oversized/normal message gossiping

* Make oversize metric names consistent

* Remove errant printf in test

* Correctly increment WaitGroup

* Add comment for OversizedMessage func

* Add metric for oversized messages dropped

Code was added to drop oversized messages if the
buffered channel they are sent on is full. This
is a good thing to surface as a metric.

* Add counter for total oversized messages sent

* Change full queue log level to debug

Was previously a warning, which isn't necessary
now that there is a metric tracking it.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
2018-06-15 13:40:21 +02:00
stuart nelson 77cc718a81 [nflog] register snapshotSize
This metric was never registered.
2018-06-12 13:59:48 +02:00
stuart nelson 36588c3865
memberlist gossip (#1389)
* Peers further propagate newly received nflogs

If a peer receives an nflog that it hasn't seen
before, queue the message and propagate it further
to other peers. This should ensure that all
peers within a cluster receive all gossip
messages.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Set Retransmit value based on number of members

For alertmanagers that are brought up with a list
of peers, set the number of message retransmits to
be half of that number. If there are no peers on
start, or there are few, continue to use the
default value of 3.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* [nflog] Move retransmit calculation

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* [silence] further gossip silence messages

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Set GossipNodes to equal RetransmitMulti

During a gossip, we send messages to at most
GossipNodes nodes. If possible, we only a message
to reach all nodes as soon as possible.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Fix rebase

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
2018-06-08 11:48:42 +02:00
Simon Pasquier b7d891cf39 notify: notify resolved alerts properly (#1408)
* notify: notify resolved alerts properly

The PR #1205 while fixing an existing issue introduced another bug when
the send_resolved flag of the integration is set to true.

With send_resolved set to false, the semantics remain the same:
AlertManager generates a notification when new firing alerts are added
to the alert group. The notification only carries firing alerts.

With send_resolved set to true, AlertManager generates a notification
when new firing or resolved alerts are added to the alert group. The
notification carries both the firing and resolved notifications.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Fix comments

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-06-08 11:37:38 +02:00
Simon Pasquier 0ebaeccd4b *: add missing license headers
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-05-14 17:37:13 +02:00
Ted Zlatanov b04e9ad19b #1346: move maintenance messages to DEBUG log level (#1347)
Signed-off-by: Ted Zlatanov <tzz@lifelogs.com>
2018-04-30 11:56:17 +02:00
Simon Pasquier a8c995f77c nflog: fix potential panic in decodeState()
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-04-10 10:11:40 +02:00
Simon Pasquier 1531aa66f3 Fix for #1282 (#1286)
* cluster: add alertmanager_cluster_messages_queued metric

* cluster: add metrics for sent messages

This change adds 2 new metrics:

- alertmanager_cluster_messages_sent_total
- alertmanager_cluster_messages_sent_size_total

* Fix marshaling for entries being broadcast

Individual notifications logs and silences being broadcast to the other
peers need to be encoded using the same length-delimited format as when
doing full-state synchronization.

* main: fix argument order for cluster.Join()

cluster.Join() was called with the push/pull and gossip interval
parameters being swapped one for another.
2018-03-22 13:53:00 +01:00
pasquier-s e8a92f65ef Run staticcheck as part of the build process (#1264)
This change also fixes potential issues highlighted by running
staticcheck.
2018-02-28 17:42:32 +01:00
Fabian Reinartz 247bfff606 cluster: remove MergeSingle 2018-02-09 11:06:51 +01:00
Fabian Reinartz fd49dbb477 *: move to memberlist for clustering 2018-02-08 12:18:44 +01:00
pasquier-s 62b957cc14 Notify only when new firing alerts are added (#1205)
After the initial notification has been sent, AlertManager shouldn't notify the
receiver again when no new alerts have been added to the group during
group_interval.

This change also modifies the acceptance test framework to assert that no
notification has been received in a given interval.
2018-01-23 16:52:03 +01:00
pasquier-s 9b10acae68 Don't notify resolved alerts if none were firing (#1198)
* Don't notify resolved alerts if none were firing

* Fix comments
2018-01-18 11:12:17 +01:00
pasquier-s a7d4e4ea7c Log snapshot sizes on maintenance (#1155)
* Log snapshot sizes on maintenance

* Add metrics for snapshot sizes

This change adds 2 new gauges for tracking the last snapshots' sizes:

  - alertmanager_nflog_snapshot_size_bytes
  - alertmanager_silences_snapshot_size_bytes
2018-01-10 14:53:57 +01:00
Frederic Branczyk bfdff67138 nflog: Copy and replace gossipData instead of modifying it in place (#1121) 2017-12-09 15:22:07 +01:00
Frederic Branczyk 53bd897bd0
Merge pull request #1066 from josedonizetti/add_set_test
Add tests to nflog set
2017-11-02 11:23:19 +01:00
Julius Volz 9b72c10134 Minor code cleanups 2017-11-01 23:08:34 +01:00
Jose Donizetti 511c6bcb6a Add nflog TestQuery (#1070) 2017-11-01 20:38:00 +01:00
Julius Volz fc984941ee nflog: Fix Log() crash when gossip is nil (#1064) 2017-11-01 10:34:40 +01:00
Jose Donizetti bf3f6de719 Add tests to nflog set 2017-11-01 06:44:27 -02:00
Jose Donizetti 359b614f5f Fix documentation (#1065) 2017-11-01 08:41:00 +00:00
Julius Volz 947970af44 Convert Alertmanager to use non-global go-kit loggers
Fixes https://github.com/prometheus/alertmanager/issues/1040
2017-10-22 00:20:40 -07:00
Fabian Reinartz 3269bc39e1 *: switch group key to matcher serialization
Turn the GroupKey into a string that is composed of the matchers if the
path in the routing tree and the grouping labels.
Only hash it at the very end to ensure we don't exceed size limits of
integration APIs.
2017-04-21 12:06:23 +02:00
Fabian Reinartz 4258b028d6 nflog: switch to gogoproto
This switches the nflog to generate Go code via gogoproto and thereby
use standard library timestamp types.
2017-04-18 10:03:57 +02:00
Fabian Reinartz 309c6af4b2
nflog: use alert set instead of hash for deduplication
Building a hash over an entire set of alerts causes problems, because
the hash differs, on any change, whereas we only want to send
notifications if the alert and it's state have changed. Therefore this
introduces a list of alerts that are active and a list of alerts that
are resolved. If the currently active alerts of a group are a subset of
the ones that have been notified about before then they are
deduplicated. The resolved notifications work the same way, with a
separate list of resolved notifications that have already been sent.
2017-04-13 15:13:47 +02:00
Fabian Reinartz 1e01b2bdba nflog: add metrics (#518) 2016-11-21 15:22:35 +01:00
Fabian Reinartz b2461bb2d4 *: remove go-kit logging 2016-09-06 11:56:57 +02:00
Fabian Reinartz d6713c8eeb nflog: enable sharing log via gossip 2016-08-19 12:20:04 +02:00
Fabian Reinartz 5dc8286942 nflog: fix maintenance termination 2016-08-19 12:01:16 +02:00
Fabian Reinartz 72fdf3d3ab *: integrate nflog
This commit replaces the previous NotifyInfo provider with the new
nflog package. It needs adjustments in the behavior of the deduping
stage.
The nflog stores notification digests per receiver per alert aggregation
group rather than one entry for alert per receiver. This drastically
reduces the number of entries and removes interference
across aggregation groups.
2016-08-18 15:52:28 +02:00
Fabian Reinartz a42a473213 nflog: add doc comments, license headers 2016-08-18 14:08:01 +02:00
Fabian Reinartz 4a5df40539 nflog: add logging 2016-08-16 16:33:17 +02:00
Fabian Reinartz 48b6c8ff70 nflog: add initial tests 2016-08-16 11:11:48 +02:00
Fabian Reinartz 086d581cf8 nflog: add gc/snapshotting maintenance, remove delete
This removes the Delete function from the interface as the log
should be append-only and only be reduced by expired entries.
This also adds an argument to configure a background processing routine,
which periodically garbage collects and snapshots.
2016-08-16 11:11:48 +02:00
Fabian Reinartz 80afd502d5 nflog: add mesh gossip support 2016-08-16 11:11:48 +02:00
Fabian Reinartz 3d8e60ded7 nflog: add notification log package
This adds a new nflog package meant to replace provider.Notifies. It
has a central protobuf type package, which is also meant for usage for
other packages and the API.
The generated Go types are also the in-memory representation.
2016-08-16 11:11:48 +02:00