Changes to the UI:
- "Active Since" timestamps are now human-readable.
- Alerting rules are now pretty-printed better.
- Labels are no longer just strings, but alert bubbles (like we do on
the status page for base labels).
- Alert states and target health states are now capitalized in the
presentation layer rather than at the source.
This commit adds the honor_labels and params arguments to the scrape
config. This allows to specify query parameters used by the scrapers
and handling scraped labels with precedence.
The main purpose of this is to allow for blacklisting
of expensive metrics as a tactical option.
It could also find uses for renaming and removing labels
from federation.
The main purpose of this is to allow for blacklisting
of expensive metrics as a tactical option.
It could also find uses for renaming and removing labels
from federation.
Figuring out what's going on with the new service discovery
and labels is difficult. Add a popover with the labels
to the target table to make things simpler, and help
discovery of potentially useful labels.
This change is conceptually very simple, although the diff is large. It
switches logging from "github.com/golang/glog" to
"github.com/prometheus/log", while not actually changing any log
messages. V(1)-style logging has been changed to be log.Debug*().
Appending to the storage can block for a long time. Timing out
scrapes can also cause longer blocks. This commit avoids that those
blocks affect other compnents than the target itself.
Also the Target interface was removed.
The target implementation and interface contain methods only serving a
specific purpose of the templates. They were moved to the template
as they operate on more fundamental target data.
This commits adds file based service discovery which reads target
groups from specified files. It detects changes based on file watches
and regular refreshes.
With this commit, sending SIGHUP to the Prometheus process will reload
and apply the configuration file. The different components attempt
to handle failing changes gracefully.
This commit adds a relabelling stage on the set of base
labels from which a target is created. It allows to drop
targets and rewrite any regular or internal label.
This commit changes the configuration interface from job configs to scrape
configs. This includes allowing multiple ways of target definition at once
and moving DNS SD to its own config message. DNS SD can now contain multiple
DNS names per configured discovery.
This commit shifts responsibility for maintaining targets from providers and
pools to the target manager. Target groups have a source name that identifies
them for updates.
/api/targets was undocumented and never used and also broken.
Showing instance and job labels on the status page (next to targets)
does not make sense as those labels are set in an obvious way.
Also add a doc comment to TargetStateToClass.
The one central sample ingestion channel has caused a variety of
trouble. This commit removes it. Targets and rule evaluation call an
Append method directly now. To incorporate multiple storage backends
(like OpenTSDB), storage.Tee forks the Append into two different
appenders.
Note that the tsdb queue manager had its own queue anyway. It was a
queue after a queue... Much queue, so overhead...
Targets have their own little buffer (implemented as a channel) to
avoid stalling during an http scrape. But a new scrape will only be
started once the old one is fully ingested.
The contraption of three pipelined ingesters was removed. A Target is
an ingester itself now. Despite more logic in Target, things should be
less confusing now.
Also, remove lint and vet warnings in ast.go.
The current wording suggests that a target is not reachable at all,
although it might also get set when the target was reachable, but there
was some other error during the scrape (invalid headers or invalid
scrape content). UNHEALTHY is a more general wording that includes all
these cases.
For consistency, ALIVE is also renamed to HEALTHY.
This is now not even trying to throttle in a benign way, but creates a
fully-fledged error. Advantage: It shows up very visible on the status
page. Disadvantage: The server does not really adjusts to a lower
scraping rate. However, if your ingestion backs up, you are in a very
irregulare state, I'd say it _should_ be considered an error and not
dealt with in a more graceful way.
In different news: I'll work on optimizing ingestion so that we will
not as easily run into that situation in the first place.
The simple algorithm applied here will increase the actual interval
incrementally, whenever and as long as the scrape itself takes longer
than the configured interval. Once it takes shorter again, the actual
interval will iteratively decrease again.
- Move CONTRIBUTORS.md to the more common AUTHORS.
- Added the required NOTICE file.
- Changed "Prometheus Team" to "The Prometheus Authors".
- Reverted the erroneous changes to the Apache License.
The "Address" is actually a URL which may contain username and
password. Calling this Address is misleading so we rename it.
Change-Id: I441c7ab9dfa2ceedc67cde7a47e6843a65f60511
Essentially:
- Remove unused code.
- Make it 'go vet' clean. The only remaining warnings are in generated code.
- Make it 'golint' clean. The only remaining warnings are in gerenated code.
- Smoothed out same minor things.
Change-Id: I3fe5c1fbead27b0e7a9c247fee2f5a45bc2d42c6
Also, fix problems in shutdown.
Starting serving and shutdown still has to be cleaned up properly.
It's a mess.
Change-Id: I51061db12064e434066446e6fceac32741c4f84c
- Always spell out the time unit (e.g. milliseconds instead of ms).
- Remove "_total" from the names of metrics that are not counters.
- Make use of the "Namespace" and "Subsystem" fields in the options.
- Removed the "capacity" facet from all metrics about channels/queues.
These are all fixed via command line flags and will never change
during the runtime of a process. Also, they should not be part of
the same metric family. I have added separate metrics for the
capacity of queues as convenience. (They will never change and are
only set once.)
- I left "metric_disk_latency_microseconds" unchanged, although that
metric measures the latency of the storage device, even if it is not
a spinning disk. "SSD" is read by many as "solid state disk", so
it's not too far off. (It should be "solid state drive", of course,
but "metric_drive_latency_microseconds" is probably confusing.)
- Brian suggested to not mix "failure" and "success" outcome in the
same metric family (distinguished by labels). For now, I left it as
it is. We are touching some bigger issue here, especially as other
parts in the Prometheus ecosystem are following the same
principle. We still need to come to terms here and then change
things consistently everywhere.
Change-Id: If799458b450d18f78500f05990301c12525197d3
Having metrics with variable timestamps inconsistently
spaced when things fail will make it harder to write correct rules.
Update status page, requires some refactoring to insert a function.
Change-Id: Ie1c586cca53b8f3b318af8c21c418873063738a8
Now that the subtle bug in matttproud/golang_protobuf_extensions is
fixed, we do not need to copy the bytes of a scrape into a buffer
first before starting to parse it.
Change-Id: Ib73ecae16173ddd219cda56388a8f853332f8853