Various doc tweaks (#8111)
Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com>
This commit is contained in:
parent
3f8e51738c
commit
2cbc0f9bfe
23
README.md
23
README.md
|
@ -13,19 +13,18 @@ examples and guides.
|
|||
|
||||
Prometheus, a [Cloud Native Computing Foundation](https://cncf.io/) project, is a systems and service monitoring system. It collects metrics
|
||||
from configured targets at given intervals, evaluates rule expressions,
|
||||
displays the results, and can trigger alerts if some condition is observed
|
||||
to be true.
|
||||
displays the results, and can trigger alerts when specified conditions are observed.
|
||||
|
||||
Prometheus's main distinguishing features as compared to other monitoring systems are:
|
||||
The features that distinguish Prometheus from other metrics and monitoring systems are:
|
||||
|
||||
- a **multi-dimensional** data model (time series defined by metric name and set of key/value dimensions)
|
||||
- A **multi-dimensional** data model (time series defined by metric name and set of key/value dimensions)
|
||||
- PromQL, a **powerful and flexible query language** to leverage this dimensionality
|
||||
- no dependency on distributed storage; **single server nodes are autonomous**
|
||||
- time series collection happens via a **pull model** over HTTP
|
||||
- **pushing time series** is supported via an intermediary gateway
|
||||
- targets are discovered via **service discovery** or **static configuration**
|
||||
- multiple modes of **graphing and dashboarding support**
|
||||
- support for hierarchical and horizontal **federation**
|
||||
- No dependency on distributed storage; **single server nodes are autonomous**
|
||||
- An HTTP **pull model** for time series collection
|
||||
- **Pushing time series** is supported via an intermediary gateway for batch jobs
|
||||
- Targets are discovered via **service discovery** or **static configuration**
|
||||
- Multiple modes of **graphing and dashboarding support**
|
||||
- Support for hierarchical and horizontal **federation**
|
||||
|
||||
## Architecture overview
|
||||
|
||||
|
@ -56,9 +55,9 @@ Prometheus will now be reachable at http://localhost:9090/.
|
|||
|
||||
### Building from source
|
||||
|
||||
To build Prometheus from the source code yourself you need to have a working
|
||||
To build Prometheus from source code, first ensure that have a working
|
||||
Go environment with [version 1.13 or greater installed](https://golang.org/doc/install).
|
||||
You will also need to have [Node.js](https://nodejs.org/) and [Yarn](https://yarnpkg.com/)
|
||||
You also need [Node.js](https://nodejs.org/) and [Yarn](https://yarnpkg.com/)
|
||||
installed in order to build the frontend assets.
|
||||
|
||||
You can directly use the `go` tool to download and install the `prometheus`
|
||||
|
|
|
@ -6,9 +6,9 @@ sort_rank: 1
|
|||
# Getting started
|
||||
|
||||
This guide is a "Hello World"-style tutorial which shows how to install,
|
||||
configure, and use Prometheus in a simple example setup. You will download and run
|
||||
configure, and use a simple Prometheus instance. You will download and run
|
||||
Prometheus locally, configure it to scrape itself and an example application,
|
||||
and then work with queries, rules, and graphs to make use of the collected time
|
||||
then work with queries, rules, and graphs to use collected time
|
||||
series data.
|
||||
|
||||
## Downloading and running Prometheus
|
||||
|
@ -25,12 +25,12 @@ Before starting Prometheus, let's configure it.
|
|||
|
||||
## Configuring Prometheus to monitor itself
|
||||
|
||||
Prometheus collects metrics from monitored targets by scraping metrics HTTP
|
||||
endpoints on these targets. Since Prometheus also exposes data in the same
|
||||
Prometheus collects metrics from _targets_ by scraping metrics HTTP
|
||||
endpoints. Since Prometheus exposes data in the same
|
||||
manner about itself, it can also scrape and monitor its own health.
|
||||
|
||||
While a Prometheus server that collects only data about itself is not very
|
||||
useful in practice, it is a good starting example. Save the following basic
|
||||
useful, it is a good starting example. Save the following basic
|
||||
Prometheus configuration as a file named `prometheus.yml`:
|
||||
|
||||
```yaml
|
||||
|
@ -79,26 +79,26 @@ navigating to its metrics endpoint:
|
|||
|
||||
## Using the expression browser
|
||||
|
||||
Let us try looking at some data that Prometheus has collected about itself. To
|
||||
Let us explore data that Prometheus has collected about itself. To
|
||||
use Prometheus's built-in expression browser, navigate to
|
||||
http://localhost:9090/graph and choose the "Console" view within the "Graph" tab.
|
||||
|
||||
As you can gather from [localhost:9090/metrics](http://localhost:9090/metrics),
|
||||
one metric that Prometheus exports about itself is called
|
||||
one metric that Prometheus exports about itself is named
|
||||
`prometheus_target_interval_length_seconds` (the actual amount of time between
|
||||
target scrapes). Go ahead and enter this into the expression console and then click "Execute":
|
||||
target scrapes). Enter the below into the expression console and then click "Execute":
|
||||
|
||||
```
|
||||
prometheus_target_interval_length_seconds
|
||||
```
|
||||
|
||||
This should return a number of different time series (along with the latest value
|
||||
recorded for each), all with the metric name
|
||||
recorded for each), each with the metric name
|
||||
`prometheus_target_interval_length_seconds`, but with different labels. These
|
||||
labels designate different latency percentiles and target group intervals.
|
||||
|
||||
If we were only interested in the 99th percentile latencies, we could use this
|
||||
query to retrieve that information:
|
||||
If we are interested only in 99th percentile latencies, we could use this
|
||||
query:
|
||||
|
||||
```
|
||||
prometheus_target_interval_length_seconds{quantile="0.99"}
|
||||
|
@ -129,8 +129,7 @@ Experiment with the graph range parameters and other settings.
|
|||
|
||||
## Starting up some sample targets
|
||||
|
||||
Let us make this more interesting and start some example targets for Prometheus
|
||||
to scrape.
|
||||
Let's add additional targets for Prometheus to scrape.
|
||||
|
||||
The Node Exporter is used as an example target, for more information on using it
|
||||
[see these instructions.](https://prometheus.io/docs/guides/node-exporter/)
|
||||
|
@ -151,7 +150,7 @@ http://localhost:8081/metrics, and http://localhost:8082/metrics.
|
|||
## Configure Prometheus to monitor the sample targets
|
||||
|
||||
Now we will configure Prometheus to scrape these new targets. Let's group all
|
||||
three endpoints into one job called `node`. However, imagine that the
|
||||
three endpoints into one job called `node`. We will imagine that the
|
||||
first two endpoints are production targets, while the third one represents a
|
||||
canary instance. To model this in Prometheus, we can add several groups of
|
||||
endpoints to a single job, adding extra labels to each group of targets. In
|
||||
|
@ -185,8 +184,8 @@ about time series that these example endpoints expose, such as `node_cpu_seconds
|
|||
|
||||
Though not a problem in our example, queries that aggregate over thousands of
|
||||
time series can get slow when computed ad-hoc. To make this more efficient,
|
||||
Prometheus allows you to prerecord expressions into completely new persisted
|
||||
time series via configured recording rules. Let's say we are interested in
|
||||
Prometheus can prerecord expressions into new persisted
|
||||
time series via configured _recording rules_. Let's say we are interested in
|
||||
recording the per-second rate of cpu time (`node_cpu_seconds_total`) averaged
|
||||
over all cpus per instance (but preserving the `job`, `instance` and `mode`
|
||||
dimensions) as measured over a window of 5 minutes. We could write this as:
|
||||
|
|
|
@ -5,7 +5,7 @@ sort_rank: 7
|
|||
|
||||
# Management API
|
||||
|
||||
Prometheus provides a set of management API to ease automation and integrations.
|
||||
Prometheus provides a set of management APIs to facilitate automation and integration.
|
||||
|
||||
|
||||
### Health check
|
||||
|
|
|
@ -11,10 +11,10 @@ This document offers guidance on migrating from Prometheus 1.8 to Prometheus 2.0
|
|||
|
||||
## Flags
|
||||
|
||||
The format of the Prometheus command line flags has changed. Instead of a
|
||||
The format of Prometheus command line flags has changed. Instead of a
|
||||
single dash, all flags now use a double dash. Common flags (`--config.file`,
|
||||
`--web.listen-address` and `--web.external-url`) are still the same but beyond
|
||||
that, almost all the storage-related flags have been removed.
|
||||
`--web.listen-address` and `--web.external-url`) remain but
|
||||
almost all storage-related flags have been removed.
|
||||
|
||||
Some notable flags which have been removed:
|
||||
|
||||
|
@ -27,11 +27,11 @@ Some notable flags which have been removed:
|
|||
- `-query.staleness-delta` has been renamed to `--query.lookback-delta`; Prometheus
|
||||
2.0 introduces a new mechanism for handling staleness, see [staleness](querying/basics.md#staleness).
|
||||
|
||||
- `-storage.local.*` Prometheus 2.0 introduces a new storage engine, as such all
|
||||
- `-storage.local.*` Prometheus 2.0 introduces a new storage engine; as such all
|
||||
flags relating to the old engine have been removed. For information on the
|
||||
new engine, see [Storage](#storage).
|
||||
|
||||
- `-storage.remote.*` Prometheus 2.0 has removed the already deprecated remote
|
||||
- `-storage.remote.*` Prometheus 2.0 has removed the deprecated remote
|
||||
storage flags, and will fail to start if they are supplied. To write to
|
||||
InfluxDB, Graphite, or OpenTSDB use the relevant storage adapter.
|
||||
|
||||
|
|
|
@ -9,15 +9,22 @@ Prometheus includes a local on-disk time series database, but also optionally in
|
|||
|
||||
## Local storage
|
||||
|
||||
Prometheus's local time series database stores time series data in a custom format on disk.
|
||||
Prometheus's local time series database stores data in a custom, highly efficient format on local storage.
|
||||
|
||||
### On-disk layout
|
||||
|
||||
Ingested samples are grouped into blocks of two hours. Each two-hour block consists of a directory containing one or more chunk files that contain all time series samples for that window of time, as well as a metadata file and index file (which indexes metric names and labels to time series in the chunk files). When series are deleted via the API, deletion records are stored in separate tombstone files (instead of deleting the data immediately from the chunk files).
|
||||
|
||||
The block for currently incoming samples is kept in memory and not fully persisted yet. It is secured against crashes by a write-ahead log (WAL) that can be replayed when the Prometheus server restarts after a crash. Write-ahead log files are stored in the `wal` directory in 128MB segments. These files contain raw data that has not been compacted yet, so they are significantly larger than regular block files. Prometheus will keep a minimum of 3 write-ahead log files, however high-traffic servers may see more than three WAL files since it needs to keep at least two hours worth of raw data.
|
||||
The current block for incoming samples is kept in memory and is not fully
|
||||
persisted. It is secured against crashes by a write-ahead log (WAL) that can be
|
||||
replayed when the Prometheus server restarts. Write-ahead log files are stored
|
||||
in the `wal` directory in 128MB segments. These files contain raw data that
|
||||
has not yet been compacted; thus they are significantly larger than regular block
|
||||
files. Prometheus will retain a minimum of three write-ahead log files.
|
||||
High-traffic servers may retain more than three WAL files in order to to keep at
|
||||
least two hours of raw data.
|
||||
|
||||
The directory structure of a Prometheus server's data directory will look something like this:
|
||||
A Prometheus server's data directory looks something like this:
|
||||
|
||||
```
|
||||
./data
|
||||
|
@ -46,7 +53,12 @@ The directory structure of a Prometheus server's data directory will look someth
|
|||
```
|
||||
|
||||
|
||||
Note that a limitation of the local storage is that it is not clustered or replicated. Thus, it is not arbitrarily scalable or durable in the face of disk or node outages and should be treated as you would any other kind of single node database. Using RAID for disk availability, [snapshots](querying/api.md#snapshot) for backups, capacity planning, etc, is recommended for improved durability. With proper storage durability and planning, storing years of data in the local storage is possible.
|
||||
Note that a limitation of local storage is that it is not clustered or
|
||||
replicated. Thus, it is not arbitrarily scalable or durable in the face of
|
||||
drive or node outages and should be managed like any other single node
|
||||
database. The use of RAID is suggested for storage availability, and [snapshots](querying/api.md#snapshot)
|
||||
are recommended for backups. With proper
|
||||
architecture, it is possible to retain years of data in local storage.
|
||||
|
||||
Alternatively, external storage may be used via the [remote read/write APIs](https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage). Careful evaluation is required for these systems as they vary greatly in durability, performance, and efficiency.
|
||||
|
||||
|
@ -56,37 +68,46 @@ For further details on file format, see [TSDB format](/tsdb/docs/format/README.m
|
|||
|
||||
The initial two-hour blocks are eventually compacted into longer blocks in the background.
|
||||
|
||||
Compaction will create larger blocks up to 10% of the retention time, or 31 days, whichever is smaller.
|
||||
Compaction will create larger blocks containing data spanning up to 10% of the retention time, or 31 days, whichever is smaller.
|
||||
|
||||
## Operational aspects
|
||||
|
||||
Prometheus has several flags that allow configuring the local storage. The most important ones are:
|
||||
Prometheus has several flags that configure local storage. The most important are:
|
||||
|
||||
* `--storage.tsdb.path`: This determines where Prometheus writes its database. Defaults to `data/`.
|
||||
* `--storage.tsdb.retention.time`: This determines when to remove old data. Defaults to `15d`. Overrides `storage.tsdb.retention` if this flag is set to anything other than default.
|
||||
* `--storage.tsdb.retention.size`: [EXPERIMENTAL] This determines the maximum number of bytes that storage blocks can use. The oldest data will be removed first. Defaults to `0` or disabled. This flag is experimental and can be changed in future releases. Units supported: B, KB, MB, GB, TB, PB, EB. Ex: "512MB"
|
||||
* `--storage.tsdb.retention`: This flag has been deprecated in favour of `storage.tsdb.retention.time`.
|
||||
* `--storage.tsdb.wal-compression`: This flag enables compression of the write-ahead log (WAL). Depending on your data, you can expect the WAL size to be halved with little extra cpu load. This flag was introduced in 2.11.0 and enabled by default in 2.20.0. Note that once enabled, downgrading Prometheus to a version below 2.11.0 will require deleting the WAL.
|
||||
* `--storage.tsdb.path`: Where Prometheus writes its database. Defaults to `data/`.
|
||||
* `--storage.tsdb.retention.time`: When to remove old data. Defaults to `15d`. Overrides `storage.tsdb.retention` if this flag is set to anything other than default.
|
||||
* `--storage.tsdb.retention.size`: [EXPERIMENTAL] The maximum number of bytes of storage blocks to retain. The oldest data will be removed first. Defaults to `0` or disabled. This flag is experimental and may change in future releases. Units supported: B, KB, MB, GB, TB, PB, EB. Ex: "512MB"
|
||||
* `--storage.tsdb.retention`: Deprecated in favor of `storage.tsdb.retention.time`.
|
||||
* `--storage.tsdb.wal-compression`: Enables compression of the write-ahead log (WAL). Depending on your data, you can expect the WAL size to be halved with little extra cpu load. This flag was introduced in 2.11.0 and enabled by default in 2.20.0. Note that once enabled, downgrading Prometheus to a version below 2.11.0 will require deleting the WAL.
|
||||
|
||||
On average, Prometheus uses only around 1-2 bytes per sample. Thus, to plan the capacity of a Prometheus server, you can use the rough formula:
|
||||
Prometheus stores an average of only 1-2 bytes per sample. Thus, to plan the capacity of a Prometheus server, you can use the rough formula:
|
||||
|
||||
```
|
||||
needed_disk_space = retention_time_seconds * ingested_samples_per_second * bytes_per_sample
|
||||
```
|
||||
|
||||
To tune the rate of ingested samples per second, you can either reduce the number of time series you scrape (fewer targets or fewer series per target), or you can increase the scrape interval. However, reducing the number of series is likely more effective, due to compression of samples within a series.
|
||||
To lower the rate of ingested samples, you can either reduce the number of time series you scrape (fewer targets or fewer series per target), or you can increase the scrape interval. However, reducing the number of series is likely more effective, due to compression of samples within a series.
|
||||
|
||||
If your local storage becomes corrupted for whatever reason, your best bet is to shut down Prometheus and remove the entire storage directory. You can try removing individual block directories, or WAL directory, to resolve the problem, this means losing a time window of around two hours worth of data per block directory. Again, Prometheus's local storage is not meant as durable long-term storage.
|
||||
If your local storage becomes corrupted for whatever reason, the best
|
||||
strategy to address the problenm is to shut down Prometheus then remove the
|
||||
entire storage directory. You can also try removing individual block directories,
|
||||
or the WAL directory to resolve the problem. Note that this means losing
|
||||
approximately two hours data per block directory. Again, Prometheus's local
|
||||
storage is not intended to be durable long-term storage; external solutions
|
||||
offer exteded retention and data durability.
|
||||
|
||||
CAUTION: Non-POSIX compliant filesystems are not supported by Prometheus' local storage as unrecoverable corruptions may happen. NFS filesystems (including AWS's EFS) are not supported. NFS could be POSIX-compliant, but most implementations are not. It is strongly recommended to use a local filesystem for reliability.
|
||||
CAUTION: Non-POSIX compliant filesystems are not supported for Prometheus' local storage as unrecoverable corruptions may happen. NFS filesystems (including AWS's EFS) are not supported. NFS could be POSIX-compliant, but most implementations are not. It is strongly recommended to use a local filesystem for reliability.
|
||||
|
||||
If both time and size retention policies are specified, whichever policy triggers first will be used at that instant.
|
||||
If both time and size retention policies are specified, whichever triggers first
|
||||
will be used.
|
||||
|
||||
Expired block cleanup happens on a background schedule. It may take up to two hours to remove expired blocks. Expired blocks must be fully expired before they are cleaned up.
|
||||
Expired block cleanup happens in the background. It may take up to two hours to remove expired blocks. Blocks must be fully expired before they are removed.
|
||||
|
||||
## Remote storage integrations
|
||||
|
||||
Prometheus's local storage is limited by single nodes in its scalability and durability. Instead of trying to solve clustered storage in Prometheus itself, Prometheus has a set of interfaces that allow integrating with remote storage systems.
|
||||
Prometheus's local storage is limited to a single node's scalability and durability.
|
||||
Instead of trying to solve clustered storage in Prometheus itself, Prometheus offers
|
||||
a set of interfaces that allow integrating with remote storage systems.
|
||||
|
||||
### Overview
|
||||
|
||||
|
|
Loading…
Reference in New Issue