Promql2
PromQL Workshop – Part 2: Cheat Sheet Deep Dive
This workshop is based on the PromLabs PromQL Cheat Sheet. Each section below introduces a PromQL concept or function, provides a brief explanation, and includes practical examples.
PromQL Metric Types Overview
Prometheus instrumentation client libraries allow you to instrument your services with four different metric types, which offer different API methods for recording values.
Metric Types
A quick summary of metric types:
- Counters track values that can only ever go up, like HTTP request counts or CPU seconds used.
- Gauges track values that can go up or down, like temperatures or disk space.
- Summaries calculate client-side-calculated quantiles from a set of observations, like request latency percentiles. They also track the total count and total sum of observations.
- Histograms track bucketed histograms of a set of observations like request latencies. They also track the total count and total sum of observations.
Note: Histogram and Summary metrics can both be used for calculating quantiles, but have different trade-offs. The most important one is that Summary metrics cannot be aggregated over dimensions or multiple instances. See the Prometheus documentation for more information.
Metric Types Meaning and Serialization
Let’s have a brief look at the meaning of each metric type and when you would use it:
Counters
Counters are for tracking cumulative totals over time, like the total number of HTTP requests that have been handled so far, the total number of seconds spent handling requests, or the number of errors that have occurred. Counters may only decrease in value when the process that exposes them restarts, in which case their last value is forgotten and thus gets reset to 0.
Example serialization:
# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="post",code="200"} 1027
http_requests_total{method="post",code="400"} 3
Gauges
Gauges are for tracking current tallies, or things that can naturally go up or down over time, like memory usage, a queue length, the number of in-flight requests, or a temperature.
Example serialization:
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 15
Summaries
Summaries allow you to track the distribution of a set of observed values (most commonly request latencies) as a set of quantiles. A quantile is the same as a percentile, but is indicated from 0 to 1 instead of 0 to 100, so:
- Quantile 0.5 is the 50th percentile
- Quantile 0.99 is the 99th percentile
- …and so on.
A summary metric without any further instrumentation labels (but a few configured output quantiles) would get serialized like this in the exposition format, with the quantile label indicating the quantile:
# HELP rpc_duration_seconds A summary of RPC durations in seconds.
# TYPE rpc_duration_seconds summary
rpc_duration_seconds{quantile="0.01"} 3.102
rpc_duration_seconds{quantile="0.05"} 3.272
rpc_duration_seconds{quantile="0.5"} 4.773
rpc_duration_seconds{quantile="0.9"} 9.001
rpc_duration_seconds{quantile="0.99"} 76.656
rpc_duration_seconds_sum 5.7560473e+04
rpc_duration_seconds_count 2693
NOTE: Besides tracking value distributions in buckets or quantiles, histogram and summary metrics also track the total count and the total sum of observations as a by-product:
<basename>_sum
tracks the total sum of all observed values. In a request latency histogram, this represents the total number of seconds spent handling requests since process start.<basename>_count
tracks the total count of all observed values. In a request latency histogram, this represents the total number of requests handled so far.Often you will be interested in tracking these counts anyway, especially for calculating request rates from the
<basename>_count
metric. If you already have a histogram or summary metric that tracks your request latencies, this means that you don’t need to create separate counter metrics for tracking these counts anymore.
Histograms
Histograms allow you to track the distribution of a set of observed values (most commonly request latencies) across a set of buckets. They also track the total number of observed values, as well as the cumulative sum of the observed values.
A histogram metric without any further instrumentation labels would get serialized as a list of counter series (one per bucket), with an le
label indicating the latency upper bound of each bucket counter:
# HELP http_request_duration_seconds A histogram of the request duration.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.025"} 20
http_request_duration_seconds_bucket{le="0.05"} 60
http_request_duration_seconds_bucket{le="0.1"} 90
http_request_duration_seconds_bucket{le="0.25"} 100
http_request_duration_seconds_bucket{le="+Inf"} 105
http_request_duration_seconds_sum 21.322
http_request_duration_seconds_count 105
When to Use Each Metric Type
Choosing the right metric type is essential for effective monitoring and querying in Prometheus. Here’s a practical guide:
Counters
- Use when: You want to track a value that only ever increases (except for resets on restart).
- Examples: Total HTTP requests, total errors, total bytes sent.
- Best for: Rates, totals, and alerting on increases over time.
- Do not use for: Values that can decrease (e.g., current memory usage).
Gauges
- Use when: You want to track a value that can go up and down.
- Examples: Current memory usage, temperature, number of active connections, queue length.
- Best for: Current state, resource utilization, and values that fluctuate.
- Do not use for: Cumulative totals or strictly increasing values.
Summaries
- Use when: You need quantiles (e.g., median, 99th percentile) of observed values, calculated on the client side.
- Examples: Request latency percentiles, response size percentiles.
- Best for: Per-instance quantile calculations.
- Do not use for: Aggregating quantiles across multiple instances or labels (not possible with summaries).
Histograms
- Use when: You want to observe the distribution of values and need to aggregate quantiles across instances or labels.
- Examples: Request latency distributions, response size distributions.
- Best for: Calculating quantiles (using
histogram_quantile()
), aggregating across multiple dimensions, and tracking distributions. - Do not use for: Exact quantile calculation per instance (use summaries if you need client-side quantiles).
Summary Table
Metric Type | Use For | Avoid For |
---|---|---|
Counter | Totals, rates, ever-increasing values | Values that can decrease |
Gauge | Current state, fluctuating values | Cumulative totals |
Summary | Per-instance quantiles | Aggregated quantiles across instances |
Histogram | Aggregated quantiles, distributions | Exact per-instance quantiles |
Inspecting Metrics Endpoints
When you navigate to the /metrics
endpoint of an instrumented service (like Prometheus itself) in a browser, you can inspect these metrics manually.
For more details, see the Prometheus exposition formats documentation
1. Selectors
1.1. Instant Vector Selector
http_requests_total
Returns the latest value for each time series of the metric.
1.2. Range Vector Selector
http_requests_total[5m]
Returns all values in the last 5 minutes for each time series.
1.3. Label Matchers
http_requests_total{job="api", status!="500"}
Selects time series where job
is “api” and status
is not “500”.
1.4 Selecting Series
Selecting series is the foundation of any PromQL query. You can select the latest sample, a range of samples, or filter series based on label values and regular expressions.
Basic Series Selection
- Select latest sample for series with a given metric name:
node_cpu_seconds_total
- Select 5-minute range of samples for series with a given metric name:
node_cpu_seconds_total[5m]
Filtering with Label Matchers
You can refine your series selection by specifying label values.
- Only series with given label values:
node_cpu_seconds_total{cpu="0",mode="idle"}
Filtering series by value
You can filter time series based on their sample values using comparison operators. This allows you to keep only those series that meet certain value conditions, or to return a boolean result (0 or 1) for each series.
Common comparison operators:
Operator | Description |
---|---|
== |
Equal to |
!= |
Not equal to |
> |
Greater than |
< |
Less than |
>= |
Greater than or equal to |
<= |
Less than or equal to |
Examples:
- Keep only series with a value greater than a threshold:
node_filesystem_avail_bytes > 10*1024*1024
Returns only those series where available bytes are greater than 10 MB.
- Compare two metrics and keep only series where the left is greater than the right:
go_goroutines > go_threads
Returns only those series where
go_goroutines
is greater thango_threads
for matching labels. - Return 0 or 1 for each series using the
bool
modifier:up == bool 0
Returns 1 for targets that are down, 0 otherwise.
Tip:
Use value filtering to focus on series that are above or below thresholds, or to create binary series
Complex Label Matchers
PromQL provides several label matchers for more advanced filtering:
Operator | Description | Example |
---|---|---|
= |
Equality | node_cpu_seconds_total{cpu="0"} |
!= |
Non-equality | node_cpu_seconds_total{cpu!="0"} |
=~ |
Regex match | node_cpu_seconds_total{mode=~"user|system"} |
!~ |
Negative regex match | node_cpu_seconds_total{mode!~"idle|iowait"} |
- Example with complex label matchers:
node_cpu_seconds_total{cpu!="0",mode=~"user|system"}
Offset Modifier
The offset
modifier allows you to shift the time for a query, useful for comparing current data with past data.
- Select data from one day ago and shift it to the current time:
process_resident_memory_bytes offset 1d
2. Operators
2.1. Arithmetic Operators
rate(http_requests_total[5m]) * 100
Multiplies the per-second rate by 100.
2.2. Comparison Operators
You can filter series based on their sample values using comparison operators. These can also be used to return 0 or 1 for each compared series instead of filtering.
Operator | Description |
---|---|
== |
Equal to |
!= |
Not equal to |
> |
Greater than |
< |
Less than |
>= |
Greater than or equal to |
<= |
Less than or equal to |
-
Only keep series with a sample value greater than a given number:
node_filesystem_avail_bytes > 10*1024*1024
-
Only keep series from the left-hand side whose sample values are larger than their right-hand-side matches:
go_goroutines > go_threads
-
Instead of filtering, return 0 or 1 for each compared series:
```promql go_goroutines > bool go_threads
2.3. Boolean Modifier
up == bool 0
Returns 1 for targets that are down, ignores the value otherwise.
2.4. Set Operators
up or http_requests_total
up and http_requests_total
up unless http_requests_total
Combines or filters time series sets.
3. Aggregation Operators
3.1. sum()
sum(http_requests_total)
Sums all values.
3.2. sum by (label)
sum by (job) (http_requests_total)
Sums values grouped by job
.
3.3. count()
count(up)
Counts the number of time series.
3.4. avg()
avg(http_requests_total)
Averages all values.
3.5. min() / max()
min(up)
max(up)
Finds the minimum or maximum value.
3.6. stddev() / stdvar()
stddev(http_requests_total)
stdvar(http_requests_total)
Standard deviation and variance.
3.7. topk() / bottomk()
topk(3, http_requests_total)
bottomk(3, http_requests_total)
Top or bottom 3 time series by value.
4. Functions: Rates of Increase for Counters
Prometheus counters (like http_requests_total
) only ever increase (except when they reset, e.g., on process restart). PromQL provides several functions to analyze how fast these counters are increasing over time. The most common are rate()
, irate()
, and increase()
.
rate()
- Purpose: Calculates the average per-second rate of increase of a counter over a specified time range.
- Usage: Best for dashboards and alerting, as it smooths out short-term fluctuations.
- Example:
rate(http_requests_total[5m])
Returns the average number of requests per second over the last 5 minutes.
irate()
- Purpose: Calculates the “instantaneous” per-second rate of increase, using only the two most recent data points in the range.
- Usage: Best for detecting sudden spikes or drops; more sensitive to short-term changes.
- Example:
irate(http_requests_total[5m])
Returns the per-second rate based on the last two samples in the last 5 minutes.
increase()
- Purpose: Calculates the total increase in the counter over the specified time range.
- Usage: Useful for totals over a period (e.g., “How many requests in the last hour?”).
- Example:
increase(http_requests_total[1h])
Returns the total number of requests that occurred in the last hour.
Comparison Table
Function | What it shows | Use case example |
---|---|---|
rate() |
Average per-second rate (smoothed) | Dashboards, alerting, trends |
irate() |
Instantaneous per-second rate | Detecting spikes, troubleshooting |
increase() |
Total increase over time range | Counting events in a period |
Practical Examples
- Show the average request rate per job:
sum by (job) (rate(http_requests_total[5m]))
- Detect sudden spikes in error responses:
irate(http_requests_total{status="500"}[1m])
- Count total requests in the last 10 minutes:
increase(http_requests_total[10m])
Tip:
- Use
rate()
for most monitoring and alerting scenarios. - Use
irate()
when you need to see immediate changes. - Use
increase()
when you want a total count over a period.
5. Functions: Changes in Gauges
Gauges are metrics that can go up and down (e.g., memory usage, temperature). PromQL provides functions to analyze how these values change over time.
delta()
- Purpose: Calculates the difference between the first and last value of a gauge within a specified time range.
- Usage: Useful for understanding the net change over a period.
- Example:
delta(memory_usage_bytes[10m])
Returns the net change in memory usage over the last 10 minutes.
deriv()
- Purpose: Estimates the per-second derivative (rate of change) of a gauge, using linear regression.
- Usage: Useful for identifying trends or the speed of change.
- Example:
deriv(memory_usage_bytes[5m])
Returns the per-second rate of change in memory usage over the last 5 minutes.
idelta()
- Purpose: Calculates the difference between the last two samples in the range.
- Usage: Useful for detecting sudden jumps or drops.
- Example:
idelta(memory_usage_bytes[5m])
Returns the difference between the last two memory usage samples in the last 5 minutes.
Practical Examples
- Show the net change in CPU usage over the last 15 minutes:
delta(process_cpu_seconds_total[15m])
- Show the rate of change in active connections:
deriv(active_connections[10m])
- Detect sudden drops in available memory:
idelta(node_memory_MemAvailable_bytes[2m])
Comparison Table: Changes in Gauges
| Function | Description | Example Usage | Typical Use Case |
|————-|——————————————————————|———————————————–|—————————————–|
| delta()
| Difference between first and last value in the range | delta(memory_usage_bytes[10m])
| Net change over a period |
| deriv()
| Per-second rate of change (linear regression over the range) | deriv(memory_usage_bytes[5m])
| Trend/speed of change over time |
| idelta()
| Difference between the last two samples in the range | idelta(memory_usage_bytes[5m])
| Detecting sudden jumps or drops
—
6. Functions: Histogram Quantile
Histograms are used to observe the distribution of values (like request durations or response sizes). Prometheus stores histogram data as multiple time series with bucket boundaries. The histogram_quantile()
function estimates quantiles (e.g., median, 95th percentile) from these histograms.
histogram_quantile()
- Purpose: Calculates an estimated quantile (e.g., 0.5 for median, 0.95 for 95th percentile) from histogram buckets.
- Usage: Useful for understanding latency or size distributions.
- Syntax:
histogram_quantile(φ, sum(rate(<histogram_metric>_bucket[range])) by (le))
Where
φ
is the desired quantile (e.g., 0.95 for the 95th percentile).
Examples
- Calculate the 95th percentile request duration:
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Estimates the 95th percentile of HTTP request durations over the last 5 minutes.
- Calculate the median (50th percentile) response size:
histogram_quantile(0.5, sum(rate(http_response_size_bytes_bucket[5m])) by (le))
Estimates the median response size over the last 5 minutes.
- Calculate the 99th percentile latency per job:
histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))
Estimates the 99th percentile latency for each job.
Tip:
- Always use the
_bucket
metric forhistogram_quantile()
. - Use
sum(rate(...)) by (le)
to aggregate across all relevant labels exceptle
.
7. Aggregation Over Multiple Series
Aggregation in PromQL allows you to combine multiple time series into a single series or a smaller set of series, using functions like sum
, avg
, min
, max
, count
, and more. Aggregation is essential for summarizing data across different dimensions (labels), such as combining metrics from all instances of a service or across all endpoints.
How Aggregation Works
- Without grouping: Aggregates all matching series into a single value.
- With
by
clause: Groups series by the specified label(s) and aggregates within each group.
Examples
- Sum all HTTP requests across all jobs:
sum(http_requests_total)
Returns the total number of HTTP requests across all jobs.
- Sum HTTP requests per job:
sum by (job) (http_requests_total)
Returns the total number of HTTP requests for each job.
- Average CPU usage per instance:
avg by (instance) (process_cpu_seconds_total)
Returns the average CPU usage for each instance.
- Count the number of up targets per job:
count by (job) (up)
Returns the number of targets that are up for each job.
Comparison Table: Aggregation Over Multiple Series
Aggregation Function | Description | Example Usage | Typical Use Case |
---|---|---|---|
sum |
Adds up all values | sum(http_requests_total) |
Total requests across all series |
avg |
Calculates the average | avg by (job) (process_cpu_seconds_total) |
Average CPU usage per job |
min |
Finds the minimum value | min by (instance) (up) |
Find the lowest value per instance |
max |
Finds the maximum value | max by (job) (http_requests_total) |
Find the highest value per job |
count |
Counts the number of series | count by (job) (up) |
Number of up targets per job |
stddev |
Standard deviation | stddev by (job) (process_cpu_seconds_total) |
Variability of CPU usage per job |
stdvar |
Variance | stdvar by (job) (process_cpu_seconds_total) |
Variance of CPU usage per job |
topk |
Top K series by value | topk(3, sum by (job) (http_requests_total)) |
Top 3 jobs by request count |
bottomk |
Bottom K series by value | bottomk(3, sum by (job) (http_requests_total)) |
Bottom 3 jobs by request count |
Tip:
- Use the
by
clause to control the grouping of your aggregation by one or more labels. - You can use multiple labels in the
by
clause for multi-dimensional grouping, e.g.,sum by (job,instance) (...)
. - Omitting the
by
clause aggregates across all series into a single value.
8. Aggregation Over Time
Aggregation over time in PromQL uses functions that summarize or transform the values of a single time series within a specified time window. These functions are useful for smoothing data, detecting trends, or extracting statistical information from time series.
Common Over Time Functions
Function | Description | Example Usage | Typical Use Case |
---|---|---|---|
sum_over_time() |
Sums all values in the range | sum_over_time(http_requests_total[10m]) |
Total requests in the last 10 minutes |
avg_over_time() |
Averages all values in the range | avg_over_time(process_cpu_seconds_total[5m]) |
Average CPU usage in the last 5 minutes |
min_over_time() |
Finds the minimum value in the range | min_over_time(node_memory_MemAvailable_bytes[1h]) |
Lowest available memory in the last hour |
max_over_time() |
Finds the maximum value in the range | max_over_time(temperature_celsius[30m]) |
Highest temperature in the last 30 minutes |
count_over_time() |
Counts the number of values in the range | count_over_time(up[1h]) |
Number of samples in the last hour |
quantile_over_time() |
Calculates a quantile (e.g., median) over range | quantile_over_time(0.5, temperature_celsius[30m]) |
Median temperature in the last 30 minutes |
stddev_over_time() |
Standard deviation over the range | stddev_over_time(process_cpu_seconds_total[5m]) |
Variability of CPU usage in the last 5 minutes |
stdvar_over_time() |
Variance over the range | stdvar_over_time(process_cpu_seconds_total[5m]) |
Variance of CPU usage in the last 5 minutes |
Examples
- Sum of requests over the last 10 minutes:
sum_over_time(http_requests_total[10m])
Returns the total number of requests in the last 10 minutes for each series.
- Average CPU usage over the last 5 minutes:
avg_over_time(process_cpu_seconds_total[5m])
Returns the average CPU usage in the last 5 minutes for each series.
- Minimum available memory over the last hour:
min_over_time(node_memory_MemAvailable_bytes[1h])
Returns the lowest available memory value in the last hour for each series.
- Median value of a gauge over the last 30 minutes:
quantile_over_time(0.5, temperature_celsius[30m])
Returns the median temperature in the last 30 minutes for each series.
Tip:
- Over time functions always require a range vector selector (e.g.,
[5m]
). - These functions operate on each time series independently.