Building on my previous learning exercises, I need to learn how to query Prometheus so that I can work with production systems running on Kubernetes. For large scale datasets, I find histograms to be an excellent tool for summarizing and visualizing throughput and latency. However, this is all a bit confusing and new to me in Prometheus.

Summaries, Histograms, Oh My!

Prometheus has two metric types for histograms -- Summaries and Histograms.

Summaries

Summaries are calculated client side, meaning the CPU must dedicated cycles to computing these values it could be serving your customers. Exact percentiles are precomputed and stored in a ready-to-use state in Prometheus. However, you can't calculate any new percentiles that weren't explicitly calculated ahead of time. You can think of a Summary as being stored inside Prometheus something like this:

You can see in the diagram that the metric name provided in the application code http_response_duration_ms is stored with explicit labels for the exact percentile (called quartiles). You can retrieve any of those time series with an instant selector, but you can't apply any new functions. Here's a sample PromQL for the median value from the diagram above:

http_response_duration_ms{quartile="0.5"}

Sadly, because the percentiles are all precomputed, you can't combine this value from multiple kubernetes pods at query time. Some other system would have to calculate a Summary ahead of time...so we're not going to talk about Summaries anymore.

Histograms

Histograms are calculated server side, within Prometheus itself at query time. Prometheus stores histograms internally in buckets that have a max size (labeled le), but no minimum size. You must configure the number and max size of each bucket ahead of time. Each bucket time series will contain the count of observations that was less than or equal to its le value for a given timestamp.

Things look quite a bit different. Each bucket is stored in a separate metric suffixed with _bucket and the maximum value for that time series is in the label le. There is always a largest bucket with infinite maximum value {le="Inf"} which will always have the same value as _count. Because you can't meaningfully use the buckets directly AFAICT, you generally use the function histogram_quantile() to estimate whatever quantile you want. The accuracy will be based mostly on the sizes of buckets you choose in the client. The more and smaller the buckets, the more accurate but the more data required to store and compute values.

Querying Histograms

Let's start playing with the histogram_quantile() function. Here's a simple PromQL query against the histogram for the diagram:

histogram_quantile(.99, rate(http_response_duration_ms[1m]))

Even though this is the simplest usage of histogram_quantile(), there's kind of a lot. Let's break it down:

.99 - this is the "rank" or the "percentile" being called in a range from 0 to 1
rate() - we'll talk more about this, but basically this converts a counter metric into a usable form
[1m] - we're aggregating at 1m intervals here

Counters and `rate()`

If you read the docs, you'll see that counters are "monotonically" increasing. That means that a given metric name with a given set of labels will always increase, but the problem is graphing a line that always goes up just isn't very helpful. So, rate will break down time slices, see the increase during the time slice, and make that the value. Let's take some observations at different times of a counter as an example:

Time	Counter Value
T1	3
T2	5
T3	7
T4	7
T5	17

So, if we apply rate() to this time series, then the values will be changed like this:

Time	`rate(counter)` Value
T1	0
T2	2
T3	2
T4	0
T5	10

So, basically, because sending histograms to Prometheus generates a whole bunch of different counters (specifically the _bucket metric names) we normally need to use rate() first on the get usable data.

Aggregating Histograms by Series

So, let's say you have a lot of pods running a service and you add labels to all your metrics that includes the pod. For this particular metric, you've also included the http status code. However, you now want to build a dashboard out of one metric showing the 95th percentile of something, for example http_response_duration_ms. If you naively use the code I shared above you'll see a time series graphed for the combination of each unique pod and status code!!

Let's not do that. Instead try this:

histogram_quantile(.99, sum(rate(http_response_duration_ms[1m])) by (le, status_code))

The main thing that's new and different here is that sum(..) by (..) will look for any series that share the same labels {le=.., status_code=..} and add all their values together into a single series. Let's take an example using a single sample/timestamp across many series to show what happens:

le	pod	status_code	value
500	pod1	200	5
500	pod2	200	5
1000	pod3	200	10
1000	pod4	200	10
1000	pod5	404	1

After running the promql above, the above series sum up any series that have the same le and status_code, so the end result looks like:

le	status_code	value
500	200	10 (5+5)
1000	200	20 (10+10)
1000	404	1

Now, you'll see one series per status code across your whole cluster. If we had any other labels (like endpoint, service, etc), those would also disappear after the aggregation above.

Don't Forget Your `le`!

Behind the scenes, at query time, the histogram_quantile() function will secretly look for all labels of {le=..} on the provided series. Before I understood this, I kept hitting errors saying:

No datapoints found.

Once you know that histogram_quantile() requires the time series passed in to have the labels {le=..}, this actually makes sense. This is why anytime we combine histogram bucket series, we must be careful to preserve the le labels.

Conclusion

Using histograms as a lens, we've dug deep into how Prometheus:

expects data to come from client applications
stores time series
converts counters into usable time series using rate()
exposes histograms via histogram_quantile() function
allows us to combine different labels for a metric name

Hopefully, this lets you start doing interesting analysis of your production systems!