Aggregate functions
This page describes the available functions to assist with performing aggregate calculations.
approx_count_distinct
approx_count_distinct(column_name, precision)
- estimates the number of
distinct non-null
values in IPv4
, int
, or long
columns using the
HyperLogLog data structure, which provides an
approximation rather than an exact count.
The precision of HyperLogLog can be controlled via the optional precision
parameter, typically between 4 and 16. A higher precision leads to more accurate
results with increased memory usage. The default is 1.
This function is useful within high cardinality datasets where an exact count is
not required. Thus consider it the higher cardinality alternative to
count_dinstinct
.
Parameters
column_name
: The name of the column for which to estimate the count of distinct values.precision
(optional): A number specifying the precision of the HyperLogLog algorithm, which influences the trade-off between accuracy and memory usage. A higher precision gives a more accurate estimate, but consumes more memory. Defaults to 1 (lower accuracy, high efficiency).
Return value
Return value type is long
.
Examples
Please note that exact example values will vary as they are approximations derived from the HyperLogLog algorithm.
SELECT approx_count_distinct(ip_address, 5) FROM logs;
approx_count_distinct |
---|
1234567 |
SELECT date, approx_count_distinct(user_id) FROM sessions GROUP BY date;
date | approx_count_distinct |
---|---|
2023-01-01 | 2358 |
2023-01-02 | 2491 |
... | ... |
SELECT region, approx_count_distinct(product_id) FROM sales GROUP BY region;
region | approx_count_distinct |
---|---|
North | 1589 |
South | 1432 |
East | 1675 |
West | 1543 |
SELECT approx_count_distinct(order_id, 8) FROM orders;
approx_count_distinct |
---|
3456789 |
SELECT store_id, approx_count_distinct(transaction_id) FROM transactions GROUP BY store_id;
store_id | approx_count_distinct |
---|---|
1 | 56789 |
2 | 67890 |
... | ... |
approx_percentile
approx_percentile(value, percentile, precision)
calculates the approximate
value for the given non-negative column and percentile using the
HdrHistogram algorithm.
Arguments
value
is any numeric non-negative value.percentile
is adouble
value between 0.0 and 1.0, inclusive.precision
is an optionalint
value between 0 and 5, inclusive. This is the number of significant decimal digits to which the histogram will maintain value resolution and separation. For example, when the input column contains integer values between 0 and 3,600,000,000 and the precision is set to 3, value quantization within the range will be no larger than 1/1,000th (or 0.1%) of any value. In this example, the function tracks and analyzes the counts of observed response times ranging between 1 microsecond and 1 hour in magnitude, while maintaining a value resolution of 1 microsecond up to 1 millisecond, a resolution of 1 millisecond (or better) up to one second, and a resolution of 1 second (or better) up to 1,000 seconds. At its maximum tracked value (1 hour), it would still maintain a resolution of 3.6 seconds (or better).
Return value
Return value type is double
.
Examples
SELECT approx_percentile(latency, 0.99) FROM request_logs;
approx_percentile |
---|
101.5 |