小言_互联网的博客

Flink SQL--分组聚合

365人阅读  评论(0)

一、DISTINCT Aggregation

二、GROUPING SETS

在一个GROUP BY 查询中,根据不同的维度组合进行聚合。GROUPING SETS就是一种将多个GROUP BY逻辑 UNION 在一起。GROUPING SETS会把在单个GROUP BY逻辑中没有参与GROUP BY的那一列置为NULL值。空分组集意味着所有行都聚合到一个组中

SELECT supplier_id, rating, COUNT(*) AS total
FROM (VALUES
    ('supplier1', 'product1', 4),
    ('supplier1', 'product2', 3),
    ('supplier2', 'product3', 3),
    ('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY GROUPING SETS ((supplier_id, rating), (supplier_id), ())
Results:

+-------------+--------+-------+
| supplier_id | rating | total |
+-------------+--------+-------+
|   supplier1 |      4 |     1 |
|   supplier1 |      3 |     1 |
|   supplier2 |      3 |     1 |
|   supplier2 |      4 |     1 |

|   supplier1 | (NULL) |     2 |
|   supplier2 | (NULL) |     2 |

|      (NULL) | (NULL) |     4 |
+-------------+--------+-------+

 

TTL

在持续查询的过程中,由于用于分组的 key 可能会不断增加,因此计算结果所需要维护的状态也会持续增长。为了防止状态无限增长耗尽资源,Flink Table API 和 SQL 可以在表环境中配置状态的生存时间(TTL)table.exec.state.ttl

配置 TTL 有可能会导致统计结果不准确,这其实是以牺牲正确性为代价换取了资源的释放

2.1、ROLLUP

ROLLUP 相当于 GROUPING SETS 的简化

SELECT supplier_id, rating, COUNT(*)
FROM (VALUES
    ('supplier1', 'product1', 4),
    ('supplier1', 'product2', 3),
    ('supplier2', 'product3', 3),
    ('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY ROLLUP (supplier_id, rating)

等价于
SELECT supplier_id, rating, COUNT(*)
FROM (VALUES
    ('supplier1', 'product1', 4),
    ('supplier1', 'product2', 3),
    ('supplier2', 'product3', 3),
    ('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY GROUPING SETS (
    ( supplier_id, rating ),
    ( supplier_id         ),
    (                     )
)


--- 三个字段分组
SELECT supplier_id, rating,product_id, COUNT(*)
FROM (VALUES
    ('supplier1', 'product1', 4),
    ('supplier1', 'product2', 3),
    ('supplier2', 'product3', 3),
    ('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY ROLLUP(supplier_id, rating,product_id)
等价于
SELECT supplier_id, rating, product_id, COUNT(*)
FROM (VALUES
    ('supplier1', 'product1', 4),
    ('supplier1', 'product2', 3),
    ('supplier2', 'product3', 3),
    ('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY GROUPING SETS (
    ( supplier_id, product_id, rating ),
    ( supplier_id, product_id         ),
    ( supplier_id                     ),
    (                                 )
)

 

2.2、CUBE


SELECT supplier_id, rating,product_id, COUNT(*)
FROM (VALUES
    ('supplier1', 'product1', 4),
    ('supplier1', 'product2', 3),
    ('supplier2', 'product3', 3),
    ('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY CUBE(supplier_id, rating,product_id)


等价于
SELECT supplier_id, rating, product_id, COUNT(*)
FROM (VALUES
    ('supplier1', 'product1', 4),
    ('supplier1', 'product2', 3),
    ('supplier2', 'product3', 3),
    ('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY GROUPING SETS (
    ( supplier_id, product_id, rating ),
    ( supplier_id, product_id         ),
    ( supplier_id,             rating ),
    ( supplier_id                     ),
    (              product_id, rating ),
    (              product_id         ),
    (                          rating ),
    (                                 )
)

 

下面是 CUBE 和 ROLLUP 之间的具体区别:

  • CUBE 生成的结果集显示了所选列中值的所有组合的聚合
  • ROLLUP 生成的结果集显示了所选列中值的某一层次结构的聚合

三、HAVING

用于分组后的过滤条件


转载:https://blog.csdn.net/wuxintdrh/article/details/127570167
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场