Skip to content

[ClickHouse] 按时间做累加的SQL怎么写? #11

@GarinZ

Description

@GarinZ

点开文章的朋友们可以先看看讲的是不是你要的效果
初始数据:假设我们的表如下所示,时间戳(ts)我简化成了1到5的数字。

ts amount
1 10
2 5
3 3
4 10
5 5

处理结果:SQL执行完想变成这样

ts amount
1 10
2 15
3 18
4 22
5 27

我先放最终的完整SQL,然后再解释里面让人迷茫的Clickhouse函数。

SELECT ts, arraySum(arraySlice(total, 1, i)) as total_num
FROM (
  SELECT
    groupArray(ts) as t,
    groupArray(amount) as total
  FROM (
    SELECT ts, amount
    FROM [表名]
    WHERE ts >= [开始时间]
    AND ts <= [结束时间]
  )
)
ARRAY JOIN
  t as ts,
  arrayEnumerate(total) as i

STEP1:groupArray

里面除了SQL的基本语法之外,主要有4个看着就蛋疼的Clickhouse函数。我们从最内层往外看。

SELECT
  groupArray(ts) as t,
  groupArray(amount) as total
FROM (
  SELECT ts, amount
  FROM [表名]
  WHERE ts >= [开始时间]
  AND ts <= [结束时间]
)

这个SQL中groupArray比较陌生。这函数简单来说就是把列转成行,执行后的效果就是

t total
[1,2,3,4,5] [10,5,3,10,5]

STEP2:ARRAY JOIN

接下来只看ARRAY JOINarrayEnumerate,我把SQL中只保留ARRAY JOIN

SELECT t, total
FROM (
  SELECT
    groupArray(ts) as t,
    groupArray(amount) as total
  FROM (
    SELECT ts, amount
    FROM [表名]
    WHERE ts >= [开始时间]
    AND ts <= [结束时间]
  )
) ARRAY JOIN t

ARRAY JOIN的主要作用是把数组展开

t total
1 [10,5,3,10,5]
2 [10,5,3,10,5]
3 [10,5,3,10,5]
4 [10,5,3,10,5]
5 [10,5,3,10,5]

STEP3:带alias的ARRAY JOIN

原始SQL中还带了alias有啥用?当我们ARRAY JOIN t的时候,t在SQL中就代表了数组中的元素,而不是数组本身。那如果在查询结果中既包含元素、也包含数组本身怎么办?就得再加上alias了。s

SELECT t, total, ts
FROM (
  SELECT
    groupArray(ts) as t,
    groupArray(amount) as total
  FROM (
    SELECT ts, amount
    FROM [表名]
    WHERE ts >= [开始时间]
    AND ts <= [结束时间]
  )
) ARRAY JOIN t as ts

使用了alias的话,t就变成了数组的引用,ts就变成了数组元素的引用

t total ts
[1,2,3,4,5] [10,5,3,10,5] 1
[1,2,3,4,5] [10,5,3,10,5] 2
[1,2,3,4,5] [10,5,3,10,5] 3
[1,2,3,4,5] [10,5,3,10,5] 4
[1,2,3,4,5] [10,5,3,10,5] 5

STEP4:arrayEnumerate

arrayEnumerate这个方法就和row_number()一样,简单理解就是返回行号。

SELECT t, total, ts, i
FROM (
  SELECT
    groupArray(ts) as t,
    groupArray(amount) as total
  FROM (
    SELECT ts, amount
    FROM [表名]
    WHERE ts >= [开始时间]
    AND ts <= [结束时间]
  )
) ARRAY JOIN 
  t as ts,
  arrayEnumerate(total) as i

查询结果如下:

t total ts i
[1,2,3,4,5] [10,5,3,10,5] 1 1
[1,2,3,4,5] [10,5,3,10,5] 2 2
[1,2,3,4,5] [10,5,3,10,5] 3 3
[1,2,3,4,5] [10,5,3,10,5] 4 4
[1,2,3,4,5] [10,5,3,10,5] 5 5

STEP5:arraySlice和arraySum

那我们绕这么一圈对咱们的目标有啥帮助呢?憋着急,这时候就要用到arraySlicearraySum。我们再看看我们现在已经查询出来的结果。

t total ts i
[1,2,3,4,5] [10,5,3,10,5] 1 1
[1,2,3,4,5] [10,5,3,10,5] 2 2
[1,2,3,4,5] [10,5,3,10,5] 3 3
[1,2,3,4,5] [10,5,3,10,5] 4 4
[1,2,3,4,5] [10,5,3,10,5] 5 5

大家可能已经发现了,根据我们现有的数据,已经能从单行的数据中计算目标数据:

  • 第一行:10 = total[0]
  • 第二行:15 = total[0] + total[1]
  • 以此类推

咱们利用Clickhouse的函数就可以做这种处理:

  • arraySlice:截断数组,arraySlice(数组变量名称, 开始下标, 结束下标),注意开始下标从1开始,并且区间左闭右开[开始下标, 结束下标)。
  • arraySum:将数组中所有元素做加和。

所以咱们再看看这个SQL,截断每个total数组,截取到行号i的前一位。然后把数组加起来。

SELECT ts, arraySum(arraySlice(total, 1, i)) as total_num
FROM (
  SELECT
    groupArray(ts) as t,
    groupArray(amount) as total
  FROM (
    SELECT ts, amount
    FROM [表名]
    WHERE ts >= [开始时间]
    AND ts <= [结束时间]
  )
)
ARRAY JOIN
  t as ts,
  arrayEnumerate(total) as i

最终获得结果就是

ts total_num
1 10
2 15
3 18
4 22
5 27

结尾

按时间累加这个SQL第一次写的话我觉得还是挺绕的,大家可以把Clickhouse函数和细节放一边,先记忆查询结果的处理过程。

accumulation-by-time-range-query-view-steps

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions