[SQL] Aggregate function and window function

Junwoo Ahn·2022년 11월 6일

배경

얼마 전 공모전을 준비하던 중 데이터를 특정 유저에 대해 그 내부적으로 log의 sequence로 정렬해야 했던 경험을 했다. 당시에 python의 pandas dataframe을 변형시키며 구현을 시도했는데, SQL의 window function을 사용할 경우 더 효과적으로 짧은 쿼리로도 결과를 조회 가능하여 이를 정리해보고자 한다.

Aggregate Function과 Window Function에 대한 비교

Aggregate Function 단독 사용 시

Aggregate Function : COUNT, SUM, MIN, MAX, AVG, ... etc

Aggregate function의 경우 주로 결과가 하나의 row로 보여짐. 예를 들어 특정 Table 내의 Column 값들의 총합 (SUM)을 SELECT문으로 조회할 경우 집계함수를 사용한 결과가 합계값 하나로만 나타남.

Window Function, Aggregate Function과 OVER절

window frame

Partition 내에서 row들의 집합

window function

A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.
- postgresql document

window function의 경우 주로 window frame으로 인식되는 단위에서 동작함.

sum(cost) over ( partition by customer_id)

예를 들어 Aggregate function이 특정 partition by Dep_name 에 대해 연산될 경우, Dep_name에서 나눈 partition내의 모든 row가 window frame으로 인식됨.

sum(cost) over ( partition by customer_id order by cost)

뒤에 order by가 추가로 사용되는 경우, order by한 column의 각각의 row 모두가 window frame으로 인식되어 모든 row에 대해 aggregate function(위의 경우 sum)을 적용한 결과가 리턴된다.

예시

with
table_01 as (
select order_id, customer_id, order_date
, lag(order_date) over (partition by customer_id order by order_date) as prev_ord_date
from orders
)
select * from table_01;

LAG 함수와 LEAD 함수

lag(Array) : 이전 행의 값
lead(Array) : 다음 행의 값

(Array의 경우 조회 결과 또는 결과 내 특정 집합을 의미)

동작 순서

customer_id로 partition이 생성됨
order_date로 정렬한 row들 각각이 window frame으로 인식되어 over절 앞의 window function에 활용될 준비 마침
정렬된 order_date가 모두 window frame으로 인식되어 Partition 단위 내에 모든 order_date의 row에 대해 lag를 적용

위와 같이 customer_id에 대해 내부적으로 prev_ord_date가 한 칸 밀려서 생성되어 있는 것을 확인할 수 있다. group by로만 활용하는 방식보다 훨씬 간결하면서도 효과적으로 window frame 별로 연산이 가능하기 때문에 lag 및 lead 함수와 결합하여 사용할 경우 어떤 sequence의 이전 단계와의 값 차이 등을 비교할 시에 매우 효과적인 도구가 될 것 같다 :)