GROUP BY์™€ CTAS

GROUP BY์™€ AGGREGATEํ•จ์ˆ˜


ํ…Œ์ด๋ธ”์˜ ๋ ˆ์ฝ”๋“œ๋ฅผ ๊ทธ๋ฃนํ•‘ํ•˜์—ฌ ๊ทธ๋ฃน๋ณ„๋กœ ๋‹ค์–‘ํ•œ ์ •๋ณด๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

๋‘ ๋‹จ๊ณ„๋กœ ์ด๋ฃจ์–ด์ง

1.๋จผ์ € ๊ทธ๋ฃนํ•‘์„ ํ•  ํ•„๋“œ๋ฅผ ์ง€์ •(ํ•˜๋‚˜ ์ด์ƒ์˜ ํ•„๋“œ๊ฐ€ ๋  ์ˆ˜ ์žˆ์Œ)

  • GROUP BY๋กœ ์ง€์ •(ํ•„๋“œ ์ด๋ฆ„์„ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ ํ•„๋“œ ์ผ๋ จ๋ฒˆํ˜ธ๋ฅผ ์‚ฌ์šฉ)

2.๋‹ค์Œ ๊ทธ๋ฃน๋ณ„๋กœ ๊ณ„์‚ฐํ•  ๋‚ด์šฉ์„ ์ง€์ •

  • ์—ฌ๊ธฐ์„œ AGGREGATEํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉ
  • COUNT, SUM, AVG, MIN, MAX, LISTAGG,...
    ๋ณดํ†ต ํ•„๋“œ ์ด๋ฆ„์„ ์ง€์ •ํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ด๋‹ค.

EX1) ์›”๋ณ„ ์„ธ์…˜์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” SQL
-raw_data.session_timestamp๋ฅผ ์‚ฌ์šฉ(sessionId์™€ tsํ•„๋“œ)

SELECT 
     LEFT(ts, 7) AS mon, -- timestampํ˜•์‹์ด ๋ฌธ์ž์—ด ํ•จ์ˆ˜์— ๋“ค์–ด๊ฐ€ ๋ฌธ์ž์—ด๋กœ ์ถ”์ถœ๋จ
     COUNT(1) AS session_count
FROM raw_data.session_timestamp
GROUP BY 1 -- GROUP BY mon, GROUP BY LEFT(ts, 7)
ORDER BY 1; -- mon ํ•„๋“œ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์˜ค๋ฆ„์ฐจ์ˆœ ์ •๋ ฌ

EX2) ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋œ ์ฑ„๋„์€ ๋ฌด์—‡์ธ๊ฐ€?

  • ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋˜์—ˆ๋‹ค๋Š” ์ •์˜๋Š”? ์‚ฌ์šฉ์ž ๊ธฐ๋ฐ˜์ธ์ง€ ์„ธ์…˜ ๊ธฐ๋ฐ˜์ธ์ง€
  • ๋จผ์ € ์–ด๋А ํ…Œ์ด๋ธ”์„ ์‚ฌ์šฉํ•ด์•ผํ•˜๋Š”์ง€ ์ƒ๊ฐํ•˜์ž
SELECT
     channel,
     COUNT(1) AS session_count, -- ์ฑ„๋„๋ณ„ ์„ธ์…˜ ์ˆ˜
     COUNT(DISTINCT userId) AS user_count -- ์ฑ„๋„๋ณ„ ์œ ์ผํ•œ ์‚ฌ์šฉ์ž ์ˆ˜
FROM raw_data.user_session_channel
GROUP BY 1	-- GROUP BY channel
ORDER BY 2 DESC;	-- ORDER BY session_count DESC

EX3) ๊ฐ€์žฅ ๋งŽ์€ ์„ธ์…˜์„ ๋งŒ๋“ค์–ด๋‚ธ ์‚ฌ์šฉ์ž ID๋Š” ๋ฌด์—‡์ธ๊ฐ€?

  • ํ•„์š”ํ•œ ์ •๋ณด : ์„ธ์…˜ ์ •๋ณด, ์‚ฌ์šฉ์ž ์ •๋ณด
  • ๋จผ์ € ์–ด๋Аํ…Œ์ด๋ธ”์„ ์‚ฌ์šฉํ•ด์•ผํ•˜๋Š” ์ง€ ์ƒ๊ฐ
SELECT
     userId, 
     COUNT(1) AS count
FROM raw_data.user_session_channel
GROUP BY 1	-- GROUP BY userId
ORDER BY 2 DESC	-- ORDER BY count DESC
LIMIT 1; -- ๊ฐ€์žฅ ๋งŽ์€ ์„ธ์…˜์„ ๋งŒ๋“  ์‚ฌ์šฉ์ž

EX4) ์›”๋ณ„ ์œ ๋‹ˆํฌํ•œ ์‚ฌ์šฉ์ž ์ˆ˜(MAU)

  • ํ•„์š”ํ•œ ์ •๋ณด : ์‹œ๊ฐ„ ์ •๋ณด, ์‚ฌ์šฉ์ž ์ •๋ณด
SETECT
     TO_CHAR(A.ts, 'YYYY_MM') AS month, 
     COUNT(DISTINCT B.userid) AS mau
FROM raw_data.session_timestamp A
JOIN raw_data.user_session_channel B ON A.sessionid = B.sessionid -- INNER JOIN(์–‘์ชฝํ…Œ์ด๋ธ” ์ค‘์— ๋‹ค ์กฐ๊ฑด์— ๋งž๋Š” ๋ ˆ์ฝ”๋“œ๋งŒ ๋‚จ์Œ)
GROUP BY 1
GROUP BY 1 DESC;

*TO_CHAR: timestamp ํƒ€์ž…์„ ๋ฌธ์ž์—ด๋กœ ๋ณ€ํ™˜
LEFT(A.ts, 7)
DATE_TRUNC('month', A.ts): ์ด๊ฒƒ๋งŒ timestamp ํƒ€์ž…์œผ๋กœ ๋ฐ˜ํ™˜, ํ•ด๋‹น ์›” 1์ผ๋กœ ๋ฐ”๊ฟˆ
SUBSTRING(A.ts, 1, 7)

EX5) ์›”๋ณ„ ์ฑ„๋„๋ณ„ ์œ ๋‹ˆํฌํ•œ ์‚ฌ์šฉ์ž ์ˆ˜

  • ํ•„์š”ํ•œ ์ •๋ณด : ์‹œ๊ฐ„ ์ •๋ณด, ์‚ฌ์šฉ์ž ์ •๋ณด, ์ฑ„๋„ ์ •๋ณด
SETECT
     TO_CHAR(A.ts, 'YYYY_MM') AS month, 
     channel
     COUNT(DISTINCT B.userid) AS mau
FROM raw_data.session_timestamp A
JOIN raw_data.user_session_channel B ON A.sessionid = B.sessionid
GROUP BY 1, 2
GROUP BY 1 DESC, 2; -- ์›”๋ณ„์€ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ฑ„๋„๋ณ„์€ ์˜ค๋ฆ„์ฐจ์ˆœ

CTAS์™€ CTE์†Œ๊ฐœ


CTAS : SELECT๋ฅผ ๊ฐ€์ง€๊ณ  ๊ฐ„๋‹จํ•˜๊ฒŒ ์ƒˆ๋กœ์šด ํ…Œ์ด๋ธ”์„ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•.

์ž์ฃผ ์กฐ์ธํ•˜๋Š” ํ…Œ์ด๋ธ”๋“ค์ด ์žˆ๋‹ค๋ฉด ์ด๋ฅผ CTAS๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์กฐ์ธํ•ด๋‘๋ฉด ํŽธ๋ฆฌํ•ด์ง„๋‹ค.

DROP TABLE IF EXISTS adhoc.junhee_session_summary; -- ํ…Œ์ด๋ธ”์ด ์ด๋ฏธ ์กด์žฌํ•˜๋ฉด DROP 
CREATE TABLE adhoc.junhee_session_summary AS -- ์ƒˆ๋กœ์šด ํ…Œ์ด๋ธ” ์ƒ์„ฑ
SELECT B.*, A.ts FROM raw_data.session_timestamp A
JOIN raw_data.user_session_channel B ON A.sessionid = B.sessionid;

์›”๋ณ„ ์œ ๋‹ˆํฌํ•œ ์‚ฌ์šฉ์ž ์ˆ˜๋ฅผ ๋‹ค์‹œ ์ž‘์„ฑ
SETECT
     TO_CHAR(A.ts, 'YYYY_MM') AS month, 
     COUNT(DISTINCT userid) AS mau
FROM adhoc.junhee_session_summary
GROUP BY 1
GROUP BY 1 DESC

ํ•ญ์ƒ ์‹œ๋„ํ•ด๋ด์•ผ ํ•˜๋Š” ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ํ™•์ธ ๋ฐฉ๋ฒ•๋“ค

  • ์ค‘๋ณต๋œ ๋ ˆ์ฝ”๋“œ๋“ค ์ฒดํฌํ•˜๊ธฐ
  • ์ตœ๊ทผ ๋ฐ์ดํ„ฐ์˜ ์กด์žฌ ์—ฌ๋ถ€ ์ฒดํฌํ•˜๊ธฐ(freshness)
  • Primary key uniqyeness๊ฐ€ ์ง€์ผœ์ง€๋Š”์ง€ ์ฒดํฌํ•˜๊ธฐ
  • ๊ฐ’์ด ๋น„์–ด์žˆ๋Š” ์นผ๋Ÿผ๋“ค์ด ์žˆ๋Š”์ง€ ์ฒดํฌํ•˜๊ธฐ
  1. ์ค‘๋ณต๋œ ๋ ˆ์ฝ”๋“œ๋“ค ์ฒดํฌํ•˜๊ธฐ(1)
    ๋‹ค์Œ ๋‘ ๊ฐœ์˜ ์นด์šดํŠธ๋ฅผ ๋น„๊ต
SELECT COUNT(1) FROM adhoc.junhee_session_summary;	-- ๋ชจ๋“  ๋ ˆ์ฝ”๋“œ ์ˆ˜ ์ฒดํฌ

SELECT COUNT(1)
FROM (
    SELECT DISTINCT userId, sessionId, ts, channel	-- ์œ ์ผํ•œ ๋ ˆ์ฝ”๋“œ ์ˆ˜ ์ฒดํฌ
    FROM adhoc.junhee_session_summary
);
  1. ์ค‘๋ณต๋œ ๋ ˆ์ฝ”๋“œ๋“ค ์ฒดํฌํ•˜๊ธฐ(2)
    CTE๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์ค‘๋ณต ์ œ๊ฑฐ ํ›„ ์นด์šดํŠธ ํ•ด๋ณด๊ธฐ
With ds AS (	-- ์ค‘๋ณต ์ œ๊ฑฐ์˜ ๊ฒฐ๊ณผ๊ฐ€ ds๋ผ๋Š” ์ž„์‹œ ํ…Œ์ด๋ธ”๋กœ ๋งŒ๋“ค์–ด์ง
  SELECT DISTINCT userId, sessionId, ts, channel 
  FROM adhoc.junhee_session_summary
)
SELECT COUNT(1)
FROM ds;
  1. ์ตœ๊ทผ ๋ฐ์ดํ„ฐ์˜ ์กด์žฌ ์—ฌ๋ถ€ ์ฒดํฌํ•˜๊ธฐ(freshness)
    ํ•„๋“œ์ค‘์— ๊ฐ€์žฅ ์ž‘์€ ๊ฐ’, ๊ฐ€์žฅ ํฐ ๊ฐ’ ํ™•์ธ
SELECT MIN(ts), MAX(ts)
FROM adhoc.junhee_session_summary;

4.Primary key uniqueness๊ฐ€ ์ง€์ผœ์ง€๋Š”์ง€ ์ฒดํฌํ•˜๊ธฐ
์—ฌ๊ธฐ์„œ๋Š” sessionId๊ฐ€ Primary key

SELECT sessionId, COUNT(1)
FROM adhoc.junhee_session_summary
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1;
  1. ๊ฐ’์ด ๋น„์–ด์žˆ๋Š” ์นผ๋Ÿผ๋“ค์ด ์žˆ๋Š”์ง€ ์ฒดํฌํ•˜๊ธฐ
SELECT
    COUNT(CASE WHEN sessionId is NULL THEN 1 END) sessionid_null_count,	-- ๊ฐ’์ด NULL์ด๋ฉด ์นด์šดํŠธ
    COUNT(CASE WHEN userId is NULL THEN 1 END) userid_null_count,
    COUNT(CASE WHEN ts is NULL THEN 1 END) ts_null_count,
    COUNT(CASE WHEN channel is NULL THEN 1 END) channel_null_count
FROM adhoc.junhee_session_summary;

profile
๊ฒŒ์„๋ €๋˜ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ณต๋ถ€

1๊ฐœ์˜ ๋Œ“๊ธ€

comment-user-thumbnail
2023๋…„ 11์›” 13์ผ

๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฐ ์ •๋ณด๋ฅผ ๋‚˜๋ˆ ์ฃผ์…”์„œ ์ข‹์•„์š”.

๋‹ต๊ธ€ ๋‹ฌ๊ธฐ