๐Ÿ“’ DBT(1)

Kimdongkiยท2024๋…„ 6์›” 6์ผ

DBT

๋ชฉ๋ก ๋ณด๊ธฐ
1/2

๐Ÿ“Œ ELT์˜ ๋ฏธ๋ž˜

  • ETL์€ ๊ฒฐ๊ตญ ELT๋ฅผ ํ•˜๊ธฐ ์œ„ํ•จ์ด๋‹ค. ์ด ๋•Œ ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ๊ฒ€์ฆ์ด ์ค‘์š”ํ•˜๋‹ค.

  • ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ์˜ ์ค‘์š”์„ฑ ์ฆ๋Œ€

    • ์ž…์ถœ๋ ฅ ์ฒดํฌ

    • ๋” ๋‹ค์–‘ํ•œ ํ’ˆ์งˆ ๊ฒ€์‚ฌ

    • ๋ฆฌ๋‹ˆ์ง€ ์ฒดํฌ

    • ๋ฐ์ดํ„ฐ ํžˆ์Šคํ† ๋ฆฌ ํŒŒ์•…

  • ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ์œ ์ง€ -> ๋น„์šฉ/๋…ธ๋ ฅ ๊ฐ์†Œ์™€ ์ƒ์‚ฐ์„ฑ ์ฆ๋Œ€์˜ ์ง€๋ฆ„๊ธธ


๐Ÿ“Œ DB Normalization

  • DB๋ฅผ ์ข€๋” ์กฐ์ง์ ์ด๊ณ  ์ผ๊ด€๋œ ๋ฐฉ๋ฒ•์œผ๋กœ ๋””์ž์ธํ•˜๋ ค๋Š” ๋ฐฉ๋ฒ•
    -> DB ์ ‘ํ•ฉ์„ฑ์„ ์‰ฝ๊ฒŒ ์œ ์ง€ํ•˜๊ณ  ๋ ˆ์ฝ”๋“œ๋“ค์„ ์ˆ˜์ •/์‚ญ์ œ/์ ์žฌ๋ฅผ ์šฉ์ดํ•˜๊ฒŒ ํ•˜๋Š” ๊ฒƒ

  • Normalization์— ์‚ฌ์šฉ๋˜๋Š” ๊ฐœ๋…

    • Primary Key
    • Composite Key
    • Foreign Key
  • 1NF(First Normal Form)

    • ํ•œ ์…€์—๋Š”ํ•˜๋‚˜์˜ ๊ฐ’๋งŒ ์žˆ์–ด์•ผํ•œ๋‹ค. (atomicity)
    • Primary Key๊ฐ€ ์žˆ์–ด์•ผํ•œ๋‹ค.
    • ์ค‘๋ณต๋œ ํ‚ค๋‚˜ ๋ ˆ์ฝ”๋“œ๋“ค์ด ์—†์–ด์•ผํ•œ๋‹ค.

-> ์ฆ‰ ๋ชฉํ‘œ๋Š” ์ค‘๋ณต์„ ์ œ๊ฑฐํ•˜๊ณ  atomicity๋ฅผ ๊ฐ–๋Š” ๊ฒƒ์ด๋‹ค.

  • 2NF(Second Nomal Form)
    • 1NF๋ฅผ ๋งŒ์กฑ
    • Primary Key๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ์˜์กด ๊ฒฐ๊ณผ๋ฅผ ์•Œ ์ˆ˜ ์žˆ์–ด์•ผํ•œ๋‹ค.
    • ๋ถ€๋ถ„์ ์ธ ์˜์กด๋„๊ฐ€ ์—†์–ด์•ผ ํ•œ๋‹ค.
      • ์ฆ‰ ๋ชจ๋“  ๋ถ€๊ฐ€ ์†์„ฑ๋“ค์€ Primary Key๋ฅผ ๊ฐ€์ง€๊ณ  ์ฐพ์„ ์ˆ˜ ์žˆ์–ด์•ผํ•œ๋‹ค.

-> ์ค‘๋ณต์„ ์ œ๊ฑฐํ•˜๊ณ  atomicity๋ฅผ ๊ฐ–๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ด๋‹ค.

  • 3NF(Third Normal Form)
    • 2NF๋ฅผ ๋งŒ์กฑ
    • ์ „์ด์  ๋ถ€๋ถ„ ์ข…์†์„ฑ์ด ์—†์–ด์•ผํ•œ๋‹ค.
  • Slowly Changing Dimensions

    • DW๋‚˜ DL์—์„œ๋Š” ๋ชจ๋“  ํ…Œ์ด๋ธ”๋“ค์˜ ํžˆ์Šคํ† ๋ฆฌ๋ฅผ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค.

      • ๋ณดํ†ต ๋‘ ๊ฐœ์˜ timestampํ•„๋“œ๋ฅผ ๊ฐ–๋Š” ๊ฒƒ์ด ์ข‹๋‹ค.

        • created_at(์ƒ์„ฑ์‹œ๊ฐ„์œผ๋กœ ํ•œ๋ฒˆ ๋งŒ๋“ค์–ด์ง€๋ฉด ๊ณ ์ •๋œ๋‹ค.)

        • updated_at(๊ผญ ํ•„์š”ํ•˜๋‹ค, ๋งˆ์ง€๋ง‰ ์ˆ˜์ • ์‹œ๊ฐ„์„ ๋‚˜ํƒ€๋‚ธ๋‹ค.)

    • ์ด ๊ฒฝ์šฐ ์ปฌ๋Ÿผ์˜ ์„ฑ๊ฒฉ์— ๋”ฐ๋ผ ์–ด๋–ป๊ฒŒ ์œ ์ง€ํ• ์ง€ ๋ฐฉ๋ฒ•์ด ๋‹ฌ๋ผ์ง„๋‹ค.

      • SCD Type 0
      • SCD Type 1
      • ...
      • SCD Type 4
  • SCD Type 0

    • ํ•œ ๋ฒˆ ์“ฐ๊ณ  ๋‚˜๋ฉด ๋ฐ”๊ฟ€ ์ด์œ ๊ฐ€ ์—†๋Š” ๊ฒฝ์šฐ๋“ค
    • ํ•œ ๋ฒˆ ์ •ํ•ด์ง€๋ฉด ๊ฐฑ์‹ ๋˜์ง€ ์•Š๊ณ  ๊ณ ์ •๋˜๋Š” ํ•„๋“œ๋“ค
    • ex) ๊ณ ๊ฐ ํ…Œ์ด๋ธ”์ด๋ผ๋ฉด ํšŒ์› ๋“ฑ๋ก์ผ, ์ œํ’ˆ์ฒซ ๊ตฌ๋งค์ผ
  • SCD Type 1

    • ๋ฐ์ดํ„ฐ๊ฐ€ ์ƒˆ๋กœ ์ƒ๊ธฐ๋ฉด ๋ฎ์–ด์“ฐ๋ฉด ๋˜๋Š” ์ปฌ๋Ÿผ๋“ค
    • ์ฒ˜์Œ ๋ ˆ์ฝ”๋“œ ์ƒ์„ฑ์‹œ์—๋Š” ์กด์žฌํ•˜์ง€ ์•Š์•˜์ง€๋งŒ ๋‚˜์ค‘์— ์ƒ๊ธฐ๋ฉด์„œ ์ฑ„์šฐ๋Š” ๊ฒฝ์šฐ
    • ex) ๊ณ ๊ฐ ํ…Œ์ด๋ธ”์ด๋ผ๋ฉด ์—ฐ๊ฐ„ ์†Œ๋“ ํ•„๋“œ
  • SCD Type 2

    • ํŠน์ • entity์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์ƒˆ๋กœ์šด ๋ ˆ์ฝ”๋“œ๋กœ ์ถ”๊ฐ€๋˜์–ด์•ผํ•˜๋Š” ๊ฒฝ์šฐ
    • ex) ๊ณ ๊ฐ ํ…Œ์ด๋ธ”์—์„œ ๊ณ ๊ฐ์˜ ๋“ฑ๊ธ‰ ๋ณ€ํ™”
      • tier๋ผ๋Š” ์ปฌ๋Ÿผ์˜ ๊ฐ’์ด "regular"์—์„œ "vip"๋กœ ๋ณ€ํ™”ํ•˜๋Š” ๊ฒฝ์šฐ
      • ๋ณ€๊ฒฝ ์‹œ๊ฐ„๋„๊ฐ™์ด ์ถ”๊ฐ€๋˜์–ด์•ผ ํ•œ๋‹ค.
  • SCD Type 3

    • SCD Type 2์˜ ๋Œ€์•ˆ์œผ๋กœ ํŠน์ • entity ๋ฐ์ดํ„ฐ๊ฐ€ ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์œผ๋กœ ์ถ”๊ฐ€๋˜๋Š” ๊ฒฝ์šฐ
    • ex) ๊ณ ๊ฐ ํ…Œ์ด๋ธ”์—์„œ tier๋ผ๋Š” ์ปฌ๋Ÿผ์˜ ๊ฐ’์ด "regular"์—์„œ "vip"๋กœ ๋ณ€ํ™”ํ•˜๋Š” ๊ฒฝ์šฐ
      • previous_tier๋ผ๋Š” ์ปฌ๋Ÿผ ์ƒ์„ฑ
      • ๋ณ€๊ฒฝ ์‹œ๊ฐ„๋„ ๋ณ„๋„ ์ปฌ๋Ÿผ์œผ๋กœ ์กด์žฌํ•ด์•ผํ•œ๋‹ค.
  • SCD Type 4

    • ํŠน์ • entity์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒˆ๋กœ์šด Dimension ํ…Œ์ด๋ธ”์— ์ €์žฅํ•˜๋Š” ๊ฒฝ์šฐ
    • SCD Type 2์˜ ๋ณ€์ข…
    • ex) ๋ณ„๋„์˜ ํ…Œ์ด๋ธ”๋กœ ์ €์žฅํ•˜๊ณ  ์ด ๊ฒฝ์šฐ ์•„์˜ˆ ์ผ๋ฐ˜ํ™”ํ•  ์ˆ˜๋„ ์žˆ๋‹ค.

๐Ÿ“Œ DBT(Data Build Tool)

  • Data Build Tool

    • ELT์šฉ ์˜คํ”ˆ์†Œ์Šค : In-warehouse data transformation
    • DBT Labs๋ผ๋Š” ํšŒ์‚ฌ๊ฐ€ ์ƒ์šฉํ™”ํ•จ
    • Analytics Engineer๋ผ๋Š” ๋ง์„ ๋งŒ๋“ค์–ด ๋ƒˆ๋‹ค.
  • ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ์›จ์–ดํ•˜์šฐ์Šค๋ฅผ ์ง€์›

    • Redshift, Snowflake, Bigquery, Spark
  • ํด๋ผ์šฐ๋“œ ๋ฒ„์ „๋„ ์กด์žฌํ•œ๋‹ค.

    • dbt Cloud
  • DBT๊ฐ€ ์ง€์›ํ•˜๋Š” ๋ฐ์ดํ„ฐ ์‹œ์Šคํ…œ

    • BigQuery
    • Redshift
    • Snowflake
    • Spark
    • ...
  • DBT ๊ตฌ์„ฑ ์ปดํฌ๋„ŒํŠธ
    • ๋ฐ์ดํ„ฐ ๋ชจ๋ธ(models)
      • ํ…Œ์ด๋ธ”๋“ค์„ ๋ช‡ ๊ฐœ์˜ ํ‹ฐ์–ด๋กœ ๊ด€๋ฆฌํ•œ๋‹ค.
        -> ์ผ์ข…์˜ CTAS(SELECT ๋ฌธ๋“ค), Lineage ํŠธ๋ž˜ํ‚น
      • Table, View, CTE ๋“ฑ๋“ฑ
    • ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ๊ฒ€์ฆ(tests)
    • ์Šคํƒญ์ƒท(snapshots)

๐Ÿ“Œ DBT ์‚ฌ์šฉ ์‹œ๋‚˜๋ฆฌ์˜ค

  • DBT๊ฐ€ ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋Š”์ง€ ๊ฐ€์ƒ ํ™˜๊ฒฝ์„ ์ƒ๊ฐํ•ด๋ณด์ž.

  • ์š”๊ตฌ์กฐ๊ฑด

    • ๋ฐ์ดํ„ฐ ๋ณ€๊ฒฝ ์‚ฌํ•ญ์„ ์ดํ•ดํ•˜๊ธฐ ์‰ฝ๊ณ  ํ•„์š”ํ•˜๋‹ค๋ฉด ๋กค๋ฐฑ ๊ฐ€๋Šฅ
    • ๋ฐ์ดํ„ฐ๊ฐ„๋ฆฌ๋‹ˆ์ง€ ํ™•์ธ ๊ฐ€๋Šฅ
    • ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ํ…Œ์ŠคํŠธ ๋ฐ ์—๋Ÿฌ ๋ณด๊ณ 
    • Fact ํ…Œ์ด๋ธ”์˜ ์ฆ๋ถ„ ๋กœ๋“œ(Incremental Update)
    • Dimension ํ…Œ์ด๋ธ” ๋ณ€๊ฒฝ ์ถ”์ (History Table)
    • ์šฉ์ดํ•œ ๋ฌธ์„œ ์ž‘์„ฑ
  • ๋ณดํ†ต ์‚ฌ์šฉํ•˜๋Š” Tech Stack

    • Redshift/Spark/Snowflake/BigQuery
    • DBT
    • Airflow
  • ์‹ค์Šต

    • Redshift ์‚ฌ์šฉ
    • ABํ…Œ์ŠคํŠธ ๋ถ„์„์„ ์‰ฝ๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•œ ELT ํ…Œ์ด๋ธ” ์ƒ์„ฑ
    • ์ž…๋ ฅ ํ…Œ์ด๋ธ”
      • user_event
      • user_variant
      • user_metadata
    • ์ƒ์„ฑ ํ…Œ์ด๋ธ” : Variant๋ณ„ ์‚ฌ์šฉ์ž๋ณ„ ์ผ๋ณ„ ์š”์•ฝ ํ…Œ์ด๋ธ”
      • variant_id
      • user_id
      • datestamp
      • age
      • gender
      • Total impression
      • Total click
      • Total purchase
      • Total revenue

Input Data

  • Production DB์— ์ €์žฅ๋˜๋Š” ์ •๋ณด๋“ค์„ Data Warehouse๋กœ ์ ์žฌํ–ˆ๋‹ค๊ณ  ๊ฐ€์ •
  • raw_data.user_event
    • ์‚ฌ์šฉ์ž/๋‚ ์งœ/์•„์ดํ…œ๋ณ„๋กœ impression์ด ์žˆ๋Š” ๊ฒฝ์šฐ ๊ทธ ์ •๋ณด๋ฅผ ๊ธฐ๋ก
    • impression์œผ๋กœ๋ถ€ํ„ฐ ํด๋ฆญ, ๊ตฌ๋งค, ๊ตฌ๋งค์‹œ ๊ธˆ์•ก์„ ๊ธฐ๋ก
    • ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ๋Š” ์ด๋Ÿฐ aggregate ์ •๋ณด๋ฅผ ๋กœ๊ทธ ํŒŒ์ผ๋“ฑ์˜ ์†Œ์Šค๋กœ๋ถ€ํ„ฐ ๋งŒ๋“ค์–ด๋‚ด๋Š” ํ”„๋กœ์„ธ์Šค๊ฐ€ ํ•„์š”ํ•˜๋‹ค.
CREATE TABLE raw_data.user_event(
	user_id int,
	datestamp timestamp,
	item_id int,
	clicked int,
	purchased int,
	paidamountint
);
  • raw_data.user_variant
    • ์‚ฌ์šฉ์ž๊ฐ€ ์†Œ์†ํ•œ AB test variant๋ฅผ ๊ธฐ๋กํ•œํŒŒ์ผ(control vs. test)
CREATE TABLE raw_data.user_variant(
	user_id int,
    variant_id varchar(32) -- control vs. test
);
  • ๋ณดํ†ต experiment์™€ variant ํ…Œ์ด๋ธ”์ด ๋ณ„๋„๋กœ ์กด์žฌํ•œ๋‹ค.

  • ์–ธ์ œ variant_id๋กœ ์†Œ์†๋˜์—ˆ๋Š”์ง€ Timestampํ•„๋“œ๊ฐ€ ์กด์žฌํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ด๋‹ค.

  • raw_data.user_metadata

    • ์‚ฌ์šฉ์ž์— ๊ด€ํ•œ metadata ์ •๋ณด๊ฐ€ ๊ธฐ๋ก๋œ ํŒŒ์ผ(์„ฑ๋ณ„, ๋‚˜์ด ๋“ฑ๋“ฑ)
CREATE TABLE raw_data.user_metadata(
	user_id int,
    age varchar(16),
    gender varchar(16)
);

Fact Table

  • ๋ถ„์„์˜ ์ดˆ์ ์ด ๋˜๋Š” ์–‘์  ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๋Š” ์ค‘์•™ ํ…Œ์ด๋ธ”
  • ์ผ๋ฐ˜์ ์œผ๋กœ ๋งค์ถœ ์ˆ˜์ต, ํŒ๋งค๋Ÿ‰, ์ด์ต๊ณผ ๊ฐ™์€ ์ธก์ • ํ•ญ๋ชฉ์ด ํฌํ•จ๋œ๋‹ค.
  • ๋น„์ฆˆ๋‹ˆ์Šค ๊ฒฐ์ •์— ์‚ฌ์šฉ๋œ๋‹ค.
  • Fact ํ…Œ์ด๋ธ”์€ ์ผ๋ฐ˜์ ์œผ๋กœ ์™ธ๋ž˜ ํ‚ค๋ฅผ ํ†ตํ•ด ์—ฌ๋Ÿฌ Dimension ํ…Œ์ด๋ธ”๊ณผ ์—ฐ๊ฒฐ๋œ๋‹ค.
  • ๋ณดํ†ต Fact ํ…Œ์ด๋ธ”์˜ ํฌ๊ธฐ๊ฐ€ ํ›จ์”ฌ ๋” ํฌ๋‹ค.

Dimension Table

  • Fact ํ…Œ์ด๋ธ”์— ๋Œ€ํ•œ ์ƒ์„ธ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜๋Š” ํ…Œ์ด๋ธ”
  • ๊ณ ๊ฐ, ์ œํ’ˆ๊ณผ ๊ฐ™์€ ํ…Œ์ด๋ธ”๋กœ Factํ…Œ์ด๋ธ”์— ๋Œ€ํ•œ ์ƒ์„ธ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•œ๋‹ค.
  • Fact ํ…Œ์ด๋ธ”์˜ ๋ฐ์ดํ„ฐ์— ๋งฅ๋ฝ์„ ์ œ๊ณตํ•˜์—ฌ ๋‹ค์–‘ํ•œ ๋ฐฉ์‹์œผ๋กœ ๋ถ„์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ด์ค€๋‹ค.
  • Dimension ํ…Œ์ด๋ธ”์€ Primary Key๋ฅผ ๊ฐ€์ง€๋ฉฐ, Factํ…Œ์ด๋ธ”์—์„œ ์ฐธ์กฐํ•œ๋‹ค.(Foreign Key)
  • ๋ณดํ†ต Dimension ํ…Œ์ด๋ธ”์˜ ํฌ๊ธฐ๋Š” ํ›จ์”ฌ ๋” ์ž‘๋‹ค.

์ตœ์ข… ์ƒ์„ฑ ๋ฐ์ดํ„ฐ (ELT Table)

SELECT
	varian_id,
    ue.user_id,
    datestamp,
    age,
    gender,
    COUNT(DISTINCT item_id) num_of_items, -- Total impression
    COUNT(DISTINCT CASE WHEN clicked THEN item_id END) num_of_clicks, -- Total click
    SUM(purchased) num_of_purchases, -- Total purchase
    SUM(paidamount) revenue -- Total revenue
FROMraw_data.user_event ue
JOIN raw_data.user_variant uv ON ue.user_id = uv.user_id
JOIN raw_data.user_metadata um ON uv.user_id = um.user_id
GROUP BY 1, 2, 3, 4, 5;

๐Ÿ“Œ DBT ์„ค์น˜/ํ™˜๊ฒฝ์„ค์ •

  • ์‚ฌ์šฉ์ ˆ์ฐจ
    • dbt ์„ค์น˜
      • dbt Cloud vs. dbtCore
      • git์„ ๋ณดํ†ต ์‚ฌ์šฉํ•œ๋‹ค.
        -> pip install dbt-redshift
        -> dbt-core ๋ชจ๋“ˆ๋„ ๊ฐ™์ด ์„ค์น˜๋œ๋‹ค.
    • dbt ํ™˜๊ฒฝ์„ค์ •
    • Connector ์„ค์ •
      • Connector๊ฐ€ ๋ฐ”๋กœ ๋ฐ”ํƒ•์ด ๋˜๋Š” ๋ฐ์ดํ„ฐ ์‹œ์Šคํ…œ(Redshift, Spark, ...)
    • ๋ฐ์ดํ„ฐ๋ชจ๋ธ๋ง(tier)
      • Raw Data -> Staging -> Core
    • ํ…Œ์ŠคํŠธ ์ฝ”๋“œ ์ž‘์„ฑ
    • Snapshot ์„ค์ •
  • ์„ค์น˜

    • pip install dbt-redshift
      -> ์—ฌ๊ธฐ์„œ ๋งŽ์€ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•  ๊ฒƒ์ด๋‹ค.. ํ•˜๋‚˜์”ฉ ๋”ฐ๋ผํ•ด๋ณด์ž.
    • sudo apt-get update
    • get-pip.py ์„ค์น˜
    • pip ์„ค์น˜ ๋””๋ ‰ํ† ๋ฆฌ๋ฅผ PATH์— ์ถ”๊ฐ€ํ•˜๊ธฐ
      • nano ~/.bashrc
      • ๋งจ ์•„๋ž˜์— ์•„๋ž˜ ์ฝ”๋“œ ์ž‘์„ฑ
      • export PATH=$PATH:/home/urface/.local/bin
      • urface ๋Œ€์‹  ๋‹น์‹  local์ด๋ฆ„ ์ž‘์„ฑ
      • source ~/.bashrc (์ €์žฅ)
      • pip3 --version (ํ™•์ธ)
    • PostgreSQL ์„ค์น˜
      • sudo apt-get install postgresql libpq-dev
    • gcc๋“ฑ ํ•„์ˆ˜ ๋„๊ตฌ ์„ค์น˜
      • sudo apt-get install build-essential
    • ๋งŒ์•ฝ psycopg2 ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ์•„๋ž˜ ๋ช…๋ น์–ด ์ž…๋ ฅ
      • pip3 install --user psycopg2-binary
    • Python ๊ฐœ๋ฐœํŒจํ‚ค์ง€
      • sudo apt-get install python3-dev
    • DBT - Redshift ์„ค์น˜
      • pip install --user dbt-redshift
  • ํ™˜๊ฒฝ ์„ค์ •

    • dbt๋ผ๋Š” ๋””๋ ‰ํ† ๋ฆฌ๋ฅผ ํ•˜๋‚˜ ๋งŒ๋“ค๊ณ  ์ง„ํ–‰ํ•˜์ž.
      • mkdir dbt -> cd dbt
    • init์œผ๋กœ ํ™˜๊ฒฝ์„ค์ •์„ ์ง„ํ–‰ํ•ด์ฃผ์ž.
      • dbt init dbt_name
    • ํ™•์ธํ•ด๋ณด์ž.
      • ls -tl dbt_name
    • dbt_project.yml๊ณผ profiles.yml ํŒŒ์ผ์€ ์ค‘์š”ํ•˜๋‹ค.
  • dbt ํŒŒ์ผ๊ณผ ํด๋”

    • dbt_project.yml : ๋ฉ”์ธ ํ™˜๊ฒฝ ์„ค์ • ํŒŒ์ผ
    • models
    • seeds
    • tests
    • snapshots
    • macros
    • analyses
    • README.md
  • dbt_project.yml

    • name
    • version
    • config-version
    • profile
    • ์•„๋ž˜ ๋‚ด์šฉ์€ ํด๋” ์ด๋ฆ„๋“ค๊ณผ ์ผ์น˜ํ•ด์•ผํ•œ๋‹ค.
    • model-paths:["models"]
    • analysis-paths:["analyses"]
    • test-paths:["tests"]
    • seed-paths:["seeds"]
    • macro-paths:["macros"]
    • snapshot-paths:["snapshots"]
    • target-path: "target" -> ๊ฒฐ๊ณผ๋“ค์ด ์ €์žฅ๋˜๋Š” ํด๋”
    • clean-targets:
      • "target"
      • "dbt_packages"
    • models:
      • learn_dbt:
        • example:
          • +materialized: view
    • ์—ฌ๊ธฐ์„œ example์ดํ•˜ 2์ค„์€ ์‚ญ์ œํ•˜๋ฉฐ models์— ์žˆ๋Š” example ์„œ๋ธŒํด๋”๋„ ์‚ญ์ œํ•œ๋‹ค.
  • dbt_profiles.yml

    • Project_name
    • dev:
      • dbname
      • host
      • password
      • port
      • schema
      • threads
      • type
      • user
    • target : dev

๐Ÿ“Œ DBT Model

1. Model์ด๋ž€?

  • ELT ํ…Œ์ด๋ธ”์„ ๋งŒ๋“ฌ์— ์žˆ์–ด์„œ ๊ธฐ๋ณธ์ด ๋˜๋Š” ๋นŒ๋”ฉ๋ธ”๋ก์ด๋‹ค.
    -> Table, View, CTE์˜ ํ˜•ํƒœ๋กœ ์กด์žฌํ•œ๋‹ค.
  • ์ž…๋ ฅ, ์ค‘๊ฐ„, ์ตœ์ข… ํ…Œ์ด๋ธ”์„ ์ •์˜ํ•˜๋Š” ๊ณณ์ด๋‹ค.
    • Tier(raw, staging,core,...)
    • raw => staging(src) => core

2. View๋ž€?

  • SELECT ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋งŒ๋“ค์–ด์ง„ ๊ฐ€์ƒ ํ…Œ์ด๋ธ”
    • ๊ธฐ์กด ํ…Œ์ด๋ธ”์˜ ์ผ๋ถ€ ํ˜น์€ ์—ฌ๋Ÿฌ ํ…Œ์ด๋ธ”๋“ค์„ ์กฐ์ธํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•œ๋‹ค.
    • CREATE VIEW ์ด๋ฆ„ AS SELECT ...
  • View์˜ ์žฅ์ 
    • ๋ฐ์ดํ„ฐ์˜ ์ถ”์ƒํ™” : ์‚ฌ์šฉ์ž๋Š” View๋ฅผ ํ†ตํ•ด ํ•„์š” ๋ฐ์ดํ„ฐ์— ์ง์ ‘ ์ ‘๊ทผํ•œ๋‹ค.
    • ์›๋ณธ ๋ฐ์ดํ„ฐ๋ฅผ ์•Œ ํ•„์š”๊ฐ€ ์—†๋‹ค.
    • ๋ฐ์ดํ„ฐ๋ณด์•ˆ : View๋ฅผ ํ†ตํ•ด ์‚ฌ์šฉ์ž์—๊ฒŒ ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋งŒ ์ œ๊ณตํ•œ๋‹ค.
    • ์›๋ณธ๋ฐ์ดํ„ฐ ์ ‘๊ทผ ๋ถˆํ•„์š”
    • ๋ณต์žกํ•œ ์ฟผ๋ฆฌ์˜ ๊ฐ„์†Œํ™” : SQL(View)๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋ณต์žกํ•œ ์ฟผ๋ฆฌ๋ฅผ ๋‹จ์ˆœํ™”ํ•œ๋‹ค.
  • View์˜ ๋‹จ์ 
    • ๋งค๋ฒˆ ์ฟผ๋ฆฌ๊ฐ€ ์‹คํ–‰๋˜๋ฏ€๋กœ ์‹œ๊ฐ„์ด ๊ฑธ๋ฆด ์ˆ˜ ์žˆ๋‹ค.
    • ์›๋ณธ ๋ฐ์ดํ„ฐ์˜ ๋ณ€๊ฒฝ์„ ๋ชจ๋ฅด๋ฉด ์‹คํ–‰์ด ์‹คํŒจํ•œ๋‹ค.

3. Model ๊ตฌ์„ฑ์š”์†Œ

  • Input
    • ์ž…๋ ฅ(raw)๊ณผ ์ค‘๊ฐ„(staging, src)๋ฐ์ดํ„ฐ ์ •์˜
    • raw๋Š” CTE๋กœ ์ •์˜
    • staging์€ View๋กœ ์ •์˜
  • Output
    • ์ตœ์ข…(core) ๋ฐ์ดํ„ฐ ์ •์˜
    • core๋Š” Table๋กœ ์ •์˜
  • model ํด๋” ์•„๋ž˜ sql ํŒŒ์ผ๋กœ ์กด์žฌ
    • ๊ธฐ๋ณธ์ ์œผ๋กœ๋Š” SELECT + Jinja ํ…œํ”Œ๋ฆฟ๊ณผ ๋งคํฌ๋กœ
    • ๋‹ค๋ฅธ ํ…Œ์ด๋ธ”๋“ค์„ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค.(reference)
      -> ์ด๋ฅผ ํ†ตํ•ด ๋ฆฌ๋‹ˆ์ง€ ํŒŒ์•…

4. Model ๊ตฌ์„ฑ - input

  • src ํด๋” ์ƒ์„ฑ
    -> mkdir src
    ->cd src

  • models/sec/src_user_event.sql

WITH src_user_evernt AS(
	SELECT * FROM raw_data.user_event
)
SELECT
	user_id,
    datestamp,
    item_id,
    clicked,
    purchased,
    paidamount
FROM
	src_user_event
  • models/sec/src_user_metadata.sql
WITH src_user_metadata AS(
	SELECT * FROM raw_data.user_metadata
)
SELECT
	user_id,
    age,
    gender,
    updated_at
FROM
	src_user_metadata
  • models/sec/src_user_variant.sql
WITH src_user_variant AS(
	SELECT * FROM raw_data.user_variant
)
SELECT
	user_id,
    variant_id
FROM
	src_user_variant

5. Model Building : dbt run

6. Model Building Check

  • ํ•ด๋‹น ์Šคํ‚ค๋งˆ ๋ฐ‘์— ํ…Œ์ด๋ธ” ์ƒ์„ฑ ์—ฌ๋ถ€ ํ™•์ธ
  • dbt run์€ ํ”„๋กœ์ ํŠธ์˜ ๋‹ค์–‘ํ•œ SQL์„ ์‹คํ–‰ํ•œ๋‹ค.
    -> ์ด SQL๋“ค์€ DAG๋กœ ๊ตฌ์„ฑ๋˜์—ˆ๋‹ค.
  • dbt run์€ ๋ณดํ†ต ๋‹ค๋ฅธ ๋” ํฐ ๋ช…๋ น์˜ ์ผ๋ถ€๋กœ ์‹คํ–‰ํ•œ๋‹ค.
    • dbt test
    • dbt docs generate

7. Model ๊ตฌ์„ฑ - output

  • Materialization

    • ์ž…๋ ฅ ๋ฐ์ดํ„ฐ(Table)๋“ค์„ ์—ฐ๊ฒฐํ•˜์—ฌ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ(Table)์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
      -> ๋ณดํ†ต ์—ฌ๊ธฐ์„œ ์ถ”๊ฐ€ transformation์ด๋‚˜ ๋ฐ์ดํ„ฐ clean-up์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.
    • 4๊ฐ€์ง€์˜ ๋‚ด์žฅ materialization์ด ์ œ๊ณต๋œ๋‹ค.
    • ํŒŒ์ผ์ด๋‚˜ ํ”„๋กœ์ ํŠธ๋ ˆ๋ฒจ์—์„œ ๊ฐ€๋Šฅํ•˜๋‹ค.
    • dbt run์„ ๊ธฐํƒ€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์‹คํ–‰ํ•œ๋‹ค.
  • Materialization ์ข…๋ฅ˜

    • View
      • ๋ฐ์ดํ„ฐ๋ฅผ ์ž์ฃผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ
    • Table
      • ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ˜๋ณตํ•ด์„œ ์ž์ฃผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ
    • Incremental (Table Appends)
      • Fact Table
      • ๊ณผ๊ฑฐ ๋ ˆ์ฝ”๋“œ๋ฅผ ์ˆ˜์ •ํ•  ํ•„์š”๊ฐ€ ์—†๋Š” ๊ฒฝ์šฐ
    • Ephemeral(CTE)
      • ํ•œ SELECT์—์„œ ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋“ˆํ™” ํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•œ๋‹ค.
  • Models ์•„๋ž˜์— core Table๋“ค์„ ์œ„ํ•œ ํด๋” ์ƒ์„ฑ

    • dim ํด๋”์™€ fact ํด๋” ์ƒ์„ฑ
      • dim ์•„๋ž˜ ๊ฐ๊ฐ dim_user_variant.sql, dim_user_metadata.sql ์ƒ์„ฑ
      • fact ์•„๋ž˜ fact_user_event.sql ์ƒ์„ฑ
    • ์ด ๋ชจ๋‘๋ฅผ physicaltable๋กœ ์ƒ์„ฑ
  • models/dim/dim_user_variant.sql
    -> Jinja ํ…œํ”Œ๋ฆฟ๊ณผ ref ํƒœ๊ทธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ dbt ๋‚ด ๋‹ค๋ฅธ ํ…Œ์ด๋ธ”๋“ค์„ ์•ก์„ธ์Šค ํ•œ๋‹ค.

WITH src_user_variant AS(
	SELECT * FROM {{ ref('src_user_variant') }}
)
SELECT
	user_id,
    variant_id
FROM
	src_user_variant
  • models/dim/dim_user_metadata.sql
    -> ์„ค์ •์—๋”ฐ๋ผ view/table/CTE ๋“ฑ์œผ๋กœ ๋งŒ๋“ค์–ด์„œ ์‚ฌ์šฉ๋œ๋‹ค.
    -> meterialized๋ผ๋Š” ํ‚ค์›Œ๋“œ๋กœ ์„ค์ •ํ•œ๋‹ค.
WITH src_user_metadata AS(
	SELECT * FROM {{ ref('src_user_metadata') }}
)
SELECT
	user_id,
    age,
    gender,
    updated_at
FROM
	src_user_metadata
  • models/fact/fact_user_event.sql
    -> Incremental Table๋กœ ๋นŒ๋“œ (materialized = 'incremental')
{{
	config(
    	materialized = 'incremental',
        on_schema_change='fail' 
    ) 
}} 
WITH src_user_event AS (
	SELECT * FROM {{ ref("src_user_event") }} 
)
SELECT 
	user_id, 
    datestamp, 
    item_id, 
    clicked, 
    purchased, 
    paidamount 
FROM 
	src_user_event
WHERE datestamp is not NULL
{% if is_incremental() %}
	AND datestamp> (SELECT max(datestamp) FROM {{ this }})
{% endif %}
  • incremental_strategy๋„ ์‚ฌ์šฉ๊ฐ€๋Šฅ

    • append
    • merge
    • insert_overwrite
      ->์ด ๊ฒฝ์šฐ unique_key์™€ merge_update_columns ํ•„๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ๋„ ํ•จ
  • ์Šคํ‚ค๋งˆ๊ฐ€ ๋ฐ”๋€ ๊ฒฝ์šฐ ๋Œ€์‘ ๋ฐฉ๋ฒ• ์ง€์ •

    • append_new_columns
    • ignore
    • sync_all_columns
    • fail
  • model์˜ materialized format ๊ฒฐ์ •

    • ์ตœ์ข… Core Table๋“ค์€ View๊ฐ€ ์•„๋‹Œ Table๋กœ ๋นŒ๋“œํ•œ๋‹ค.
    • dbt_project.yml ํŽธ์ง‘
  • dbt compile vs. dbt run

    • dbt compile : SQL ์ฝ”๋“œ๊นŒ์ง€๋งŒ ์ƒ์„ฑ, ์‹คํ–‰X
    • dbt run : ์ƒ์„ฑ๋œ ์ฝ”๋“œ๋ฅผ ์‹ค์ œ ์‹คํ–‰ํ•œ๋‹ค.

0๊ฐœ์˜ ๋Œ“๊ธ€