๐Ÿ“’ Spark(9)

Kimdongkiยท2024๋…„ 6์›” 20์ผ

Spark

๋ชฉ๋ก ๋ณด๊ธฐ
9/22

๐Ÿ“Œ Spark ํŒŒ์ผํฌ๋งท

  • Unstructured

    • Text
  • Semi-structured

    • JSON
    • XML
    • CSV
  • Structured

    • PARQUET
    • AVRO
    • ORC
    • SquenceFile
ํŠน์ง•CSVJSONPARQUETAVRO
Column StorageXXYX
์••์ถ• ๊ธฐ๋ŠฅYYYY
SplittableYYYY
Human readableYYXX
Nested Structure supportXYYY
Schema evolutionXXYY
  • CSV, JSON : ์••์ถ•๋˜๋ฉด Splittableํ•˜์ง€ ์•Š๋‹ค.
    -> ์••์ถ• ๋ฐฉ์‹์— ๋”ฐ๋ผ ๋‹ค๋ฅด๋‹ค. - snappy ์••์ถ•์ด๋ผ๋ฉด Splittable)
  • PARQUET : Spark์˜ ๊ธฐ๋ณธ ํŒŒ์ผ ํฌ๋งท
  • gzip์œผ๋กœ ์••์ถ•๋œ CSV, JSON ํŒŒ์ผ์€ Splittableํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ํ•˜๋‚˜์˜ executor๊ฐ€ ์ผ๋‹จ ์ฒ˜๋ฆฌํ•˜๊ฒŒ ๋˜๋ฉฐ ๋ฉ”๋ชจ๋ฆฌ ์—๋Ÿฌ๊ฐ€ ์ƒ๊ธธ ํ™•๋ฅ ์ด ๋†’๋‹ค.

Parquet : Spark์˜ ๊ธฐ๋ณธ ํŒŒ์ผ ํฌ๋งท

  • Twiter & Cloudera ์—์„œ ๊ณต๋™ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค. (Doug Cutting)

  • Hybrid Storage (Row Group)
    -> Paruet๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ ๋ธ”๋ก์€ ํ•˜๋‚˜์˜ Row Group์œผ๋กœ ๊ตฌ์„ฑ๋˜์—ˆ๋‹ค.

๐Ÿ“Œ Execution Plan

Transformations & Actions

  • Transformations

    • Narrow Dependencies : ๋…๋ฆฝ์ ์ธ Partition level์ž‘์—…
      -> SELECT, FILTER, MAP ๋“ฑ๋“ฑ
    • Wide Dependencies: Shuffling์ด ํ•„์š”ํ•œ ์ž‘์—…
      -> GROUP BY, REDUCE BY, PARTITION BY, REPARTITION ๋“ฑ๋“ฑ
  • Actions

    • Read, Write, Show, Collect -> Job์„ ์‹คํ–‰ ์‹œํ‚จ๋‹ค. -> ์‹ค์ œ ์ฝ”๋“œ๊ฐ€ ์‹คํ–‰๋œ๋‹ค.
    • Lazy Execution
      -> ๋” ๋งŽ์€ ์˜คํผ๋ ˆ์ด์…˜์„ ๋ณผ ์ˆ˜ ์žˆ๊ธฐ์— ์ตœ์ ํ™”๋ฅผ ๋” ์ž˜ํ•  ์ˆ˜ ์žˆ๋‹ค.
      -> ๋•Œ๋ฌธ์— SQL์ด ๋” ์„ ํ˜ธ๋œ๋‹ค.

Transformations & Actions Visualization

spark.read.option("header", True).csv("test.csv"). \
	where("gender <> 'F'"). \
    select("name", "gender"). \
    groupby("gender). \
    count(). \
    show()

Jobs, Stages, Tasks

  • Action -> Job -> 1+Stages -> 1+Tasks

  • Action
    -> Job์„ ํ•˜๋‚˜ ๋งŒ๋“ค์–ด๋‚ด๊ณ  ์ฝ”๋“œ๊ฐ€ ์‹ค์ œ๋กœ ์‹คํ–‰๋œ๋‹ค.

  • Job

    • ํ•˜๋‚˜ ํ˜น์€ ๊ทธ ์ด์ƒ์˜ Stage๋กœ ๊ตฌ์„ฑ๋˜์—ˆ๋‹ค.
    • Stage๋Š” Shuffling์ด ๋ฐœ์ƒํ•˜๋Š” ๊ฒฝ์šฐ ์ƒˆ๋กœ ์ƒ๊ธด๋‹ค.
  • Stage

    • DAG์˜ ํ˜•ํƒœ๋กœ ๊ตฌ์„ฑ๋œ Task๋“ค์ด ์กด์žฌํ•œ๋‹ค.
    • ์—ฌ๊ธฐ Task๋“ค์€ ๋ณ‘๋ ฌ ์‹คํ–‰์ด ๊ฐ€๋Šฅํ•˜๋‹ค.
  • Task
    -> ๊ฐ€์žฅ ์ž‘์€ ์‹คํ–‰ ์œ ๋‹›์œผ๋กœ Executor์— ์˜ํ•ด ์‹คํ–‰๋œ๋‹ค.

๐Ÿ“Œ Bucketing & Partitioning

Bucketing & Partitioning ์†Œ๊ฐœ

  • ๋‘˜๋‹ค Hive ๋ฉ”ํƒ€์Šคํ† ์–ด์˜ ์‚ฌ์šฉ์ด ํ•„ํšจํ•˜๋‹ค -> saveAsTable

  • ๋ฐ์ดํ„ฐ ์ €์žฅ์„ ์ดํ›„ ๋ฐ˜๋ณต์ฒ˜๋ฆฌ์— ์ตœ์ ํ™”๋œ ๋ฐฉ๋ฒ•์œผ๋กœ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

  • Bucketing

    • ๋จผ์ € Aggregation์ด๋‚˜ Window ํ•จ์ˆ˜๋‚˜ Join์—์„œ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ์ปฌ๋Ÿผ์ด ์žˆ๋Š”๊ฐ€?
    • ์žˆ๋‹ค๋ฉด ๋ฐ์ดํ„ฐ๋ฅผ ์ด ํŠน์  ์ปฌ๋Ÿผ(๋“ค)์„ ๊ธฐ์ค€์œผ๋กœ ํ…Œ์ด๋ธ”๋กœ ์ €์žฅ.
      -> ์ด ๋•Œ์˜ ๋ฒ„ํ‚ท์˜ ์ˆ˜๋„ ์ง€์ •ํ•˜๋‹ค.
  • FileSystem Partitioning

    • ์›๋ž˜ Hive์—์„œ ๋งŽ์ด ์‚ฌ์šฉํ•œ๋‹ค.
    • ๋ฐ์ดํ„ฐ์˜ ํŠน์ • ์ปฌ๋Ÿผ(๋“ค)์„ ๊ธฐ์ค€์œผ๋กœ ํด๋” ๊ตฌ์กฐ๋ฅผ ๋งŒ๋“ค์–ด ๋ฐ์ดํ„ฐ ์ €์žฅ์„ ์ตœ์ ํ™”ํ•œ๋‹ค.
      -> ์œ„์˜ ์ปฌ๋Ÿผ๋“ค์„ Partition Key ๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

Buketing

  • DataFrame์„ ํŠน์ • ID๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‚˜๋ˆ„์–ด์„œ ํ…Œ์ด๋ธ”๋กœ ์ €์žฅํ•œ๋‹ค.

    • ๋‹ค์Œ๋ถ€ํ„ฐ๋Š” ์ด๋ฅผ ๋กœ๋”ฉํ•˜์—ฌ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ๋ฐ˜๋ณต ์ฒ˜๋ฆฌ์‹œ ์‹œ๊ฐ„์„ ๋‹จ์ถ•ํ•  ์ˆ˜ ์žˆ๋‹ค.

      • DataFrameWriter์˜ buketBy ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
        -> Bucket์˜ ์ˆ˜์™€ ๊ธฐ์ค€ ID๋ฅผ ์ง€์ •ํ•œ๋‹ค.
    • ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ์„ ์ž˜ ์•Œ๊ณ  ์žˆ๋Š” ๊ฒฝ์šฐ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค.

File System Partitioning

  • ๋ฐ์ดํ„ฐ๋ฅผ Partition Key ๊ธฐ๋ฐ˜ ํด๋” ("Partition") ๊ตฌ์กฐ๋กœ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ๋‚˜๋ˆ„์–ด์„œ ์ €์žฅํ•œ๋‹ค.
    -> Hive์—์„œ ์‚ฌ์šฉํ•˜๋Š” Partitioning์„ ๋งํ•œ๋‹ค.

  • Partitioning ์˜ˆ

    • ํฐ ๋กœ๊ทธ ํŒŒ์ผ์„ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ์‹œ๊ฐ„์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฐ์ดํ„ฐ์ฝ๊ธฐ๋ฅผ ๋งŽ์ด ํ•œ๋‹ค๋ฉด?
      -> ๋ฐ์ดํ„ฐ ์ž์ฒด๋ฅผ ์—ฐ-์›”-์ผ์˜ ๊ตฌ์กฐ๋กœ ์ €์žฅํ•œ๋‹ค.
      -> ๋ณดํ†ต ์ด์™€๊ฐ™์ด ์ด๋ฏธ ์ €์žฅ๋˜์–ด ์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค.
  • Partitioning์˜ ์žฅ์ 

    • ๋ฐ์ดํ„ฐ์˜ ์ฝ๊ธฐ ๊ณผ์ •์„ ์ตœ์ ํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค.(Scaning ๊ณผ์ •์ด ์ค„์–ด๋“ค๊ฑฐ๋‚˜ ์—†์–ด์ง„๋‹ค.)
    • ๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ๋„ ์‰ฌ์›Œ์ง„๋‹ค. (Retention Policy๋ฅผ ์ ์šฉํ•  ๊ฒฝ์šฐ)
  • DataFrameWriter์˜ paritionBy๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
    -> Partition key๋ฅผ ์ž˜๋ชป ์„ ํƒํ•˜๋ฉด ์—„์ฒญ๋‚˜๊ฒŒ ๋งŽ์€ ํŒŒ์ผ๋“ค์ด ์ƒ์„ฑ๋œ๋‹ค.
    -> Partition key๋Š” ์นด๋””๋„๋ฆฌํ‹ฐ๊ฐ€ ๋‚ฎ์€๊ฒƒ์„ ์‚ฌ์šฉํ•ด์•ผํ•œ๋‹ค.

0๊ฐœ์˜ ๋Œ“๊ธ€