๐Ÿ“’ Spark(4)

Kimdongkiยท2024๋…„ 6์›” 18์ผ

Spark

๋ชฉ๋ก ๋ณด๊ธฐ
4/22

๐Ÿ“Œ Spark ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ

  • RDD, DataFrame, Dataset (ummutable Distributed Data)
    -> 2016๋…„์— DataFrame๊ณผ Dataset์€ ํ•˜๋‚˜์˜ API๋กœ ํ†ตํ•ฉ๋””์—ˆ๋‹ค.
    -> ๋ชจ๋‘ ํŒŒํ‹ฐ์…˜์œผ๋กœ ๋‚˜๋ˆ„์–ด์„œ Spark์—์„œ ์ฒ˜๋ฆฌ๋œ๋‹ค.
RDDDataFrameDataset
What?Distributed collection of records (structured & unstructured)RDD organized into named columnExtension of data frame
When1.01.31.6
Compile type CheckNoNoYes
APINoYesYes
Base Spark SQLNoYesYes
Catalyst OptimizerNoYesYes

1. RDD

  • ๋กœ์šฐ๋ ˆ๋ฒจ ๋ฐ์ดํ„ฐ๋กœ ํด๋Ÿฌ์Šคํ„ฐ ๋‚ด์˜ ์„œ๋ฒ„์— ๋ถ„์‚ฐ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์ง€์นญํ•œ๋‹ค.

  • ๋ ˆ์ฝ”๋“œ๋ณ„๋กœ ์กด์žฌํ•˜์ง€๋งŒ ์Šคํ‚ค๋งˆ๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค.

    • ๊ตฌ์กฐํ™”๋œ ๋ฐ์ดํ„ฐ๋‚˜ ๋น„๊ตฌ์กฐํ™”๋œ ๋ฐ์ดํ„ฐ ๋ชจ๋‘ ์ง€์›ํ•œ๋‹ค.
  • ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ƒ์‚ฐ์„ฑ์ด ๋–จ์–ด์ง„๋‹ค.

2. DataFrame & Dataset

  • RDD์™€ ๋‹ฌ๋ฆฌ ํ•„๋“œ ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค -> Table

  • Dataset์€ Type ์ •๋ณด๊ฐ€ ์กด์žฌํ•˜๋ฉฐ ์ปดํŒŒ์ผ ์–ธ์–ด์—์„œ ์‚ฌ์šฉ๊ฐ€๋Šฅํ•Ÿ.
    -> Scala/Java์—์„œ ์‚ฌ์šฉ๊ฐ€๋Šฅ

  • PySpark์—์„œ๋Š” DataFrame์„ ์‚ฌ์šฉํ•œ๋‹ค.


1. Code Analysis : ์ฝ”๋“œ ๋ถ„์„ -> ์—๋Ÿฌ ๋ถ„์ถœ

  1. Logical Optimization (Catalyst Optimizer) : ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ์•ˆ์„ ์ฐพ๋Š”๋‹ค.(๋น„์šฉ ๊ณ„์‚ฐ)

  2. Physical Planning : ๋น„์šฉ์ด ๊ฐ€์žฅ ์ €๋ ดํ•œ ๊ฒƒ์„ ์ฐพ์•„์„œ RDD ์˜คํผ๋ ˆ์ด์…˜์œผ๋กœ ์ฝ”๋“œ๋ฅผ ๋งŒ๋“ ๋‹ค.

  3. Code Generation (Project Tungsten) : ์ฝ”๋“œ ๋ณ€ํ™˜ -> ์ž๋ฐ” ๋ฐ”์ด์ฝ”๋“œ


๐Ÿ“Œ RDD

  • ๋ณ€๊ฒฝ์ด ๋ถˆ๊ฐ€๋Šฅํ•œ ๋ถ„์‚ฐ ์ €์žฅ๋œ ๋ฐ์ดํ„ฐ
    • RDD๋Š” ๋‹ค์ˆ˜์˜ ํŒŒํ‹ฐ์…˜์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ๋‹ค.
    • ๋กœ์šฐ๋ ˆ๋ฒจ์˜ ํ•จ์ˆ˜ํ˜• ๋ณ€ํ™˜์„ ์ง€์›ํ•œ๋‹ค. (Map, Filter, FlatMap ๋“ฑ๋“ฑ)
  • ์ผ๋ฐ˜ Python Data๋Š” Parallelize ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ RDD๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.
    • ๋ฐ˜๋Œ€๋กœ collect๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Python Data๋กœ ๋ณ€ํ™˜ ๊ฐ€๋Šฅํ•˜๋‹ค.
py_list = [
	(1, 2, 3, 'a b c'),
    (4, 5, 6, 'd e f'),
    (7, 8, 9, 'g h i')
]
rdd = sc.parallelize(py_list)
print(rdd.collect())


๐Ÿ“Œ DataFrame

  • ๋ณ€๊ฒฝ์ด ๋ถˆ๊ฐ€๋Šฅํ•œ ๋ถ„์‚ฐ ์ €์žฅ๋œ ๋ฐ์ดํ„ฐ

  • RDD์™€๋Š” ๋‹ค๋ฅด๊ฒŒ ๊ด€๊ณ„ํ˜• DB Table์ฒ˜๋Ÿผ Column์œผ๋กœ ๋‚˜๋ˆ„์–ด์„œ ์ €์žฅํ•œ๋‹ค.

    • Padas์˜ DataFrame ํ˜น์€ ๊ด€๊ณ„ํ˜• DB์˜ Table๊ณผ ๊ฑฐ์˜ ํก์‚ฌํ•˜๋‹ค.
    • ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์†Œ์Šค๋ฅผ ์ง€์›ํ•œ๋‹ค. : HDFS, Hive, ์™ธ๋ถ€ DB, RDD ๋“ฑ๋“ฑ
  • Scala, Java, Python๊ณผ ๊ฐ™์€ ์–ธ์–ด์—์„œ ์ง€์›ํ•œ๋‹ค.

0๊ฐœ์˜ ๋Œ“๊ธ€