๐Ÿ“’ Spark(12)

Kimdongkiยท2024๋…„ 6์›” 21์ผ

Spark

๋ชฉ๋ก ๋ณด๊ธฐ
12/22

๐Ÿ“Œ AWS Spark ํด๋Ÿฌ์Šคํ„ฐ ๋ก ์น˜

  • AWS์—์„œ Spark์„ ์‹คํ–‰ํ•˜๊ธฐ์œ„ํ•œ ์ค€๋น„

    • EMR(Elastic MapReduce )์œ„์—์„œ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ด๋‹ค.
  • EMR

    • AWS์˜ Hadoop Service(On-demand Hadoop)
      ->Hadoop(YARN), Spark, Hive, Notebook ๋“ฑ๋“ฑ์ด ์„ค์น˜๋˜์–ด ์ œ๊ณต๋˜๋Š” ์„œ๋น„์Šค์ด๋‹ค.

    • EC2 Server๋“ค์„ worker node๋กœ ์‚ฌ์šฉํ•˜๊ณ  S3๋ฅผ HDFS๋กœ ์‚ฌ์šฉํ•œ๋‹ค.

    • AWS๋‚ด์˜ ๋‹ค๋ฅธ Service๋“ค๊ณผ ์—ฐ๋™์ด ์‰ฝ๋‹ค.(Kinesis, DynamoDB, Redshift, ..)

  • Spark on EMR ์‹คํ–‰ ๋ฐ ์‚ฌ์šฉ ๊ณผ์ •

    • AWS์˜ EMR ํด๋Ÿฌ์Šคํ„ฐ ์ƒ์„ฑ
    • EMR ์ƒ์„ฑ์‹œ Spark์„ ์‹คํ–‰(Option์œผ๋กœ ์„ ํƒ)
      -> S3๋ฅผ ๊ธฐ๋ณธ ํŒŒ์ผ ์‹œ์Šคํ…œ์œผ๋กœ ์‚ฌ์šฉ
    • EMR์˜ ๋งˆ์Šคํ„ฐ ๋…ธ๋“œ๋ฅผ ๋“œ๋ผ์ด๋ฒ„ ๋…ธ๋“œ๋กœ ์‚ฌ์šฉ
      • ๋งˆ์Šคํ„ฐ ๋…ธ๋“œ๋ฅผ SSH๋กœ ๋กœ๊ทธ์ธ
        -> spark-submit๋ฅผ ์‚ฌ์šฉ
      • Spark์˜ Cluster ๋ชจ๋“œ์— ํ•ด๋‹นํ•œ๋‹ค.
  • Spark Cluster Manager์™€ ์‹คํ–‰ Model ์š”์•ฝ

Cluster managerDeployed modeProgram Run Method
loacl[n]ClientSpark Shell, IDE, Notebook
YARNClientSpark Shell, Notebook
YARNClientspark-submit

๐Ÿ“Œ ์‹œ๋‚˜๋ฆฌ์˜ค

Step 1

  • EMR Service Page๋กœ ์ด๋™ & AWS Console์—์„œ EMR ์„ ํƒ

  • Create Cluster ์„ ํƒ

Step 2

  • EMR Cluster ์ƒ์„ฑ -> ์ด๋ฆ„๊ณผ ๊ธฐ์ˆ  ์Šคํƒ ์„ ํƒ

  • Software configuration

    • Spark์ด ๋“ค์–ด๊ฐ„ ์˜ต์…˜ ์„ ํƒ
    • Zeppelin
  • Zeppelin์ด๋ž€
    -> Notebook -> Spark, SQL, Python

Step 3

  • Cluster ์‚ฌ์–‘ ์„ ํƒ ํ›„ ์ƒ์„ฑ
  • m5.xlarge node 3๊ฐœ ์„ ํƒ -> ํ•˜๋ฃจ $35 ๋น„์šฉ ๋ฐœ์ƒ

    • 4 CPU * 2
    • 16 GB * 2
  • Create Cluster ์„ ํƒ

Step 4

  • EMR Cluster ์ƒ์„ฑ๊นŒ์ง€ ๋Œ€๊ธฐ

  • Cluster๊ฐ€ Waiting์œผ๋กœ ๋ณ€ํ•  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐ

Step 5

  • ๋งˆ์Šคํ„ฐ ๋…ธ๋“œ์˜ ํฌํŠธ๋ฒˆํ˜ธ 22๋ฒˆ์œผ๋กœ ์—ด๊ธฐ
    -> ๋„คํŠธ์›Œํฌ ๋ฐ ๋ณด์•ˆ -> EC2 ๋ณด์•ˆ ๊ทธ๋ฃน(๋ฐฉํ™”๋ฒฝ) -> ํ”„๋ผ์ด๋จธ๋ฆฌ ๋…ธ๋“œ -> ์ธ๋ฐ”์šด๋“œ ๊ทœ์น™ -> ์ธ๋ฐ”์šด๋“œ ๊ทœ์น™ ํŽธ์ง‘(SSH๊ฐ€ ์—†๋‹ค๋ฉด)

  • EMR Cluster Summary Tab ์„ ํƒ

  • Security Groups for Master ๋งํฌ ์„ ํƒ

  • Security Groups ํŽ˜์ด์ง€์—์„œ ๋งˆ์Šคํ„ฐ ๋…ธ๋“œ์˜ security group ID๋ฅผ ํด๋ฆญ

  • Edit inbound rules ๋ฒ„ํŠผ ํด๋ฆญ ํ›„ Add rule ๋ฒ„ํŠผ ์„ ํƒ

  • Port ๋ฒˆํ˜ธ๋กœ 22๋ฅผ ์ž…๋ ฅ

  • Anywhere IP v4 ์„ ํƒ

  • Save rules ๋ฒ„ํŠผ ์„ ํƒ

Step 6

  • Spark History Server ๋ณด๊ธฐ

Step 7

  • Spark master node์— SSH๋กœ ๋กœ๊ทธ์ธ
    -> ์ด๋ฅผ ์œ„ํ•ด์„œ Master node์˜ TCP Port Number 22๋ฒˆ์„ ์˜คํ”ˆํ•ด์•ผํ•œ๋‹ค.
  • spark-submit์„ ์ด์šฉํ•˜์—ฌ ์‹คํ–‰ํ•˜๋ฉด์„œ ๋””๋ฒ„๊น…ํ•œ๋‹ค.

  • ๋‘ ๊ฐœ์˜ Job์„ AWS EMR ์ƒ์—์„œ ์‹คํ–‰ํ•ด ๋ณผ ์˜ˆ์ •์ด๋‹ค.

Step 8

  • ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ S3๋กœ ๋กœ๋”ฉ

  • Stackoverflow 2022๋…„ ๊ฐœ๋ฐœ์ž Survey CSV ํŒŒ์ผ์„ S3 Bucket์œผ๋กœ ์—…๋กœ๋“œ
    -> ์ต๋ช…ํ™”๋œ 83,339๊ฐœ์˜ Survey ์‘๋‹ต

  • s3://spark-tutorial-dataset/survey_results_public.csv

Step 9

  • PySpark Job Code

  • ์ž…๋ ฅ CSV ํŒŒ์ผ์„ ๋ถ„์„ํ•˜์—ฌ ๊ทธ ๊ฒฐ๊ณผ๋ฅผ S3์— ๋‹ค์‹œ ์ €์žฅ(stackoverflow.py)

  • ๋ฏธ๊ตญ ๊ฐœ๋ฐœ์ž๋งŒ ๋Œ€์ƒ์œผ๋กœ ์ฝ”๋”ฉ์„ ์ฒ˜์Œ ์–ด๋–ป๊ฒŒ ๋ฐฐ์› ๋Š”์ง€๋ฅผ ์นด์šดํŠธ ํ•˜์—ฌ S3์— ์ €์žฅ

๐Ÿ“™ stackoverflow.py

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

S3_DATA_INPUT_PATH = 's3://spark-tutorial-dataset/survey_results_public.csv'
S3_DATA_OUTPUT_PATH = 's3://spark-tutorial-dataset/data-output'

spark = SparkSession.builder.appName('Tutorial').getOrCreate()

df = spark.read.csv(S3_DATA_INPUT_PATH, header=True)
print('# of records {}'.format(df.count()))

learnCodeUS = df.where((col('Country')=='United States of America')).groupby('LearnCode').count()
learnCodeUS.write.mode('overwrite').csv(S3_DATA_OUTPUT_PATH) # parquet
learnCodeUS.show()

print('Selected data is successfully saved to S3: {}'.format(S3_DATA_OUTPUT_PATH))

Step 10

  • PySpark Job Run

  • Spark Master node์— SSH๋กœ ๋กœ๊ทธ์ธ ํ•˜์—ฌ spark-submit์„ ํ†ตํ•ด ์‹คํ–‰

  • ์•ž์„œ ๋‹ค์šด๋กœ๋“œ ๋ฐ›์€ .ppk ํŒŒ์ผ์„ ์‚ฌ์šฉํ•˜์—ฌ SSH ๋กœ๊ทธ์ธ

  • Code Run

    • spark-submit -- master yarn stackoverflow.py
[hadoop@ip-172-31-47-191 ~]$ nano stackoverflow.py
[hadoop@ip-172-31-47-191 ~]$ spark-submit --master yarn stackoverflow.py

Final

  • PySpark Job ์‹คํ–‰ ๊ฒฐ๊ณผ๋ฅผ S3์—์„œ ํ™•์ธํ•˜๊ธฐ

0๊ฐœ์˜ ๋Œ“๊ธ€