๐Ÿ“’ Spark(6)

Kimdongkiยท2024๋…„ 6์›” 19์ผ

Spark

๋ชฉ๋ก ๋ณด๊ธฐ
7/22

๐Ÿ“Œ Spark ๊ฐœ๋ฐœ ํ™˜๊ฒฝ ์˜ต์…˜

  • Local Standalone Spark + Spark Shell
  • Python IDE โ€“ PyCharm, Visual Studio
  • Databricks Cloud โ€“ ์ปค๋ฎค๋‹ˆํ‹ฐ ์—๋””์…˜์„ ๋ฌด๋ฃŒ๋กœ ์‚ฌ์šฉ
  • ๋‹ค๋ฅธ ๋…ธํŠธ๋ถ โ€“ ์ฃผํ”ผํ„ฐ ๋…ธํŠธ๋ถ, ๊ตฌ๊ธ€ Colab, ์•„๋‚˜์ฝ˜๋‹ค ๋“ฑ๋“ฑ

๐Ÿ“Œ Local Standalone Spark

  • Spark Cluster Manager๋กœ local[n] ์ง€์ •ํ•œ๋‹ค.

    • master๋ฅผ local[n]์œผ๋กœ ์ง€์ •ํ•œ๋‹ค.
    • master๋Š” ํด๋Ÿฌ์Šคํ„ฐ ๋งค๋‹ˆ์ €๋ฅผ ์ง€์ •ํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•œ๋‹ค.
  • ์ฃผ๋กœ ๊ฐœ๋ฐœ์ด๋‚˜ ๊ฐ„๋‹จํ•œ ํ…Œ์ŠคํŠธ ์šฉ๋„์ด๋‹ค.

  • ํ•˜๋‚˜์˜ JVM์—์„œ ๋ชจ๋“  ํ”„๋กœ์„ธ์Šค๋ฅผ ์‹คํ–‰๋œ๋‹ค.

    • ํ•˜๋‚˜์˜ Driver์™€ ํ•˜๋‚˜์˜ Executor๊ฐ€ ์‹คํ–‰๋œ๋‹ค.
    • 1+ ์“ฐ๋ ˆ๋“œ๊ฐ€ Executor์•ˆ์—์„œ ์‹คํ–‰๋œ๋‹ค.
  • Executor์•ˆ์— ์ƒ์„ฑ๋˜๋Š” ์“ฐ๋ ˆ๋“œ ์ˆ˜

    • local:ํ•˜๋‚˜์˜ ์“ฐ๋ ˆ๋“œ๋งŒ ์ƒ์„ฑ
    • local[*]: ์ปดํ“จํ„ฐ CPU ์ˆ˜๋งŒํผ ์“ฐ๋ ˆ๋“œ๋ฅผ ์ƒ์„ฑ

๐Ÿ“Œ Google Colab์—์„œ Spark ์‚ฌ์šฉ

  • PySpark + Py4J๋ฅผ ์„ค์น˜
!pip install pyspark==3.3.1 py4j==0.10.9.5 
  • ๊ตฌ๊ธ€ Colab ๊ฐ€์ƒ์„œ๋ฒ„ ์œ„์— ๋กœ์ปฌ ๋ชจ๋“œ Spark์„ ์‹คํ–‰

  • ๊ฐœ๋ฐœ ๋ชฉ์ ์œผ๋กœ๋Š” ์ถฉ๋ถ„ํ•˜์ง€๋งŒ ํฐ ๋ฐ์ดํ„ฐ์˜ ์ฒ˜๋ฆฌ๋Š” ๋ถˆ๊ฐ€

  • Spark Web UI๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ๋Š” ์ ‘๊ทผ ๋ถˆ๊ฐ€

    • ngrok์„ ํ†ตํ•ด ์–ต์ง€๋กœ ์—ด ์ˆ˜๋Š” ์žˆ์Œ
  • Py4J

    • ํŒŒ์ด์ฌ์—์„œ JVM๋‚ด์— ์žˆ๋Š” ์ž๋ฐ” ๊ฐ์ฒด๋ฅผ ์‚ฌ์šฉ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ด์คŒ

๐Ÿ“Œ Linux์—์„œ Spark ์‚ฌ์šฉ

  • JDK 8/11 ์„ค์น˜
sudo apt install openjdk-8-jdk
sudo apt install openjdk-11-jdk
  • JDK 8/11 ๋ฒ„์ „ ๋ฐ”๊พธ๋Š” ๋ฐฉ๋ฒ•
sudo update-alternatives --config java
  • ํ™˜๊ฒฝ๋ณ€์ˆ˜ (์‹œ์ž‘ ์Šคํฌ๋ฆฝํŠธ)
nano ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATH

JAVA 2๊ฐ€์ง€ ๋ฒ„์ „ ์‚ฌ์šฉ๋ฐฉ๋ฒ•

์ด๋ ‡๊ฒŒ ํ•˜๊ณ  ๋ณด๋‹ˆ๊นŒ ํ™˜๊ฒฝ๋ณ€์ˆ˜๊ฐ€ ํ•˜๋‚˜์˜ ๊ฑธ๋ฆผ๋Œ์ด๊ฐ€ ๋˜์—ˆ๋‹ค.
๋‚˜๋Š” ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•์ด ์žˆ๊ฒ ์ง€๋งŒ ๋‘๊ฐœ์˜ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์ž‘์„ฑํ•ด์„œ source ๋ช…๋ น์–ด๋กœ ๋ฒ„์ „์„ ๋ฐ”๊พธ๋ฉด์„œ ์‚ฌ์šฉํ•˜๋ ค๊ณ  ํ•œ๋‹ค.

nano ~/use_java11.sh

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATH
echo "witched to JDK 11"
nano ~/use_java8.sh

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATH
echo "witched to JDK 8"

๋งˆ๋ฌด๋ฆฌ๋กœ ๋‘˜๋‹ค ๊ถŒํ•œ์„ ๋ถ€์—ฌํ•ด์ฃผ์–ด์•ผํ•œ๋‹ค.

chmod +x ~/use_java11.sh
chmod +x ~/use_java8.sh

์ด์ œ ์‚ฌ์šฉํ•  ๋•Œ๋Š” ์•„๋ž˜์ฒ˜๋Ÿผ ๋ช…๋ น์–ด๋ฅผ ์ž…๋ ฅํ•˜๋ฉด ๋œ๋‹ค.

source ~/use_java8.sh
source ~/use_java11.sh

Spark Install

์ด์ œ Spark๋ฅผ ๋‹ค์šด๋กœ๋“œ ๋ฐ›์•„๋ณด์ž.

์šฐ๋ฆฌ๋Š” ๋ฆฌ๋ˆ…์Šค ํ„ฐ๋ฏธ๋„ ํ™˜๊ฒฝ์—์„œ ๋‹ค์šด๋กœ๋“œ ๋ฐ›์„๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ๋งํฌ๋งŒ ๊ฐ€์ ธ์˜ค์ž.

wget https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz

์••์ถ• ํ•ด์ œ

tar -xvzf spark-3.5.1-bin-hadoop3.tgz

Spark๋„ ํ™˜๊ฒฝ๋ณ€์ˆ˜๋ฅผ ์ž‘์„ฑํ•ด ์ฃผ์–ด์•ผ ํ•œ๋‹ค.
ํ˜„์žฌ ๋””๋ ‰ํ† ๋ฆฌ + Spark์ด๋ฆ„(spark-3.5.1-bin-hadoop3) ๋ฅผ ์ž‘์„ฑํ•ด์ฃผ์ž.

nano ~/.bashrc

export SPARK_HOME=/home/urface/spark_course/spark-3.5.1-bin-hadoop3
export PATH=$PATH:$SPARK_HOME/bin

์ €์žฅํ•ด์ฃผ์ž.

source ~/.bashrc

์ด์ œ Spark์˜ Shell๋กœ ์ง„์ž…ํ•ด๋ณด์ž.

spark-shell

Spark-shell์„ ๋‹ซ๋Š” ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

:q

Py-Spark-Shell

์ด์ „์— ์„ค์น˜ํ•œ Spark๋Š” Scala๋ฅผ ์œ„ํ•œ ๋ฒ„์ „์ด๋‹ค.
์ด๋ฒˆ์—๋Š” python์œผ๋กœ ์‚ฌ์šฉ๊ฐ€๋Šฅํ•œ Spark์„ ์„ค์น˜ํ•ด๋ณด์ž
๊ทธ์ „์— ๋ฏธ๋ฆฌ Py4j๋ฅผ ์„ค์น˜ํ•ด์•ผ ํ•œ๋‹ค.

pip install py4j

์ด์ œ ๋ฐ”๋กœ ์‹คํ–‰ ๊ฐ€๋Šฅํ•˜๋‹ค

pyspark

๋‚˜๊ฐ€๋Š” ๋ฐฉ๋ฒ•

exit()

Spark Submit

Spark์„ ์„ค์น˜ํ•˜๋ฉด ๊ฐ™์ด ์„ค์น˜๋˜๋Š” ํŒŒ์ด์˜ ๊ฐ’์„ ๊ณ„์‚ฐํ•˜๋Š” ์˜ˆ์ œ ํ”„๋กœ๊ทธ๋žจ์„ ์‹คํ–‰ํ•ด๋ณด์ž

spark-submit --master 'local[4]'  ./spark-3.5.1-bin-hadoop3/examples/src/main/python/pi.py

0๊ฐœ์˜ ๋Œ“๊ธ€