๐Ÿ“’ Hadoop (2)

Kimdongkiยท2024๋…„ 6์›” 17์ผ

DB

๋ชฉ๋ก ๋ณด๊ธฐ
30/33

๐Ÿ“Œ Hadoop Install

  • ์˜์‚ฌ ๋ถ„์‚ฐ ๋ชจ๋“œ๋Š” Hadoop ๊ด€๋ จ ํ”„๋กœ์„ธ์Šค๋“ค์„ ๊ฐœ๋ณ„ JVM์œผ๋กœ ์‹คํ–‰ํ•œ๋‹ค.
  • AWS Ubuntu EC2 t2.medium ์ธ์Šคํ„ด์Šค ์ถ”์ฒœ
    • Java8์ด ํ•„์š”ํ•˜๋‹ค.

1. ์„ค์น˜

  • ์žˆ๋‹ค๋ฉด ๋ฒ„์ „ ํ™•์ธ
java -version

  • ์—†๋‹ค๋ฉด ์„ค์น˜ํ•œ๋‹ค. ์•ˆ๋‚ดํ•ด์ค€ ๋ช…๋ น์–ด๋ฅผ ๋ณต๋ถ™ํ•˜๋ฉด ๊ทธ๋งŒ์ด๋‹ค.
    -> ์—ฌ๊ธฐ์„œ๋Š” java8 ์‚ฌ์šฉ
sudo apt install openjdk-8-jre-headless

๋งŒ์•ฝ ์•„๋ž˜ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค๋ฉด apt ๋ฒ„์ „ ์˜ค๋ฅ˜์ธ๊ฒƒ์ด๋‹ค.

E: Unable to locate package openjdk-8-jre-headless

sudo apt update

2. ์„ค์น˜ ์™„๋ฃŒ

3. Hadoop ๊ณ„์ • ์ƒ์„ฑ

  • Hadoop ํด๋Ÿฌ์Šคํ„ฐ๊ฐ€ ๋™์ž‘ํ•  ์ „์šฉ ๊ณ„์ •์ดํ•„์š”ํ•˜๋‹ค.
sudo adduser hdoop
  • Password๋งŒ ์ž…๋ ฅํ•ด์ฃผ๊ณ  ๋‚˜๋จธ์ง€๋Š” Enter๋กœ ๋„˜๊ฒผ๋‹ค.

  • ๊ณ„์ • ์Šค์œ„์น˜

su - hdoop

4. Hadoop ๊ณ„์ • ๊ถŒํ•œ ์„ค์ •

  • Hadoop ๊ณ„์ •์ด localhost๋กœ ๋กœ๊ทธ์ธํ•  ๋•Œ password๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š๋„๋ก sshํ‚ค๋ฅผ ๋“ฑ๋กํ•ด์•ผํ•œ๋‹ค.

  • ssh ํ‚ค ์ƒ์„ฑ

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

  • ssh ํ‚ค ๋“ฑ๋ก
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

  • ssh ํ‚ค ์ฝ๊ธฐ ์ „์šฉ์œผ๋กœ ๋ณ€๊ฒฝ
chmod 0600 ~/.ssh/authorized_keys
  • ํ™•์ธ
ssh localhost

5. Hadoop ์„ค์น˜

Apache
์„ค์น˜์ „์— ๋งํฌ๋กœ ๋“ค์–ด๊ฐ€์„œ hadoop์˜ ๋ฒ„์ „์„ ํ™•์ธํ•˜์ž

  • ์„ค์น˜
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.5/hadoop-3.3.5.tar.gz
  • ์••์ถ• ํ•ด์ œ
tar xvf hadoop-3.3.5.tar.gz
  • ์„ค์น˜ ํ™•์ธ

  • Hadoop๊ณผ ๊ด€๊ณ„๋œ ํ™˜๊ฒฝ๋ณ€์ˆ˜๋ฅผ ์„ค์ •ํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค.

vi .bashrc
nano .bashrc
export HADOOP_HOME=/home/hdoop/hadoop-3.3.5
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
source .bashrc

์•„๋ž˜ ๋งˆ์Šคํ„ฐํŒŒ์ผ์„ ์žฌ์ •์˜ ํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค.

vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Generic settings for HADOOP ๋ธ”๋Ÿญ์„ ์ฐพ์•„์„œ
export JAVA_HOME= ์˜ ์ฃผ์„์„ ํ’€์–ด์ฃผ์ž.

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

๋‹ค์Œ์œผ๋กœ core-site.xml ํŒŒ์ผ๋„ ์ˆ˜์ •ํ•ด์ฃผ์ž.

vi $HADOOP_HOME/etc/hadoop/core-site.xml
nano $HADOOP_HOME/etc/hadoop/core-site.xml

์•„๋ž˜ ๋‚ด์šฉ์„ ๋„ฃ์–ด์ฃผ์ž

<configuration>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hdoop/tmpdate</value>
</property>
<property>
  <name>fs.default.name</name>
  <value>hdfs://127.0.0.1:9000</value>
</property>
</configuration>

๋‹ค์Œ์œผ๋กœ hdfs-site.xml์„ ์ˆ˜์ •ํ•ด์ฃผ์ž.

vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<configuration>
<property>
  <name>dfs.data.dir</name>
  <value>/home/hdoop/dfsdata/namenode</value>
</property>
<property>
  <name>dfs.data.dir</name>
  <value>/home/hdoop/dfsdata/datanode</value>
</property>
<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>
</configuration>

๋‹ค์Œ์œผ๋กœ mapred-site.xml์„ ์ˆ˜์ •ํ•ด์ฃผ์ž.

vi $HADOOP_HOME/etc/hadoop/mapred-site.xml
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
<configuration>
<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
</property>
</configuration>

๋งˆ์ง€๋ง‰์œผ๋กœ yarn-site.xml์„ ์ˆ˜์ •ํ•ด์ฃผ์ž.

vi $HADOOP_HOME/etc/hadoop/yarn-site.xml
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
<configuration>
<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>
<property>
  <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>localhost</value>
</property>
<property>
  <name>yarn.acl.enable</name>
  <value>0</value>
</property>
<property>
  <name>yarn.nodemanager.env-whitelist</name>
  <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>

์ด์ œ๋Š” hdfs๋ฅผ ํฌ๋งทํ•ด์ฃผ์–ด์•ผํ•œ๋‹ค.

hdfs namenode -format

dfs ์‹คํ–‰ํ•˜๊ณ  yarn์„ ์‹คํ–‰ํ•ด์•ผํ•œ๋‹ค.

cd hadoop-3.3.5/sbin/
./start-dfs.sh
./start-yarn.sh

jdk๋Š” ์„ค์น˜ํ•˜์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์—
์‹œ์Šคํ…œ ๋‚ด๋ถ€์— ์žˆ๋Š” java Application์„ ๋ณด์—ฌ์ฃผ๋Š” jps ๋ช…๋ น์–ด๋Š” ์‹คํ–‰๋˜์ง€ ์•Š๋Š”๋‹ค.

jps

jdk๋ฅผ ์„ค์น˜ํ•˜๊ธฐ ์œ„ํ•ด์„œ Ubuntu๋กœ ๋Œ์•„๊ฐ€์ž

exit

jdk์„ค์น˜

sudo apt install openjdk-8-jdk-headless

๋‹ค์‹œ Hadoop acount๋กœ ๋Œ์•„๊ฐ€์ž

su - hdoop

๋‹ค์‹œ jps ์‹คํ–‰

jps

๐Ÿ“Œํ•˜๋‘ก ์›น UI - HDFS

  • NameNode (ํฌํŠธ๋ฒˆํ˜ธ: 9870)
  • DataNode (ํฌํŠธ๋ฒˆํ˜ธ: 9864)

๐Ÿ“Œ ๋งต๋ฆฌ๋“€์Šค ํ”„๋กœ๊ทธ๋ž˜๋ฐ - ๋‹จ์–ด์ˆ˜ ์„ธ๊ธฐ

  • ์•ž์„œ ์‚ดํŽด๋ณธ WordCount ํ”„๋กœ๊ทธ๋žจ ์‹คํ–‰ํ•ด๋ณด๊ธฐ

    • bin/hadoop jar hadoop-*-examples.jar wordcount input output
    • bin/hadoop == bin/yarn
  • HDFS ์ž…๋ ฅ/์ถœ๋ ฅ ์‚ดํŽด๋ณด๊ธฐ

    • bin/hdfs dfs -ls input
    • bin/hdfs dfs -ls output
  • ํ•˜๋‘ก Web UI (Resource Manager)๋กœ ์‹คํ–‰ ๊ฒฐ๊ณผ ์‚ดํŽด๋ณด๊ธฐ


๐Ÿ“Œ MapReduce ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ฌธ์ œ์ 

  • ์ƒ์‚ฐ์„ฑ์ด ๋–จ์–ด์ง. ๋ฐ์ดํ„ฐ ๋ชจ๋ธ๊ณผ ์˜คํผ๋ ˆ์ด์…˜์— ์ œ์•ฝ์ด ๋งŽ์Œ
  • ๋ชจ๋“  ์ž…์ถœ๋ ฅ์ด ๋””์Šคํฌ๋ฅผ ํ†ตํ•ด ์ด๋ค„์ง
    • ํฐ ๋ฐ์ดํ„ฐ ๋ฐฐ์น˜ ํ”„๋กœ์„ธ์‹ฑ์— ์ ํ•ฉ
  • Shuffling ์ดํ›„์— Data Skew๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ ์‰ฌ์›€
    • Reduce ํƒœ์Šคํฌ ์ˆ˜๋ฅผ ๊ฐœ๋ฐœ์ž๊ฐ€ ์ง€์ •ํ•ด์ฃผ์–ด์•ผํ•จ

0๊ฐœ์˜ ๋Œ“๊ธ€