๐Ÿ“’ Hadoop (1)

Kimdongkiยท2024๋…„ 6์›” 17์ผ

DB

๋ชฉ๋ก ๋ณด๊ธฐ
29/33

๐Ÿ“Œ Hadoop

  • ์ƒ์šฉ HW๋กœ ๊ตฌ์ถ•๋œ ์ปดํ“จํ„ฐ ํด๋Ÿฌ์Šคํ„ฐ์—์„œ ๋งค์šฐ ๊ฑฐ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ์…‹์˜ ๋ถ„์‚ฐ ์Šคํ† ๋ฆฌ์ง€์™€ ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ์˜คํ”ˆ์†Œ์Šค SW ํ”Œ๋žซํผ์ด๋‹ค.

  • An open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware

    • HDFS : ๋ถ„์‚ฐ ํŒŒ์ผ ์‹œ์Šคํ…œ
    • MapReduce: ๋ถ„์‚ฐ ์ปดํ“จํŒ… ์‹œ์Šคํ…œ
  • ๋‹ค์ˆ˜์˜ ๋…ธ๋“œ๋กœ ๊ตฌ์„ฑ๋œ ํด๋Ÿฌ์Šคํ„ฐ ์‹œ์Šคํ…œ(Cluster)

    • ๋งˆ์น˜ ํ•˜๋‚˜์˜ ๊ฑฐ๋Œ€ํ•œ ์ปดํ“จํ„ฐ์ฒ˜๋Ÿผ ๋™์ž‘
    • ์‚ฌ์‹ค์€ ๋‹ค์ˆ˜์˜ ์ปดํ“จํ„ฐ๋“ค์ด ๋ณต์žกํ•œ SW๋กœ ํ†ต์ œ๋œ๋‹ค.
  • Hadoop 1.0

    • HDFS์œ„์— MapReduce๋ผ๋Š” ๋ถ„์‚ฐ ์ปดํ“จํŒ… ์‹œ์Šคํ…œ์ด ๋„๋Š” ๊ตฌ์กฐ์ด๋‹ค.
      -> MapReduce์œ„์—์„œ ๋‹ค์–‘ํ•œ ์ปดํ“จํŒ… ์–ธ์–ด๋“ค์ด ๋งŒ๋“ค์–ด์ง„๋‹ค.
  • Hadoop 2.0

    • ์•„ํ‚คํ…์ฒ˜๊ฐ€ ํฌ๊ฒŒ ๋ณ€๊ฒฝ๋˜์—ˆ๋‹ค.
      • YARN์ด๋ž€ ์ด๋ฆ„์˜ ๋ถ„์‚ฐ์ฒ˜๋ฆฌ ์‹œ์Šคํ…œ์œ„์—์„œ ๋™์ž‘ํ•˜๋Š” APP์ด ๋˜์—ˆ๋‹ค.
      • Spark์€ YARN์œ„์—์„œ APP LAYER๋กœ ์‹คํ–‰๋˜์—ˆ๋‹ค.


๐Ÿ“Œ HDFS - ๋ถ„์‚ฐ ํŒŒ์ผ ์‹œ์Šคํ…œ

  • ๋ฐ์ดํ„ฐ๋ฅผ ๋ธ”๋ก๋‹จ์œ„๋กœ ๋‚˜๋ˆ„์–ด ์ €์žฅํ•œ๋‹ค.
    -> ๋ธ”๋ก์˜ ํฌ๊ธฐ๋Š” 128MB(Default)

  • ๋ธ”๋ก ๋ณต์ œ ๋ฐฉ์‹(Replication)

    • ๊ฐ ๋ธ”๋ก์€ 3 ๊ตฐ๋ฐ์— ์ค‘๋ณต ์ €์žฅ๋œ๋‹ค.
    • Fault tolerance๋ฅผ ๋ณด์žฅํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ์‹์œผ๋กœ ์ด ๋ธ”๋ก๋“ค์€ ์ €์žฅ๋œ๋‹ค.
  • Hadoop 2.0 NameNode ์ด์ค‘ํ™”์ง€์›

    • Active & Standby
      -> ๋‘˜ ์‚ฌ์ด์— share edit log๊ฐ€ ์กด์žฌํ•œ๋‹ค.
    • Secondary NameNode๋Š” ์—ฌ์ „ํžˆ ์กด์žฌํ•œ๋‹ค.

๐Ÿ“Œ MapReduce - ๋ถ„์‚ฐ ์ปดํ“จํŒ…

  • Hadoop 1.0

  • ํ•˜๋‚˜์˜ Job Tracker์™€ ๋‹ค์ˆ˜์˜ Task Tracker๋กœ ๊ตฌ์„ฑ๋˜์—ˆ๋‹ค.

    • Job Tracker๊ฐ€ ์ผ์„ ๋‚˜๋ˆ„์–ด์„œ ๋‹ค์ˆ˜์˜ Task Tracker์—๊ฒŒ ๋ถ„๋ฐฐํ•œ๋‹ค.
    • Task Tracker์—์„œ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌํ•œ๋‹ค.
  • MapReduce๋งŒ ์ง€์›ํ•œ๋‹ค.

    • Generalํ•œ ์‹œ์Šคํ…œ์ด ์•„๋‹ˆ๋‹ค.

๐Ÿ“Œ YARN

1. ๋ถ„์‚ฐ ์ปดํ“จํŒ… ์‹œ์Šคํ…œ : Hadoop 2.0(YARN 1.0)

  • ์„ธ๋ถ€ ๋ฆฌ์†Œ์Šค ๊ด€๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•œ ๋ฒ”์šฉ ์ปดํ“จํŒ… ํ”Œ๋ ˆ์ž„ ์›Œํฌ์ด๋‹ค.

    • ๋ฆฌ์†Œ์Šค ๋งค๋‹ˆ์ €
      -> Job Scheduler, Application Manager

    • Node Manager

    • Container

      • App Master
      • Task
  • Spark๊ฐ€ ์ด ์œ„์—์„œ ๊ตฌํ˜„๋œ๋‹ค.

2. YARN์˜ ๋™์ž‘

  • ์‹คํ–‰ํ•˜๋ ค๋Š” ์ฝ”๋“œ์™€ ํ™˜๊ฒฝ ์ •๋ณด๋ฅผ RM(Resource Manager)์—๊ฒŒ ๋„˜๊ธด๋‹ค.

    • ์‹คํ–‰์— ํ•„์š”ํ•œ ํŒŒ์ผ๋“ค์€ Application ID์— ํ•ด๋‹นํ•˜๋Š” HDFS ํด๋”์— ๋ฏธ๋ฆฌ ๋ณต์‚ฌ๋œ๋‹ค.
  • RM์€ NM(Node Manager)์œผ๋กœ๋ถ€ํ„ฐ Container๋ฅผ ๋ฐ›์•„ AM(Application Master)์„ ์‹คํ–‰ํ•œ๋‹ค.

    • AM์€ ํ”„๋กœ๊ทธ๋žจ ๋งˆ๋‹ค ํ•˜๋‚˜์”ฉ ํ• ๋‹น๋˜๋Š” ํ”„๋กœ๊ทธ๋žจ ๋งˆ์Šคํ„ฐ์— ํ•ด๋‹นํ•œ๋‹ค.
  • AM์€ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ์— ํ•„์š”ํ•œ ๋ฆฌ์†Œ์Šค๋ฅผ RM์—๊ฒŒ ์š”๊ตฌํ•œ๋‹ค.

    • RM์€ Data Locality๋ฅผ ๊ณ ๋ คํ•ด์„œ ๋ฆฌ์†Œ์Šค(Container)๋ฅผ ํ• ๋‹นํ•œ๋‹ค.
  • AM์€ ํ• ๋‹น๋ฐ›์€ ๋ฆฌ์†Œ์Šค๋ฅผ NM์„ ํ†ตํ•ด Container๋กœ Launchํ•˜๊ณ  ๊ทธ ์•ˆ์—์„œ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•œ๋‹ค.

    • ์ด ๋•Œ ์‹คํ–‰์— ํ•„์š”ํ•œ ํŒŒ์ผ๋“ค์ด HDFS์—์„œ Container๊ฐ€ ์žˆ๋Š” ์„œ๋ฒ„๋กœ ๋จผ์ € ๋ณต์‚ฌํ•œ๋‹ค.
  • ๊ฐ Task๋Š” ์ƒํ™ฉ์„ ์ฃผ๊ธฐ์ ์œผ๋กœ AM์—๊ฒŒ ๋ณด๊ณ ํ•œ๋‹ค. (Heartbeat)

    • Task๊ฐ€ ์‹คํŒจํ•˜๊ฑฐ๋‚˜ ๋ณด๊ณ ๊ฐ€ ์˜ค๋žœ ์‹œ๊ฐ„ ์—†์œผ๋ฉด Task๋ฅผ ๋‹ค๋ฅธ Container๋กœ ์žฌ์‹คํ–‰ํ•œ๋‹ค.

3. Hadoop 1.0 vs. Hadoop 2.0

  • ํ•˜๋‘ก 2.0์—์„œ ์†Œ๊ฐœ๋œ ํด๋Ÿฌ์Šคํ„ฐ ์ž์› ๊ด€๋ฆฌ์ž๋ฅผ YARN์ด๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

4. Hadoop 3.0์˜ ํŠน์ง•

  • YARN 2.0์„ ์‚ฌ์šฉ

    • YARN ํ”„๋กœ๊ทธ๋žจ๋“ค์˜ ๋…ผ๋ฆฌ์ ์ธ ๊ทธ๋ฃน(ํ”Œ๋กœ์šฐ๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค)์œผ๋กœ ๋‚˜๋ˆ ์„œ ์ž์›๊ด€๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.
    • ์ด๋ฅผ ํ†ตํ•ด ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ํ”„๋กœ์„ธ์Šค์™€ ๋ฐ์ดํ„ฐ ์„œ๋น™ ํ”„๋กœ์„ธ์Šค๋ฅผ ๋‚˜๋ˆ ์„œ ๊ด€๋ฆฌ๊ฐ€๋Šฅํ•˜๋‹ค.
    • ํƒ€์ž„๋ผ์ธ ์„œ๋ฒ„์—์„œ HBase๋ฅผ ๊ธฐ๋ณธ ์Šคํ† ๋ฆฌ์ง€๋กœ ์‚ฌ์šฉํ•œ๋‹ค.(Hadoop 2.1)
  • ํŒŒ์ผ ์‹œ์Šคํ…œ

    • NameNode์˜ ๊ฒฝ์šฐ ๋‹ค์ˆ˜์˜ ์Šคํƒ ๋ฐ”์ด NameNode๋ฅผ ์ง€์›ํ•œ๋‹ค.

    • HDFS, S3, Azure Storage ์ด์—์™ธ๋ฐ Azure Data Lake Storage๋“ฑ์„ ์ง€์›ํ•œ๋‹ค.


๐Ÿ“Œ MapReduce Programing

1. MapReduce Programing์˜ ํŠน์ง•

  • ๋ฐ์ดํ„ฐ ์…‹์€ Key, Value์˜ ์ง‘ํ•ฉ์ด๋ฉฐ ๋ณ€๊ฒฝ ๋ถˆ๊ฐ€ํ•˜๋‹ค. (Immutable)

  • ๋ฐ์ดํ„ฐ์กฐ์ž‘์€ map๊ณผ reduce ๋‘ ๊ฐœ์˜ ์˜คํผ๋ ˆ์ด์…˜์œผ๋กœ๋งŒ ๊ฐ€๋Šฅํ•˜๋‹ค.

    • ์ด ๋‘ ์˜คํผ๋ ˆ์ด์…˜์€ ํ•ญ์ƒ ํ•˜๋‚˜์˜ ์Œ์œผ๋กœ ์—ฐ์†์œผ๋กœ ์‹คํ–‰๋œ๋‹ค.
    • ์ด ๋‘ ์˜คํผ๋ ˆ์ด์…˜์˜ ์ฝ”๋“œ๋ฅผ ๊ฐœ๋ฐœ์ž๊ฐ€ ์ฑ„์›Œ์ฃผ์–ด์•ผ ํ•œ๋‹ค.
  • MapReduce System์ด Map์˜ ๊ฒฐ๊ณผ๋ฅผ Reduce๋‹จ์œผ๋กœ ๋ชจ์•„์ค€๋‹ค.

    • ์ด ๋‹จ๊ณ„๋ฅผ ๋ณดํ†ต ์…”ํ”Œ๋ง์ด๋ผ๊ณ  ํ•˜๋ฉฐ ๋„คํŠธ์›Œํฌ๋‹จ์„ ํ†ตํ•œ ๋ฐ์ดํ„ฐ ์ด๋™์ด ์ƒ๊ธด๋‹ค.

2. Map๊ณผ Reduce

  • Map: (k, v) -> [(k', v')*]
    • ์ž…๋ ฅ์€ ์‹œ์Šคํ…œ์— ์˜ํ•ด ์ฃผ์–ด์ง€๋ฉฐ ์ž…๋ ฅ์œผ๋กœ ์ง€์ •๋œ HDFS ํŒŒ์ผ์—์„œ ๋„˜์–ด์˜จ๋‹ค.
    • ํ‚ค, ๋ฐธ๋ฅ˜ ํŽ˜์–ด๋ฅผ ์ƒˆ๋กœ์šด ํ‚ค, ๋ฐธ๋ฅ˜ ํŽ˜์–ด ๋ฆฌ์ŠคํŠธ๋กœ ๋ณ€ํ™˜(Transformation)
    • ์ธŒ๋ ฅ: ์ž…๋ ฅ๊ณผ ๋™์ผํ•œ ํ‚ค, ๋ฐธ๋ฅ˜ ํŽ˜์–ด๋ฅผ ๊ทธ๋Œ€๋กœ ์ถœ๋ ฅํ•ด๋„ ๋˜๊ณ  ์ถœ๋ ฅ์ด ์—†์–ด๋„ ๋œ๋‹ค.
  • Reduce: (k', [v1', v2', v3', v4', ...]) -> (k'', v'')
    • ์ž…๋ ฅ์€ ์‹œ์Šคํ…œ์— ์˜ํ•ด ์ฃผ์–ด์ง„๋‹ค.
      -> ๋งต์˜ ์ถœ๋ ฅ ์ค‘ ๊ฐ™์€ ํ‚ค๋ฅผ ๊ฐ–๋Š” ํ‚ค/๋ฐธ๋ฅ˜ ํŽ˜์–ด๋ฅผ ์‹œ์Šคํ…œ์ด ๋ฌถ์–ด์„œ ์ž…๋ ฅ์œผ๋กœ ๋„ฃ์–ด์ค€๋‹ค.
    • ํ‚ค์™€ ๋ฐธ๋ฅ˜ ๋ฆฌ์ŠคํŠธ๋ฅผ ์ƒˆ๋กœ์šด ํ‚ค/๋ฐธ๋ฅ˜ ํŽ˜์–ด๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.
    • SQL์˜ Group by์™€ ํก์‚ฌํ•˜๋‹ค.
    • ์ถœ๋ ฅ์ด HDFS์— ์ €์žฅ๋œ๋‹ค.

3. MapReduce Program ๋™์ž‘ ์˜ˆ์‹œ

  • Word Count

4. MapReduce Programing ์˜ˆ์ œ

  • Word Count Mapper

  • Map: (k, v) -> [(k', v')*]

    • Transformation
    • ํ‚ฌ, ๋ฐธ๋ฅ˜ ํŽ˜์–ด๋ฅผ ์ƒˆ๋กœ์šด ํ‚ค, ๋ฐธ๋ฅ˜ ํŽ˜์–ด ๋ฆฌ์ŠคํŠธ๋กœ ๋ณ€ํ™˜
  • Input: (100, "the brave yellow lion")

  • Output: [("the", 1), ("brave", 1), ("yellow", 1), ("lion", 1)]

public static class TokenizerMapper 
     extends Mapper<Object, Text, Text, IntWritable>{
    
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();
      
  public void map(Object key, Text value, Context context
                  ) throws IOException, InterruptedException {
    StringTokenizer itr = new StringTokenizer(value.toString());
    while (itr.hasMoreTokens()) {
      word.set(itr.nextToken());
      context.write(word, one);
    }
  }
}

5. MapReduce Programing ์˜ˆ์ œ

  • Word Count Reducer

  • Reduce: (k', [v1', v2', v3', v4', ...]) -> (k'', v'')

    • SQL์˜ Group by์™€ ๋™์ผํ•˜๋‹ค
    • ํ‚ค, ๋ฐธ๋ฅ˜ ๋ฆฌ์ŠคํŠธ๋ฅผ ์ƒˆ๋กœ์šด ํ‚ค, ๋ฐธ๋ฅ˜ ํŽ˜์–ด๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.
  • Input: ("lion": [1, 1, 1])

  • Output: ("lion": 3)

public static class IntSumReducer 
     extends Reducer<Text,IntWritable,Text,IntWritable> {
  private IntWritable result = new IntWritable();

  public void reduce(Text key, Iterable<IntWritable> values, 
                     Context context
                     ) throws IOException, InterruptedException {
    int sum = 0;
    for (IntWritable val : values) {
      sum += val.get();
    }
    result.set(sum);
    context.write(key, result);
  }
}

6. MapReduce: Shuffling and Sorting

  • Shuffling

    • Mapper์˜ ์ถœ๋ ฅ์„ Reducer๋กœ ๋ณด๋‚ด์ฃผ๋Š” ํ”„๋กœ์„ธ์Šค๋ฅผ๋งํ•œ๋‹ค.
    • ์ „์†ก๋˜๋Š” ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ๊ฐ€ ํฌ๋ฉด ๋„คํŠธ์›Œํฌ ๋ณ‘๋ชฉ์„ ์ดˆ๋ž˜ํ•˜๊ณ  ์‹œ๊ฐ„์ด ์˜ค๋ž˜๊ฑธ๋ฆฐ๋‹ค.
  • Sorting

    • ๋ชจ๋“  Mapper์˜ ์ถœ๋ ฅ์„ Reducer๊ฐ€ ๋ฐ›์œผ๋ฉด ์ด๋ฅผ ํ‚ค๋ณ„๋กœ ์†ŒํŒ…ํ•œ๋‹ค.

7. MapReduce: Data Skew

  • ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ์˜ ํฐ ์˜๋ฏธ๊ฐ€ ์—†๋‹ค.
  • ๊ฐ€์žฅ ๋А๋ฆฐ Task๊ฐ€ ์ „์ฒด ์ฒ˜๋ฆฌ ์†๋„๋ฅผ ๊ฒฐ์ •ํ•œ๋‹ค.
  • ํŠนํžˆ Reducer๋กœ ์˜ค๋Š” ๋ฐ์ดํ„ฐ ํฌ๊ธฐ๋Š” ํฐ ์ฐจ์ด๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ๋‹ค.
    • Group By๋‚˜ Join๋“ฑ์ด ์ด์— ํ•ด๋‹นํ•œ๋‹ค.
    • ์ฒ˜๋ฆฌ ๋ฐฉ์‹์— ๋”ฐ๋ผ Reducer์˜ ์ˆ˜์— ๋”ฐ๋ผ ๋ฉ”๋ชจ๋ฆฌ ์—๋Ÿฌ๋“ฑ์ด ๋‚  ์ˆ˜ ์žˆ๋‹ค.
  • ๋ฐ์ดํ„ฐ ์—”์ง€๋‹ˆ์–ด๊ฐ€ ๊ณ ์ƒํ•˜๋Š” ์ด์œ  ์ค‘์˜ ํ•˜๋‚˜
    • ๋น…๋ฐ์ดํ„ฐ ์‹œ์Šคํ…œ์—๋Š” ์ด ๋ฌธ์ œ๊ฐ€ ๋ชจ๋‘ ์กด์žฌํ•œ๋‹ค.

8. MapReduce ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ๋ฌธ์ œ์ 

  • ๋‚ฎ์€ ์ƒ์‚ฐ์„ฑ

    • ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ์ด ๊ฐ€์ง„ ์œตํ†ต์„ฑ ๋ถ€์กฑ (2๊ฐ€์ง€ ์˜คํผ๋ ˆ์ด์…˜๋งŒ ์ง€์›ํ•œ๋‹ค.)
    • ํŠœ๋‹/์ตœ์ ํ™”๊ฐ€ ์‰ฝ์ง€ ์•Š๋‹ค.
    • ์˜ˆ) ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๊ฐ€ ๊ท ๋“ฑํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ
  • ๋ฐฐ์น˜์ž‘์—… ์ค‘์‹ฌ

    • ๊ธฐ๋ณธ์ ์œผ๋กœ Low Latency๊ฐ€ ์•„๋‹ˆ๋ผ Throughput์— ์ดˆ์ ์ด ๋งž์ถฐ์ง„๋‹ค.

9. MapReduce ๋Œ€์•ˆ๋“ค์˜ ๋“ฑ์žฅ

  • ๋” ๋ฒ”์šฉ์ ์ธ ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ํ”„๋ ˆ์ž„์›Œํฌ๋“ค์˜ ๋“ฑ์žฅ
    • YARN, Spark
  • SQL์˜ ์ปด๋ฐฑ: Hive, Presto๋“ฑ์ด ๋“ฑ์žฅ
    • Hive
      • MapReduce์œ„์—์„œ ๊ตฌํ˜„๋œ๋‹ค.
      • Throughput์— ์ดˆ์ ์„ ๋‘”๋‹ค.
      • ๋Œ€์šฉ๋Ÿ‰ ETL์— ์ ํ•ฉํ•˜๋‹ค.
    • Presto
      • Low Latency์—์„œ ์ดˆ์ ์„ ๋‘”๋‹ค.
      • ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ฃผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค.
      • Adhoc ์ฟผ๋ฆฌ์— ์ ํ•ฉํ•˜๋‹ค.
      • AWS Athena๊ฐ€ Presto ๊ธฐ๋ฐ˜์ด๋‹ค.

0๊ฐœ์˜ ๋Œ“๊ธ€