What is Distribute computing ?

우진·2024년 9월 4일

Why distribute computing ??

We should mine huge data.. so we have to use multi-computer resource.

In the past, we use super computer, but it costs a lot.

So, we use distributed computing on commodity hardware instead super computer.

And, there has some problems on commodity hardware…

Issues

1. How to distribute computation ??

In commodity hardware system, Not all computers have the data they need.

So, the data must be transferred from a computer with data to a computer without data.

But, Copying data over a network takes time

So, Not handing over data.

💡 Idea
1. Handing over computation
2. Store files multiple times for reliability

And, Spark/Hadoop address these problems

  • Storage Infrastructure – File system
    • Google : GFS.
    • Hadoop : HDFS
  • Programming model
    • MapReduce
    • Spark

2. Storage Infrastructure

Q. If nodes fail, how to store data persistently?

A. Distributed File System

What is Distributed File System ??

provides global file namespace

“feel like to use single computer when connect multi-computer.”

  • Typical usage pattern: ▪ Huge files (100s of GB to TB)
    ▪ Data is rarely updated in place
    ▪ Reads and appends are common
  • Chunk servers
    ▪ File is split into contiguous chunks
    ▪ Typically each chunk is 16-64MB
    ▪ Each chunk replicated (usually 2x or 3x)
    ▪ Try to keep replicas in different racks
  • Master node
    ▪ a.k.a. Name Node in Hadoop’s HDFS
    ▪ Stores metadata about where files are stored
    ▪ Might be replicated
  • Client library for file access
  1. Talks to master to find chunk servers
  2. Connects directly to chunk servers to access data

Data kept in “chunks” spread across machines

Each chunk replicated on different machines

→ Seamless recovery from disk or machine failure

Distributed File System = Reliable distributed file system

Google makes early Distributed Computing system.

That is ..

Mapreduce

= “style of programming”

Purpose :

“Processing large amounts of data (petabytes or more) on a cluster of low-end computers”

## What is cluster ??
- We need many computers and networks for distribute computing.
- This is cluster(many computers and networks).
- It contains computer, LAN, software for implementing cluster.
- Combining several relatively inexpensive, low-performance computers 
	to achieve supercomputer-like performance

Method :

Parallelizing the data to be processed by implementing Map() and Reduce(). This parallelized data is sent to each node in the cluster so that it can be processed simultaneously.

Result :

Developers only need to implement Map() and Reduce() and the framework will handle the complex distributed processing at the back end.

Developers focus only on data analysis

  1. Easy parallel programming
  2. Invisible management of hardware and software failures
  3. Easy management of very-large-scale data

🪜 Step

  1. Map
    • User write map function for each element
    • The output of the Map function is a set of 0, 1, or more
      key-value pairs
  2. Group by key
    • System sorts all the key-value pairs by key, and
      outputs key-(list of values) pairs
  3. Reduce
    • User-written Reduce function is applied to each
      key-(list of values)

0개의 댓글