Distributed Computing

더기덕·2022년 4월 7일

Big Data Basics

A local process will use the computation resources of a single machine
A distributed process has access to the computational resources across a number of machines connected through network

After a certain point, it is easier to scale out to many lower CPU machines, than to try to scale up to a single machine with a high CPU
Fault Tolerance : if one machine fails, the whole network still goes on

MapReduce is a way to split a computation task to a distributed set of files
Job Tracker : Sends code to run on the task trackers
Task Tracker : allocates CPU and memory for the tasks and monitor the tasks on the worker nodes

can use data stored in variety of formats
- Cassandra
- AWS S3
- HDFS
- and more
can perform operations up to 100x faster than MapReduce
- Spark keeps most of the data in memory after each transformation / uses disk when the memory is filled

4 Main Features
- Distributed Collection of Data
- Fault Tolerant
- Parallel operation-partitioned
- Ability to use many data sources
4 Stages of Spark Operation
- Users Manipulate RDD Objects
- Spark schedules those tasks with DAG Scheuduler
- Task Scheduler schedules the tasks onto slave instances
- Worker carries out the actions
RDDs are immutable, lazily evaluated, and cacheable
There are two types of RDD operations
- Transformations
- Actions

Basic Actions
- First
- Collect
- Count
- Take
Basic Transformations
- RDD.filter() : applies a function to each element and returns elements that evaluate to True
- RDD.map() : transforms each element and preserves # of elements, very similar idea to pandas.apply()
- RDD.flatmap() : transforms each element into 0-N element and chages # of elements
  - Transforming a corpus of texts into a list of words
Often RDDs will be holding their values in tuples (key,value)
Reduce() and ReduceByKey()
- Reduce() : an action that will aggregte RDD elements using a function that returns a single element
- ReduceByKey() : an action that will aggregate RDD elements using a function that return a pair RDD