4 Main Features
- Distributed Collection of Data
- Fault Tolerant
- Parallel operation-partitioned
- Ability to use many data sources
4 Stages of Spark Operation
- Users Manipulate RDD Objects
- Spark schedules those tasks with DAG Scheuduler
- Task Scheduler schedules the tasks onto slave instances
- Worker carries out the actions
RDDs are immutable, lazily evaluated, and cacheable
There are two types of RDD operations
- Transformations
- Actions
Basic Actions
- First
- Collect
- Count
- Take
Basic Transformations
- RDD.filter()
: applies a function to each element and returns elements that evaluate to True
RDD.map()
: transforms each element and preserves # of elements, very similar idea to pandas.apply()
RDD.flatmap()
: transforms each element into 0-N element and chages # of elements Often RDDs will be holding their values in tuples (key,value)
Reduce()
and ReduceByKey()
- Reduce()
: an action that will aggregte RDD elements using a function that returns a single element
ReduceByKey()
: an action that will aggregate RDD elements using a function that return a pair RDD