See how pyspark processes iterative/interactive processes
sc.parallelize(array)
: create RDD of elements of array (or list) sc.textFile({path to file})
: create RDD of lines from filefilter(lambda x: x%2 ==0)
: Discard False elements map(labmda x: x*2)
: multiply each RDD element by 2map(lambda x:x.split())
: split each string into wordsflatMap(lambda x: x.split())
: split each string into words and flatten sequence sample(withReplacement=True, 0.25)
: create a sample of 25% of elements with replacementunion(rdd)
: append rdd to existing RDD distinct()
: remove duplicates in RDD sortBy(lambda x:x, ascending=False)
: sort elements in desceding order collect()
: convert RDD to in-memory list take(3)
: first 3 elements of RDD top(3)
: top 3 elements of RDD (think about when you are carrying out SortBy actions) takeSample(withReplacement=True, 3)
: create sample of 3 elements with replacementsum()
: find element sum(assumes numeric elements) mean()
: find element mean(assumes numeric elements) stdev()
: find element deviation (assumes numeric elements)