
See how pyspark processes iterative/interactive processes
sc.parallelize(array) : create RDD of elements of array (or list) sc.textFile({path to file}): create RDD of lines from filefilter(lambda x: x%2 ==0) : Discard False elements map(labmda x: x*2) : multiply each RDD element by 2map(lambda x:x.split()) : split each string into wordsflatMap(lambda x: x.split()) : split each string into words and flatten sequence sample(withReplacement=True, 0.25) : create a sample of 25% of elements with replacementunion(rdd) : append rdd to existing RDD distinct() : remove duplicates in RDD sortBy(lambda x:x, ascending=False) : sort elements in desceding order collect() : convert RDD to in-memory list take(3) : first 3 elements of RDD top(3) : top 3 elements of RDD (think about when you are carrying out SortBy actions) takeSample(withReplacement=True, 3) : create sample of 3 elements with replacementsum() : find element sum(assumes numeric elements) mean() : find element mean(assumes numeric elements) stdev() : find element deviation (assumes numeric elements)