Immutability and Lazy processing
-
python variables
- mutable
- flexibility
- potential for issues with concurrency
- likely adds complexity
-
immutability
- a component of functional programming
- defined once
- unable to be directly modified
- recreated if reassigned
- able to be shared efficiently
-
laziy processing ?
Understanding Parquet
- Spark and CSV files
- slow to parse
- files cannot be filtered
- The
Parquet
Format
- columnar data format
- supported in Spark and other data processing frameworks
- supports predicate pushdown
- automatically stores schema info.
df = spark.read.format('parquet').load(filename.parquet')
df = spark.read.parquet('filename.parquet')
df = spark.read.parquest('filename.parquet')
df.createOrReplaceTempView('name')
df2 = spark.sql('SQL QUERIES')
Partitioning and lazy processing
- Partitioning
- DF are broken up into partitions
- Partition size can vary
- Each partition is handled independently
- Lazy processing
- Transformations are lazy
- .withcolumn(...)
- .select(...)
- Nothing is actually done until an action is performed
- Transformations can be re-ordered for best performance
- Sometimes causes unexpected behavior