several built-in standard functions to work with DataFrame and SQL queries
from pyspark.sql import functions as F
df.select(
[F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns]
).show()
1) select
2) count
3) when
4) isnull
5) alias
1) filter
# filter(구문) where
df.filter(F.col("StoreLocation").isNull()).show()
df.filter(F.col("CountyNumber").isNull()).filter(F.col("County").isNotNull()).count()
PySpark DataFrame API에서는 위처럼 Transformation을 추가하는 형태로 코드를 작성 할 수 있다. 아래 예시와 동일한 결과를 가져온다.
df.filter(
(F.col("CountyNumber").isNull())
& (F.col("County").isNotNull())
).count()
2) drop
df = df.drop("StoreLocation", "CountyNumber")
df.printSchema()
# 결과값:
# root
# |-- Invoice/ItemNumber: string (nullable = true)
# |-- Date: string (nullable = true)
# |-- StoreNumber: integer (nullable = true)
# |-- StoreName: string (nullable = true)
# |-- Address: string (nullable = true)
# |-- City: string (nullable = true)
# |-- ZipCode: string (nullable = true)
# |-- County: string (nullable = true)
# |-- CategoryName: string (nullable = true)
region_cols = ["Address", "City", "ZipCode", "County"]
df.select(
[F.count(F.when(F.isnull(c), c)).alias(c) for c in region_cols] # df.columns
).show()
# 결과값:
# +-------+-----+-------+------+
# |Address| City|ZipCode|County|
# +-------+-----+-------+------+
# | 82627|82626| 82693|159429|
# +-------+-----+-------+------+
근소한 차이지만, city의 결측치가 82,626개로 가장 적다.
# City를 기준으로 결측치가 없는 데이터프레임만 필터링
df = df.filter(F.col("City").isNotNull())
# Address, Zipcode, County 컬럼 제거
df = df.drop("Address", "Zipcode", "County")
df.printSchema()
# 결과값:
# root
# |-- Invoice/ItemNumber: string (nullable = true)
# |-- Date: string (nullable = true)
# |-- StoreNumber: integer (nullable = true)
# |-- StoreName: string (nullable = true)
# |-- City: string (nullable = true)
# |-- Category: integer (nullable = true)
# |-- CategoryName: string (nullable = true)