[Spark] Spark 데이터프레임 주요 메서드 - (4) withColumn

baekdata·2022년 3월 31일

Spark

목록 보기

7/8

withColumn 메서드

요약

withcolumn을 이용하여 기존 컬럼의 업데이트, 타입 변경, 신규 컬럼 값 추가

withColumn('신규/Update 되는 컬럼명', '신규/Update 되는 값')으로 사용

신규 또는 업데이트하는 값을 생성 시에 기존 컬럼 기반으로 수행한다면,
신규 컬럼은 문자열로, 기존 컬럼은 반드시 컬럼형 (col('컬럼명'))을 이용해 적용

신규 컬럼 추가는 select() 메소드로도 가능

컬럼명 변경은 withColumnRename() 메서드로 수행

a. 기본 용법

신규 또는 업데이트하는 값을 생성 시에 기존 컬럼 기반으로 수행한다면,
신규 컬럼은 문자열로, 기존 컬럼은 반드시 컬럼형 (col('컬럼명'))을 이용해 적용

# 라이브러리 로드
from pyspark.sql.functions import col 

# Copy - spark는 .copy() 메서드 없어서, select(*)로 카피
titanic_sdf_copied = titanic.sdf.select(*) 

# 신규 컬럼 추가
titanic_sdf_copied = titanic_sdf_copied.withColumn('Extra_Fare', col('Fare') * 10)

# 기존 컬럼 업데이트
titanic_sdf_copied = titanic_sdf_copied.withColumn('Fare', col('Fare') + 20)

# 기존 컬럽 타입 변경
titanic_sdf_copied = titanic_sdf_copied.withColumn('Fare', col('Fare').cast('Interger')

# 한번에 여러 withColumn() 적용
titanic_sdf_copied = titanic_sdf_copied.withColumn('Fare', col('Fare') + 20)
									   .withColumn('Fare', col('Fare').cast('Interger')

b. 상수 컬럼 적용 - lit()

신규/Update 할 값은 컬럼형이어야 하기 때문에, lit로 상수 값을 감싸줘야 함

# 상수 값으로 update시에 아래와 같이 수행시 error 발생 -> 반드시 update할 값은 컬럼형!
titanic_sdf_copied = titanic_sdf_copied.withColumn('Extra_Fare', 10)

# lit imporrt
from pyspark.sql.functions import lit

# 상수 값 update시, lit() 사용
titanic_sdf_copied = titanic_sdf_copied.withColumn('Extra_Fare', lit(10))

# 상수 값으로 신규 컬럼 생성시에도 lit() 사용
titanic_sdf_copied = titanic_sdf_copied.withColumn('New_Name', lit(Test_Name))

c. select()로 신규 컬럼 추가

# 라이브러리 로드
from pyspark.sql.functions import col, substring

# select()를 이용한, 신규 컬럼 추가 
titanic_copied = titanic_copied.select('*', col('Sex').alias('Gender')
titanic_copied = titanic_copied.select('*',substring('Cabin',0,1).alias('First'))

# withColumn으로 할 경우
titanic_copied = titanic_copied.withColumn('Gender', col('Sex'))
							   .withColumn('Cabin_First', substring('Cabin',0,1))
                                       
# split 사용
titanic_copied = titanic_copied.withColumn('Sp',split(col('Name'), ',')
									   .getItem(0))
titanic_copied = titanic_copied.withColumn('Sp',split(col('Name'), ',')
									   .getItem(1))

d. 컬럼명 변경 - withColumnRenamed()

withColumnRename('기존 컬럼명', '변경 컬럼명')으로 컬럼명 변경

# 컬럼명 변경 
titanic_sdf_copied = titanic_sdf_copied.withColumnRenamed('Gender', 'Gender_Renamed')

# 변경하려는 컬럼이 없어도 오류 발생 시키지 않으므로 유의 필요 
titanic_sdf_copied = titanic_sdf_copied.withColumnRenamed('Gender_X', 'Gender_Renamed')

baekdata

글쓰는 데이터 분석가

이전 포스트

[Spark] Spark 데이터프레임 주요 메서드 - (3) groupBy

다음 포스트

[Spark] Spark 데이터프레임 주요 메서드 - (4) withColumn

Spark

withColumn 메서드

a. 기본 용법

b. 상수 컬럼 적용 - lit()

c. select()로 신규 컬럼 추가

d. 컬럼명 변경 - withColumnRenamed()

[Spark] Spark 데이터프레임 주요 메서드 - (3) groupBy

[Spark] Spark 데이터프레임에서의 컬럼과 레코드 삭제

0개의 댓글

관련 채용 정보