Spark 에서 특정 값 바꾸기

유알·2024년 5월 11일

특정값 바꾸기

>> newDF.show()
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|      New Country|      Other Country|    5|
|     New Country2|    Othre Country 3|    1|
+-----------------+-------------------+-----+

이 데이터 프레임에는 오타가 있다. Othre Country 3
이 오타를 어떻게 고칠 수 있을 까?

SQL이었다면, UPDATE WHERE을 하면 됬지만, spark에서는 어떻게 해야할지 알아보자

newDF.withColumn(
	"ORIGIN_COUNTRY_NAME",
	when(col("ORIGIN_COUNTRY_NAME") == "Othre Country 3", "Other Country")
		.otherwise(col("ORIGIN_COUNTRY_NAME"))
).show()

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|      New Country|      Other Country|    5|
|     New Country2|      Other Country|    1|
+-----------------+-------------------+-----+

나는 withColumn이 새로운 열을 추가할 때만 사용되는 줄 알았는데, 약간 merge같이 동작한다.

만약 컬럼이 있으면, 덮어쓰기
없으면 추가

만약 다른 컬럼명으로 준다면?

newDF.withColumn(
	"DIFF", # 여기가 다르다.
	when(col("ORIGIN_COUNTRY_NAME") == "Othre Country 3", "Other Country")
		.otherwise(col("ORIGIN_COUNTRY_NAME"))
).show()

+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|         DIFF|
+-----------------+-------------------+-----+-------------+
|      New Country|      Other Country|    5|Other Country|
|     New Country2|    Othre Country 3|    1|Other Country|
+-----------------+-------------------+-----+-------------+

추가적으로 when / otherwise 를 함께 알아둬야겠다.

substring

newDF = newDF.withColumn(
    "ORIGIN_COUNTRY_NAME", 
    when(col("ORIGIN_COUNTRY_NAME").startswith("Othre"), concat(lit("Other"),substring(col("ORIGIN_COUNTRY_NAME"),6,8)))
    .otherwise(col("ORIGIN_COUNTRY_NAME"))
).show()

concat과 lit, substring을 활용해 컬럼 값을 바꾸었다.

regexp_replace

newDF = newDF.withColumn(
    "ORIGIN_COUNTRY_NAME", 
    when(col("ORIGIN_COUNTRY_NAME").startswith("Othre"), regexp_replace(col("ORIGIN_COUNTRY_NAME"),"^Othre","Other"))
    .otherwise(col("ORIGIN_COUNTRY_NAME"))
).show()

정규식을 활용해서 값을 교체할 수도 있다.

유알

더 좋은 구조를 고민하는 개발자 입니다

이전 포스트

키가 이미 존재할 때, 우분투 서버간 ssh 자동로그인 설정

다음 포스트

Spark 에서 특정 값 바꾸기

특정값 바꾸기

substring

regexp_replace

키가 이미 존재할 때, 우분투 서버간 ssh 자동로그인 설정

Spark Streaming Kafka 연동(feat. ClassNotFoundException)

0개의 댓글

Spark 에서 특정 값 바꾸기

특정값 바꾸기

substring

regexp_replace

키가 이미 존재할 때, 우분투 서버간 ssh 자동로그인 설정

Spark Streaming Kafka 연동(feat. ClassNotFoundException)

0개의 댓글

Spark 에서 특정 값 바꾸기