pyspark split() default

장유림·2025년 4월 29일

데이터 엔지니어링

목록 보기

4/4

In PySpark, the split() function, when used without specifying a delimiter, defaults to splitting the string by any whitespace character (spaces, tabs, newlines). It returns an array of strings. The limit parameter, which controls the number of splits, defaults to -1, meaning "split as many times as possible."
Python

from pyspark.sql.functions import split
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SplitExample").getOrCreate()

data = [("apple banana orange",), ("grape kiwi",), ("pear peach plum",)]
df = spark.createDataFrame(data, ["fruits"])

df = df.withColumn("split_fruits", split(df["fruits"], "\\s+"))
df.show(truncate=False)

# Output
# +-------------------+-------------------------+
# |fruits             |split_fruits             |
# +-------------------+-------------------------+
# |apple banana orange|[apple, banana, orange]  |
# |grape kiwi         |[grape, kiwi]             |
# |pear peach plum    |[pear, peach, plum]     |
# +-------------------+-------------------------+

spark.stop()

장유림

환영합니다

이전 포스트

pyspark split() default

데이터 엔지니어링

Flume 실습

0개의 댓글