[Spark] partition column 이 있을 경우 basePath 지정하기

Woong·2025년 1월 21일

apache spark basePath partition column

Apache Spark

목록 보기

23/25

`basePath` 에 대해서

분할 데이터(partitioned data)의 root directory 를 지정하는 옵션
- Spark가 디렉토리명에서 partition column 을 추론할 때 기준 경로로 사용

사용 예시

root directory 를 /data/, partition column 을 year 라고 할 때

/data/
    year=2020/
    year=2021/
    year=2022/

basePath를 /data/ 로 지정하고 partition column year 을 추론
- year=2020, year=2021 디렉토리의 데이터를 읽고, year 컬럼이 자동 생성

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Example").getOrCreate()

df = spark.read.option("basePath", "/data/").parquet(
    "/data/year=2020",
    "/data/year=2021"
)

`basePath`를 사용하지 않을 때 겪은 문제

Conflicting directory structures detected 에러

basePath 를 지정하지 않고 여러 디렉토리를 직접 지정하였을 때, root directory 가 명확하지 않아 아래와 같은 에러 발생

target_path = ["/data/year=" + year for year in ['2020', '2021']]
df = spark.read.parquet(*target_path)

# 혹은 df = spark.read.parquet("/data/year=2020", "/data/year=2021")

...

('An error occurred: ', Py4JJavaError(u'An error occurred while calling o74.parquet.\n', JavaObject id=o77))
Py4JJavaError: An error occurred while calling o74.parquet.
: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:
  file:/data/year=2024
  file:/data/year=2025

If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.
...

해결 방법

basePath를 설정하여 Spark가 root directory 를 명시하여 해결

spark.read.option("basePath", "/data/").parquet(
    "/data/year=2020",
    "/data/year=2021"
)

reference

Spark parquet docs

Woong

다음 포스트

[Spark] partition column 이 있을 경우 basePath 지정하기

Apache Spark

`basePath` 에 대해서

사용 예시

`basePath`를 사용하지 않을 때 겪은 문제

Conflicting directory structures detected 에러

해결 방법

reference

[Livy] livy 로 spark application submit

0개의 댓글

관련 채용 정보

[Spark] partition column 이 있을 경우 basePath 지정하기

Apache Spark

basePath 에 대해서

사용 예시

basePath를 사용하지 않을 때 겪은 문제

Conflicting directory structures detected 에러

해결 방법

reference

[Livy] livy 로 spark application submit

0개의 댓글

관련 채용 정보

`basePath` 에 대해서

`basePath`를 사용하지 않을 때 겪은 문제