[Glue] No space left on device

yozzum·2024년 1월 8일

Spark

목록 보기

21/21

Situation

Apache Spark uses local disk on AWS Glue workers to spill data from memory that exceeds the heap space defined by the spark.executor.memory configuration parametre.

Wide transformations(groupByKey(), reduceByKey(), join()) can cause a shuffle.

Spark writes the intermediate data to a local disk before it can exchange that data between the different workers.

At This point, you might get a "No space left on device" or a MetadataFetchFailedException error.
Spark throws this error when there isn't enough disk space left on the executor and there's no recovery.

These types of errors commonly occur when the processing job observes a significant skew in the dataset.

Task

You need to make sure the error does not occur.

Action

3.1 Use dedicated serverless storage.

Store Spark shuffle and spill data in Amazon Simple Storage Service instead of writing data to the AWS Glue worker's local disk.

Key: --write-shuffle-files-to-S3
Value : TRUE

Key: --conf
Value: spark.shuffle.storage.path=s3://custom_shuffle_bucket

3.2 Scaling out

Increase the number of workers through horizontal scaling or upgrading the worker type through vertical scaling. However, scaling out might not always work, expecially if your data is heavily skewed on a few keys.

3.3 Reduce and filter input data

3.4 Broadcast small tables

3.5 Use AQE

Adaptive Query Execution(AQE) from databricks is an optimization technique in Spark SQL. It dynamically coalesces shuffle partitions/Dynamically switches join strategies/dynamically optimizes skew joins.

Result

https://repost.aws/knowledge-center/glue-no-spaces-on-device-error

https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-shuffle-manager.html

https://docs.aws.amazon.com/AmazonS3/latest/userguide/how-to-set-lifecycle-configuration-intro.html

yozzum

이전 포스트

[Glue] No space left on device

Spark

SPARK TIPS

0개의 댓글