[TroubleShooting] ElasticSearch - disk usage exceeded flood-stage watermark

devhans·2023년 9월 26일

TroubleShooting

목록 보기

1/3

발생 환경

docker container에 ElasticSearch oss(이하 ES)를 설치하고 데이터를 적재하는 상황
1초에 한번씩 ES의 한 인덱스에서 timestamp로 sorting 하여 최신 문서 1건을 조회해오는 작업을 진행중인 상태였습니다.

장애 파악

[TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block]

해당 문제는 ES의 공식 문서에서 찾을 수 있었습니다.
도커 컨테이너를 띄운 server 자체가 disk가 가득 차는 문제가 발생했기 때문에 ES에서 데이터를 적재할 수 없도록 블록해버렸습니다.

장애 해결

가장 간단하게 해결하는 방법은 공식문서에 나와있는 특정 인덱스의 블록을 해제하는 것입니다.

PUT /my-index-000001/_settings
{
  "index.blocks.read_only_allow_delete": null
}

curl 명령어는 다음과 같습니다.

curl -X PUT "localhost:9200/my-index-000001/_settings?pretty" -H 'Content-Type: application/json' -d'
{
  "index.blocks.read_only_allow_delete": null
}
'

또한 disk usage를 줄일 수 없다면 ES의 옵션을 변경해보는 것도 방법입니다.

cluster.routing.allocation.disk.threshold_enabled
(Dynamic) Defaults to true. Set to false to disable the disk allocation decider.
cluster.routing.allocation.disk.watermark.low
(Dynamic) Controls the low watermark for disk usage. It defaults to 85%, meaning that Elasticsearch will not allocate shards to nodes that have more than 85% disk used. It can also be set to an absolute byte value (like 500mb) to prevent Elasticsearch from allocating shards if less than the specified amount of space is available. This setting has no effect on the primary shards of newly-created indices but will prevent their replicas from being allocated.
cluster.routing.allocation.disk.watermark.high
(Dynamic) Controls the high watermark. It defaults to 90%, meaning that Elasticsearch will attempt to relocate shards away from a node whose disk usage is above 90%. It can also be set to an absolute byte value (similarly to the low watermark) to relocate shards away from a node if it has less than the specified amount of free space. This setting affects the allocation of all shards, whether previously allocated or not.
cluster.routing.allocation.disk.watermark.enable_for_single_data_node
(Static) For a single data node, the default is to disregard disk watermarks when making an allocation decision. This is deprecated behavior and will be changed in 8.0. This setting can be set to true to enable the disk watermarks for a single data node cluster (will become default in 8.0).
cluster.routing.allocation.disk.watermark.flood_stage
(Dynamic) Controls the flood stage watermark, which defaults to 95%. Elasticsearch enforces a read-only index block (index.blocks.read_only_allow_delete) on every index that has one or more shards allocated on the node, and that has at least one disk exceeding the flood stage. This setting is a last resort to prevent nodes from running out of disk space. The index block is automatically released when the disk utilization falls below the high watermark.

You cannot mix the usage of percentage values and byte values within these settings. Either all values are set to percentage values, or all are set to byte values. This enforcement is so that Elasticsearch can validate that the settings are internally consistent, ensuring that the low disk threshold is less than the high disk threshold, and the high disk threshold is less than the flood stage threshold.

아래 사진은 키바나에서 개발자도구를 이용해 응답을 보냈을 때의 결과입니다.
키바나를 통한 응답

이후 ES로 부터 응답을 받을 순 있지만 본질적인 해결 방법은 아닙니다.
이에 대한 본질적인 해결방법은 disk usage를 줄이는 것입니다. 다만 위의 환경에서는 도커 컨테이너 내부에서 문제가 생겨 데이터를 생성했으므로, 컨테이너 내부에서 디스크 용량을 확보하고 컨테이너를 재시작해줘야 실제로 용량이 줄어든 것으로 인식할 수 있습니다.

devhans

책 읽고 운동하기

다음 포스트

[TroubleShooting] ElasticSearch - disk usage exceeded flood-stage watermark