26M19c1

Young-Kyoo Kim·2026년 5월 19일

A specific objectstore’s readiness probe was false for about 3–4 minutes, and during that period quorum-related errors were reported across multiple EC sets.

(related pod log)

API: SYSTEM.storage
Error: Read quorum could not be established on pool: 0, set: 94, expected read quorum: 5, drives-online: 0 (errors.errorString)
4: cmd/logging.go:228:cmd.storageLogIf()
3: cmd/erasure-server-pool.go:3841:cmd.(
erasureServerPools).Health()
2: cmd/healthcheck-handler.go:120:cmd.clusterCheck()
1: cmd/healthcheck-handler.go:275:cmd.clusterReadCheckHandler()

...

Error: Write quorum could not be established on pool: 0, set: 95, expected read quorum: 5, drives-online: 0 (*errors.errorString)

...

API: GetObject(bucket=xxx, object=xxx)
UserAgent: MinIO ..
Error: context canceled (*errors.errorString)
GetObjectInfo="name=xxx,pool=1,set=93"
4: cmd/logging.go:172:cmd.internalLogIf()
3: cmd/api-errors.go:2955:cmd.toAPIError()
2: cmd/object-handlers.go:490:cmd.objectAPIHandlers.getObjectHandler()
1: cmd/object-handlers.go:791:cmd.objectAPIHandlers.GetObjectHandler()

...

Error: Write quorum could not be established on pool: 0, set: 96, expected read quorum: 5, drives-online: 0 (*errors.errorString)

...

Error: Read quorum could not be established on pool: 0, set: 96, expected read quorum: 5, drives-online: 0 (*errors.errorString)

........

API: SYSTEM.grid
Error: grid: marking http://lake-pool-1-26.xxx ... offline temporarily; caused by dial tcp xxx:9000: i/o timeout (3) (*fmt.wrapError)

As shown above, errors were reported not only for EC sets in pool 0, where the affected node belongs, but also for EC sets in pool 1.

At the same timeframe, the following logs were found in the server log (messages):

(12 seconds before the pod logs above appeared)

kubelet I0507 prober.go:120 "Probe failed" probeType="Readiness" pod="lake-objectstore/lake-pool-0-47" podUID=xx containerName="sidecar" probeResult="failure" output="HTTP probe failed with statuscode: 502"

..

The cluster is configured with rack awareness, and even if a single node has an issue, the service should continue operating normally using the remaining 127 nodes. However, I do not understand why EC set errors are being reported even from other unrelated pools.

We would like to understand what kinds of conditions can cause readiness probe failures like the above, and which areas we should pay closer attention to in our environment.

J
Jeeva01:59 PM
Hi @Youngkyoo Kim, Can you please link the cluster to this ticket? Did you face any network glitch during the error period?
What is the current status? is it still reporting the error?

Y
Youngkyoo Kim02:24 PM
It occurred yesterday between 3:22 PM and 3:26 PM, and things have been stable since then.

Although it was on a different objectstore, a small number of 5xx errors were also observed at two locations around 11 AM on the 6th (there were no quorum errors like the above at that time).

We have been continuously checking the network side (Cilium), but so far we have not found anything unusual.

Even though this was a case where access to only a single node’s objectstore was unavailable, the overall minio_cluster_erasure_set_write_tolerance metric also dropped based on the threshold. Could that behavior be expected?
(Temporarily (for less than 30 seconds), the tolerance value of all EC sets dropped to the minimum level)

Kranthi04:57 PM
If the rack awareness used, single node failure should not have resulted to quorum loss, considering pods are spread.
Are these quorum errors are from the single pod during this period?

Or do we see these errors on other pods too?

minio_cluster_erasure_set_write_tolerance site wide metric. If a pod not available it would be impacting multiple EC there by impacts this metric. So it is expected to affect.

May 9th, 2026Toggle date menu

Kranthi12:34 AM
If its case we see these errors on the impacted pod, then it mean that this particular pod had issues and logged errors as it lost connections to other pods.
Requests reaching the site must be served with other pods and there should not have impacted.

May 11th, 2026Toggle date menu

Y
Youngkyoo Kim11:34 AM
The quorum errors were observed in about 70% of the pods, and the same 30 EC sets appeared repeatedly across different time periods. (There were cases where quorum errors occurred together with 5xx errors, and other cases where they did not.)

Those 30 EC sets (out of the total 135 EC sets in pool 0) are associated with the following pods and drives:

EC set 5: 0-40 ~ 0-47 /export0/data
EC set 6: 0-48 ~ 0-53 /export0/data, 0-0 ~ 0-1 /export1/data
EC set 12: 0-42 ~ 0-49 /export1/data
EC set 19: 0-44 ~ 0-51 /export2/data
EC set 25: 0-38 ~ 0-45 /export3/data
EC set 26: 0-46 ~ 0-53 /export3/data
The same pattern is repeated for the remaining 24 EC sets as follows:

EC sets 32, 33, 39, 46, 52, 53
EC sets 59, 60, 66, 73, 79, 80
EC sets 86, 87, 93, 100, 106, 107
EC sets 113, 114, 120, 127, 133, 134
When the 5xx errors and quorum errors occurred at the same time, there was consistently (two times) an issue (5xx errors) on pod 0-47 last 2 weeks.
However, there were also cases where 5xx errors occurred on 0-47 without any accompanying quorum errors.

The cluster-wide EC set quorum error I mentioned initially occurred only once on the afternoon of May 7th.
On other days(since the end of March), as in the case described above, 30 EC sets reported quorum errors simultaneously (different groups in each case)
Is that considered a normal situation?

R
RIA CHOI01:57 PM
Additionally, we noticed that pod 0-47 had a significantly higher goroutine count compared to other pods.

Most pods were around ~3k goroutines and rarely exceeded 10k, but pod 0-47 was consistently around ~10k and occasionally spiked up to ~500k goroutines.

Kranthi02:13 PM
I guess these pods spread over multiple nodes.
Do we know their racks situation? Could they be on same rack?

kubectl get objectstore lake -n lake-objectstore -o yaml
kubectl get pods -n lake-objectstore -l v1.min.io/pool=pool-0 -o wide
kubectl get nodes --show-labels

R
RIA CHOI
Additionally, we noticed that pod 0-47 had a significantly higher goroutine count compared to other pods.

Most pods were around ~3k goroutines and rarely exceeded 10k, but pod 0-47 was consistently around ~10k and occasionally spiked up to ~500k goroutines.

Is this currently being observed aswell? How is CPU and memory faring?

R
RIA CHOI02:43 PM
To answer your questions:

The affected pods are spread across different racks as expected.
The goroutine count on pod 0-47 repeatedly returned to a normal range (~5k), then temporarily spiked again. (up to ~50k or higher)
CPU and memory appear normal overall, and we did not observe obvious OOM conditions.

We'd like to consolidate our outstanding questions:

Cross-pool quorum-related errors from a single pod issue
During the May 7th incident, quorum-related errors were observed on EC sets in both pool 0 and pool 1, even though pod 0-47 appeared to be the primary affected pod. With rack-aware placement and 127 remaining healthy nodes, what conditions could cause quorum-related errors to appear across unrelated EC sets/pools?

Recurring 30-EC-set quorum errors
Since late March, we've observed a recurring pattern where approximately 30 EC sets report quorum errors simultaneously (always different groups each time). Is this considered normal behavior, or could it indicate an underlying issue?

write_tolerance metric drop
During the incident, minio_cluster_erasure_set_write_tolerance briefly dropped to minimum across all EC sets for under 30 seconds. Is this expected behavior when a single pod becomes temporarily unavailable?

Goroutine spike on pod 0-47
Pod 0-47 was consistently running at a higher goroutine count than other pods (~10k vs ~3k normally), and periodically spiked much higher before returning to normal again. Could this type of goroutine accumulation behavior have contributed to the readiness probe failure, RPC timeout behavior, or quorum-related errors?

We just confirmed that pod 0-47 was being used directly as an endpoint by one of the services during the affected timeframe.

Could this potentially have contributed to the issue?

We will investigate this path further on our side as well.

Kranthi04:19 PM

R
RIA CHOI
We just confirmed that pod 0-47 was being used directly as an endpoint by one of the services during the affected timeframe.

Could this potentially have contributed to the issue?

We will investigate this path further on our side as well.

Yes, this would definitely.
If its bombarded by many requests directly, it could lead to high go routines.

Readiness failures would be result of it.

Is there any specific reason for using this as direct end point? This should definitely be avoided.

R
RIA CHOI05:29 PM
We've requested that they switch to using the LoadBalancer endpoint instead of directly targeting the pod.

We'll continue monitoring the environment and observe whether the behavior improves.

K
Krutika Dhananjay06:38 PM
Noted. Thanks.

May 12th, 2026Toggle date menu

Y
Youngkyoo Kim03:57 PM
When the affected ObjectStore pod was returning 502 errors and failing readiness probes for about 4 minutes, we observed the following behavior at the beginning of the issue:

The pod's memory usage increased by around 40 GB within about 1 minute.
Busy IRQs on the node spiked to around 26% and then dropped again about a minute later.
The pod's goroutine count also increased significantly (close to 100k) before going back down.
Other resource metrics such as CPU, memory, disk, and network usage did not show any other obvious anomalies.
My assumption is that a sudden burst of load may have temporarily prevented the pod from responding properly to health checks, resulting in the 5xx errors.

At the moment, Prometheus accesses the system for archiving purposes through NodePort. I’ll share whether the behavior changes after we switch to accessing it through a Service starting tomorrow.

(Currently, the 5xx error occurs on this pod about 1–3 times per week, and once it happens, the pod recovers and returns to normal after about 4 to 4.5 minutes. And it occurs also to a few pods beside one connected to prometheus)

However, regarding the intermittent quorum errors, I would appreciate some advice on what additional areas we should investigate.

As mentioned earlier, the issue is not limited to 1–2 EC sets. Read and write quorum errors occur simultaneously across around 30 EC sets. Over the past two months, this usually happened only once at a specific time.
However, in the case from the afternoon of the 7th, the errors continued for about 6 minutes and affected not just one group of 30 EC sets, but multiple different EC sets as well. For a short period, the tolerance for all EC sets also dropped to the minimum level (minio_cluster_erasure_set_write_tolerance).

Could you explain under what conditions EC set tolerance issues typically occur? Also, what kinds of situations could cause quorum errors to occur simultaneously across around 30 EC sets like this?

Kranthi06:01 PM
You are right there, Yes this could be the consequences of traffic burst on single pod.
Also I guess the client bit slow reading the responses, go routines might be waiting on the downstream I/O.

I thought you have different client sending requests directly.
is it running any periodic jobs like deletions or so? It would be important to know that.

We could run traces mc admin trace lake if we see the issue live to know what requests are coming in.
I guess this is random short time, so might be difficult to time.

If understand correctly, 30-EC-set quorum errors on multiple pods? Or few pods reported various EC quorum errors.
Do you see many pods reporting drives-online: 0?

As single pod has 20 drives, at max it should only impact 20 EC.
Because of the memory usage burst, did you notice any impacts on the node from node metrics or so?

May 13th, 2026Toggle date menu

Y
Youngkyoo Kim11:05 AM
Several pods(25%~74% out of 128 nodes) reported 30-EC-set quorum errors.
Based on what we've identified so far, quorum errors occurred in some cases during the 5xx errors. From the metrics alone, it's hard to determine whether there were disk issues (though it seems unlikely that multiple disks failed simultaneously, and nothing unusual was observed on the network side either).

Harshavardhana11:24 AM
No it must be related to network can you share ethtool -S output from all the nodes?

-> grid: marking lake-pool-1-26 ... offline temporarily; i/o timeout.

this is an indication of that

May 14th, 2026Toggle date menu

Y
Youngkyoo Kim08:05 AM
Thanks. We'll check it further.
The ethtool -S output is quite large, so It may need to be uploaded through a separate process to comply with security requirements. I’ll check the other items first and then proceed with that.

J
Jeeva11:57 AM
Thanks @Youngkyoo Kim, Please keep us posted.

May 15th, 2026Toggle date menu

Y
Youngkyoo Kim12:07 PM
After monitoring the system for a few more days, I’ve observed that quorum errors are occurring even in the absence of 5xx errors.

These incidents happen about 1 to 3 times a day, with each instance lasting between 2 and 10 minutes. Notably, the errors are not isolated to a specific node but are appearing across multiple pods.

I have two specific questions regarding this:

When these quorum errors occur, does it imply that actual read and write operations for that specific erasure set are failing?

And what other areas or metrics should I investigate to pinpoint the root cause?

Looking forward to your guidance.

Kranthi01:33 PM
just to confirm,
All errors are same? does they show drives-online: 0?

like these? Error: Read quorum could not be established on pool: 0, set: 96, expected read quorum: 5, drives-online: 0 (*errors.errorString)
If thats the case, we suspect N/W blip on inter pod connectivity.

I guess you already have drives being monitored and they do not show any variation during these periods?

I will get all other cases for the errors for us to look at and list here to help you plan further investigation.

Y
Youngkyoo Kim01:52 PM
various - drives-online: 0 / 1 / 2 / 3 / 4

Y
Youngkyoo Kim04:30 PM
To provide more context, during that 2-to-3 minute window, we observed a sharp drop in the drives-online metric across multiple Erasure Sets (135 sets in Pool 0):

10 sets: drives-online: 0,
5 sets: drives-online: 1,
5 sets: drives-online: 2,
5 sets: drives-online: 3,
5 sets: drives-online: 4

Surprisingly, other Grafana metrics seemed to show absolutely no anomalies. The 4xx error rates remained at their usual baseline before and after the event, and there were no 5xx errors recorded at all.

Given this discrepancy, I need to know how to definitively verify whether these read/write quorum errors on the EC sets actually caused real data access failures for our applications. Could you guide me on the best methods to cross-check actual user-facing disruptions during such metric drops?
(No abnormal things in audit log)

A
Ali Yaldaz04:41 PM
Hi @Youngkyoo Kim ,
Is it possible to share ethtool -S output with us? We need it to troubleshoot the issue

Kranthi04:53 PM
Hello @Youngkyoo Kim

Thank you for details,
Usually Quorum errors are from below causes:

If the drive states not ok. 
 - As your other drive metrics are fine, we do not suspect it.

Peer-RPC failures - This could be the case for us here..
Peer pod actually down:
You mentioned pods are healthy during the situation. May be not our case.
Peer pod overloaded
Local pod's gridConn to peer is in clusterDeadline "offline temporarily" state — circuit breaker is open after past failures.
The reporting pod's Health() fan-out raced its 10-second clusterDeadline and substituted offline placeholders for peers whose RPCs didn't complete in time.
If this is the case then it’s only Reporting-pod local view artefacts. Did not cause real user-facing failures.



Followup Questions for this:
Did you now changed the traffic routing pods directly? If this is still the case then the errors will show in this category.
Also do you think of any heavy operations during this observed periods? (Not just no. Of operations), Is this a peak period?


Network event inter pods - could be too.
ethtool -S output required to verify rule out this.
YesterdayToggle date menu

Harshavardhana05:41 PM
Given this discrepancy, I need to know how to definitively verify whether these read/write quorum errors on the EC sets actually caused real data access failures for our applications. Could you guide me on the best methods to cross-check actual user-facing disruptions during such metric drops?
(No abnormal things in audit log)
There is no one single state that is held by MinIO like that - what is being reported is always transient, which points to situations that you have a network that is flapping or dropping packets. However for Applications this may or may not be the case by the time they arrive etc, basically server is reconnecting back.

So the log itself is not some definitive "quorum loss" if its persistent yes, but it may or may not. It is also a view of per node, when there are network splits so it is not some cluster view.

What you are seeing is a definitive proof of network issues in your environment so that needs to be investigated i.e packet loss, packet discards, mtu mismatches etc

We need to perhaps get ethtool -S output from the host system here to see what those NIC counters indicate

0개의 댓글