Analyzing etcd performance
The brain of any good Kubernetes cluster is etcd (pronounced et-see-dee), an open source, distributed, consistent key-value store for shared configuration, service discovery, and scheduler coordination of distributed systems or clusters of machines.
As the primary datastore of Kubernetes, etcd stores and replicates all Kubernetes cluster state. Since it is a critical component of a Kubernetes cluster it is important that etcd has a reliable approach to its configuration and management and is hosted on hardware that is able to support its rigorous demands. When etcd is experiencing performance issues, then entire cluster suffers and can become unstable and unusable.
In this module we will utilize a python script to analyze the etcd pod logs in an OpenShift must-gather to help identify common errors and correlate those errors to specific time frames.
etcd-ocp-diag.py
With this python script we will be able to identify common errors in the etc pod logs including:
-
"Apply Request Took Too Long" - etcd took longer than expected to apply the request.
-
"Failed to Send Out Heartbeat" - The etcd node took longer than expected to send out a heart beat.
-
"Lost Leader" and "Elected Leader" - The node hasn’t heard from the leader in a set period causing an election
-
"wal: Sync Duration" - The Write Ahead Log Sync Duration took too long.
-
"The Clock Difference Against Peer" - The Peers have too large of a clock difference/
-
"Server is Likely Overloaded" - This error is a direct result of disk performance per etcd docs.
-
"Sending Buffer is Full" - The sending buffer for etcd raft messages is full due to slowness of other nodes.
-
and more.
cd ~/Module4/
etcd-ocp-diag.py -h
usage: etcd-ocp-diag.py [-h] --path PATH [--ttl] [--heartbeat] [--election] [--lost_leader] [--fdatasync] [--buffer] [--overloaded] [--etcd_timeout] [--pod POD] [--date DATE] [--compare] [--errors] [--stats] [--previous] [--rotated] [-i]
Process etcd logs and gather statistics.
options:
-h, --help show this help message and exit
--path PATH Path to the must-gather
--ttl Check apply request took too long
--heartbeat Check failed to send out heartbeat
--election Checks for leader elections messages
--lost_leader Checks for lost leader errors
--fdatasync Check slow fdatasync
--buffer Check sending buffer is full
--overloaded Check leader is overloaded likely from slow disk
--etcd_timeout Check etcdserver: request timed out
--pod POD Specify the pod to analyze
--date DATE Specify date for error search in YYYY-MM-DD format
--compare Display only dates or times that happen in all pods
--errors Display etcd errors
--stats Display etcd stats
--previous Use previous logs
--rotated Use rotated logs
-i, --interactive Run in interactive mode
Common Errors
When investigating potential etcd performance issues, we first need to establish which errors are occurring in the etcd pods. To do this we utilize the --errors option of the etcd-ocp-diag.py script.
Comparing these two clusters below we can see that both have errors for Apply Request Took Too Long which is a great indicator of performance issues but we also see leader is overloaded likely from slow disk, and in the second cluster we see leader elections which is a potential sign of major issues.
etcd-ocp-diag.py --path Cluster_1/must-gather.local --errors
POD ERROR COUNT
etcd-master-0.prod.example.com waiting for ReadIndex response took too long, retrying 24
etcd-master-0.prod.example.com apply request took too long 6105
etcd-master-0.prod.example.com request stats 2829
etcd-master-1.prod.example.com waiting for ReadIndex response took too long, retrying 9
etcd-master-1.prod.example.com apply request took too long 2593
etcd-master-1.prod.example.com leader is overloaded likely from slow disk 330
etcd-master-1.prod.example.com request stats 1075
etcd-ocp-diag.py --path Cluster_2/must-gather.local --errors
POD ERROR COUNT
etcd-master0-openshift waiting for ReadIndex response took too long, retrying 505
etcd-master0-openshift slow fdatasync 25
etcd-master0-openshift apply request took too long 17339
etcd-master0-openshift leader is overloaded likely from slow disk 504
etcd-master0-openshift elected leader 3
etcd-master0-openshift lost leader 3
etcd-master0-openshift lease not found 1
etcd-master0-openshift request stats 16229
etcd-master1-openshift waiting for ReadIndex response took too long, retrying 273
etcd-master1-openshift slow fdatasync 17
etcd-master1-openshift apply request took too long 7087
etcd-master1-openshift leader is overloaded likely from slow disk 214
etcd-master1-openshift elected leader 1
etcd-master1-openshift lost leader 1
etcd-master1-openshift request stats 6103
etcd-master2-openshift waiting for ReadIndex response took too long, retrying 354
etcd-master2-openshift etcdserver: request timed out 3
etcd-master2-openshift slow fdatasync 7
etcd-master2-openshift apply request took too long 12577
etcd-master2-openshift leader is overloaded likely from slow disk 448
etcd-master2-openshift elected leader 2
etcd-master2-openshift lost leader 2
etcd-master2-openshift lease not found 2
etcd-master2-openshift request stats 11688
etcd Error Stats
Now that we know there are performance related errors in the pod logs, we need to dig into the severity of the performance issues. To do this we use the --stats sub-command. This looks for Took Too Long error messages, collects the expected time, and then calculates the maximum time and when it occurred. It also reports the minimum time over the expected time, the median, and the average along with the count and then displays the information. It also does the same thing for the slow fdatasync error message.
If you encounter a large number (over 100 errors a day average) or your Maximums are near 1 Second or higher, then you want to dig deeper into the etcd performance to see when it is happening and if we see any issues elsewhere in the cluster.
In the below output we can see that Cluster_2 is seeing significant performance issues on all 3 etcd nodes, while on Cluster_3 we only see performance issues, though significant, on one node.
etcd-ocp-diag.py --path Cluster_2/must-gather.local --stats
Stats about etcd "apply request took too long" messages: etcd-master0-openshift
First Occurrence: 2025-11-17T23:36:03
Last Occurrence: 2025-11-18T00:30:53
Maximum: 82615.2260ms 2025-11-17T23:50:23
Minimum: 200.0120ms
Median: 531.3509ms
Average: 3519.0608ms
Count: 17339
Expected: 200ms
Stats about etcd "apply request took too long" messages: etcd-master1-openshift
First Occurrence: 2025-11-18T00:05:36
Last Occurrence: 2025-11-18T00:31:05
Maximum: 23307.2911ms 2025-11-18T00:30:31
Minimum: 200.0794ms
Median: 478.6989ms
Average: 830.3278ms
Count: 7087
Expected: 200ms
Stats about etcd "apply request took too long" messages: etcd-master2-openshift
First Occurrence: 2025-11-17T23:58:06
Last Occurrence: 2025-11-18T00:31:27
Maximum: 43226.6834ms 2025-11-18T00:04:10
Minimum: 200.0175ms
Median: 540.8824ms
Average: 1761.9831ms
Count: 12577
Expected: 200ms
etcd-ocp-diag.py --path Cluster_3/must-gather.local --stats
Stats about etcd "apply request took too long" messages: etcd-master1.prod-b.openshift.example.com
First Occurrence: 2025-12-03T19:23:37
Last Occurrence: 2025-12-03T19:27:05
Maximum: 13997.5810ms 2025-12-03T19:25:14
Minimum: 201.5809ms
Median: 9999.7021ms
Average: 9151.5463ms
Count: 4907
Expected: 200ms
Searching for Specific Errors
etcd has common errors to let you know what issues are affecting your cluster this script lets you look for them quickly to then help determine what problems you should be focusing on.
To do this you run etcd-ocp-diag.py --path <path_to_must_gather> --ttl and it will search all of the etcd Pods and return the Pod Name, Date, and the Count.
In addition to Took Too Long errors you can also use the following sub-commands:
--ttl Check apply request took too long
--heartbeat Check failed to send out heartbeat
--election Checks for leader elections messages
--lost_leader Checks for lost leader errors
--fdatasync Check slow fdatasync
--buffer Check sending buffer is full
--overloaded Check leader is overloaded likely from slow disk
etcd-ocp-diag.py --path Cluster_2/must-gather.local --overloaded
POD DATE COUNT
etcd-master0-openshift 2025-11-18 504
etcd-master1-openshift 2025-11-18 214
etcd-master2-openshift 2025-11-17 162
etcd-master2-openshift 2025-11-18 286
etcd-ocp-diag.py --path Cluster_4/must-gather.local --election
POD DATE COUNT
etcd-ocpmstr1.openshift.example.com 2025-12-08 1
etcd-ocpmstr2.openshift.example.com 2025-12-08 1
etcd-ocpmstr3.openshift.example.com 2025-12-08 46
After you return the results for all dates and pods, you can then drill down further by specifying the --date and/or the --pod command to return the hour and minute the error happened and results just for one specific pod.
In this example we can see that a large number of Took Too Long errors occurred on 2025-12-03 and when we drill down into that date using the --date option, we can see that node experienced significant performance issues over a short period of time which warrants further investigation in the Must-Gather and with the Customer’s infrastructure team.
etcd-ocp-diag.py --path Cluster_3/must-gather.local --ttl
POD DATE COUNT
etcd-master1.prod-b.openshift.example.com 2025-12-03 4907
etcd-ocp-diag.py --path Cluster_3/must-gather.local --ttl --date 2025-12-03
POD DATE COUNT MAX_TIME
etcd-master1.prod-b.openshift.example.com 19:23 482 13687.0232ms
etcd-master1.prod-b.openshift.example.com 19:24 1315 13996.3965ms
etcd-master1.prod-b.openshift.example.com 19:25 1432 13997.5810ms
etcd-master1.prod-b.openshift.example.com 19:26 1404 13990.8706ms
etcd-master1.prod-b.openshift.example.com 19:27 274 12668.7780ms
Finally, you can use the --compare command to see when errors happened on the same date. You can combine it with the --date command to narrow issues down to specifics hours or minutes.
etcd-ocp-diag.py --path Cluster_2/must-gather.local --ttl --compare
Date: 2025-11-17
POD COUNT
etcd-master0-openshift 10079
etcd-master2-openshift 646
Date: 2025-11-18
POD COUNT
etcd-master0-openshift 7260
etcd-master1-openshift 7087
etcd-master2-openshift 11931
etcd-ocp-diag.py --path Cluster_2/must-gather.local --ttl --compare --date 2025-11-18
Date: 00:00
POD COUNT MAX_TIME
etcd-master0-openshift 158 3465.4064ms
etcd-master2-openshift 121 2342.1259ms
Date: 00:01
POD COUNT MAX_TIME
etcd-master0-openshift 308 1403.4716ms
etcd-master2-openshift 213 1148.7342ms
Date: 00:02
POD COUNT MAX_TIME
etcd-master0-openshift 577 1453.5319ms
etcd-master2-openshift 593 1384.3426ms
Date: 00:03
POD COUNT MAX_TIME
etcd-master0-openshift 204 2280.3111ms
etcd-master2-openshift 327 31136.6423ms
Date: 00:04
POD COUNT MAX_TIME
etcd-master0-openshift 73 520.2907ms
etcd-master2-openshift 1154 43226.6834ms
Date: 00:05
POD COUNT MAX_TIME
etcd-master0-openshift 1256 1981.6568ms
etcd-master1-openshift 352 1140.6648ms
etcd-master2-openshift 1245 4330.4708ms
Date: 00:06
POD COUNT MAX_TIME
etcd-master0-openshift 489 1660.3067ms
etcd-master1-openshift 475 1766.2286ms
etcd-master2-openshift 296 1592.4008ms
Date: 00:07
POD COUNT MAX_TIME
etcd-master0-openshift 479 7130.0045ms
etcd-master1-openshift 463 4976.1788ms
etcd-master2-openshift 475
...
Debugging etcd can be complex and there are often a handful of potential root causes. If you find a customer cluster that is showing signs of etcd performance issues, the best course of action is to open a support ticket to engage with our experts. You can also determine next steps by reviewing our detailed etcd debugging article found at https://access.redhat.com/articles/6271341
Lab Scenarios
Scenario 1
The Customer reports cluster performance was bad around midnight from 2025-11-17 into 2025-11-18 on a cluster that is hosted in VMware on a Shared Datastore. They had jobs fail, connectivity issues, and are not sure what the cause is. Look at the errors and stats for the etcd cluster. We are going to start by determing how often and how bad the apply request, took too long occured and their maximum time. From there we need to determine the impact that these extremely high took too long messages are having on the cluster by looking for overloaded messages using the --overloaded flag. After confirming if the disk is overloaded, we are going to look to see if we had any Leader issues as a result of the disk performance problems.
etcd-ocp-diag.py --path ./Cluster_2/must-gather.local/ --errors
etcd-ocp-diag.py --path ./Cluster_2/must-gather.local/ --stats
etcd-ocp-diag.py --path ./Cluster_2/must-gather.local/ --ttl --date 2025-11-17
etcd-ocp-diag.py --path ./Cluster_2/must-gather.local/ --ttl --date 2025-11-17 --compare
etcd-ocp-diag.py --path ./Cluster_2/must-gather.local/ --ttl --date 2025-11-18
etcd-ocp-diag.py --path ./Cluster_2/must-gather.local/ --ttl --date 2025-11-18 --compare
etcd-ocp-diag.py --path ./Cluster_2/must-gather.local/ --overloaded --date 2025-11-17
etcd-ocp-diag.py --path ./Cluster_2/must-gather.local/ --overloaded --date 2025-11-18
etcd-ocp-diag.py --path ./Cluster_2/must-gather.local/ --election --date 2025-11-18
Scenario 2
The customer reported a couple errors popped up on the cluster, they had a few connection issues, and then it went back to normal. They need someone to look into the cluster and see what they can find.
Like in the previous scenario, let’s take a look at the stats and errors, and then try to pinpoint when the issue happened and where the customer can investigate next.
etcd-ocp-diag.py --path ./Cluster_3/must-gather.local/ --errors
etcd-ocp-diag.py --path ./Cluster_3/must-gather.local/ --stats
etcd-ocp-diag.py --path ./Cluster_3/must-gather.local/ --ttl
etcd-ocp-diag.py --path ./Cluster_3/must-gather.local/ --ttl --date 2025-12-03
Scenario 3
This is a tricky situation. Checking the --stats we can see etcd node 2 and 3 are experiencing significant performance issues but node 1 is running as expected.
In this scenario we will need to look for a range and a pattern to then help narrow down why these nodes are having such high performance issues. Look for a pattern starting on 2025-12-08 and going into 2025-12-09 that will help narrow down and pin point exactly when the issue is occuring and to help predict and identify the next occurance.
etcd-ocp-diag.py --path ./Cluster_4/must-gather.local/ --errors
etcd-ocp-diag.py --path ./Cluster_4/must-gather.local/ --stats
etcd-ocp-diag.py --path ./Cluster_4/must-gather.local/ --ttl --pod etcd-ocpmstr2.openshift.example.com --date 2025-12-08
etcd-ocp-diag.py --path ./Cluster_4/must-gather.local/ --ttl --pod etcd-ocpmstr2.openshift.example.com --date 2025-12-09