Analyzing etcd performance

The brain of any good Kubernetes cluster is etcd (pronounced et-see-dee), an open source, distributed, consistent key-value store for shared configuration, service discovery, and scheduler coordination of distributed systems or clusters of machines.

As the primary datastore of Kubernetes, etcd stores and replicates all Kubernetes cluster state. Since it is a critical component of a Kubernetes cluster it is important that etcd has a reliable approach to its configuration and management and is hosted on hardware that is able to support its rigorous demands. When etcd is experiencing performance issues, then entire cluster suffers and can become unstable and unusable.

In this module we will utilize a python script to analyze the etcd pod logs in an OpenShift must-gather to help identify common errors and correlate those errors to specific time frames.

etcd-ocp-diag.py

With this python script we will be able to identify common errors in the etc pod logs including:

"Apply Request Took Too Long" - etcd took longer than expected to apply the request.
"Failed to Send Out Heartbeat" - The etcd node took longer than expected to send out a heart beat.
"Lost Leader" and "Elected Leader" - The node hasn’t heard from the leader in a set period causing an election
"wal: Sync Duration" - The Write Ahead Log Sync Duration took too long.
"The Clock Difference Against Peer" - The Peers have too large of a clock difference/
"Server is Likely Overloaded" - This error is a direct result of disk performance per etcd docs.
"Sending Buffer is Full" - The sending buffer for etcd raft messages is full due to slowness of other nodes.
and more.

Understanding etcd-ocp-diag.py

cd ~/Module4/

etcd-ocp-diag.py -h

usage: etcd-ocp-diag.py [-h] --path PATH [--ttl] [--heartbeat] [--election] [--lost_leader] [--fdatasync] [--buffer] [--overloaded] [--etcd_timeout] [--pod POD] [--date DATE] [--compare] [--errors] [--stats] [--previous] [--rotated] [-i]

Process etcd logs and gather statistics.

options:
  -h, --help         show this help message and exit
  --path PATH        Path to the must-gather
  --ttl              Check apply request took too long
  --heartbeat        Check failed to send out heartbeat
  --election         Checks for leader elections messages
  --lost_leader      Checks for lost leader errors
  --fdatasync        Check slow fdatasync
  --buffer           Check sending buffer is full
  --overloaded       Check leader is overloaded likely from slow disk
  --etcd_timeout     Check etcdserver: request timed out
  --pod POD          Specify the pod to analyze
  --date DATE        Specify date for error search in YYYY-MM-DD format
  --compare          Display only dates or times that happen in all pods
  --errors           Display etcd errors
  --stats            Display etcd stats
  --previous         Use previous logs
  --rotated          Use rotated logs
  -i, --interactive  Run in interactive mode

Common Errors

When investigating potential etcd performance issues, we first need to establish which errors are occurring in the etcd pods. To do this we utilize the --errors option of the etcd-ocp-diag.py script.

Comparing these two clusters below we can see that both have errors for Apply Request Took Too Long which is a great indicator of performance issues but we also see leader is overloaded likely from slow disk, and in the second cluster we see leader elections which is a potential sign of major issues.

Reviewing Common Errors

etcd-ocp-diag.py --path Cluster_1/must-gather.local --errors

POD                             ERROR                                                   COUNT
etcd-master-0.prod.example.com  waiting for ReadIndex response took too long, retrying    24
etcd-master-0.prod.example.com  apply request took too long                             6105
etcd-master-0.prod.example.com  request stats                                           2829
etcd-master-1.prod.example.com  waiting for ReadIndex response took too long, retrying     9
etcd-master-1.prod.example.com  apply request took too long                             2593
etcd-master-1.prod.example.com  leader is overloaded likely from slow disk               330
etcd-master-1.prod.example.com  request stats                                           1075

etcd-ocp-diag.py --path Cluster_2/must-gather.local --errors

POD                     ERROR                                                   COUNT
etcd-master0-openshift  waiting for ReadIndex response took too long, retrying    505
etcd-master0-openshift  slow fdatasync                                             25
etcd-master0-openshift  apply request took too long                             17339
etcd-master0-openshift  leader is overloaded likely from slow disk                504
etcd-master0-openshift  elected leader                                              3
etcd-master0-openshift  lost leader                                                 3
etcd-master0-openshift  lease not found                                             1
etcd-master0-openshift  request stats                                           16229
etcd-master1-openshift  waiting for ReadIndex response took too long, retrying    273
etcd-master1-openshift  slow fdatasync                                             17
etcd-master1-openshift  apply request took too long                              7087
etcd-master1-openshift  leader is overloaded likely from slow disk                214
etcd-master1-openshift  elected leader                                              1
etcd-master1-openshift  lost leader                                                 1
etcd-master1-openshift  request stats                                            6103
etcd-master2-openshift  waiting for ReadIndex response took too long, retrying    354
etcd-master2-openshift  etcdserver: request timed out                               3
etcd-master2-openshift  slow fdatasync                                              7
etcd-master2-openshift  apply request took too long                             12577
etcd-master2-openshift  leader is overloaded likely from slow disk                448
etcd-master2-openshift  elected leader                                              2
etcd-master2-openshift  lost leader                                                 2
etcd-master2-openshift  lease not found                                             2
etcd-master2-openshift  request stats                                           11688

etcd Error Stats

Now that we know there are performance related errors in the pod logs, we need to dig into the severity of the performance issues. To do this we use the --stats sub-command. This looks for Took Too Long error messages, collects the expected time, and then calculates the maximum time and when it occurred. It also reports the minimum time over the expected time, the median, and the average along with the count and then displays the information. It also does the same thing for the slow fdatasync error message.

If you encounter a large number (over 100 errors a day average) or your Maximums are near 1 Second or higher, then you want to dig deeper into the etcd performance to see when it is happening and if we see any issues elsewhere in the cluster.

In the below output we can see that Cluster_2 is seeing significant performance issues on all 3 etcd nodes, while on Cluster_3 we only see performance issues, though significant, on one node.

Checking for etcd Error Stats

etcd-ocp-diag.py --path Cluster_2/must-gather.local --stats

Stats about etcd "apply request took too long" messages: etcd-master0-openshift
        First Occurrence: 2025-11-17T23:36:03
        Last Occurrence: 2025-11-18T00:30:53
        Maximum: 82615.2260ms 2025-11-17T23:50:23
        Minimum: 200.0120ms
        Median: 531.3509ms
        Average: 3519.0608ms
        Count: 17339
        Expected: 200ms

Stats about etcd "apply request took too long" messages: etcd-master1-openshift
        First Occurrence: 2025-11-18T00:05:36
        Last Occurrence: 2025-11-18T00:31:05
        Maximum: 23307.2911ms 2025-11-18T00:30:31
        Minimum: 200.0794ms
        Median: 478.6989ms
        Average: 830.3278ms
        Count: 7087
        Expected: 200ms

Stats about etcd "apply request took too long" messages: etcd-master2-openshift
        First Occurrence: 2025-11-17T23:58:06
        Last Occurrence: 2025-11-18T00:31:27
        Maximum: 43226.6834ms 2025-11-18T00:04:10
        Minimum: 200.0175ms
        Median: 540.8824ms
        Average: 1761.9831ms
        Count: 12577
        Expected: 200ms

etcd-ocp-diag.py --path Cluster_3/must-gather.local --stats

Stats about etcd "apply request took too long" messages: etcd-master1.prod-b.openshift.example.com
        First Occurrence: 2025-12-03T19:23:37
        Last Occurrence: 2025-12-03T19:27:05
        Maximum: 13997.5810ms 2025-12-03T19:25:14
        Minimum: 201.5809ms
        Median: 9999.7021ms
        Average: 9151.5463ms
        Count: 4907
        Expected: 200ms

Searching for Specific Errors

etcd has common errors to let you know what issues are affecting your cluster this script lets you look for them quickly to then help determine what problems you should be focusing on.

To do this you run etcd-ocp-diag.py --path <path_to_must_gather> --ttl and it will search all of the etcd Pods and return the Pod Name, Date, and the Count.

In addition to Took Too Long errors you can also use the following sub-commands:

  --ttl              Check apply request took too long
  --heartbeat        Check failed to send out heartbeat
  --election         Checks for leader elections messages
  --lost_leader      Checks for lost leader errors
  --fdatasync        Check slow fdatasync
  --buffer           Check sending buffer is full
  --overloaded       Check leader is overloaded likely from slow disk

Using Sub-Commands to look for Specific Errors

etcd-ocp-diag.py --path Cluster_2/must-gather.local --overloaded

POD                     DATE            COUNT
etcd-master0-openshift  2025-11-18      504
etcd-master1-openshift  2025-11-18      214
etcd-master2-openshift  2025-11-17      162
etcd-master2-openshift  2025-11-18      286

etcd-ocp-diag.py --path Cluster_4/must-gather.local --election

POD                                     DATE            COUNT
etcd-ocpmstr1.openshift.example.com     2025-12-08       1
etcd-ocpmstr2.openshift.example.com     2025-12-08       1
etcd-ocpmstr3.openshift.example.com     2025-12-08      46

After you return the results for all dates and pods, you can then drill down further by specifying the --date and/or the --pod command to return the hour and minute the error happened and results just for one specific pod.

In this example we can see that a large number of Took Too Long errors occurred on 2025-12-03 and when we drill down into that date using the --date option, we can see that node experienced significant performance issues over a short period of time which warrants further investigation in the Must-Gather and with the Customer’s infrastructure team.

Using the Date and Pod Options

etcd-ocp-diag.py --path Cluster_3/must-gather.local --ttl

POD                                             DATE            COUNT
etcd-master1.prod-b.openshift.example.com       2025-12-03      4907

etcd-ocp-diag.py --path Cluster_3/must-gather.local --ttl --date 2025-12-03

POD                                             DATE    COUNT   MAX_TIME
etcd-master1.prod-b.openshift.example.com       19:23    482    13687.0232ms
etcd-master1.prod-b.openshift.example.com       19:24   1315    13996.3965ms
etcd-master1.prod-b.openshift.example.com       19:25   1432    13997.5810ms
etcd-master1.prod-b.openshift.example.com       19:26   1404    13990.8706ms
etcd-master1.prod-b.openshift.example.com       19:27    274    12668.7780ms

Finally, you can use the --compare command to see when errors happened on the same date. You can combine it with the --date command to narrow issues down to specifics hours or minutes.

Using Compare

etcd-ocp-diag.py --path Cluster_2/must-gather.local --ttl --compare

Date: 2025-11-17
POD                            COUNT
etcd-master0-openshift         10079
etcd-master2-openshift         646

Date: 2025-11-18
POD                            COUNT
etcd-master0-openshift         7260
etcd-master1-openshift         7087
etcd-master2-openshift         11931

etcd-ocp-diag.py --path Cluster_2/must-gather.local --ttl --compare --date 2025-11-18

Date: 00:00
POD                            COUNT      MAX_TIME
etcd-master0-openshift         158        3465.4064ms
etcd-master2-openshift         121        2342.1259ms

Date: 00:01
POD                            COUNT      MAX_TIME
etcd-master0-openshift         308        1403.4716ms
etcd-master2-openshift         213        1148.7342ms

Date: 00:02
POD                            COUNT      MAX_TIME
etcd-master0-openshift         577        1453.5319ms
etcd-master2-openshift         593        1384.3426ms

Date: 00:03
POD                            COUNT      MAX_TIME
etcd-master0-openshift         204        2280.3111ms
etcd-master2-openshift         327        31136.6423ms

Date: 00:04
POD                            COUNT      MAX_TIME
etcd-master0-openshift         73         520.2907ms
etcd-master2-openshift         1154       43226.6834ms

Date: 00:05
POD                            COUNT      MAX_TIME
etcd-master0-openshift         1256       1981.6568ms
etcd-master1-openshift         352        1140.6648ms
etcd-master2-openshift         1245       4330.4708ms

Date: 00:06
POD                            COUNT      MAX_TIME
etcd-master0-openshift         489        1660.3067ms
etcd-master1-openshift         475        1766.2286ms
etcd-master2-openshift         296        1592.4008ms

Date: 00:07
POD                            COUNT      MAX_TIME
etcd-master0-openshift         479        7130.0045ms
etcd-master1-openshift         463        4976.1788ms
etcd-master2-openshift         475
...

Debugging etcd can be complex and there are often a handful of potential root causes. If you find a customer cluster that is showing signs of etcd performance issues, the best course of action is to open a support ticket to engage with our experts. You can also determine next steps by reviewing our detailed etcd debugging article found at https://access.redhat.com/articles/6271341

Lab Scenarios

Scenario 1

The Customer reports cluster performance was bad around midnight from 2025-11-17 into 2025-11-18 on a cluster that is hosted in VMware on a Shared Datastore. They had jobs fail, connectivity issues, and are not sure what the cause is. Look at the errors and stats for the etcd cluster. We are going to start by determing how often and how bad the apply request, took too long occured and their maximum time. From there we need to determine the impact that these extremely high took too long messages are having on the cluster by looking for overloaded messages using the --overloaded flag. After confirming if the disk is overloaded, we are going to look to see if we had any Leader issues as a result of the disk performance problems.

Scenario 1

etcd-ocp-diag.py --path ./Cluster_2/must-gather.local/ --errors

etcd-ocp-diag.py --path ./Cluster_2/must-gather.local/ --stats

etcd-ocp-diag.py --path ./Cluster_2/must-gather.local/ --ttl --date 2025-11-17

etcd-ocp-diag.py --path ./Cluster_2/must-gather.local/ --ttl --date 2025-11-17 --compare

etcd-ocp-diag.py --path ./Cluster_2/must-gather.local/ --ttl --date 2025-11-18

etcd-ocp-diag.py --path ./Cluster_2/must-gather.local/ --ttl --date 2025-11-18 --compare

etcd-ocp-diag.py --path ./Cluster_2/must-gather.local/ --overloaded --date 2025-11-17

etcd-ocp-diag.py --path ./Cluster_2/must-gather.local/ --overloaded --date 2025-11-18

etcd-ocp-diag.py --path ./Cluster_2/must-gather.local/ --election --date 2025-11-18

Scenario 2

The customer reported a couple errors popped up on the cluster, they had a few connection issues, and then it went back to normal. They need someone to look into the cluster and see what they can find.

Like in the previous scenario, let’s take a look at the stats and errors, and then try to pinpoint when the issue happened and where the customer can investigate next.

Scenario 2

etcd-ocp-diag.py --path ./Cluster_3/must-gather.local/ --errors

etcd-ocp-diag.py --path ./Cluster_3/must-gather.local/ --stats

etcd-ocp-diag.py --path ./Cluster_3/must-gather.local/ --ttl

etcd-ocp-diag.py --path ./Cluster_3/must-gather.local/ --ttl --date 2025-12-03

Scenario 3

This is a tricky situation. Checking the --stats we can see etcd node 2 and 3 are experiencing significant performance issues but node 1 is running as expected.

In this scenario we will need to look for a range and a pattern to then help narrow down why these nodes are having such high performance issues. Look for a pattern starting on 2025-12-08 and going into 2025-12-09 that will help narrow down and pin point exactly when the issue is occuring and to help predict and identify the next occurance.

Scenario 3

etcd-ocp-diag.py --path ./Cluster_4/must-gather.local/ --errors

etcd-ocp-diag.py --path ./Cluster_4/must-gather.local/ --stats

etcd-ocp-diag.py --path ./Cluster_4/must-gather.local/ --ttl --pod etcd-ocpmstr2.openshift.example.com --date 2025-12-08

etcd-ocp-diag.py --path ./Cluster_4/must-gather.local/ --ttl --pod etcd-ocpmstr2.openshift.example.com --date 2025-12-09