What is overloading my API?

A customer reported an issue where the OpenShift API was down and etcd, kube-apiserver and other control plane pods were in a CrashLoopBackoff with extremely high memory usage.

The customer provided the following information:

The hardware of our 3 control plane nodes has reached end of life and we will be replacing them with new hardware.

OpenShift 4.16.36

I removed the first old master node and the API became inaccessible. I believe our cluster is having issues running with just two masters.

What is hitting my API?

How do we check who, what and how often something is hitting the API of an OpenShift cluster?

In this lab we are going to explore the usage of the OpenShift Cluster Debug Tools, specifically kubectl-dev_tool and the kubectl-dev_tool audit subcommand.

In addition to the standard must-gather, there are a number of additional variants for collecting data for special use cases.

For this issue, we asked the customer to collect the audit logs from their cluster so we can analyze every request going to the API.

You can use the following command to collect an audit log must-gather:

oc adm must-gather — /usr/bin/gather_audit_logs

This will produce a must-gather style directory that can be zipped and uploaded to a support case.

If the API is not available, you can collect the audit logs manually under /var/log/ using ssh and scp to the node.

/var/log/kube-apiserver and the other directories, if necessary, such as etcd, oauth-apiserver, openshift-apiserver and oauth-server.

Explore the kubectl-dev_tool audit command

Let’s take a look at the kubectl-dev_tool audit subcommand by running it with the -h flag to look at the help screen.

kubectl-dev_tool audit -h

On this screen, you will find some examples to give you some understanding of how to use the tool.

When using the -f flag, you can pass a single file, as in the example, or the entire directory.

-f 'audit_logs/kube-apiserver'

Run a command

Let’s run a command against the kube-apiserver and see what kind of information we get back.

cd ~/Module5/module5-must-gather-audit-log/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-c413593a46a302b8fab6ffa3f2f993b0110c2c84ced3dd399cedb4c66c0c99f7/

Depending on the amount of data and the complexity of the search, it can take some time to process.

Using the example from the help page, we can run a command like this:

kubectl-dev_tool audit -f 'audit_logs/kube-apiserver/cluster-master-02.dmz-audit.log' --verb=get --stage=ResponseComplete --output=top=5 --by=verb

In this case, we are looking at individual GET requests, how long they took, what they accessed and who made the request.

count: 16795, first: 2025-09-08T11:30:32-04:00, last: 2025-09-08T11:34:57-04:00, duration: 4m25.796056s

Top 5 "GET" (of 16795 total hits):
    114x [  1.511842ms] [200-113] /apis/coordination.k8s.io/v1/namespaces/openshift-example-porter-operator/leases/porter-operator            [system:serviceaccount:openshift-example-porter-operator:porter-operator-controller-manager]
     85x [  1.662611ms] [200-84]  /apis/coordination.k8s.io/v1/namespaces/openshift-example-sysdig-agent/leases/kspm-analyzer-leader-election [system:serviceaccount:openshift-example-sysdig-agent:sysdig-shield-host]
     64x [  1.598375ms] [200-63]  /apis/operators.coreos.com/v1/olmconfigs/cluster                                                            [system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount]
     57x [  2.951438ms] [200-56]  /api/v1/namespaces/devops-archeobot/configmaps/archeobot                                                    [system:serviceaccount:devops-archeobot:default]
     57x [  1.383017ms] [200-56]  /apis/coordination.k8s.io/v1/namespaces/openshift-example-integration/leases/strimzi-cluster-operator       [system:serviceaccount:openshift-example-integration:strimzi-cluster-operator]

The larger and busier the cluster, the shorter the audit log duration will be.

If you have a requirement to maintain or potentially review the audit logs, you need to send them to an external logging source for retention.

For this lab, each kubectl-dev_tool commands takes approximately 50 seconds to run.

Checking for issues

Now that we know a little bit about the kubectl-dev_tool, let’s look at our audit log must-gather for potential issues.

The first thing you want to do is narrow down and define a time range for the issue. Being as specific as possible will help ensure your results are accurate. As an example, 100,000 requests against a certain API over a 6 hour window may not be an issue. If those 100,000 requests actually occur during a 10 minute period of within those 6 hours of logs, when the issue occurs, it is likely worth looking into.

In our case, the incident took place from:

2025-09-08T14:15:00Z
to
2025-09-08T15:34:57Z

There is a helpful flag you can use -o top or --output=top which will count, group and display the top 10 entries for your search.

If you want more or less from the top count, you can use -o top=20 or --output=top=5.

Start by taking a high level view. You can be both broad and granular with audit logs, but unless you know exactly what you’re looking for, it’s best to cast a wide net.

Look at the top usage for the common --by= groups like resource and user. We can combine that using --after and --before, to narrow the results to focus in on our timerange.

kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --by=resource -otop --after 2025-09-08T14:15:00Z --before 2025-09-08T15:34:57Z

The first thing you will see is a log outlining the count (number of records), followed by a time range and duration for the period the audit logs cover.

count: 3994302, first: 2025-09-08T10:15:00-04:00, last: 2025-09-08T11:34:56-04:00, duration: 1h19m56.999222s

After that, you will see the returned data set.

487532x              v1/endpoints
442694x              v1/configmaps
362910x              v1/persistentvolumeclaims
338418x              v1/persistentvolumes
328024x              v1/secrets
320234x              discovery.k8s.io/endpointslices
279757x              v1/pods
195895x              metrics.k8s.io/pods
181994x              coordination.k8s.io/leases
148746x              apps/deployments

Similarly, you can run the same command, using --by=user and we can adjust -otop=15 to see a larger list of users.

kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --by=user -otop=15 --after 2025-09-08T14:15:00Z --before 2025-09-08T15:34:57Z
506707x              system:serviceaccount:kube-system:horizontal-pod-autoscaler
298071x              system:serviceaccount:kube-system:endpointslicemirroring-controller
184530x              system:serviceaccount:openshift-example-crunchy:pgo[postgrescluster-controller]
158611x              system:serviceaccount:kyverno:kyverno-reports-controller
89842x               system:serviceaccount:d6af69-test:ha-postgres-crunchydb-instance
87176x               system:serviceaccount:db4642-dev:pay-transparency-dev-crunchy-instance
82224x               system:serviceaccount:e52f12-dev:pg-272ed005-crunchy-instance
81991x               system:serviceaccount:d6af69-dev:ha-postgres-crunchydb-instance
78016x               system:serviceaccount:openshift-infra:serviceaccount-pull-secrets-controller[openshift.io/image-registry-pull-secrets_legacy-token-secrets-controller]
48879x               system:serviceaccount:openshift-example-crunchy:pgo
47722x               system:serviceaccount:openshift-machine-config-operator:machine-config-operator
46633x               system:serviceaccount:c57ee7-dev:dev-crunchy-postgres-instance
41531x               system:node:cluster-app-109.dmz
41342x               system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount
40934x               system:serviceaccount:openshift-example-integration:strimzi-cluster-operator

Evaluate and dig deeper

Based on the above data, there is nothing glaringly obvious that is wrong with the cluster, but we can see that the API received 3,994,302 resource requests and another 719,823 access review requests (removed via scrubbing) over this 80 minute period. That translates to almost 1000 API requests per second.

At its peak, the cluster was exceeding 1200 API requests per second.

Let’s dig a little deeper.

When evaluating the data, always factor in things like the total number of requests, time period and the number of nodes.

Our top resources from the --by=resource command were endpoints, configmaps, subjectaccessreviews, persistentvolumeclaims, persistentvolumes and secrets:

Our top users from the --by=user command were horizontal-pod-autoscaler, endpointslicemirroring-controller, openshift-apiserver-sa, pgo[postgrescluster-controller] and kyverno-reports-controller.

Again, none of those stick out as glaring issues, but let’s dig into the top resources. For this we can use the --resource= flag, in addition to --by= and -o top to focus in on a specific resource.

kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource=endpoints -otop --by=user --after 2025-09-08T14:15:00Z --before 2025-09-08T15:34:57Z

We can see the endpoint activity is spread out, but the top consumers are crunchy / postgres instances from different namespaces.

count: 487532, first: 2025-09-08T10:15:00-04:00, last: 2025-09-08T11:34:56-04:00, duration: 1h19m56.967674s
89645x               system:serviceaccount:d6af69-test:ha-postgres-crunchydb-instance
87151x               system:serviceaccount:db4642-dev:pay-transparency-dev-crunchy-instance
82088x               system:serviceaccount:e52f12-dev:pg-272ed005-crunchy-instance
81963x               system:serviceaccount:d6af69-dev:ha-postgres-crunchydb-instance
38591x               system:serviceaccount:c57ee7-dev:dev-crunchy-postgres-instance
18316x               system:serviceaccount:kube-system:endpoint-controller
4689x                system:serviceaccount:openshift-example-crunchy:pgo[postgrescluster-controller]
1344x                system:serviceaccount:openshift-example-integration:strimzi-cluster-operator
1309x                system:apiserver
1148x                system:serviceaccount:openshift-monitoring:prometheus-k8s
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource=configmaps --by=user -otop --after 2025-09-08T14:15:00Z --before 2025-09-08T15:34:57Z

We can see the configmap activity is spread out, but crunchy / postgres is there again.

count: 442694, first: 2025-09-08T10:15:00-04:00, last: 2025-09-08T11:34:56-04:00, duration: 1h19m56.999222s
40623x               system:serviceaccount:kyverno:kyverno-reports-controller
25472x               system:serviceaccount:openshift-example-crunchy:pgo[postgrescluster-controller]
6470x                system:node:cluster-app-109.dmz
5870x                system:node:cluster-app-104.dmz
5513x                system:node:cluster-app-40.dmz
5453x                system:node:cluster-app-103.dmz
5423x                system:node:cluster-app-101.dmz
5407x                system:node:cluster-app-87.dmz
5396x                system:node:cluster-app-108.dmz
5389x                system:node:cluster-app-102.dmz
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource='persistentvolumeclaims' -otop --by=user --after 2025-09-08T14:15:00Z --before 2025-09-08T15:34:57Z

We can see a high number of persistentvolumeclaims activity across all nodes in the cluster, as well as from crunchy / postgres.

count: 362910, first: 2025-09-08T10:15:00-04:00, last: 2025-09-08T11:34:56-04:00, duration: 1h19m56.992954s
16115x               system:serviceaccount:openshift-example-crunchy:pgo[postgrescluster-controller]
13138x               system:node:cluster-app-109.dmz
11969x               system:node:cluster-app-104.dmz
11681x               system:node:cluster-app-89.dmz
11439x               system:node:cluster-app-87.dmz
11247x               system:node:cluster-app-108.dmz
10692x               system:node:cluster-app-94.dmz
10191x               system:node:cluster-app-84.dmz
10187x               system:node:cluster-app-92.dmz
10028x               system:node:cluster-app-101.dmz
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource='persistentvolumes' -otop --by=user --after 2025-09-08T14:15:00Z --before 2025-09-08T15:34:57Z

We can see a high number of persistentvolumes activity across all nodes in the cluster.

count: 338418, first: 2025-09-08T10:15:00-04:00, last: 2025-09-08T11:34:56-04:00, duration: 1h19m56.910381s
13109x               system:node:cluster-app-109.dmz
11952x               system:node:cluster-app-104.dmz
11675x               system:node:cluster-app-89.dmz
11429x               system:node:cluster-app-87.dmz
11236x               system:node:cluster-app-108.dmz
10677x               system:node:cluster-app-94.dmz
10190x               system:node:cluster-app-84.dmz
10185x               system:node:cluster-app-92.dmz
10023x               system:node:cluster-app-101.dmz
9551x                system:node:cluster-app-86.dmz
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource='secrets' -otop --by=user --after 2025-09-08T14:15:00Z --before 2025-09-08T15:34:57Z

We can see the secrets activity is spread out but again, we see crunchy high up in the data.

count: 328024, first: 2025-09-08T10:15:00-04:00, last: 2025-09-08T11:34:56-04:00, duration: 1h19m56.89568s
78016x               system:serviceaccount:openshift-infra:serviceaccount-pull-secrets-controller[openshift.io/image-registry-pull-secrets_legacy-token-secrets-controller]
46030x               system:serviceaccount:openshift-example-crunchy:pgo[postgrescluster-controller]
12091x               system:serviceaccount:openshift-authentication-operator:authentication-operator
10523x               system:serviceaccount:openshift-apiserver-operator:openshift-apiserver-operator
5428x                system:node:cluster-app-109.dmz
5376x                system:serviceaccount:openshift-example-integration:strimzi-cluster-operator
4701x                system:serviceaccount:openshift-example-crunchy:pgo
4567x                system:serviceaccount:e69aae-test:postgresql-operator-manager
4358x                system:node:cluster-app-104.dmz
4350x                system:node:cluster-app-103.dmz

Evaluating the output, the most common theme across all of the data points has been activity from crunchy / postgress. If you refer back to the original --by=user output, there was no obvious single user, but 7 of the top 15 users driving API load were crunchy / postgress related. If we add them all up, a conservative estimate, without secondary and follow-on factors, is 621,275 of 3,994,302 requests, totalling 15.55% of API traffic.

Given that crunchy / postgress are databases, we can hypothesize the high number of requests against the PersistentVolumes and PersistentVolumeClaims is also related.

Let’s take a look at one of the users and see what they are doing. You can do this by passing in the --user= flag along with --by=verb and -o top.

Let’s try to answer the following:

  1. What is the user doing?

  2. What is the problem?

kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --user='system:serviceaccount:d6af69-test:ha-postgres-crunchydb-instance' --by=verb -otop --after 2025-09-08T14:15:00Z --before 2025-09-08T15:34:57Z
Top 10 "PATCH" (of 90224 total hits):
  89454x [  44.05581ms] [200-89453] /api/v1/namespaces/d6af69-test/endpoints/ha-postgres-crunchydb-ha                                 [system:serviceaccount:d6af69-test:ha-postgres-crunchydb-instance]

What we see is that the majority of activity is from PATCH commands on the endpoints objects.

Lets take a look at our most active node:

kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --user='system:node:cluster-app-109.dmz' --by=verb -otop --after 2025-09-08T14:15:00Z --before 2025-09-08T15:34:57Z
Top 10 "GET" (of 27649 total hits):
    898x [  1.439828ms] [200-897] /api/v1/persistentvolumes/pvc-a906c87c-ffe2-4e86-99b0-d8bb0aaaf676                       [system:node:cluster-app-109.dmz]
    898x [  1.728236ms] [200-897] /api/v1/namespaces/ea8776-dev/persistentvolumeclaims/postgresql-mycool-app-patroni-1     [system:node:cluster-app-109.dmz]
    554x [  1.972292ms] [200-553] /api/v1/persistentvolumes/pvc-90a35859-8d42-4986-8ee0-b8c2a1e6d23d                       [system:node:cluster-app-109.dmz]
    554x [  1.469763ms] [200-553] /api/v1/namespaces/32d13a-test/persistentvolumeclaims/pg-1573229e-crunchy-db-b9v6-pgwal  [system:node:cluster-app-109.dmz]
    554x [  2.180034ms] [200-553] /api/v1/persistentvolumes/pvc-a6736361-91e2-4824-a2da-b7d17b804837                       [system:node:cluster-app-109.dmz]
Top 10 "WATCH" (of 11163 total hits):
    170x [6m47.647582582s] [200-169] /api/v1/namespaces/0fad32-tools/secrets?allowWatchBookmarks=true&fieldSelector=metadata.name%3Ddefault-dockercfg-d28lr&resourceVersion=15895096121&timeout=7m6s&timeoutSeconds=426&watch=true                           [system:node:cluster-app-109.dmz]
    164x [5m25.153896615s] [200-163] /api/v1/namespaces/a12c97-dev/secrets?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dchefs-encryption-keys&resourceVersion=15895174082&timeout=6m54s&timeoutSeconds=414&watch=true                              [system:node:cluster-app-109.dmz]
    139x [5m28.029877906s] [200-138] /api/v1/namespaces/a12c97-dev/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dchefs-files-config&resourceVersion=15895168736&timeout=9m38s&timeoutSeconds=578&watch=true                              [system:node:cluster-app-109.dmz]
    103x [6m33.862201844s] [200-102] /api/v1/namespaces/4c2ba9-dev/secrets?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dcrunchy-postgres-replication-cert&resourceVersion=15895102534&timeout=9m8s&timeoutSeconds=548&watch=true                   [system:node:cluster-app-109.dmz]
     98x [7m22.541479775s] [200-97]  /api/v1/namespaces/0fad32-tools/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dcas-shipit-db-cas-postgres-cluster-pgbackrest&resourceVersion=15895104263&timeout=6m58s&timeoutSeconds=418&watch=true [system:node:cluster-app-109.dmz]

What we see is that the majority of activity is from GET and WATCH commands on PersistentVolumes and PersistentVolumeClaims and other resources across dozens of different Namespaces.

You could continue to analyze more resources and users, but at this point based on the reviewed information, we were able to make an informed theory.

  1. Even though the API pool members are flapping up and down, millions of requests are still making it API server.

  2. We see 15%+ of our API traffic coming from crunchy / postgress related pods.

  3. We see another 17%+ of our API traffic coming from PersistentVolumes and PersistentVolumeClaims.

The conclusion was that the Crunchy Postgres Operator was behaving abnormally due to the API flapping. This lead to an unexpected surge of traffic trying to GET, LIST and UPDATE its owned resources.

With this information, we were able to have an informed discussion with the customer and they were able to take the data to the 3rd party vendor and get the issue addressed to avoid such an API storm in the future.

How did we recover the cluster?

With no way of shutting down workloads or disabling the Operator, we had to get creative to stabilize the Control Plane.

We applied 2 iptables rules on all of the worker nodes in the cluster to drop traffic to port 6443 on both the FORWARD and OUTPUT chains.

iptables -I FORWARD -p tcp --dport 6443 -j DROP
iptables -I OUTPUT -p tcp --dport 6443 -j DROP

This blocked both the kubelet and running Pods from sending traffic across the SDN and to the API and allowed for the API to stabilize and recover.

Once the API was healthy, we removed the iptables rules in blocks of 5 nodes, allowing them to come back and workloads to fully recover.

I hope you found this introduction to the kubectl-dev_tool useful and can leverage it the next time you have an issue!

You don’t need to have an overloaded API or a performance issue to take a look at the audit logs.

The audit logs and kubectl-dev_tool are equally useful if you want to understand who or what did something in your cluster.

Something or someone deleted your pod, secret, service or other object? That’s in the audit logs! Use the kubectl-dev_tool to find out who did it and when!