What is overloading my API?
A customer reported an issue where the OpenShift API was down and etcd, kube-apiserver and other control plane pods were in a CrashLoopBackoff with extremely high memory usage.
What is hitting my API?
How do we check who, what and how often something is hitting the API of an OpenShift cluster?
In this lab we are going to explore the usage of the OpenShift Cluster Debug Tools, specifically kubectl-dev_tool and the kubectl-dev_tool audit subcommand.
In addition to the standard must-gather, there are a number of additional variants for collecting data for special use cases.
For this issue, we asked the customer to collect the audit logs from their cluster so we can analyze every request going to the API.
|
You can use the following command to collect an audit log must-gather: oc adm must-gather — /usr/bin/gather_audit_logs This will produce a must-gather style directory that can be zipped and uploaded to a support case. If the API is not available, you can collect the audit logs manually under /var/log/ using ssh and scp to the node. /var/log/kube-apiserver and the other directories, if necessary, such as etcd, oauth-apiserver, openshift-apiserver and oauth-server. |
Explore the kubectl-dev_tool audit command
Let’s take a look at the kubectl-dev_tool audit subcommand by running it with the -h flag to look at the help screen.
kubectl-dev_tool audit -h
On this screen, you will find some examples to give you some understanding of how to use the tool.
|
When using the -f flag, you can pass a single file, as in the example, or the entire directory. -f 'audit_logs/kube-apiserver' |
Run a command
Let’s run a command against the kube-apiserver and see what kind of information we get back.
cd ~/Module5/module5-must-gather-audit-log/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-c413593a46a302b8fab6ffa3f2f993b0110c2c84ced3dd399cedb4c66c0c99f7/
Depending on the amount of data and the complexity of the search, it can take some time to process.
Using the example from the help page, we can run a command like this:
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver/cluster-master-02.dmz-audit.log' --verb=get --stage=ResponseComplete --output=top=5 --by=verb
In this case, we are looking at individual GET requests, how long they took, what they accessed and who made the request.
count: 16795, first: 2025-09-08T11:30:32-04:00, last: 2025-09-08T11:34:57-04:00, duration: 4m25.796056s
Top 5 "GET" (of 16795 total hits):
114x [ 1.511842ms] [200-113] /apis/coordination.k8s.io/v1/namespaces/openshift-example-porter-operator/leases/porter-operator [system:serviceaccount:openshift-example-porter-operator:porter-operator-controller-manager]
85x [ 1.662611ms] [200-84] /apis/coordination.k8s.io/v1/namespaces/openshift-example-sysdig-agent/leases/kspm-analyzer-leader-election [system:serviceaccount:openshift-example-sysdig-agent:sysdig-shield-host]
64x [ 1.598375ms] [200-63] /apis/operators.coreos.com/v1/olmconfigs/cluster [system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount]
57x [ 2.951438ms] [200-56] /api/v1/namespaces/devops-archeobot/configmaps/archeobot [system:serviceaccount:devops-archeobot:default]
57x [ 1.383017ms] [200-56] /apis/coordination.k8s.io/v1/namespaces/openshift-example-integration/leases/strimzi-cluster-operator [system:serviceaccount:openshift-example-integration:strimzi-cluster-operator]
|
The larger and busier the cluster, the shorter the audit log duration will be. If you have a requirement to maintain or potentially review the audit logs, you need to send them to an external logging source for retention. For this lab, each kubectl-dev_tool commands takes approximately 50 seconds to run. |
Checking for issues
Now that we know a little bit about the kubectl-dev_tool, let’s look at our audit log must-gather for potential issues.
The first thing you want to do is narrow down and define a time range for the issue. Being as specific as possible will help ensure your results are accurate. As an example, 100,000 requests against a certain API over a 6 hour window may not be an issue. If those 100,000 requests actually occur during a 10 minute period of within those 6 hours of logs, when the issue occurs, it is likely worth looking into.
In our case, the incident took place from:
2025-09-08T14:15:00Z
to
2025-09-08T15:34:57Z
|
There is a helpful flag you can use -o top or --output=top which will count, group and display the top 10 entries for your search. If you want more or less from the top count, you can use -o top=20 or --output=top=5. |
Start by taking a high level view. You can be both broad and granular with audit logs, but unless you know exactly what you’re looking for, it’s best to cast a wide net.
Look at the top usage for the common --by= groups like resource and user. We can combine that using --after and --before, to narrow the results to focus in on our timerange.
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --by=resource -otop --after 2025-09-08T14:15:00Z --before 2025-09-08T15:34:57Z
The first thing you will see is a log outlining the count (number of records), followed by a time range and duration for the period the audit logs cover.
count: 3994302, first: 2025-09-08T10:15:00-04:00, last: 2025-09-08T11:34:56-04:00, duration: 1h19m56.999222s
After that, you will see the returned data set.
487532x v1/endpoints
442694x v1/configmaps
362910x v1/persistentvolumeclaims
338418x v1/persistentvolumes
328024x v1/secrets
320234x discovery.k8s.io/endpointslices
279757x v1/pods
195895x metrics.k8s.io/pods
181994x coordination.k8s.io/leases
148746x apps/deployments
Similarly, you can run the same command, using --by=user and we can adjust -otop=15 to see a larger list of users.
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --by=user -otop=15 --after 2025-09-08T14:15:00Z --before 2025-09-08T15:34:57Z
506707x system:serviceaccount:kube-system:horizontal-pod-autoscaler
298071x system:serviceaccount:kube-system:endpointslicemirroring-controller
184530x system:serviceaccount:openshift-example-crunchy:pgo[postgrescluster-controller]
158611x system:serviceaccount:kyverno:kyverno-reports-controller
89842x system:serviceaccount:d6af69-test:ha-postgres-crunchydb-instance
87176x system:serviceaccount:db4642-dev:pay-transparency-dev-crunchy-instance
82224x system:serviceaccount:e52f12-dev:pg-272ed005-crunchy-instance
81991x system:serviceaccount:d6af69-dev:ha-postgres-crunchydb-instance
78016x system:serviceaccount:openshift-infra:serviceaccount-pull-secrets-controller[openshift.io/image-registry-pull-secrets_legacy-token-secrets-controller]
48879x system:serviceaccount:openshift-example-crunchy:pgo
47722x system:serviceaccount:openshift-machine-config-operator:machine-config-operator
46633x system:serviceaccount:c57ee7-dev:dev-crunchy-postgres-instance
41531x system:node:cluster-app-109.dmz
41342x system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount
40934x system:serviceaccount:openshift-example-integration:strimzi-cluster-operator
Evaluate and dig deeper
Based on the above data, there is nothing glaringly obvious that is wrong with the cluster, but we can see that the API received 3,994,302 resource requests and another 719,823 access review requests (removed via scrubbing) over this 80 minute period. That translates to almost 1000 API requests per second.
At its peak, the cluster was exceeding 1200 API requests per second.
Let’s dig a little deeper.
|
When evaluating the data, always factor in things like the total number of requests, time period and the number of nodes. |
Our top resources from the --by=resource command were endpoints, configmaps, subjectaccessreviews, persistentvolumeclaims, persistentvolumes and secrets:
Our top users from the --by=user command were horizontal-pod-autoscaler, endpointslicemirroring-controller, openshift-apiserver-sa, pgo[postgrescluster-controller] and kyverno-reports-controller.
Again, none of those stick out as glaring issues, but let’s dig into the top resources. For this we can use the --resource= flag, in addition to --by= and -o top to focus in on a specific resource.
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource=endpoints -otop --by=user --after 2025-09-08T14:15:00Z --before 2025-09-08T15:34:57Z
We can see the endpoint activity is spread out, but the top consumers are crunchy / postgres instances from different namespaces.
count: 487532, first: 2025-09-08T10:15:00-04:00, last: 2025-09-08T11:34:56-04:00, duration: 1h19m56.967674s
89645x system:serviceaccount:d6af69-test:ha-postgres-crunchydb-instance
87151x system:serviceaccount:db4642-dev:pay-transparency-dev-crunchy-instance
82088x system:serviceaccount:e52f12-dev:pg-272ed005-crunchy-instance
81963x system:serviceaccount:d6af69-dev:ha-postgres-crunchydb-instance
38591x system:serviceaccount:c57ee7-dev:dev-crunchy-postgres-instance
18316x system:serviceaccount:kube-system:endpoint-controller
4689x system:serviceaccount:openshift-example-crunchy:pgo[postgrescluster-controller]
1344x system:serviceaccount:openshift-example-integration:strimzi-cluster-operator
1309x system:apiserver
1148x system:serviceaccount:openshift-monitoring:prometheus-k8s
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource=configmaps --by=user -otop --after 2025-09-08T14:15:00Z --before 2025-09-08T15:34:57Z
We can see the configmap activity is spread out, but crunchy / postgres is there again.
count: 442694, first: 2025-09-08T10:15:00-04:00, last: 2025-09-08T11:34:56-04:00, duration: 1h19m56.999222s
40623x system:serviceaccount:kyverno:kyverno-reports-controller
25472x system:serviceaccount:openshift-example-crunchy:pgo[postgrescluster-controller]
6470x system:node:cluster-app-109.dmz
5870x system:node:cluster-app-104.dmz
5513x system:node:cluster-app-40.dmz
5453x system:node:cluster-app-103.dmz
5423x system:node:cluster-app-101.dmz
5407x system:node:cluster-app-87.dmz
5396x system:node:cluster-app-108.dmz
5389x system:node:cluster-app-102.dmz
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource='persistentvolumeclaims' -otop --by=user --after 2025-09-08T14:15:00Z --before 2025-09-08T15:34:57Z
We can see a high number of persistentvolumeclaims activity across all nodes in the cluster, as well as from crunchy / postgres.
count: 362910, first: 2025-09-08T10:15:00-04:00, last: 2025-09-08T11:34:56-04:00, duration: 1h19m56.992954s
16115x system:serviceaccount:openshift-example-crunchy:pgo[postgrescluster-controller]
13138x system:node:cluster-app-109.dmz
11969x system:node:cluster-app-104.dmz
11681x system:node:cluster-app-89.dmz
11439x system:node:cluster-app-87.dmz
11247x system:node:cluster-app-108.dmz
10692x system:node:cluster-app-94.dmz
10191x system:node:cluster-app-84.dmz
10187x system:node:cluster-app-92.dmz
10028x system:node:cluster-app-101.dmz
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource='persistentvolumes' -otop --by=user --after 2025-09-08T14:15:00Z --before 2025-09-08T15:34:57Z
We can see a high number of persistentvolumes activity across all nodes in the cluster.
count: 338418, first: 2025-09-08T10:15:00-04:00, last: 2025-09-08T11:34:56-04:00, duration: 1h19m56.910381s
13109x system:node:cluster-app-109.dmz
11952x system:node:cluster-app-104.dmz
11675x system:node:cluster-app-89.dmz
11429x system:node:cluster-app-87.dmz
11236x system:node:cluster-app-108.dmz
10677x system:node:cluster-app-94.dmz
10190x system:node:cluster-app-84.dmz
10185x system:node:cluster-app-92.dmz
10023x system:node:cluster-app-101.dmz
9551x system:node:cluster-app-86.dmz
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource='secrets' -otop --by=user --after 2025-09-08T14:15:00Z --before 2025-09-08T15:34:57Z
We can see the secrets activity is spread out but again, we see crunchy high up in the data.
count: 328024, first: 2025-09-08T10:15:00-04:00, last: 2025-09-08T11:34:56-04:00, duration: 1h19m56.89568s
78016x system:serviceaccount:openshift-infra:serviceaccount-pull-secrets-controller[openshift.io/image-registry-pull-secrets_legacy-token-secrets-controller]
46030x system:serviceaccount:openshift-example-crunchy:pgo[postgrescluster-controller]
12091x system:serviceaccount:openshift-authentication-operator:authentication-operator
10523x system:serviceaccount:openshift-apiserver-operator:openshift-apiserver-operator
5428x system:node:cluster-app-109.dmz
5376x system:serviceaccount:openshift-example-integration:strimzi-cluster-operator
4701x system:serviceaccount:openshift-example-crunchy:pgo
4567x system:serviceaccount:e69aae-test:postgresql-operator-manager
4358x system:node:cluster-app-104.dmz
4350x system:node:cluster-app-103.dmz
Evaluating the output, the most common theme across all of the data points has been activity from crunchy / postgress. If you refer back to the original --by=user output, there was no obvious single user, but 7 of the top 15 users driving API load were crunchy / postgress related. If we add them all up, a conservative estimate, without secondary and follow-on factors, is 621,275 of 3,994,302 requests, totalling 15.55% of API traffic.
Given that crunchy / postgress are databases, we can hypothesize the high number of requests against the PersistentVolumes and PersistentVolumeClaims is also related.
Let’s take a look at one of the users and see what they are doing. You can do this by passing in the --user= flag along with --by=verb and -o top.
Let’s try to answer the following:
-
What is the user doing?
-
What is the problem?
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --user='system:serviceaccount:d6af69-test:ha-postgres-crunchydb-instance' --by=verb -otop --after 2025-09-08T14:15:00Z --before 2025-09-08T15:34:57Z
Top 10 "PATCH" (of 90224 total hits):
89454x [ 44.05581ms] [200-89453] /api/v1/namespaces/d6af69-test/endpoints/ha-postgres-crunchydb-ha [system:serviceaccount:d6af69-test:ha-postgres-crunchydb-instance]
What we see is that the majority of activity is from PATCH commands on the endpoints objects.
Lets take a look at our most active node:
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --user='system:node:cluster-app-109.dmz' --by=verb -otop --after 2025-09-08T14:15:00Z --before 2025-09-08T15:34:57Z
Top 10 "GET" (of 27649 total hits):
898x [ 1.439828ms] [200-897] /api/v1/persistentvolumes/pvc-a906c87c-ffe2-4e86-99b0-d8bb0aaaf676 [system:node:cluster-app-109.dmz]
898x [ 1.728236ms] [200-897] /api/v1/namespaces/ea8776-dev/persistentvolumeclaims/postgresql-mycool-app-patroni-1 [system:node:cluster-app-109.dmz]
554x [ 1.972292ms] [200-553] /api/v1/persistentvolumes/pvc-90a35859-8d42-4986-8ee0-b8c2a1e6d23d [system:node:cluster-app-109.dmz]
554x [ 1.469763ms] [200-553] /api/v1/namespaces/32d13a-test/persistentvolumeclaims/pg-1573229e-crunchy-db-b9v6-pgwal [system:node:cluster-app-109.dmz]
554x [ 2.180034ms] [200-553] /api/v1/persistentvolumes/pvc-a6736361-91e2-4824-a2da-b7d17b804837 [system:node:cluster-app-109.dmz]
Top 10 "WATCH" (of 11163 total hits):
170x [6m47.647582582s] [200-169] /api/v1/namespaces/0fad32-tools/secrets?allowWatchBookmarks=true&fieldSelector=metadata.name%3Ddefault-dockercfg-d28lr&resourceVersion=15895096121&timeout=7m6s&timeoutSeconds=426&watch=true [system:node:cluster-app-109.dmz]
164x [5m25.153896615s] [200-163] /api/v1/namespaces/a12c97-dev/secrets?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dchefs-encryption-keys&resourceVersion=15895174082&timeout=6m54s&timeoutSeconds=414&watch=true [system:node:cluster-app-109.dmz]
139x [5m28.029877906s] [200-138] /api/v1/namespaces/a12c97-dev/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dchefs-files-config&resourceVersion=15895168736&timeout=9m38s&timeoutSeconds=578&watch=true [system:node:cluster-app-109.dmz]
103x [6m33.862201844s] [200-102] /api/v1/namespaces/4c2ba9-dev/secrets?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dcrunchy-postgres-replication-cert&resourceVersion=15895102534&timeout=9m8s&timeoutSeconds=548&watch=true [system:node:cluster-app-109.dmz]
98x [7m22.541479775s] [200-97] /api/v1/namespaces/0fad32-tools/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dcas-shipit-db-cas-postgres-cluster-pgbackrest&resourceVersion=15895104263&timeout=6m58s&timeoutSeconds=418&watch=true [system:node:cluster-app-109.dmz]
What we see is that the majority of activity is from GET and WATCH commands on PersistentVolumes and PersistentVolumeClaims and other resources across dozens of different Namespaces.
You could continue to analyze more resources and users, but at this point based on the reviewed information, we were able to make an informed theory.
-
Even though the API pool members are flapping up and down, millions of requests are still making it API server.
-
We see 15%+ of our API traffic coming from crunchy / postgress related pods.
-
We see another 17%+ of our API traffic coming from PersistentVolumes and PersistentVolumeClaims.
The conclusion was that the Crunchy Postgres Operator was behaving abnormally due to the API flapping. This lead to an unexpected surge of traffic trying to GET, LIST and UPDATE its owned resources.
With this information, we were able to have an informed discussion with the customer and they were able to take the data to the 3rd party vendor and get the issue addressed to avoid such an API storm in the future.
How did we recover the cluster?
With no way of shutting down workloads or disabling the Operator, we had to get creative to stabilize the Control Plane.
We applied 2 iptables rules on all of the worker nodes in the cluster to drop traffic to port 6443 on both the FORWARD and OUTPUT chains.
iptables -I FORWARD -p tcp --dport 6443 -j DROP
iptables -I OUTPUT -p tcp --dport 6443 -j DROP
This blocked both the kubelet and running Pods from sending traffic across the SDN and to the API and allowed for the API to stabilize and recover.
Once the API was healthy, we removed the iptables rules in blocks of 5 nodes, allowing them to come back and workloads to fully recover.
I hope you found this introduction to the kubectl-dev_tool useful and can leverage it the next time you have an issue!
|
You don’t need to have an overloaded API or a performance issue to take a look at the audit logs. The audit logs and kubectl-dev_tool are equally useful if you want to understand who or what did something in your cluster. Something or someone deleted your pod, secret, service or other object? That’s in the audit logs! Use the kubectl-dev_tool to find out who did it and when! |