What is overloading my API?
A customer reported an issue where access to the OpenShift API was intermittent and or slow.
What is hitting my API?
How do we check who, what and how often something is hitting the API of an OpenShift cluster?
In this lab we are going to explore the usage of the OpenShift Cluster Debug Tools, specifically kubectl-dev_tool and the kubectl-dev_tool audit subcommand.
In addition to the standard must-gather, there are a number of additional variants for collecting data for special use cases.
For this exercise, we asked the customer to collect the audit logs from their cluster so we can analyze every request going to the API.
| Can you use the following command to collect an  
 This will produce a  | 
Explore the kubectl-dev_tool audit command
Lets take a look at the kubectl-dev_tool audit subcommand by running it with the -h flag to look at the help screen.
On this screen, you will find 5 examples to give you some understanding of how to use the tool.
| When using the  
 | 
Run a command
Let’s run a command against the kube-apiserver and see what kind of information we get back.
cd ~/module5/module5-must-gather-audit-log/data/tmp.nTGcWFOdf5/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b9694710390d7e99f1c26959025d6399f6fb5e9809ae49a336a9a6b3dad1b565/Depending on the amount of data and the complexity of the search, it can take some time to process.
Using the example from the help page, we can run a command like this:
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --verb=get --resource='*.*' --resource='-subjectaccessreviews.*' --resource='-tokenreviews.*' | moreIn this case, we are looking at individual GET requests, how long they took, what they accessed and who made the request.
23:03:59 [   GET][     1.095ms] [200] /livez                                  [system:anonymous]
23:04:00 [   GET][       352µs] [200] /.well-known/oauth-authorization-server [system:anonymous]
23:04:00 [   GET][       196µs] [200] /.well-known/oauth-authorization-server [system:anonymous]| The larger and busier the cluster, the shorter the audit log duration will be. If you have a requirement to maintain or potentially review the audit logs, you need to send them to an external logging source for retention. For this lab, each  | 
Checking for issues
Now that we know a little bit about the kubectl-dev_tool, let’s look at our audit log must-gather for potential issues.
| There is a helpful flag you can use  | 
Start by taking a high level view. You can be both broad and granular with audit logs, but unless you know exactly what you’re looking for, it’s best to cast a wide net.
Look at the top usage for the common --by= groups like resource and user
The first thing you will see is a log outlining the count (number of records), followed by a time range and duration for the period the audit logs cover.
count: 2181977, first: 2022-02-07T20:11:40Z, last: 2022-02-08T00:59:59Z, duration: 4h48m18.666203sAfter that, you will see the returned data set.
data by resource
464191x              v1/nodes
372952x              v1/configmaps
357233x              v1/pods
99403x               v1/secrets
99280x               authorization.k8s.io/v1/subjectaccessreviews
96385x               coordination.k8s.io/leases
93652x               v1/persistentvolumeclaims
92327x               v1/persistentvolumes
50571x               v1/namespaces
34385x               apps/deploymentsdata by user
446278x              system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent
76068x               system:apiserver
63661x               system:serviceaccount:openshift-apiserver:openshift-apiserver-sa
39312x               user@identity
36799x               system:serviceaccount:openshift-operators-redhat:elasticsearch-operator
33976x               system:serviceaccount:kube-system:horizontal-pod-autoscaler
33303x               system:serviceaccount:openshift-etcd-operator:etcd-operator
30875x               system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount
29274x               system:serviceaccount:openshift-cluster-storage-operator:csi-snapshot-controller-operator
28468x               system:serviceaccount:openshift-apiserver-operator:openshift-apiserver-operatorkubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --by=resource -otopkubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --by=user -otopEvaluate and dig deeper
We spotted something suspicious, so let’s drill down a little deeper.
| When evaluating the data, always factor in things like the total number of requests, time period and the number of nodes. | 
Our top 3 resources from the previous command were nodes, configmaps and pods:
464191x              v1/nodes
372952x              v1/configmaps
357233x              v1/podsOur top 3 users from the previous command were sysdig-agent, apiserver and openshift-apiserver-sa
446278x              system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent
76068x               system:apiserver
63661x               system:serviceaccount:openshift-apiserver:openshift-apiserver-saOne of those sticks out, but let’s first take a look at our top 3 resources. For this we can use the --resource= flag, in addition to --by= and -o top to down on a specific resource.
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource=configmaps -otop --by=userkubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource=pods -otop --by=userkubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --resource=nodes -otop --by=userThe data for configmaps and pods looks pretty spread out across a variety of users. There are no obvious bad actors.
data by user for configmaps
16947x               system:serviceaccount:openshift-etcd-operator:etcd-operator
6470x                user@identity
5197x                system:serviceaccount:openshift-kube-controller-manager-operator:kube-controller-manager-operator
4157x                system:serviceaccount:openshift-operators-redhat:elasticsearch-operator
4149x                system:serviceaccount:openshift-kube-apiserver-operator:kube-apiserver-operator
3317x                system:serviceaccount:openshift-logging:cluster-logging-operator
3285x                system:serviceaccount:9b301c-dev:tno
3131x                system:serviceaccount:devops-archeobot:default
3045x                system:kube-scheduler
3016x                system:serviceaccount:openshift-kube-scheduler-operator:openshift-kube-scheduler-operatordata by user for pods
13218x               system:serviceaccount:openshift-multus:multus
8808x                system:node:cluster-app-13.dmz
8763x                system:node:cluster-app-15.dmz
6592x                system:node:cluster-app-16.dmz
4805x                system:serviceaccount:openshift-logging:logcollector
4777x                system:node:cluster-app-07.dmz
4741x                system:node:cluster-app-12.dmz
4644x                system:node:cluster-app-03.dmz
4510x                system:node:cluster-app-31.dmz
4380x                system:node:cluster-app-26.dmzBut for nodes, there is one big outlier in the number of requests being executed. It aligns with the outlier we saw from the previous --by=user output.
data by user for nodes
440987x              system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent
737x                 system:node:cluster-infra-03.dmz
736x                 system:node:cluster-infra-04.dmz
734x                 system:node:cluster-infra-06.dmz
734x                 system:node:cluster-infra-02.dmz
452x                 system:node:cluster-app-26.dmz
449x                 system:node:cluster-app-07.dmz
441x                 system:node:cluster-app-02.dmz
437x                 system:node:cluster-app-41.dmz
437x                 system:node:cluster-app-05.dmzSo let’s take a look at that specific user and see what they are doing. You can do this by passing in the --user= flag along with --by=verb and -o top.
Let’s try to answer the following:
What is the user doing?
What is the problem?
kubectl-dev_tool audit -f 'audit_logs/kube-apiserver' --user=system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent --by=verb -otopWhat we see is very interesting:
- 
The majority are GETrequests to the/proxy/metricsendpoint of every node.
- 
They’re all returning a HTTP 403errors
| A HTTP 403 is an HTTP status code meaning access to the requested resource is forbidden. A HTTP 403 is returned when the client is not permitted access to the resource, despite providing valid credentials, typically due issues like insufficient permissions for the authenticated account. | 
Top 10 "GET" (of 440076 total hits):
   8313x [   274.335µs] [403-8312] /api/v1/nodes/cluster-app-38.dmz/proxy/metrics    [system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent]
   8309x [   272.092µs] [403-8308] /api/v1/nodes/cluster-app-25.dmz/proxy/metrics    [system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent]
   8308x [   270.327µs] [403-8307] /api/v1/nodes/cluster-app-02.dmz/proxy/metrics    [system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent]Based on this information, we could conclude that the API was being overloaded with requests from an application called sysdig which is a piece of 3rd party software used to monitor the OpenShift cluster. The service account sysdig-agent was failing to collect node metrics and in turn was spamming the API server.
With this information, you can now have an informed discussion with your customer, provide them all of the details of the issue and direct them to open a support case or start a discussion with the 3rd party vendor to address the issue.
I hope you found this introduction to the kubectl-dev_tool useful and can leverage it the next time you have an issue!
| You don’t need to have an overloaded API or a performance issue to take a look at the audit logs. The audit logs and  Something or someone deleted your pod, secret, service or other object? That’s in the audit logs! Use the  |