Dynamic Rescheduling of Virtual Workloads

Introduction

In Kubernetes, on which OpenShift is based, the default behavior when a node starts to feel pressure from a lack of physical resources (CPU/Memory) is to kill running workloads to free up those resources. It does this by using the kube-descheduler to evict pods and have them rescheduled on other nodes with more free resources.

In the case of a containerized application with many replicas this often isn’t a problem, but in the case of a virtual machine which is a stateful workload, this behavior can lead to serious issues.

With OpenShift Virtualization, we handle the eviction process gracefully through the virt-api-validator validating webhook. This webhook intercepts all eviction requests and makes a series of decisions on how to treat the request. If the eviction request is recieved for a virtual machine that has the ability to live migrate, then that less-destructive option is taken.

For more information on how this works, please see the following link: How Evictions are Handled in OpenShift Virtualization

You can benefit from descheduling running pods in situations such as the following:

Nodes are underutilized or overutilized.
Pod and node affinity requirements, such as taints or labels, have changed and the original scheduling decisions are no longer appropriate for certain nodes.
Node failure requires pods to be moved.
New nodes are added to clusters.
Pods have been restarted too many times.

Goals

Understand the use of the Kube-Descheduler using the KubeVirtRelieveAndMigrate profile to keep virtual workloads alive.
Discover how to edit the configuration of the Kube-Descheduler to change it’s functionality, and it’s eviction interval.

Prerequisites

Descheduler Operator is installed
The KubeVirtRelieveAndMigrate profile is configured
PSI metrics are enabled on all worker nodes

For your convenience, all of the requirements are already enabled in the lab.

PSI (Pressure Stall Information) is a Linux kernel feature that measures and reports the time tasks spend waiting for CPU, memory, or I/O resources, providing visibility into resource contention and saturation. It quantifies how much time processes are stalled waiting for resources, helping identify performance bottlenecks by showing what percentage of time your system is under pressure from insufficient resources.

Accessing the OpenShift Cluster

Your OpenShift cluster console is available {openshift_cluster_console_url}[here^].

Your console login is available with:

User: {openshift_cluster_admin_username}
Password: {openshift_cluster_admin_password}

You can login to your OpenShift cluster on the provided terminal by copying and pasting the following syntax:

oc login -u {openshift_cluster_admin_username} -p {openshift_cluster_admin_password} --server={openshift_api_server_url}

Demonstrating Dynamic Scheduling

The following module contains advanced actions performed in the OpenShift CLI, please ensure you are logged into your cluster in the embedded terminal to the right.

Lets begin this module by starting the dynamic-schedule-vm1 virtual machine found in the dynamic-schedule namespace. This can be done either throught the OpenShift console, or throught the CLI in the embedded console with the following command:
```
virtctl start dynamic-schedule-vm-1 -n dynamic-schedule
```
```
VM dynamic-schedule-vm-1 was scheduled to start
```

Check the name of the virtual machine that is running in the dynamic-schedule namespace with the following command:

oc get vmi -n dynamic-schedule

NAME                    AGE     PHASE     IP             NODENAME                        READY
dynamic-schedule-vm-1   2m41s   Running   10.232.0.240   control-plane-cluster-jlkcx-1   True

Run the following command in the embedded terminal to observe the Kube-Descheduler operator configuration:

oc -n openshift-kube-descheduler-operator get kubedescheduler cluster -o yaml

apiVersion: operator.openshift.io/v1
kind: KubeDescheduler
metadata:
  name: cluster
  namespace: openshift-kube-descheduler-operator
spec:
  logLevel: Normal
  mode: Predictive
  operatorLogLevel: Normal
  profileCustomizations:
    devEnableSoftTainter: true
    devDeviationThresholds: AsymmetricLow
    devActualUtilizationProfile: PrometheusCPUCombined
  profiles:
    - KubeVirtRelieveAndMigrate
  # This is the interval at which the descheduler will run ; Set a lower value for testing purposes :
  deschedulingIntervalSeconds: 3600
  managementState: Managed

By default, the mode is set to Predictive which means it only simulates and logs potential pod evictions. It does not actually perform any evictions. To enable eviction, set the mode to Automatic.

Ensure the descheduler pod is running and check the logs in the openshift-kube-descheduler-operator namespace to confirm with the following command:

oc logs -n openshift-kube-descheduler-operator -l app=descheduler

Output

I0102 13:57:26.543545       1 lownodeutilization.go:210] "Node has been classified" category="underutilized" node="worker-cluster-jlkcx-2" usage={"MetricResource":"4"} usagePercentage={"MetricResource":4}
I0102 13:57:26.543656       1 lownodeutilization.go:210] "Node has been classified" category="underutilized" node="worker-cluster-jlkcx-1" usage={"MetricResource":"5"} usagePercentage={"MetricResource":5}
I0102 13:57:26.543667       1 lownodeutilization.go:236] "Node is appropriately utilized" node="control-plane-cluster-jlkcx-1" usage={"MetricResource":"13"} usagePercentage={"MetricResource":13}
I0102 13:57:26.543674       1 lownodeutilization.go:248] "Criteria for a node under utilization" MetricResource="0.00%"
I0102 13:57:26.543687       1 lownodeutilization.go:249] "Number of underutilized nodes" totalNumber=2
I0102 13:57:26.543694       1 lownodeutilization.go:250] "Criteria for a node above target utilization" MetricResource="10.00%"
I0102 13:57:26.543704       1 lownodeutilization.go:251] "Number of overutilized nodes" totalNumber=0
I0102 13:57:26.543709       1 lownodeutilization.go:275] "All nodes are under target utilization, nothing to do here"
I0102 13:57:26.543786       1 profile.go:376] "Total number of evictions/requests" extension point="Balance" evictedPods=0 evictionRequests=0
I0102 13:57:26.543827       1 descheduler.go:403] "Number of evictions/requests" totalEvicted=0 evictionRequests=0

Now we are going to deploy a stress test pod in order to demonstrated the functionality of the Kube-Descheduler. We do so by patching the existing stress-test deployment to use a nodeSelector so that the pod runs on the node where at the VM is running.
We do this by setting the nodeSelector key kubernetes.io/hostname dynamically by:
1. Listing all of the nodes in the cluster.
2. Filtering the results on the specific node where the dynamic-schedule-vm-1 virtual machine is running.

In order to perform these actions, run the following command:

oc -n dynamic-schedule-stress-test patch deployment stress-test --type merge -p "{\"spec\":{\"template\":{\"spec\":{\"nodeSelector\":{\"kubernetes.io/hostname\":\"$(oc get nodes -o custom-columns=":metadata.name" --no-headers | grep $(oc get vmi dynamic-schedule-vm-1 -n dynamic-schedule -o json | jq -r '.status.nodeName'))\"}}}}}"

To increase the stress on the node we now need to scale the deployment. Do so with the following command:
```
oc -n dynamic-schedule-stress-test scale deployment stress-test --replicas=1
```
Use the following command to verify that the pod is running:
```
oc get pods -n dynamic-schedule-stress-test -o wide
```
If you recall from the configuration earlier, the default descheduling interval is to assess the status of the cluster every 10 minutes. For the purposes of this lab, we want to decrease that interval so that we can observe it perform the descheuling action.

We can edit the kubedescheduler cluster resource to set the deschedulingIntervalSeconds to a lower value with the following command:

oc -n openshift-kube-descheduler-operator patch kubedescheduler cluster --type merge -p '{"spec":{"deschedulingIntervalSeconds":5}}'

With the interval shortened to 5 seconds, we may now be able to observe the evictions take place. View the descheduler pod logs in the openshift-kube-descheduler-operator namespace with the following command:

oc logs -f -n openshift-kube-descheduler-operator -l app=descheduler

I1113 15:24:57.371043       1 lownodeutilization.go:261] "Number of overutilized nodes" totalNumber=1
I1113 15:24:57.371052       1 nodeutilization.go:174] "Total capacity to be moved" MetricResource=143
I1113 15:24:57.371273       1 nodeutilization.go:189] "Pods on node" node="worker01" allPods=72 nonRemovablePods=45 removablePods=27
I1113 15:30:00.673653       1 nodeutilization.go:205] "Evicting pods based on priority, if they have same priority, they'll be evicted based on QoS tiers"
I1113 15:30:00.673755       1 evictions.go:551] "Evicted pod in dry run mode" pod="dynamic-schedule/virt-launcher-dynamic-schedule-vm-1-lltlv" reason="" strategy="LowNodeUtilization" node="worker01" profile="DevKubeVirtRelieveAndMigrate"

We can see that the virt-launcher-dynamic-schedule-vm-1-lltlv pod would have been evicted but the Descheduler was in dry run or Predictive mode.

To actually allow the descheduler to evict the pods, we need to set the Descheduler mode to Automatic. We can do so with the following command:
```
oc -n openshift-kube-descheduler-operator patch kubedescheduler cluster --type merge -p '{"spec":{"mode":"Automatic"}}'
```

Lets verify the descheduler pod logs in the openshift-kube-descheduler-operator namespace one more time.

oc logs -f -n openshift-kube-descheduler-operator -l app=descheduler

Output

I0102 14:53:47.976452       1 lownodeutilization.go:251] "Number of overutilized nodes" totalNumber=1
I0102 14:53:47.976473       1 nodeutilization.go:185] "Total capacity to be moved" MetricResource=79
I0102 14:53:47.976924       1 nodeutilization.go:200] "Pods on node" node="worker-cluster-jlkcx-2" allPods=44 nonRemovablePods=31 removablePods=13
I0102 14:53:47.976946       1 nodeutilization.go:216] "Evicting pods based on priority, if they have same priority, they'll be evicted based on QoS tiers"
I0102 14:53:47.977036       1 profile.go:376] "Total number of evictions/requests" extension point="Balance" evictedPods=0 evictionRequests=0
I0102 14:53:47.977065       1 descheduler.go:403] "Number of evictions/requests" totalEvicted=0 evictionRequests=0
I0102 14:53:52.963035       1 lownodeutilization.go:210] "Node has been classified" category="overutilized" node="control-plane-cluster-jlkcx-1" usage={"MetricResource":"10"} usagePercentage={"MetricResource":10}
I0102 14:53:52.963113       1 lownodeutilization.go:210] "Node has been classified" category="underutilized" node="worker-cluster-jlkcx-1" usage={"MetricResource":"7"} usagePercentage={"MetricResource":7}
I0102 14:53:52.963143       1 lownodeutilization.go:210] "Node has been classified" category="underutilized" node="worker-cluster-jlkcx-2" usage={"MetricResource":"100"} usagePercentage={"MetricResource":100}

We can see that we have one overutilized node where the stress-test pod is running. Because of this the VM will be migrated to another node.

We can confirm that the pods are evicted and rescheduled on another node (using the -w "watch" parameter as it can take some time for the eviction to happen)

oc get pods -n dynamic-schedule -o wide -w

Output

NAME                                        READY   STATUS      RESTARTS   AGE     IP             NODE                            NOMINATED NODE   READINESS GATES
virt-launcher-dynamic-schedule-vm-1-lltlv   0/2     Completed   0          31m     10.232.0.240   control-plane-cluster-jlkcx-1   <none>           1/1
virt-launcher-dynamic-schedule-vm-1-n6ct7   2/2     Running     0          2m37s   10.234.0.84    worker-cluster-jlkcx-2          <none>           1/1

To confirm that the kube-descheduler is working, move the stress test pod to the new node where the VM is now running and repeat the last few steps to watch it move again.

oc -n dynamic-schedule-stress-test patch deployment stress-test --type merge -p "{\"spec\":{\"template\":{\"spec\":{\"nodeSelector\":{\"kubernetes.io/hostname\":\"$(oc get nodes -o custom-columns=":metadata.name" --no-headers | grep $(oc get vmi dynamic-schedule-vm-1 -n dynamic-schedule -o json | jq -r '.status.nodeName'))\"}}}}}"

The command we just executed will patch the configuration of the stress-test deployment by changing the nodeSelector which will trigger a new rollout onto the node where the dynamic-schedule-vm-1 VM is running. The VM will then be evicted and live migrate to another available node.

oc get pods -n dynamic-schedule -o wide -w

Output

NAME                                        READY   STATUS      RESTARTS   AGE     IP             NODE                            NOMINATED NODE   READINESS GATES
virt-launcher-dynamic-schedule-vm-1-4qhnq   2/2     Running     0          4m28s   10.232.1.70    control-plane-cluster-jlkcx-1              1/1
virt-launcher-dynamic-schedule-vm-1-n6ct7   0/2     Completed   0          9m49s   10.234.0.84    worker-cluster-jlkcx-2                     1/1

Congratulations, you have completed this module!

Scale down the dynamic-schedule-stress-test deployment and stop the dynamic-schedule-vm-1 VM using virtctl from your Terminal window to ensure you have enough resources for the next labs.

oc -n dynamic-schedule-stress-test scale deployment stress-test --replicas=0

virtctl stop dynamic-schedule-vm-1 -n dynamic-schedule

Summary

In this module we explored how eviction policies work for virtual machines and pods, and how we can leverage the kube-descheduler operator in OpenShift Virtualization to ensure the VM workloads are not killed when a node runs out of resources, but instead are live-migrated to another node with enough free resources. We demonstrated this by increasing the load on a worker node where our VM was running, and set the descheduler interval to a short enough amount of time that we could effectively watch the VM migrate as necessary.