Dynamic Rescheduling of Workloads

Introduction

You can use the descheduler to evict pods based on specific strategies so that the pods can be rescheduled onto more appropriate nodes.

You can benefit from descheduling running pods in situations such as the following:

Nodes are underutilized or overutilized.
Pod and node affinity requirements, such as taints or labels, have changed and the original scheduling decisions are no longer appropriate for certain nodes.
Node failure requires pods to be moved.
New nodes are added to clusters.
Pods have been restarted too many times.

Objectives

In this module, we will guide you through the process of configuring Rescheduling Strategies with this profile:

KubeVirtRelieveAndMigrate
- evicts pods from high-cost nodes to reduce overall resource expenses and enable workload migration. It also periodically rebalances workloads to help maintain similar spare capacity across nodes, which supports better handling of sudden workload spikes. Nodes can experience the following costs:
  - Resource utilization: Increased resource pressure raises the overhead for running applications.
  - Node maintenance: A higher number of containers on a node increases resource consumption and maintenance costs.

Requirements

Descheduler Operator must be installed
The KubeVirtRelieveAndMigrate profile requires PSI metrics to be enabled on all worker nodes. For your convenience, this is already enabled in this lab.

PSI (Pressure Stall Information) is a Linux kernel feature that measures and reports the time tasks spend waiting for CPU, memory, or I/O resources, providing visibility into resource contention and saturation. It quantifies how much time processes are stalled waiting for resources, helping identify performance bottlenecks by showing what percentage of time your system is under pressure from insufficient resources.

Accessing the OpenShift Cluster

Web Console

{openshift_cluster_console_url}[{openshift_cluster_console_url},window=_blank]

CLI Login

oc login -u {openshift_cluster_admin_username} -p {openshift_cluster_admin_password} --server={openshift_api_server_url}

Cluster API

{openshift_api_server_url}[{openshift_api_server_url},window=_blank]

OpenShift Username

{openshift_cluster_admin_username}

OpenShift Password

{openshift_cluster_admin_password}

Instructions

Ensure you are logged in to the OpenShift CLI as the admin user from the terminal window on the right side of your screen and continue to the next step.

Check which virtual machines are running in the dynamic-schedule namespace

oc get vmi -n dynamic-schedule

NAME                    AGE     PHASE     IP             NODENAME
dynamic-schedule-vm1    10m     Running   10.129.2.111   worker01

Observe the Descheduler Operator configuration

oc -n openshift-kube-descheduler-operator get kubedescheduler cluster -o yaml

apiVersion: operator.openshift.io/v1
kind: KubeDescheduler
metadata:
  name: cluster
  namespace: openshift-kube-descheduler-operator
spec:
  logLevel: Normal
  mode: Predictive <1>
  operatorLogLevel: Normal
  profileCustomizations:
    devEnableSoftTainter: true
    devDeviationThresholds: AsymmetricLow
    devActualUtilizationProfile: PrometheusCPUCombined
  profiles:
    - KubeVirtRelieveAndMigrate
  # This is the interval at which the descheduler will run ; Set a lower value for testing purposes :
  deschedulingIntervalSeconds: 3600
  managementState: Managed

Notice the mode is set to Predictive. By default, the descheduler does not evict pods. To evict pods, set mode to Automatic.

Ensure the descheduler pod is running and check the logs in the openshift-kube-descheduler-operator namespace
```
oc logs -n openshift-kube-descheduler-operator -l app=descheduler
```

Deploy a stress test pod to validate the descheduler functionality

Edit the existing stress-test deployment to use a specific node selector (place it on a node where at least 1 vm is running)

oc -n dynamic-schedule-stress-test patch deployment stress-test --type merge -p '{"spec":{"template":{"spec":{"nodeSelector":{"kubernetes.io/hostname":"worker01"}}}}}'

Scale the deployment

oc -n dynamic-schedule-stress-test scale deployment stress-test --replicas=1

Verify the deployment

oc get pods -n dynamic-schedule-stress-test

Observe the descheduler pod logs in the openshift-kube-descheduler-operator namespace
```
oc logs -f -n openshift-kube-descheduler-operator -l app=descheduler
```

Edit the kubedescheduler cluster resource to set the deschedulingIntervalSeconds to a lower value for testing purposes.

oc -n openshift-kube-descheduler-operator patch kubedescheduler cluster --type merge -p '{"spec":{"deschedulingIntervalSeconds":600}}'

I1113 15:24:57.371043       1 lownodeutilization.go:261] "Number of overutilized nodes" totalNumber=1
I1113 15:24:57.371052       1 nodeutilization.go:174] "Total capacity to be moved" MetricResource=143
I1113 15:24:57.371273       1 nodeutilization.go:189] "Pods on node" node="worker01" allPods=72 nonRemovablePods=45 removablePods=27
I1113 15:30:00.673653       1 nodeutilization.go:205] "Evicting pods based on priority, if they have same priority, they'll be evicted based on QoS tiers"
I1113 15:30:00.673755       1 evictions.go:551] "Evicted pod in dry run mode" pod="dynamic-schedule/virt-launcher-vm01" reason="" strategy="LowNodeUtilization" node="worker01" profile="DevKubeVirtRelieveAndMigrate"

Notice the mode is set to Predictive so the eviction is in dry run mode. To evict pods, set mode to Automatic.

Switch the descheduler mode to Automatic

oc -n openshift-kube-descheduler-operator patch kubedescheduler cluster --type merge -p '{"spec":{"mode":"Automatic"}}'

Verify the descheduler pod logs in the openshift-kube-descheduler-operator namespace
```
oc logs -f -n openshift-kube-descheduler-operator -l app=descheduler
```

Confirm the pods are evicted and rescheduled on another node

oc get pods -n dynamic-schedule -o wide

NAME                     READY   STATUS    RESTARTS   AGE     IP             NODE       NOMINATED NODE   READINESS GATES
virt-launcher-vm01       1/1     Running   0          10m     10.129.2.111   worker02   <none>           <none>
virt-launcher-vm02       1/1     Running   0          10m     10.129.2.112   worker02   <none>           <none>

You can confirm the eviction is working by placing the stress test pod on the node where the VMs are running and confirm they are evicted and rescheduled on another node

oc -n dynamic-schedule patch deployment stress-test --type merge -p '{"spec":{"template":{"spec":{"nodeSelector":{"kubernetes.io/hostname":"worker02"}}}}}'

oc get pods -n dynamic-schedule -o wide