Dynamic Rescheduling of Workloads
Introduction
You can use the descheduler to evict pods based on specific strategies so that the pods can be rescheduled onto more appropriate nodes.
You can benefit from descheduling running pods in situations such as the following:
-
Nodes are underutilized or overutilized.
-
Pod and node affinity requirements, such as taints or labels, have changed and the original scheduling decisions are no longer appropriate for certain nodes.
-
Node failure requires pods to be moved.
-
New nodes are added to clusters.
-
Pods have been restarted too many times.
Objectives
In this module, we will guide you through the process of configuring Rescheduling Strategies with this profile:
-
KubeVirtRelieveAndMigrate
-
evicts pods from high-cost nodes to reduce overall resource expenses and enable workload migration. It also periodically rebalances workloads to help maintain similar spare capacity across nodes, which supports better handling of sudden workload spikes. Nodes can experience the following costs:
-
Resource utilization: Increased resource pressure raises the overhead for running applications.
-
Node maintenance: A higher number of containers on a node increases resource consumption and maintenance costs.
-
-
Requirements
-
Descheduler Operator must be installed
-
The KubeVirtRelieveAndMigrate profile requires PSI metrics to be enabled on all worker nodes. For your convenience, this is already enabled in this lab.
| PSI (Pressure Stall Information) is a Linux kernel feature that measures and reports the time tasks spend waiting for CPU, memory, or I/O resources, providing visibility into resource contention and saturation. It quantifies how much time processes are stalled waiting for resources, helping identify performance bottlenecks by showing what percentage of time your system is under pressure from insufficient resources. |
Accessing the OpenShift Cluster
{openshift_cluster_console_url}[{openshift_cluster_console_url},window=_blank]
oc login -u {openshift_cluster_admin_username} -p {openshift_cluster_admin_password} --server={openshift_api_server_url}
{openshift_api_server_url}[{openshift_api_server_url},window=_blank]
{openshift_cluster_admin_username}
{openshift_cluster_admin_password}
Instructions
-
Ensure you are logged in to the OpenShift CLI as the admin user from the terminal window on the right side of your screen and continue to the next step.
-
Check which virtual machines are running in the dynamic-schedule namespace
oc get vmi -n dynamic-scheduleNAME AGE PHASE IP NODENAME dynamic-schedule-vm1 10m Running 10.129.2.111 worker01 -
Observe the Descheduler Operator configuration
oc -n openshift-kube-descheduler-operator get kubedescheduler cluster -o yamlapiVersion: operator.openshift.io/v1 kind: KubeDescheduler metadata: name: cluster namespace: openshift-kube-descheduler-operator spec: logLevel: Normal mode: Predictive <1> operatorLogLevel: Normal profileCustomizations: devEnableSoftTainter: true devDeviationThresholds: AsymmetricLow devActualUtilizationProfile: PrometheusCPUCombined profiles: - KubeVirtRelieveAndMigrate # This is the interval at which the descheduler will run ; Set a lower value for testing purposes : deschedulingIntervalSeconds: 3600 managementState: ManagedNotice the mode is set to Predictive. By default, the descheduler does not evict pods. To evict pods, set mode to Automatic. -
Ensure the descheduler pod is running and check the logs in the openshift-kube-descheduler-operator namespace
oc logs -n openshift-kube-descheduler-operator -l app=descheduler -
Deploy a stress test pod to validate the descheduler functionality
-
Edit the existing stress-test deployment to use a specific node selector (place it on a node where at least 1 vm is running)
oc -n dynamic-schedule-stress-test patch deployment stress-test --type merge -p '{"spec":{"template":{"spec":{"nodeSelector":{"kubernetes.io/hostname":"worker01"}}}}}' -
Scale the deployment
oc -n dynamic-schedule-stress-test scale deployment stress-test --replicas=1 -
Verify the deployment
oc get pods -n dynamic-schedule-stress-test
-
-
Observe the descheduler pod logs in the openshift-kube-descheduler-operator namespace
oc logs -f -n openshift-kube-descheduler-operator -l app=descheduler -
Edit the kubedescheduler cluster resource to set the deschedulingIntervalSeconds to a lower value for testing purposes.
oc -n openshift-kube-descheduler-operator patch kubedescheduler cluster --type merge -p '{"spec":{"deschedulingIntervalSeconds":600}}'I1113 15:24:57.371043 1 lownodeutilization.go:261] "Number of overutilized nodes" totalNumber=1 I1113 15:24:57.371052 1 nodeutilization.go:174] "Total capacity to be moved" MetricResource=143 I1113 15:24:57.371273 1 nodeutilization.go:189] "Pods on node" node="worker01" allPods=72 nonRemovablePods=45 removablePods=27 I1113 15:30:00.673653 1 nodeutilization.go:205] "Evicting pods based on priority, if they have same priority, they'll be evicted based on QoS tiers" I1113 15:30:00.673755 1 evictions.go:551] "Evicted pod in dry run mode" pod="dynamic-schedule/virt-launcher-vm01" reason="" strategy="LowNodeUtilization" node="worker01" profile="DevKubeVirtRelieveAndMigrate"Notice the mode is set to Predictive so the eviction is in dry run mode. To evict pods, set mode to Automatic. -
Switch the descheduler mode to Automatic
oc -n openshift-kube-descheduler-operator patch kubedescheduler cluster --type merge -p '{"spec":{"mode":"Automatic"}}' -
Verify the descheduler pod logs in the openshift-kube-descheduler-operator namespace
oc logs -f -n openshift-kube-descheduler-operator -l app=descheduler -
Confirm the pods are evicted and rescheduled on another node
oc get pods -n dynamic-schedule -o wideNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES virt-launcher-vm01 1/1 Running 0 10m 10.129.2.111 worker02 <none> <none> virt-launcher-vm02 1/1 Running 0 10m 10.129.2.112 worker02 <none> <none> -
You can confirm the eviction is working by placing the stress test pod on the node where the VMs are running and confirm they are evicted and rescheduled on another node
oc -n dynamic-schedule patch deployment stress-test --type merge -p '{"spec":{"template":{"spec":{"nodeSelector":{"kubernetes.io/hostname":"worker02"}}}}}'oc get pods -n dynamic-schedule -o wide