Dynamic Rescheduling of Workloads

Introduction

You can use the descheduler to evict pods based on specific strategies so that the pods can be rescheduled onto more appropriate nodes.

You can benefit from descheduling running pods in situations such as the following:

  • Nodes are underutilized or overutilized.

  • Pod and node affinity requirements, such as taints or labels, have changed and the original scheduling decisions are no longer appropriate for certain nodes.

  • Node failure requires pods to be moved.

  • New nodes are added to clusters.

  • Pods have been restarted too many times.

Objectives

In this module, we will guide you through the process of configuring Rescheduling Strategies with this profile:

  • KubeVirtRelieveAndMigrate

    • evicts pods from high-cost nodes to reduce overall resource expenses and enable workload migration. It also periodically rebalances workloads to help maintain similar spare capacity across nodes, which supports better handling of sudden workload spikes. Nodes can experience the following costs:

      • Resource utilization: Increased resource pressure raises the overhead for running applications.

      • Node maintenance: A higher number of containers on a node increases resource consumption and maintenance costs.

Requirements

  • Descheduler Operator must be installed

  • The KubeVirtRelieveAndMigrate profile requires PSI metrics to be enabled on all worker nodes. For your convenience, this is already enabled in this lab.

PSI (Pressure Stall Information) is a Linux kernel feature that measures and reports the time tasks spend waiting for CPU, memory, or I/O resources, providing visibility into resource contention and saturation. It quantifies how much time processes are stalled waiting for resources, helping identify performance bottlenecks by showing what percentage of time your system is under pressure from insufficient resources.

Accessing the OpenShift Cluster

Web Console

{openshift_cluster_console_url}[{openshift_cluster_console_url},window=_blank]

CLI Login
oc login -u {openshift_cluster_admin_username} -p {openshift_cluster_admin_password} --server={openshift_api_server_url}
Cluster API

{openshift_api_server_url}[{openshift_api_server_url},window=_blank]

OpenShift Username
{openshift_cluster_admin_username}
OpenShift Password
{openshift_cluster_admin_password}

Instructions

  1. Ensure you are logged in to the OpenShift CLI as the admin user from the terminal window on the right side of your screen and continue to the next step.

  2. Check which virtual machines are running in the dynamic-schedule namespace

    oc get vmi -n dynamic-schedule
    NAME                    AGE     PHASE     IP             NODENAME
    dynamic-schedule-vm1    10m     Running   10.129.2.111   worker01
  3. Observe the Descheduler Operator configuration

    oc -n openshift-kube-descheduler-operator get kubedescheduler cluster -o yaml
    apiVersion: operator.openshift.io/v1
    kind: KubeDescheduler
    metadata:
      name: cluster
      namespace: openshift-kube-descheduler-operator
    spec:
      logLevel: Normal
      mode: Predictive <1>
      operatorLogLevel: Normal
      profileCustomizations:
        devEnableSoftTainter: true
        devDeviationThresholds: AsymmetricLow
        devActualUtilizationProfile: PrometheusCPUCombined
      profiles:
        - KubeVirtRelieveAndMigrate
      # This is the interval at which the descheduler will run ; Set a lower value for testing purposes :
      deschedulingIntervalSeconds: 3600
      managementState: Managed
    Notice the mode is set to Predictive. By default, the descheduler does not evict pods. To evict pods, set mode to Automatic.
  4. Ensure the descheduler pod is running and check the logs in the openshift-kube-descheduler-operator namespace

    oc logs -n openshift-kube-descheduler-operator -l app=descheduler
  5. Deploy a stress test pod to validate the descheduler functionality

    1. Edit the existing stress-test deployment to use a specific node selector (place it on a node where at least 1 vm is running)

      oc -n dynamic-schedule-stress-test patch deployment stress-test --type merge -p '{"spec":{"template":{"spec":{"nodeSelector":{"kubernetes.io/hostname":"worker01"}}}}}'
    2. Scale the deployment

      oc -n dynamic-schedule-stress-test scale deployment stress-test --replicas=1
    3. Verify the deployment

      oc get pods -n dynamic-schedule-stress-test
  6. Observe the descheduler pod logs in the openshift-kube-descheduler-operator namespace

    oc logs -f -n openshift-kube-descheduler-operator -l app=descheduler
  7. Edit the kubedescheduler cluster resource to set the deschedulingIntervalSeconds to a lower value for testing purposes.

    oc -n openshift-kube-descheduler-operator patch kubedescheduler cluster --type merge -p '{"spec":{"deschedulingIntervalSeconds":600}}'
    I1113 15:24:57.371043       1 lownodeutilization.go:261] "Number of overutilized nodes" totalNumber=1
    I1113 15:24:57.371052       1 nodeutilization.go:174] "Total capacity to be moved" MetricResource=143
    I1113 15:24:57.371273       1 nodeutilization.go:189] "Pods on node" node="worker01" allPods=72 nonRemovablePods=45 removablePods=27
    I1113 15:30:00.673653       1 nodeutilization.go:205] "Evicting pods based on priority, if they have same priority, they'll be evicted based on QoS tiers"
    I1113 15:30:00.673755       1 evictions.go:551] "Evicted pod in dry run mode" pod="dynamic-schedule/virt-launcher-vm01" reason="" strategy="LowNodeUtilization" node="worker01" profile="DevKubeVirtRelieveAndMigrate"
    Notice the mode is set to Predictive so the eviction is in dry run mode. To evict pods, set mode to Automatic.
  8. Switch the descheduler mode to Automatic

    oc -n openshift-kube-descheduler-operator patch kubedescheduler cluster --type merge -p '{"spec":{"mode":"Automatic"}}'
  9. Verify the descheduler pod logs in the openshift-kube-descheduler-operator namespace

    oc logs -f -n openshift-kube-descheduler-operator -l app=descheduler
  10. Confirm the pods are evicted and rescheduled on another node

    oc get pods -n dynamic-schedule -o wide
    NAME                     READY   STATUS    RESTARTS   AGE     IP             NODE       NOMINATED NODE   READINESS GATES
    virt-launcher-vm01       1/1     Running   0          10m     10.129.2.111   worker02   <none>           <none>
    virt-launcher-vm02       1/1     Running   0          10m     10.129.2.112   worker02   <none>           <none>
  11. You can confirm the eviction is working by placing the stress test pod on the node where the VMs are running and confirm they are evicted and rescheduled on another node

    oc -n dynamic-schedule patch deployment stress-test --type merge -p '{"spec":{"template":{"spec":{"nodeSelector":{"kubernetes.io/hostname":"worker02"}}}}}'
    oc get pods -n dynamic-schedule -o wide