Module 4: Model serving

This module explores the model serving infrastructure that provides real-time inference capabilities. You’ll create two complementary model servers, configure their endpoints, and test inference to verify predictions.

Learning objectives

By the end of this module, you will be able to:

  • Understand model serving architecture using KServe and OpenVINO

  • Deploy single-model serving platforms in Red Hat OpenShift AI

  • Create and configure Stress Detection and Time to Failure model servers

  • Query model endpoints using REST API protocol

  • Interpret model predictions (stress scores and time estimates)

  • Understand why SNO model servers validate before robot deployment

Exercise 4.1: Deploy Stress Detection model server

Model servers provide real-time inference capabilities by loading trained models and exposing REST API endpoints. You’ll deploy the Stress Detection server to validate newly trained models before deployment to robots.

Although models run on robots for production inference, the SNO environment also serves models for:

  • Validation - Test newly trained models before deploying to robots

  • Comparison - Benchmark new models against existing ones

  • Development - Quick testing during model development

Configure the Stress Detection model server

  1. Ensure you’re in the ai-edge-project project in the Red Hat OpenShift AI Dashboard

  2. Click the Deployments tab

  3. Click Deploy model

    Single Model Serving selection
  4. Complete the deployment form with the following configuration (click Next between sections to complete the entire form):

    Table 1. Stress Detection model deployment configuration
    Parameter Value Purpose

    Model location

    Existing connection

    Uses the Robot MinIO DataConnection

    Connection

    vehicle-models

    Should be detected automatically

    Path

    stress-detection

    Model directory (KServe auto-detects /1/ subdirectory)

    Model type

    Predictive model

    Standard inference model type

    Model deployment name

    Stress Detection

    Identifier for the deployed model

    Model framework

    openvino_ir - opset13

    Matches exported model format from training

    Serving runtime

    OpenVINO Model Server

    Optimized for Intel architectures and edge devices

    Number of replicas

    1

    Single instance for testing purposes

    Deployment strategy

    Rolling update

    Zero-downtime deployment updates

  5. Click on Next to jump to the Review section and then select Deploy model

  6. Once the Status shows Running, check the Internal Endpoint

    Stress Detection model server endpoint

Verify

Stress Detection model deployment appears in Deployments tab

Status shows Running

✓ Endpoint URL is displayed: http://stress-detection-predictor.ai-edge-project.svc.cluster.local

Exercise 4.2: Deploy Time to Failure model server

The Time to Failure server predicts the remaining operational life of robot batteries. You’ll deploy this second model server using the same process, demonstrating how multiple models can share infrastructure while maintaining independent configurations.

Configure the Time to Failure model server

  1. Ensure you’re in the Deployments tab

  2. Click Deploy model to add a second model server

  3. Complete the deployment form with the following configuration (click Next between sections to complete the entire form):

    Table 2. Time to Failure model deployment configuration
    Parameter Value Purpose

    Model location

    Existing connection

    Uses the Robot MinIO DataConnection

    Connection

    vehicle-models

    Should be detected automatically

    Path

    time-to-failure

    Different path for independent versioning

    Model type

    Predictive model

    Standard inference model type

    Model deployment name

    Time to Failure

    Identifier for the deployed model

    Model framework

    openvino_ir - opset13

    Same format as Stress Detection model

    Serving runtime

    OpenVINO Model Server

    Optimized for Intel architectures and edge devices

    Number of replicas

    1

    Single instance for testing purposes

    Deployment strategy

    Rolling update

    Zero-downtime deployment updates

  4. Click on Next to jump to the Review section and then select Deploy model

  5. Once the Status shows Running, check the Internal Endpoint

    Time to Failure model server endpoint

Both models share the same infrastructure (runtime and connection) but maintain independent configurations and storage paths.

Verify

Deployments tab shows two deployments: Stress Detection and Time to Failure

✓ Both deployments show Status Running

✓ Time to Failure endpoint URL is displayed: http://time-to-failure-predictor.ai-edge-project.svc.cluster.local

Exercise 4.3: Query model endpoints

Test the model inference endpoints to verify they respond correctly to prediction requests. You’ll use a Jupyter notebook to send test data and receive predictions.

Access the query notebook

  1. Return to the JupyterLab environment (model-training workbench)

  2. Navigate to ai-lifecycle-edge-automation/notebooks/serving/

  3. Open the query_models.ipynb notebook

JupyterLab showing query_models.ipynb notebook in the serving folder

Query Stress Detection model

The notebook contains code to test the Stress Detection endpoint.

  1. Verify the endpoint URL in the first cell**

    BASE_URL = "http://stress-detection-predictor.ai-edge-project.svc.cluster.local:8888"
  2. Execute the cells to:

    1. Define the inference endpoint

    2. Prepare test input data (9 normalized feature values)

    3. Send POST request with test data

    4. Receive and display prediction

  3. The model returns a stress score between 0 and 1. The notebook interprets:

    • Score > 0.5 → STRESSED

    • Score ≤ 0.5 → NORMAL

Query Time to Failure model

Scroll down in the notebook to the Time to Failure section.

  1. Verify the Time to Failure endpoint URL

    BASE_URL = "http://time-to-failure-predictor.ai-edge-project.svc.cluster.local:8888"
  2. Execute the cells to:

    1. Define the TTF inference endpoint

    2. Prepare test input data (5 feature values)

    3. Send POST request

    4. Receive and display time-to-failure prediction

  3. The model returns the predicted number of hours until battery failure.

Verify

Stress Detection endpoint responds successfully

Time to Failure endpoint responds successfully

✓ Both predictions return valid numeric values

Summary

You have successfully deployed and tested the model serving infrastructure:

✓ Stress Detection Server - Deployed OpenVINO Model Server for stress detection inference

✓ Time to Failure Server - Deployed complementary model server with separate storage path

✓ Single-Model Serving - Configured two independent model servers sharing infrastructure

✓ Inference Endpoints - Tested both models using KServe v2 REST API protocol

✓ Predictions Verified - Confirmed stress scores and time estimates are valid

What You’ve Learned:

  • How to deploy single-model serving platforms in Red Hat OpenShift AI

  • Configuring model servers with OpenVINO runtime for edge-optimized inference

  • Using data connections to link model servers to S3-compatible storage

  • RawDeployment mode for predictable resources in edge environments

  • KServe v2 protocol for REST API inference

  • How to query model endpoints with JSON payloads

  • Interpreting model outputs (classification scores and regression predictions)

The model serving infrastructure is fully operational and validated.

Next, you’ll explore the automated pipeline system that retrains models every 10 minutes using fresh data from the robot fleet.