Optimizing LLMs for inference with vLLM on RHEL Al and Red Hat OpenShift Al

What you will learn

This lab will teach you how to optimize AI models using Red Hat’s AI model optimization tools on Red Hat OpenShift® AI, focusing on:

  • OpenShift AI Platform: Get familiar with OpenShift AI’s capabilities for data science projects.

  • Model Optimization: Master the use of llm-compressor to adapt models to your specific hardware.

  • Quantization Techniques: Learn different approaches to model quantization (int4, fp8, int8).

  • Model Evaluation: Understand how to assess model performance using lm-eval.

  • Deployment Optimization: Leverage vLLM inference server through OpenShift AI for efficient model serving.

The information, code, pipelines and techniques it contains are illustrations of what a first prototype could look like.

Lab Walkthrough

This lab is designed to provide you with a structured approach to understanding and utilizing the tools and techniques essential for AI Model optimization. Below is an overview of the key activities you will engage in:

  • Get Familiar with OpenShift AI

    • Explore the capabilities of OpenShift AI, which will serve as the foundation for your data science projects.

    • Create Data Science Projects: Establish your own Data Science project to organize and manage your work effectively.

    • Create Data Connections: Learn how to connect various data sources, ensuring that your projects has access to the necessary information.

    • Create Pipeline Servers: Set up a pipeline server, which is crucial for managing and executing your workflows efficiently.

    • Create Workbenches: Access your workbench, a dedicated space for reviewing content, running experiments, and collaborating with peers.

    • Define Pipelines: Design the pathways through which your data will flow, optimizing each step of the process.

  • Optimize Models with llm-compressor

    • Engage in the optimization process by utilizing the llm-compressor to enhance your models.

    • Work with llm-compressor through Workbenches and Pipelines

    • Apply optimization techniques directly within your workbench and pipelines to improve model performance.

  • Evaluate Models with lm-eval

    • Assess the effectiveness of your models using lm-eval. This evaluation will provide insights into their performance and areas for improvement.

  • Deploy Models with vLLM (ServingRuntime)

    • Deploy your models using vLLM, with and without optimization.

    • Compare the performance of your base model against the optimized version to understand the impact of your efforts.

This walkthrough is designed to equip you with the necessary skills and knowledge to start in your AI optimization projects. We encourage you to engage fully with each step as you progress through the workshop.

Disclaimer

This lab is an example of what a customer could use to optimize its models for enhanced inferencing while reducing hardware costs using OpenShift AI.

This lab makes use of "small" size large language models (LLM) for speeding up the process, but the same techniques would apply for larger models.

Repository

This repository https://github.com/rhpds/showroom-summit2025-lb2959-neural-magic.git contains workshop materials and examples for creating optimized models with Neural Magic solutions on OpenShift.

Repository Structure

Lab Content

The lab content is at the content folder, on the modules/ROOT/pages directory.

Lab Material

The lab-materials/ directory contains the different materials for the different points:

  • Section 3 (folder 03):

    • Jupyter notebooks for model optimization techniques and evaluation:

      • Weights quantization: int4

      • Weights and activations quantization: fp8 and int8

  • Section 4 (folder 04):

    • Python script for testing deployed models: request.py

  • Section 5 (folder 05):

    • Pipeline source file (quantization_pipeline.py) for optimizing models while maintaining accuracy, also includes evaluation steps.

Prerequisites

  • OpenShift AI

  • NVIDIA GPUs