Optimizing LLMs for inference with vLLM on RHEL Al and Red Hat OpenShift Al
What you will learn
This lab will teach you how to optimize AI models using Red Hat’s AI model optimization tools on Red Hat OpenShift® AI, focusing on:
-
OpenShift AI Platform: Get familiar with OpenShift AI’s capabilities for data science projects.
-
Model Optimization: Master the use of
llm-compressor
to adapt models to your specific hardware. -
Quantization Techniques: Learn different approaches to model quantization (int4, fp8, int8).
-
Model Evaluation: Understand how to assess model performance using
lm-eval
. -
Deployment Optimization: Leverage vLLM inference server through OpenShift AI for efficient model serving.
The information, code, pipelines and techniques it contains are illustrations of what a first prototype could look like.
Lab Walkthrough
This lab is designed to provide you with a structured approach to understanding and utilizing the tools and techniques essential for AI Model optimization. Below is an overview of the key activities you will engage in:
-
Get Familiar with OpenShift AI
-
Explore the capabilities of OpenShift AI, which will serve as the foundation for your data science projects.
-
Create Data Science Projects: Establish your own Data Science project to organize and manage your work effectively.
-
Create Data Connections: Learn how to connect various data sources, ensuring that your projects has access to the necessary information.
-
Create Pipeline Servers: Set up a pipeline server, which is crucial for managing and executing your workflows efficiently.
-
Create Workbenches: Access your workbench, a dedicated space for reviewing content, running experiments, and collaborating with peers.
-
Define Pipelines: Design the pathways through which your data will flow, optimizing each step of the process.
-
-
Optimize Models with llm-compressor
-
Engage in the optimization process by utilizing the
llm-compressor
to enhance your models. -
Work with
llm-compressor
through Workbenches and Pipelines -
Apply optimization techniques directly within your workbench and pipelines to improve model performance.
-
-
Evaluate Models with lm-eval
-
Assess the effectiveness of your models using lm-eval. This evaluation will provide insights into their performance and areas for improvement.
-
-
Deploy Models with vLLM (ServingRuntime)
-
Deploy your models using vLLM, with and without optimization.
-
Compare the performance of your base model against the optimized version to understand the impact of your efforts.
-
This walkthrough is designed to equip you with the necessary skills and knowledge to start in your AI optimization projects. We encourage you to engage fully with each step as you progress through the workshop.
Disclaimer
This lab is an example of what a customer could use to optimize its models for enhanced inferencing while reducing hardware costs using OpenShift AI.
This lab makes use of "small" size large language models (LLM) for speeding up the process, but the same techniques would apply for larger models.
Repository
This repository https://github.com/rhpds/showroom-summit2025-lb2959-neural-magic.git contains workshop materials and examples for creating optimized models with Neural Magic solutions on OpenShift.
Repository Structure
Lab Material
The lab-materials/
directory contains the different materials for the different points:
-
Section 3 (folder
03
):-
Jupyter notebooks for model optimization techniques and evaluation:
-
Weights quantization:
int4
-
Weights and activations quantization:
fp8
andint8
-
-
-
Section 4 (folder
04
):-
Python script for testing deployed models:
request.py
-
-
Section 5 (folder
05
):-
Pipeline source file (
quantization_pipeline.py
) for optimizing models while maintaining accuracy, also includes evaluation steps.
-