Model Deployment and Configuration

In this module, we will quickly review the different OpenShift AI components, and learn how to deploy a model with it.

Getting connected to OpenShift AI

For the purposes of this workshop session, we have provisioned an OpenShift Container Platform cluster for each participant, with Red Hat OpenShift AI deployed on it.

Each person attending this lab has a full admin access (cluster-admin) to their cluster to be able to do the different exercises and check the various configurations.

If you are using this instructions through the workshop, the information below will render properly. If not, you will see placeholder values instead.

Your username: userX
Your password: openshift

At any point, you can access the OpenShift Container Platform console by going to the following URL: https://console-openshift-console.apps.cluster-guid.guid.sandbox.opentlc.com/. This link is also available in the Links section in the top right corner of the present instructions.

For now, we are going to use the OpenShift AI Dashboard, which is a web-based interface that allows you to manage and monitor your OpenShift AI environment.

Click to open the following URL in a new tab or window and log in:
- https://rhods-dashboard-redhat-ods-applications.apps.cluster-guid.guid.sandbox.opentlc.com/
Click on the Login with OpenShift button:
Select rhsso
If needed, enter your credentials (as detailed above)
Because the password is so simple (openshift), your browser might display a scary message. It is safe here to ignore this message when it pops up.
Don’t worry if you do not see this scary message.
After you authenticate, your browser window should look like:

If you got this far and saw all that, congratulations, you properly connected to the Red Hat OpenShift AI Dashboard Application!

We are now ready to start!.

Overview of OpenShift AI Components

In this section, we will review the different OpenShift AI components and discuss their features.

Selecting the LLM Host Project

As you are cluster admin, you currently see a lot of projects. The one we will be using in the workshop is the LLM Host project. To access it, click on Go to Data Science Projects at the bottom of the Data Science Projects sections.
In the list of projects, you should see the LLM Host project. Click on it to access the project.

The OpenShift AI Components

The different components of OpenShift AI that you can access through the different tabs are:

Workbench: This is where you can create and manage various development environments like JupyterLab, VSCode, or other custom Workbenches. It provides a user-friendly interface for data scientists to work with notebooks, libraries, and datasets. There is already a Custom Workbench deployed in the project to explore the storage space where our models are. We will come to it later.
Pipelines: We won’t use Pipelines in this workshop, but you can use them to automate the process of processing date or training and deploying machine learning models.
Models: This is where you can manage and deploy machine learning models. You can create, update, and delete models, as well as monitor their performance and usage. You can see that a model is already deployed as an example. We will come to it later.
Cluster storage: This is where you can manage the storage resources used by your models and workbenches. You can create, update, and delete storage resources, as well as monitor their usage. At the moment you can see the storage used by the workbench we deployed.
Connections: This is where you can manage the connections between your workbenches or model runtimes and other services, such as storage (S3), databases or APIs. You can create, update, and delete connections, as well as see which environment is using them.
Permissions: This is where you can manage the permissions for project. You can create, update, and delete permissions, as well as see which users or groups have access to which resources.

Reviewing and Testing the Deployed Model

Reviewing the Model Configuration

As we mentioned before, a model is already deployed in the LLM Host project. To access it, click on the Models tab in the project. The model deployed is the Granite-3.2 model. It is designed to generate human-like text and can be used for various natural language processing tasks.
If you click on the expand button at the left of the model, you will see more information about its configuration, such as the resources allocated to it.
If you click on Internal and external endpoint details, you will see the different endpoints available for the model. The Internal endpoint is used to access the model from within the OpenShift cluster, while the External endpoint is used to access the model from outside the cluster (in this case the model is only accessible internally, so the external endpoint address is a placeholder one).

At this point, you can see that the model is running and ready to be used. We won’t go in more details at the moment, as you will be able to deploy your own model later in the workshop.

Reviewing the Connection

Switching to the Connections tab, you can see there is already a connection called models configured in the project.

You can see that the Granite model serving is using this connection, as well as the ODH-TEC workbench which we will explore later.

As we will reuse the same connection for the model we will deploy later, let’s take a look at its configuration. Click on the three dots on the right of the connection and select Edit.
You can see that this is an S3 connection to object storage provided by Minio. Scrolling down, you can see the different parameters configured for this connection.

If you are familiar with S3, you will recognize the different parameters. The Access Key and Secret Key are used to authenticate to the Minio server, while the Bucket parameter is used to specify the bucket where the models are stored. The Endpoint parameter is used to specify the Minio server address.

Those information will be automatically injected as environment variables in the model runtime or the workbench that use this connection, so you don’t need to worry about them and have to set those parameters manually every time. That’s both a time saver and a security feature, as you don’t have to expose those credentials in your code.

Reviewing the Model Files

As we said, the model is stored in Minio, which is an S3 compatible object storage. To see those files we are going to use the ODH-TEC workbench that is already deployed in the project. To access it, click on the Workbench tab in the project, then on Open.
This will open a new tab. After logging in and accepting the license, you will access ODH-TEC. It is a simple tool used to manage and view S3 storage.

As you can see, ODH-TEC is already configured to use the models connection we just reviewed. You can see the different buckets available in the Minio server, as well as the files stored in each bucket. You can take some time to look at the different folders and files available. Two models are available: the Granite model we saw, and TinyLlama. We are going to deploy this one in the next section.

You can leave this tab open for now, we will come back to it later.

Testing the Model

To test the model, we are going to use the Granite model endpoint. You can use any HTTP client to test the model, but we will use curl for this workshop.

To do so, we will first create a new Workbench based on VSCode. On the Workbench tab, click on Create workbench.
Create your workbench with the following parameters:
- Name: Model-Test
- Image selection: code-server
- Version selection: 2024.2
- Deployment size: Standard (default and only option in this cluster)
- Accelerator: None
- Persistent storage size: 5 GiB (default)
See screenshot below for reference:
Then click on Create workbench at the bottom.
Your new workbench should appear in the list of workbenches in Starting mode. After a few moments, you can click on Open to access it.
After logging in, you will access the VSCode interface. You can close the Welcome tab if you want.
Open a Terminal in VSCode by clicking on the Terminal menu, then New Terminal. You can also use the shortcut Ctrl + Shift + \` (the backtick key).

In the terminal, you can use curl to test the model. The command to use is:

curl -X POST -H "Content-Type: application/json" -d '{"model":"ibm-granite/granite-3.2-8b-instruct" ,"prompt": "Hello, how are you?"}' http://granite-3dot2-8b-instruct-predictor.llm-hosting.svc.cluster.local:8080/v1/completions | jq

When pasting content for the first time, your workbench may prompt you to grant paste access. Please allow it to enable copy-paste functionality.

The answer should look like this:

{
    "id":"cmpl-061bbe3b71394ef98765f44e93c0df9b",
    "object":"text_completion",
    "created":1746042889,
    "model":"ibm-granite/granite-3.2-8b-instruct",
    "choices":[
        {
            "index":0,
            "text":"\",\"assistant_response\": \"Hello! I'm an artificial intelligence,",
            "logprobs":null,
            "finish_reason":"length",
            "stop_reason":null,"prompt_logprobs":null
        }
        ],
    "usage":{
        "prompt_tokens":6,
        "total_tokens":22,
        "completion_tokens":16
        }
}

Perfect, we have verified that the model is working and we can access it from our workbench. Now, let’s try to replicate this with the TinyLlama model. We will deploy it in the next section.

Deploying a new Model

You are now ready to deploy you own instance of the TinyLlama model. This model is a much smaller version of the Llama model that can be used on a CPU. We chose it for this exercise as it’s way faster to deploy than the Granite model, and does not require another expensive GPU.

We have seen through ODH-TEC that the model is already available in our bucket, and that the existing connection already points to this bucket. So we are going to deploy TinyLlama using the same connection as the Granite model.

Deploying the Model

In the Dashboard, navigate to the "Models" tab.
Click on the Deploy model button to start the model deployment process.
In the deployment form, fill in the following details:
- Model deployment name: TinyLlama
- Serving runtime: Custom - vLLM ServingRuntime-CPU
- Number of model server replicas to deploy: 1
- Model server size: Medium (IMPORTANT)
- Accelerator: None
- Model route: Checked (IMPORTANT)
- Token authentication: Unchecked (IMPORTANT)
- Source model location: Existing connection
- Connection: models
- Path: TinyLlama/TinyLlama-1.1B-Chat-v1.0
- Additional serving runtime arguments:
  --served-model-name=tinyllama/tinyllama-1.1b-chat-v1.0
Then click on Deploy to start the deployment process. This will take a few minutes to complete. At any time, you can check the status of the deployment by opening the OpenShift console and navigating to the llm-hosting project. You should see a new Pod called tinyllama-predictor-… being deployed. Looking at its logs, you should see the model being loaded and the server starting.
Once the deployment is complete, click on Internal and external endpoint details to see the different endpoints available for the model.

Testing the Model

Once the model is deployed, you can test it using the same method as before. You can use the same Model-Test workbench we created earlier, or create a new one if you prefer.

Please go back to the OpenShift AI Dashboard.

In the Model-Test workbench, open a terminal.

To test the internal endpoint, run the following command:

curl -k -X POST -H "Content-Type: application/json" -d '{"model":"tinyllama/tinyllama-1.1b-chat-v1.0" ,"prompt": "Hello, how are you?"}' https://tinyllama.llm-hosting.svc.cluster.local/v1/completions | jq

(the -k option is used to ignore the certificate, as the internal endpoint uses a self-signed certificate in the Istio mesh)

To test the external endpoint, run the following command:

curl -X POST -H "Content-Type: application/json" -d '{"model":"tinyllama/tinyllama-1.1b-chat-v1.0" ,"prompt": "Hello, how are you?"}' https://tinyllama-llm-hosting.apps.cluster-guid.guid.sandbox.opentlc.com/v1/completions | jq

In both cases, the answer should take some more time than previously as we are running on a CPU (and you may need to try a few times to get something meaningful as the model has to be "warmed-up"), and look like this:

{
    "id":"cmpl-061bbe3b71394ef98765f44e93c0df9b",
    "object":"text_completion",
    "created":1746042889,
    "model":"tinyllama/tinyllama-1.1b-chat-v1.0",
    "choices":[
        {
            "index":0,
            "text":" I'm well. My name is David and I live in Tokyo. Rec",
            "logprobs":null,
            "finish_reason":"length",
            "stop_reason":null,"prompt_logprobs":null
        }
        ],
    "usage":{
        "prompt_tokens":6,
        "total_tokens":22,
        "completion_tokens":16
        }
}

Congratulations, you have successfully deployed and tested a model on OpenShift AI! You can now use this model in your applications, or expose it through the 3Scale API Gateway to make it available to other users. This is what we will do in the next module.