Module 5: Agent & LLM evaluations (Evals)

Observability tells you if your system is working, but evaluations tell you if your agents are working well. In this module, you’ll go beyond basic observability by running LLM evaluations against the Fed Aura Capital prospect agent using MLflow’s evaluation framework, Prompt Registry, and LLM-as-a-Judge scorers.

In this module, you’re wearing your AI Developer / Engineer hat, focusing on model output quality, prompt management, and evaluation design.

This is the AI Developer’s quality assurance layer, the inner loop where you manually evaluate agent outputs during development. In Module 6, you’ll automate this into an outer loop that detects regressions continuously.

Learning objectives

By the end of this module, you’ll be able to:

  • Launch a Jupyter workbench in Red Hat OpenShift AI

  • Register versioned prompts in MLflow’s Prompt Registry

  • Create evaluation datasets with inputs and expected outputs

  • Run deterministic and LLM-as-a-Judge evaluations against a live agent

  • Analyze evaluation results and per-trace assessments in MLflow

Why evaluations matter for AgentOps

In the Fed Aura Capital mortgage system, agents make consequential decisions:

  • The prospect agent qualifies leads. Wrong qualification costs sales.

  • The underwriter agent assesses risk. Incorrect assessment creates liability.

  • The underwriter agent checks compliance. Failures risk legal penalties.

Metrics tell you the agent responded. Traces tell you how it responded. Evaluations tell you if it responded correctly.

Exercise 1: Launch a Jupyter workbench

To run evaluations, you need a Jupyter environment. Red Hat OpenShift AI provides managed workbenches with pre-installed data science libraries.

  1. Access the Red Hat OpenShift AI dashboard directly from the RHOAI Console tab at the top of this page:

    Showroom tabs showing RHOAI Console tab

    Alternatively, you can access it from the OpenShift Console. Click the application launcher (waffle icon) in the top-right corner. Under OpenShift Self Managed Services, click Red Hat OpenShift AI:

    OpenShift Console application launcher showing Red Hat OpenShift AI
  2. In the Red Hat OpenShift AI (RHOAI) dashboard, navigate to Projects and select your workspace (wksp-user1).

  3. Click Create workbench and configure it:

    • Name: wksp-user1

    • Image selection: Jupyter | Data Science | CPU | Python 3.12

    • Version selection: 3.4

    • Leave other settings as defaults

    Red Hat OpenShift AI Create workbench form showing name and image configuration
  4. Click Create workbench and wait for it to start. Once the status shows Running, click Open to launch JupyterLab.

    The workbench image download can take a couple of minutes the first time. Please be patient while it pulls the container image and starts.
  5. Once the status shows Running, click Open to launch JupyterLab:

    Red Hat OpenShift AI workbench status showing Running with Open button

Exercise 2: Clone the repository and open the evaluation notebook

This is the inner loop in action: as an AI Developer or Engineer, you use Jupyter notebooks to experiment with prompts, tweak agent behavior, and evaluate outputs interactively. The notebook gives you a fast feedback cycle: change a prompt, re-run the evaluation, inspect the results, and iterate, all before any code reaches a pipeline or production.

Want to learn more about MLflow’s evaluation framework? Open the MLflow Docs tab at the top of this page and explore the Evaluation & Monitoring section, which covers scorers, datasets, and evaluation-driven development workflows in depth.

The evaluation notebook is part of the mortgage-ai application repository. Let’s clone it and set up the environment.

  1. In JupyterLab, click Git > Clone a Repository (or use the Git icon in the left sidebar). Enter the repository URL:

    https://github.com/rh-ai-quickstart/multi-agent-loan-origination.git
    JupyterLab Clone a repo dialog with the multi-agent-loan-origination repository URL

    Click Clone to download the repository.

  2. In the file browser, navigate to multi-agent-loan-origination/evaluations/ and open evaluate_agent.ipynb:

    JupyterLab showing evaluate_agent.ipynb open with Setup cell and environment variables
  3. Run the Install Dependencies cell to install the required packages (MLflow, LangChain, OpenAI, etc.).

  4. In the Setup cell, configure the environment variables. The cell should look like this:

    os.environ["LLM_BASE_URL"] = "<from llm-credentials secret>"
    os.environ["LLM_API_KEY"] = "<from llm-credentials secret>"
    os.environ["LLM_MODEL"] = "gpt-oss-120b"
    os.environ["MLFLOW_TRACKING_URI"] = "https://mlflow.redhat-ods-applications.svc.cluster.local:8443"
    os.environ["MLFLOW_TRACKING_TOKEN"] = "<from terminal>"
    os.environ["MLFLOW_EXPERIMENT_NAME"] = "mortgage-ai"
    os.environ["MLFLOW_TRACKING_AUTH"] = "kubernetes"
    os.environ["MLFLOW_TRACKING_INSECURE_TLS"] = "true"

    You need to fill in 3 values:

    • LLM_BASE_URL and LLM_API_KEY: extract these from the llm-credentials secret. In the OpenShift Console, navigate to Workloads > Secrets in the wksp-user1 namespace, find llm-credentials, scroll down, and click Reveal values:

      OpenShift Console Secrets page showing llm-credentials secret in the workspace namespace
  5. To get the MLFLOW_TRACKING_TOKEN, switch to the Terminal tab and run:

    TOKEN=$(oc login --insecure-skip-tls-verify $(oc whoami --show-server) -u user1 -p openshift > /dev/null 2>&1 && oc whoami --show-token)
    echo ${TOKEN}

    Copy the token value and paste it into the MLFLOW_TRACKING_TOKEN variable in the notebook’s Setup cell.

  6. Run the Setup and Configure MLflow cells. Verify the output shows your MLflow URI, experiment name, and LLM endpoint.

    You can run cells one by one with Shift+Enter, or click the Restart and Run All button (rewind icon) in the toolbar to execute the entire notebook at once:
    JupyterLab toolbar showing Restart and Run All rewind button

Exercise 3: Register a prompt in MLflow’s Prompt Registry

Before running evaluations, let’s register the agent’s system prompt in MLflow’s Prompt Registry. The Prompt Registry provides version control for prompts. Think of it as Git for your system prompts, enabling you to track changes, compare versions, and link evaluation traces back to the exact prompt that produced them.

  1. Run the cells under 1. Register System Prompt in MLflow. The notebook reads the public-assistant system prompt (the same content from config/agents/public-assistant.yaml) and registers it as a versioned artifact:

    Notebook cell registering the public-assistant system prompt in MLflow Prompt Registry
  2. Switch to the MLflow UI (use the MLflow Console tab at the top of this page). Switch to the mortgage-ai-eval workspace by clicking the back arrow next to the current workspace name, then selecting mortgage-ai-eval:

    MLflow workspace selector showing mortgage-ai-eval workspace

    Click Prompts in the left sidebar. You’ll see the registered prompt:

    MLflow Prompts page showing public-assistant-system-prompt at Version 1
  3. Click on public-assistant-system-prompt to inspect the version details:

    MLflow Prompt detail showing version 1 with metadata and prompt text preview

    The prompt detail shows:

    • Version: Version number and registration timestamp

    • Metadata: Agent name, persona, source file, and type tags

    • Commit message: Description of the change (like a Git commit)

    • Prompt text: The full system prompt content

When you update the system prompt and register a new version, MLflow keeps the history. This lets you compare evaluation results across prompt versions, a critical capability for prompt engineering at scale.

Exercise 4: Create an evaluation dataset

The foundation of good evaluations is a high-quality dataset. The notebook creates a persistent dataset on the MLflow server with representative test cases for the prospect agent.

  1. Run the cells under 2. Create Evaluation Dataset and 3. View the Dataset. The notebook creates 6 test cases, each with:

    • inputs: The user message to send to the agent (e.g., "Tell me about FHA loans")

    • expectations: Expected behavior including keywords, tool calls, topics, and forbidden content

  2. In the MLflow UI, navigate to your experiment and click Datasets. You’ll see the public_assistant_eval dataset with 6 records:

    MLflow Datasets page showing public_assistant_eval with 6 records and their inputs and expectations

    Each record shows the user message input alongside the expected answer and expected tool calls. For example, "Tell me about FHA loans" expects the keyword "FHA" in the response and the product_info tool to be called.

Datasets are stored on the MLflow server, not as local files. This means they’re versioned, shareable across team members, and can be reused across evaluation runs.

Exercise 5: Run a simple evaluation

Let’s start with deterministic scorers, which are fast checks that don’t require LLM calls. These are your first line of defense for catching obvious regressions.

Simple scorers

The notebook defines 3 deterministic scorers:

  • contains_expected: Does the response contain the expected keyword? (e.g., does an FHA question response mention "FHA"?)

  • has_numeric_result: Does the response include numeric values like dollar amounts or percentages? (important for affordability calculations)

  • response_length: Is the response at least 50 characters? (catches empty or truncated responses)

Set up the predictor and run the evaluation

Before running the evaluation, the notebook needs 2 pieces: a predictor function that invokes the prospect agent for each test case, and the scorer definitions. The predictor wraps the agent invocation and loads the registered prompt inside the traced context. This is what creates the automatic prompt-trace linkage you’ll explore in Exercise 7.

  1. Run the cells under the Predictor with Prompt Linkage and Scorers sections to set up both.

  2. Then run the Run Simple Evaluation cell. The notebook sends each test case to the live agent, collects responses, and scores them:

    Notebook running simple evaluation with 6 examples and 3 scorers showing results

    The evaluation runs all 6 test cases against the prospect agent and reports the aggregate scores.

View results in MLflow

  1. In the MLflow UI, click Evaluation runs in the left sidebar. You’ll see the evaluation run with all 6 traces:

    MLflow Evaluation Runs showing the evaluation run with 6 traces and their requests responses and token counts

    Each row shows the Trace ID, the request sent to the agent, the response received, and the token count.

  2. Click Traces in the left sidebar to see the per-trace assessments. Click Columns and enable All Assessments to see the scorer results for each trace:

    MLflow Traces view with Assessment columns showing contains_expected has_numeric_result and response_length results per trace

    The assessment columns show True/False for each scorer on each trace. You can quickly spot which test cases passed or failed each check.

    Some assessment columns may show null values. This is expected — at this stage we are only running simple deterministic scorers (contains_expected, has_numeric_result, response_length), not the LLM-as-a-Judge scorers. You’ll enable those in Exercise 6, and the remaining columns will populate.
    If assessment columns are not visible in the Traces view, use the Columns dropdown and enable All Assessments. MLflow doesn’t always show them by default.
    MLflow Columns dropdown showing All Assessments option to enable assessment columns
  3. Click on a trace that has the simple scorer columns filled (contains_expected, has_numeric_result, response_length) — the rest should be null at this point. You’ll see the detailed view with the Assessments sidebar:

    MLflow trace detail showing inputs outputs span timeline and Assessments sidebar with Feedback and Expectations

    The detail view combines the trace information (inputs, outputs, span timeline) with the evaluation assessments. The Feedback section shows the deterministic scorer results, and the Expectations section shows the expected values from the dataset.

Exercise 6: Run LLM-as-a-Judge evaluation

Deterministic scorers catch surface-level issues, but can’t assess whether a response is actually helpful. For that, you need LLM-as-a-Judge, using an LLM to evaluate the quality of another LLM’s output.

LLM judge scorers

The notebook adds 5 LLM-powered scorers on top of the 3 simple ones:

  • ToolCallCorrectness: Did the agent call the right tools? (e.g., did it use product_info for product questions?)

  • ToolCallEfficiency: Were tool calls minimal and efficient?

  • RelevanceToQuery: Is the response relevant to what the user asked?

  • Safety: Is the response safe and appropriate?

  • Guidelines: Does the response follow custom mortgage assistant guidelines? (helpful, no rate promises, professional language)

Run the evaluation

  1. Run the Run Full LLM-as-a-Judge Evaluation cell. This runs all 8 scorers (3 simple + 5 LLM judges) against the 6 test cases:

    Notebook running LLM-as-a-Judge evaluation with 8 scorers including 5 LLM judges

    This evaluation takes longer than the simple one because each test case is scored by 5 LLM judges in addition to the deterministic checks.

View results in MLflow

  1. In the MLflow UI, click Evaluation runs. You’ll now see 2 runs: the simple evaluation and the LLM-as-a-Judge evaluation:

    MLflow Evaluation Runs showing 2 runs - simple and LLM-as-a-Judge
  2. Click Traces and use the Columns dropdown to enable All Assessments. You’ll see the full picture with both deterministic and LLM judge results:

    MLflow Traces view with full assessment columns including tool_call_correctness safety and relevance

    The additional columns show Pass/Fail for each LLM judge. Notice how tool_call_correctness shows 83% pass rate and safety shows 100%. The agent is safe but occasionally calls the wrong tool.

  3. Click on a trace to see all 8 assessments in the detail view:

    MLflow trace detail showing all 8 assessments including safety mortgage_guidelines tool_call_efficiency tool_call_correctness and relevance_to_query

    The assessments sidebar now shows the full evaluation picture for this single trace: safety (Yes), mortgage_guidelines (No, perhaps the response was too informal), tool_call_efficiency (Yes), tool_call_correctness (Yes), relevance_to_query (Yes), and the 3 deterministic checks.

Exercise 7: Explore prompt-trace linkage

One of the key features of MLflow’s evaluation framework is automatic prompt-trace linkage. Every evaluation trace is linked to the prompt version that produced it, enabling you to track quality across prompt iterations.

  1. In the MLflow UI, navigate to Prompts and click on public-assistant-system-prompt. Click the Traces tab:

    MLflow Prompts Traces tab showing evaluation traces linked to prompt version 1

    All evaluation traces are automatically linked to the prompt version that produced them. When you update the system prompt and register Version 2, future evaluations will link to that new version, letting you compare quality across prompt iterations side by side.

Module summary

What you accomplished:

  • Launched a Jupyter workbench in Red Hat OpenShift AI and set up the evaluation environment

  • Registered the prospect agent’s system prompt in MLflow’s Prompt Registry

  • Created an evaluation dataset with 6 test cases on the MLflow server

  • Ran deterministic and LLM-as-a-Judge evaluations against the live agent

  • Analyzed per-trace assessments and prompt-trace linkage in MLflow

Key takeaways:

  • MLflow’s Prompt Registry provides version control for system prompts: track changes, compare versions, and link evaluations to specific prompt versions

  • Evaluations combine fast deterministic scorers (keyword checks, length) with LLM judges (tool correctness, safety, relevance) for comprehensive quality assessment

  • Prompt-trace linkage automatically connects evaluation results to the prompt version that produced them, enabling regression detection across prompt iterations

Next steps:

Module 6 will move from the inner loop (manual evaluation) to the outer loop, automating evaluation with AI Pipelines that can be scheduled or triggered to detect regressions continuously.