Module 6: From development to production

In Module 5, you ran evaluations manually from a Jupyter notebook, the inner loop of an AI Developer’s workflow. But manual evaluations don’t scale. In production, you need evaluations that run automatically, on a schedule, or as part of a CI/CD pipeline.

In this module, you’ll move from the inner loop to the outer loop by importing and running an evaluation pipeline in Red Hat OpenShift AI. The same scorers, datasets, and MLflow integration you used in Module 5 now run as an automated pipeline. No notebook required.

In this module, you’re wearing both hats: AI Developers define the evaluation logic, while SRE/Platform Engineers manage the pipeline infrastructure and scheduling.

Learning objectives

By the end of this module, you’ll be able to:

  • Explain the difference between inner loop (notebook) and outer loop (pipeline) evaluation

  • Import an evaluation pipeline definition into OpenShift AI

  • Run an automated evaluation pipeline and configure its parameters

  • View automated evaluation results in MLflow

  • Detect prompt regressions by comparing evaluation results across prompt versions

From inner loop to outer loop

In Module 5, you evaluated the prospect agent by running cells in a Jupyter notebook. That approach works well during development, but it has limitations:

  • Manual: Someone has to open the notebook and click Run

  • Not reproducible: Results depend on the notebook environment and who ran it

  • Not schedulable: You can’t trigger it automatically on a model update or prompt change

The outer loop solves these problems by packaging the same evaluation logic into an AI Pipeline, a series of containerized steps that run on the OpenShift AI platform.

Inner Loop (Module 5) Outer Loop (Module 6)

Jupyter notebook

AI Pipeline (Kubeflow Pipelines)

Manual execution

Automated, scheduled, or triggered

Developer’s workbench

Platform-managed containers

Interactive exploration

Reproducible, auditable runs

Red Hat OpenShift AI includes Data Science Pipelines (based on Kubeflow Pipelines) for orchestrating multi-step ML workflows, including evaluation pipelines.

Exercise 1: Import an evaluation pipeline

The mortgage-ai project includes a pre-compiled evaluation pipeline that packages the same simple evaluation from Module 5 into 4 automated steps.

  1. In Red Hat OpenShift AI, navigate to Develop & Train > Pipelines > Pipeline definitions:

    OpenShift AI Pipeline definitions page showing no pipelines with Import pipeline button
  2. Click Import pipeline and configure it:

    • Pipeline name: eval-simple-run

    • Select Import by url

    • Paste the pipeline URL:

      https://raw.githubusercontent.com/rh-ai-quickstart/multi-agent-loan-origination/refs/heads/main/evaluations/pipelines_gen/simple-eval-pipeline.yaml
    Import pipeline dialog with name eval-simple-run and Import by url selected

    Click Import pipeline.

  3. The pipeline graph shows the 4 automated steps:

    Pipeline definition graph showing setup-mlflow-op then create-dataset-op then run-simple-eval-op then report-results-op
    • setup-mlflow-op: Configure MLflow tracking URI, workspace, and experiment

    • create-dataset-op: Create the evaluation dataset on the MLflow server (same 6 test cases from Module 5)

    • run-simple-eval-op: Run the 3 deterministic scorers (contains_expected, has_numeric_result, response_length)

    • report-results-op: Print the evaluation summary

This is the same evaluation you ran manually in Module 5, Exercise 5, but packaged as a pipeline that can run without a notebook.

Exercise 2: Run the evaluation pipeline

  1. Click Actions > Create run:

    Actions dropdown showing Create run and Create schedule options
    Notice the Create schedule option. In production, you would schedule evaluations to run periodically (e.g., every 6 hours) to continuously monitor agent quality.
  2. In the Create run form, set the run name to eval-run-1:

    Create run form showing run type project experiment and run name fields
  3. Scroll down to Parameters and configure the required values:

    • mlflow_tracking_uri: https://mlflow.redhat-ods-applications.svc.cluster.local:8443

    • mlflow_workspace: wksp-user1

    Create run Parameters section showing agent_name dataset_name mlflow_experiment_name mlflow_tracking_uri and mlflow_workspace

    The other parameters (agent_name, dataset_name, mlflow_experiment_name) use sensible defaults. Click Create run.

  4. The pipeline starts executing. You can watch each step complete in the graph view:

    Pipeline run eval-run-1 executing showing the 4-step graph with Running status

    Each step runs in its own container on the OpenShift AI platform. The pipeline authenticates to MLflow using the pod’s Kubernetes service account token, so no manual token management is needed.

Exercise 3: View automated evaluation results in MLflow

Once the pipeline completes, the evaluation results appear in MLflow just like the notebook-driven evaluations from Module 5.

  1. In the MLflow UI (use the MLflow Console tab at the top of this page), navigate to your experiment’s Traces view. You’ll see the 6 new traces from the pipeline run with their assessment results:

    MLflow Traces showing 6 automated evaluation traces with assessment columns contains_expected has_numeric_result and response_length

    The results are identical in structure to what you saw in Module 5: the same scorers, the same dataset, the same assessment columns. The only difference is that these were produced by an automated pipeline instead of a notebook.

  2. To see all assessment and expectation columns, click Columns and check All Assessments and All Expectations:

    MLflow Columns dropdown showing All Assessments and All Expectations checkboxes
If assessment columns are not visible in the Traces view (in this module or in Module 5), use the Columns dropdown to enable All Assessments. MLflow doesn’t always show them by default.

Exercise 4: Catch a bad prompt before production

Automated pipelines ensure evaluations run continuously, but the real payoff comes when they catch a problem before it reaches users. In this exercise, you’ll simulate a realistic scenario: a team member proposes a prompt change to reduce latency, and you use evaluations to discover that the change would silently degrade response quality.

The scenario is simple. A developer on the Fed Aura Capital team notices that the prospect agent spends time calling tools for questions it could answer from general knowledge. They propose a "quick optimization": change TOOL USE (MANDATORY) to TOOL USE (OPTIONAL) in the system prompt, telling the agent to prefer answering from its own knowledge and only call tools when it truly cannot answer on its own.

This sounds reasonable. It might reduce latency. But will the agent still provide accurate, product-specific responses? Let’s find out.

Open the regression detection notebook

  1. In JupyterLab (from Module 5), navigate to multi-agent-loan-origination/evaluations/ and open evaluate_agent_v2.ipynb:

    JupyterLab showing evaluate_agent_v2.ipynb open with Prompt Regression Detection title and environment setup

    This notebook is purpose-built for prompt regression detection. It registers a modified Version 2 of the system prompt, runs the same evaluation dataset from Module 5, and compares results against the Version 1 baseline.

  2. Run the Install Dependencies cell. Run the Setup cells to configure the environment variables (these are the same as Module 5).

Review the modified prompt

  1. Run the cells under 2. Create Modified Prompt (Version 2). The notebook makes a single, subtle change to the system prompt:

    Modified system prompt Version 2 showing TOOL USE changed from MANDATORY to OPTIONAL

    The change: TOOL USE (MANDATORY) becomes TOOL USE (OPTIONAL). The agent is now told to "prefer answering from your general knowledge instead of calling tools" and "Only call tools if you truly cannot answer from your own knowledge."

    This is a realistic scenario. Someone might "optimize" the prompt to reduce tool call latency, not realizing that tool-sourced data (specific product rates, loan limits, affordability calculations) is what makes the responses accurate.

Register Version 2 in MLflow

  1. Run the cells under 3. Register Version 2 in MLflow Prompt Registry. The notebook registers the modified prompt as Version 2 with a descriptive commit message:

    Notebook registering Version 2 prompt in MLflow with commit message V2 Changed TOOL USE from MANDATORY to OPTIONAL

    The output confirms: public-assistant-system-prompt (version 2) with URI prompts:/public-assistant-system-prompt/2. Notice the tags include "purpose": "regression-detection", making it clear this version is being tested, not deployed.

Compare prompt versions in MLflow

  1. Switch to the MLflow UI. Navigate to Prompts and click on public-assistant-system-prompt. Click the Compare tab. MLflow shows a side-by-side diff of Version 2 (left) and Version 1 (right):

    MLflow Prompts Compare view showing side-by-side diff of Version 2 and Version 1 with red and green highlighting on the TOOL USE change

    The diff highlighting makes the change immediately visible: the TOOL USE section changed from (MANDATORY) to (OPTIONAL), and the instructions shifted from "you MUST call the product_info tool" to "you may call the product_info tool, but prefer answering from your general knowledge."

    This is the same experience as reviewing a code diff in a pull request, but for prompts. Before this, prompt changes at Fed Aura Capital were untracked text edits in YAML files. Now every change is versioned, diffable, and linked to evaluation results.

Run evaluations against Version 2

  1. Return to the notebook. Run the cells under 7. Run Evaluation with V2 Prompt. This runs the same 6 test cases and 3 deterministic scorers from Module 5, but with the modified v2 prompt:

    Notebook running V2 evaluation showing 6 examples and 3 scorers with results contains_expected 83 percent has_numeric_result 50 percent response_length 100 percent

    The results are already concerning. Compare with the Module 5 baseline: contains_expected dropped from 100% to 83%, meaning the agent is no longer mentioning expected product keywords in some responses. The has_numeric_result also shifted. The agent is giving generic answers instead of tool-sourced specifics.

Compare V1 vs V2 results

  1. Run the cells under 9. Compare V1 vs V2 Results. The notebook compares the v2 scores against the v1 baseline and flags regressions:

    Notebook comparing V1 baseline vs V2 modified prompt showing contains_expected 100 percent to 83 percent REGRESSION and has_numeric_result regression

    The comparison table is clear:

    Scorer V1 (baseline) V2 (modified) Delta Status

    contains_expected

    100%

    83%

    -17%

    REGRESSION

    has_numeric_result

    50%

    33%

    -17%

    REGRESSION

    response_length

    100%

    100%

    0%

    OK

    Two of three scorers show regressions. The agent still produces long-enough responses (response_length is unchanged), but those responses are now less specific: they miss expected keywords and lack numeric values like rates and dollar amounts. The "optimization" made responses faster but less accurate.

View per-trace results in MLflow

  1. In the MLflow UI, navigate to Prompts, click public-assistant-system-prompt, and select the Traces tab. You can now see evaluation traces from both Version 1 and Version 2, with assessment columns showing per-trace pass/fail:

    MLflow Prompts Traces tab showing Version 1 and Version 2 traces side by side with assessment columns contains_expected has_numeric_result and mortgage_guidelines

    The traces are grouped by prompt version. You can see exactly which test cases passed or failed for each version:

    • Version 1 traces show consistent True values for contains_expected — the agent called the right tools and included the expected keywords.

    • Version 2 traces show False for several test cases — the agent answered from general knowledge instead of calling the product_info tool, missing expected keywords.

    Click into any failing trace to understand why the response missed the expected keyword and diagnose the root cause.

  2. To compare the prompt versions side by side, navigate to Prompts > public-assistant-system-prompt and click Text in the comparison view to see the differences between versions:

    MLflow Prompt comparison showing text differences between Version 1 and Version 2

    This lets you correlate specific prompt changes with evaluation regressions — a critical capability for prompt engineering at scale.

The verdict

The prompt change that seemed like a harmless optimization would have caused a 17% regression in response accuracy. Without the evaluation framework you built in Modules 5 and 6, this change would have been deployed to production based on a quick manual test ("it still answers questions, looks good"). Fed Aura Capital’s prospect agent would have started giving generic mortgage advice instead of specific product information, potentially misguiding customers and creating compliance risk.

The evaluation infrastructure caught the regression before it reached a single user. This is the core value of AgentOps: visibility into quality, not just availability.

Module summary

What you accomplished:

  • Imported a pre-compiled evaluation pipeline into OpenShift AI

  • Configured and ran an automated evaluation pipeline

  • Viewed automated evaluation results in MLflow alongside notebook-driven results

  • Detected a prompt regression before production by comparing Version 1 and Version 2 evaluation results

Key takeaways:

  • The inner loop (notebook) is for exploration and development; the outer loop (pipeline) is for automation and production

  • OpenShift AI Pipelines package the same evaluation logic into reproducible, schedulable, containerized runs

  • MLflow’s Prompt Registry combined with evaluations creates a quality gate: every prompt change can be tested, diffed, and compared before deployment

  • Evaluations catch subtle regressions that manual testing would miss, preventing bad prompts from reaching production

Next steps:

The Conclusion will recap the full observability journey, from the black box of Module 1 through metrics, traces, evaluations, and automated pipelines, and provide resources for continuing your AgentOps practice.