Evals with Llama Stack

Any change to one of the following can have a major impact on the behaviors and responses of an AI-infused application:

  • model vendor (e.g. Llama, Qwen, Mistral, Google, OpenAI)

  • model version (e.g. Gemini2.5, Qwen3)

  • model parameter count (e.g. 7B, 14B, 70B)

  • model quantization (e.g. q4_K_M, fp16)

  • model server vendor (e.g. vLLM, Ollama, OpenAI)

  • model server configuration (e.g. tool-call-parser, chat-template)

  • prompts

  • and of course the application code (e.g. .py, .js, .ts, .java)

Why does this matter? A model upgrade that improves math reasoning might simultaneously degrade tool-calling accuracy. A quantization change that halves memory usage could introduce subtle factual errors. Without automated evals, teams discover these regressions in production — the most expensive place to find them.

Llama Stack includes first-class evaluation (Evals) capabilities. Evals in Llama Stack are dataset-driven and integrate directly with agents, tools, and models, enabling teams to measure correctness, safety, and behavior across real workflows—not isolated prompts. By supporting golden answers, deterministic scoring, and LLM-as-judge patterns, Llama Stack makes evaluations repeatable, automatable, and CI/CD-friendly, so teams can confidently test changes to prompts, tools, models, or agent logic before deploying to production.

Therefore having an automated way to perform evals is mission critical to the ongoing care and feeding of your model candidates to support an agent or any LLM-wrapping application.

For this module, we are going to focus on how to setup and execute model evals - tests that you run against different candidate models

Setup

The evals-llama-stack directory contains a set of numbered Python scripts (1_list_eval_related_providers.py, 2_register_dataset_basic_subset_of.py, etc.) that progressively build up the eval pipeline, plus a datasets/ folder with CSV test data containing questions and expected answers. You will run these scripts in order.

Make sure you are in the correct directory

cd $HOME/fantaco-redhat-one-2026/
pwd
/home/lab-user/fantaco-redhat-one-2026

If needed, create a Python virtual environment (venv)

python -m venv .venv

Set environment

source .venv/bin/activate

Change to the correct sub-directory

cd evals-llama-stack

Install the dependencies

pip install -r requirements.txt

Objects & APIs

The Llama Stack Evals capabilities involves the following objects/APIs:

  • Datasets

  • Scoring Functions

  • Benchmarks

  • Evals

  • Jobs

First see if the appropriate providers are configured on your Llama Stack Server instance

APIs and providers

python 1_list_eval_related_providers.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Fetching available providers...
Found 2 Dataset provider(s):

  Provider ID: huggingface
    Type: remote::huggingface
    API: datasetio

  Provider ID: localfs
    Type: inline::localfs
    API: datasetio

Found 1 Eval provider(s):

  Provider ID: meta-reference
    Type: inline::meta-reference
    API: eval

Found 2 Scoring provider(s):

  Provider ID: basic
    Type: inline::basic
    API: scoring

  Provider ID: llm-as-judge
    Type: inline::llm-as-judge
    API: scoring

What are providers? Providers are pluggable backends that implement Llama Stack APIs. For datasets, localfs stores data on the local filesystem while huggingface pulls from HuggingFace Hub. For scoring, basic uses deterministic string-matching (fast, no LLM call needed), while llm-as-judge uses a second LLM to grade responses (richer feedback, but slower and more expensive). The meta-reference eval provider is Llama Stack’s built-in evaluation engine.

Datasets

Any datasets registered?

curl -s -H "Authorization: Bearer notapplicable" $LLAMA_STACK_BASE_URL/v1/datasets | jq
{
  "data": []
}

Register a Dataset based on a CSV file.

The CSV file is located at GitHub

The dataset contains rows with an input_query (the question to ask the model), an expected_answer (the "golden answer" to score against), and a chat_completion_input (the prompt formatted for the Llama Stack chat API). For example, one row asks "What is 2+2?" with an expected answer of "4".

python 2_register_dataset_basic_subset_of.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Registering dataset: basic-subset-of-evals
Using dataset provider: localfs
Registered dataset: basic-subset-of-evals

List the registered datasets

python 3_list_datasets.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Fetching available datasets...
Found 1 dataset(s):

  Dataset ID: basic-subset-of-evals
    Provider: localfs
    Source: https://raw.githubusercontent.com/burrsutter/fantaco-redhat-one-2026/main/evals-llama-stack/datasets/basic-subset-of-evals.csv

Scoring Functions

List the scoring functions available

python 4_list_scoring_functions.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Fetching available scoring functions...
Found 8 scoring function(s):

  Scoring Function ID: basic::equality
    Provider: basic
    Description: Returns 1.0 if the input is equal to the target, 0.0 otherwise.

  Scoring Function ID: basic::subset_of
    Provider: basic
    Description: Returns 1.0 if the expected is included in generated, 0.0 otherwise.

  Scoring Function ID: basic::regex_parser_multiple_choice_answer
    Provider: basic
    Description: Extract answer from response matching Answer: [the_answer_letter], and compare with expected result

  Scoring Function ID: basic::regex_parser_math_response
    Provider: basic
    Description: For math related benchmarks, extract answer from the generated response and expected_answer and see if they match

  Scoring Function ID: basic::ifeval
    Provider: basic
    Description: Eval intruction follow capacity by checkping how many instructions can be followed in each example

  Scoring Function ID: basic::docvqa
    Provider: basic
    Description: DocVQA Visual Question & Answer scoring function

  Scoring Function ID: llm-as-judge::base
    Provider: llm-as-judge
    Description: Llm As Judge Scoring Function

  Scoring Function ID: llm-as-judge::405b-simpleqa
    Provider: llm-as-judge
    Description: Llm As Judge Scoring Function for SimpleQA Benchmark (https://github.com/openai/simple-evals/blob/main/simpleqa_eval.py)

These scoring functions fall into two categories:

  • Deterministic (basic::)* — Fast, binary (0.0 or 1.0), no LLM call needed. subset_of checks if the expected answer appears anywhere in the response. equality requires an exact match. Best for factual checks with clear right/wrong answers.

  • LLM-as-Judge (llm-as-judge::)* — Uses a second model to evaluate quality and nuance on a richer scale. More expensive but can assess open-ended questions where there is no single correct answer.

Benchmarks

List the benchmarks available

python 4_list_benchmarks.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Fetching available benchmarks...
No benchmarks found

Register a benchmark. A Benchmark combines a Dataset (what questions to ask) with one or more Scoring Functions (how to grade the answers). An Eval then runs a benchmark against a specific candidate model to produce scored results.

Here’s the snippet of code used by the script to register the benchmark in LlamaStack:

# Create the Llama Stack client
client = LlamaStackClient(base_url=base_url)

provider_id = os.getenv("LLAMA_STACK_BENCHMARK_PROVIDER_ID")
if provider_id:
    logger.info(f"Using benchmark provider: {provider_id}")

logger.info("Registering benchmark: my-basic-quality-benchmark")

try:
    client.benchmarks.register(
        benchmark_id="my-basic-quality-benchmark",
        dataset_id="basic-subset-of-evals",
        scoring_functions=["basic::subset_of"]
    )

Run this command to register it:

python 5_register_benchmark.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Registering benchmark: my-basic-quality-benchmark
Benchmark registered successfully

List benchmarks

python 4_list_benchmarks.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Fetching available benchmarks...
Found 1 benchmark(s):

  Benchmark ID: my-basic-quality-benchmark
    Dataset: basic-subset-of-evals
    Scoring Functions: ['basic::subset_of']
    Provider: meta-reference

Evals

Now that you have a dataset, scoring function and benchmark you are ready to execute your eval. Before you execute an eval it is important to know what models are available to you as one will be your CANDIDATE_MODEL.

python 6_list_models.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Fetching available models...
Found 7 model(s):

  Model ID: granite-embedding-125m
    Type: embedding
    Provider: sentence-transformers
    Metadata: {'embedding_dimension': 768.0}

  Model ID: sentence-transformers/nomic-ai/nomic-embed-text-v1.5
    Type: embedding
    Provider: sentence-transformers
    Metadata: {'embedding_dimension': 768.0, 'default_configured': True}

  Model ID: vllm/Llama-Guard-3-1B
    Type: llm
    Provider: vllm

  Model ID: vllm/nomic-embed-text-v1-5
    Type: llm
    Provider: vllm

  Model ID: vllm/qwen3-14b
    Type: llm
    Provider: vllm

  Model ID: vllm/granite-4-0-h-tiny
    Type: llm
    Provider: vllm

  Model ID: vllm/llama-scout-17b
    Type: llm
    Provider: vllm

Set your CANDIDATE_MODEL

export CANDIDATE_MODEL=vllm/qwen3-14b

Execute the eval

python 7_execute_eval.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Running eval for benchmark: my-basic-quality-benchmark
Using candidate model: vllm/qwen3-14b
Eval job started: 0

Make note of the Eval job started: 0. That number is significant and will be needed to reveal the results of the eval job. Set an env var called LLAMA_STACK_JOB_ID accordingly.

You can also get a listing of the current benchmarks

curl -s "$LLAMA_STACK_BASE_URL/v1/eval/benchmarks" | jq
{
  "data": [
    {
      "identifier": "my-basic-quality-benchmark",
      "provider_resource_id": "my-basic-quality-benchmark",
      "provider_id": "meta-reference",
      "type": "benchmark",
      "dataset_id": "basic-subset-of-evals",
      "scoring_functions": [
        "basic::subset_of"
      ],
      "metadata": {}
    }
  ]
}

There is no API to ask for the list of all jobs but you can guess at the number if you happened to forget it and check the status

curl -s "$LLAMA_STACK_BASE_URL/v1/eval/benchmarks/my-basic-quality-benchmark/jobs/0" | jq
{
  "job_id": "0",
  "status": "completed"
}
LLAMA_STACK_JOB_ID=0 python 8_review_eval_job.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Fetching eval job result: benchmark=my-basic-quality-benchmark job_id=0
Scores:
  Scoring Function: basic::subset_of
  Aggregated: {'accuracy': {'accuracy': 1.0, 'num_correct': 3.0, 'num_total': 3}}
  Rows: 3
   Row 1: {'score': 1.0} | Generation: {'generated_answer': '<think>\nOkay, the user asked "What is 2+2?" That\'s a straightforward math question. Let me think. In basic arithmetic, 2 plus 2 equals 4. But maybe they want more than that? Let me check if there\'s any context I\'m missing. They didn\'t specify any'}
   Row 2: {'score': 1.0} | Generation: {'generated_answer': '<think>\nOkay, the user is asking, "What color is the sky?" Hmm, that seems straightforward, but I need to make sure I cover all the bases. Let\'s start with the basics. On a clear day, the sky is usually blue. But wait, why blue? I remember something about Rayleigh'}
   Row 3: {'score': 1.0} | Generation: {'generated_answer': "<think>\nOkay, so the user is asking who wrote Romeo and Juliet. Let me think. I know that Shakespeare is the most famous author associated with that play. But wait, I should make sure I'm not missing any details. Let me recall: Romeo and Juliet is one of the most well-known tragedies by William"}

Understanding the output:

  • accuracy: 1.0, num_correct: 3.0, num_total: 3 — all 3 test cases passed. This aggregated score is what you would track over time to detect regressions when changing models or prompts.

  • score: 1.0 on each row means the expected answer (e.g. "4", "blue", "Shakespeare") was found somewhere within the model’s generated response. That is what basic::subset_of does — a simple substring check.

  • The <think> tags you see in the generated answers are Qwen3’s chain-of-thought reasoning. The model "thinks out loud" before producing its final answer. The scorer ignores these tags and just checks if the expected answer appears in the full response.

The basic::subset_of scorer is effective for clear-cut factual questions, but it cannot assess whether an explanation is clear, accurate, or well-structured. A more sophisticated way to review and judge the response string from a model is to use the LLM-as-judge pattern.

LLM-as-judge

In this pattern, a separate LLM (the "judge") evaluates the candidate model’s responses on a 1-5 scale and provides a written rationale. This gives much richer feedback than a binary pass/fail — the judge can assess accuracy, clarity, completeness, and presentation. The trade-off is cost and speed: every evaluation requires an inference call to the judge model in addition to the candidate.

Set the judge and candidate model environment variables. Here we use Llama Scout 17B as the judge and Qwen3 14B as the candidate being evaluated:

export JUDGE_MODEL=vllm/llama-scout-17b
export CANDIDATE_MODEL=vllm/qwen3-14b

The LLM-as-judge example re-uses the dataset from before called basic-subset-of-evals but dynamically registers a new scoring function and benchmark

DATASET_ID = "basic-subset-of-evals"
SCORING_FN_ID = "my-llm-as-judge-scoring-fn"
BENCHMARK_ID = "my-llm-as-judge-benchmark"

Here’s the snippet from the Python script

# Create the Llama Stack client
client = LlamaStackClient(base_url=LLAMA_STACK_BASE_URL)

try:
    client.scoring_functions.register(
        scoring_fn_id=SCORING_FN_ID,
        description="LLM-as-judge scoring function for evaluating response quality",
        return_type={"type": "string"},
        provider_id="llm-as-judge",
        provider_scoring_fn_id="llm-as-judge-base",
        params={
            "type": "llm_as_judge",
            "judge_model": JUDGE_MODEL,
            "prompt_template": judge_prompt,
        },
    )
    logger.info(f"Scoring function '{SCORING_FN_ID}' registered successfully")

Execute 9_llm_as_judge.py

python 9_llm_as_judge.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Registering scoring function: my-llm-as-judge-scoring-fn
Using judge model: vllm/llama-scout-17b
Using candidate model: vllm/qwen3-14b
Scoring function 'my-llm-as-judge-scoring-fn' registered successfully
Benchmark 'my-llm-as-judge-benchmark' registered successfully
Eval job started: 1

================================================================================
EVALUATION RESULTS
================================================================================

--- Evaluation 1 ---
Generated Answer:

2 + 2 equals **4**.

This is a basic arithmetic operation where adding two units to another two units results in four units. Let me know if you'd like further clarification! 😊
Score: None
Judge Feedback:
I would give this response a score of 5 out of 5 for quality and accuracy. Here's why:

**Accuracy: 5/5**
The generated answer is mathematically correct: 2 + 2 indeed equals 4.

**Quality: 5/5**
The response exceeds the expected answer in several ways:

1. **Confirmation of correctness**: The generated answer reiterates the correct calculation, providing reassurance.
2. **Explanation**: The response provides a brief, clear explanation of the arithmetic operation, which helps to demonstrate understanding and provide context.
3. **Additional support**: The offer to provide further clarification shows a willingness to help and support the user, which is a valuable aspect of a high-quality response.
4. **Tone and presentation**: The use of a friendly tone (😊) and clear formatting (**4**) make the response engaging and easy to read.

Overall, the generated response is not only accurate but also provides a clear, helpful, and well-presented answer that demonstrates a good understanding of the question and the user's needs.

--- Evaluation 2 ---
Generated Answer:

The color of the sky depends on several factors, including the time of day, weather conditions, and atmospheric composition. Here's a breakdown:

1. **Daytime (Midday):**
   The sky typically appe...
Score: None
Judge Feedback:
**Score: 5**

The generated response is of exceptional quality and accuracy for several reasons:

1. **Comprehensive Coverage**: The response thoroughly addresses the question by explaining that the color of the sky is not static but depends on various factors such as time of day, weather conditions, and atmospheric composition. It covers multiple scenarios, including daytime, sunrise/sunset, cloudy or stormy weather, night, and other specific contexts like the Moon and high altitudes.

2. **Scientific Accuracy**: The explanation is scientifically accurate, particularly in discussing Rayleigh scattering as the reason the sky appears blue during the day. It also correctly explains the changes in sky color during sunrise/sunset and the effects of clouds and atmospheric conditions.

3. **Clarity and Detail**: The response is clear, well-organized, and detailed. It uses specific examples and straightforward language to explain complex phenomena, making it easy to understand for a wide range of readers.

4. **Engagement**: The use of emojis (e.g., 🌅) adds a touch of engagement and modern communication style, making the response more appealing to readers.

5. **Contextualization**: By providing a nuanced view that considers different conditions and locations, the response contextualizes the answer, showing that the color of the sky can vary greatly.

Overall, the response demonstrates a deep understanding of the topic, presents information in an accessible way, and fully addresses the question's implications, earning it a perfect score.

--- Evaluation 3 ---
Generated Answer:

**Romeo and Juliet** was written by **William Shakespeare**, one of the most renowned playwrights in English literature. The play is believed to have been composed around **1596–1597** and first per...
Score: None
Judge Feedback:
**Score: 5**

The generated response is of exceptionally high quality and accuracy. Here's why:

1. **Direct and Clear Answer**: The response directly states that William Shakespeare wrote Romeo and Juliet, which matches the expected answer.

2. **Additional Contextual Information**: The response provides additional valuable context about the play, including:
   - The approximate composition date (1596-1597).
   - The first performance by the Lord Chamberlain's Men.
   - The publication details (quarto form in 1597 and inclusion in the First Folio of 1623).

3. **Source Material and Influences**: It discusses the source material that Shakespeare used, including:
   - Arthur Brooke's 1562 poem "The Tragical History of Romeus and Juliet".
   - 16th-century Italian tales by Matteo Bandello and Masuccio Salernitano.

4. **Analysis of Shakespeare's Contribution**: The response highlights how Shakespeare adapted the story into a dramatic tragedy, emphasizing themes that have made it an enduring work in world literature.

5. **Clarity and Structure**: The information is presented clearly and structured logically, making it easy to follow. The use of a "Key Context" section helps to organize the additional information.

6. **Accuracy**: The response is accurate in its details, citing specific dates, sources, and historical context related to the play and its author.

Overall, the response not only answers the question accurately but also provides a rich context that enhances understanding of the play and its significance. This makes it an exemplary response worthy of a score of 5.

================================================================================
Total evaluations: 3
================================================================================

Notice how the judge model evaluates multiple dimensions — accuracy, clarity, completeness, and presentation — and provides a structured written rationale for each score. This is far richer than the binary pass/fail from basic::subset_of. However, the eval is only as good as the judge model’s reasoning ability, so choosing a capable judge model matters.

Now let’s have a little fun and evaluate two different models. This test asks each model to identify itself. The expected answer is "Qwen". When we run a model that is Qwen, it should score well. When we run a completely different model (Granite), it will honestly identify itself as an IBM model — and the judge will score it low because it doesn’t match the expected answer. This demonstrates how evals can verify model identity and catch cases where a model has been swapped or misconfigured in your serving infrastructure.

First, register a new dataset with a "what model are you?" question:

python 10_register_dataset.py \
    --base-url $LLAMA_STACK_BASE_URL \
    --dataset-uri "https://raw.githubusercontent.com/burrsutter/fantaco-redhat-one-2026/main/evals-llama-stack/datasets/what-model-are-you-eval.csv"
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Registering dataset: what-model-are-you-eval
Dataset URI: https://raw.githubusercontent.com/burrsutter/fantaco-redhat-one-2026/main/evals-llama-stack/datasets/what-model-are-you-eval.csv
Using dataset provider: localfs
Purpose: eval/question-answer
Registered dataset: what-model-are-you-eval
export JUDGE_MODEL=vllm/llama-scout-17b
export CANDIDATE_MODEL=vllm/qwen3-14b
python 11_llm_as_judge_what_model_am_i.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Registering scoring function: what-model-scoring-fn
Using judge model: vllm/llama-scout-17b
Using candidate model: vllm/qwen3-14b
Scoring function 'what-model-scoring-fn' registered successfully
Benchmark 'what-model-benchmark' registered successfully
Eval job started: 5

================================================================================
WHAT MODEL ARE YOU - EVALUATION RESULTS
================================================================================

--- Evaluation 1 ---
Generated Answer:

I am Qwen, a large language model developed by Alibaba Cloud. I can answer questions, create text, and provide information on a wide range of topics. I support multiple languages, including Chinese ...
Score: None
Judge Feedback:
I would give this response a score of 5.

The generated answer clearly and accurately identifies the model's name and type: "I am Qwen, a large language model developed by Alibaba Cloud." This matches the expected answer exactly, and the additional information provided about the model's capabilities and features is supplementary and not required to meet the identification criteria.

Therefore, the model correctly identifies itself, and I award a score of 5.

================================================================================
Total evaluations: 1
================================================================================

Now try a different model. Granite is an IBM model, so it will not identify itself as Qwen:

export JUDGE_MODEL=vllm/llama-scout-17b
export CANDIDATE_MODEL=vllm/granite-4-0-h-tiny
python 11_llm_as_judge_what_model_am_i.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Registering scoring function: what-model-scoring-fn
Using judge model: vllm/llama-scout-17b
Using candidate model: vllm/granite-4-0-h-tiny
Scoring function 'what-model-scoring-fn' already exists, skipping registration
Benchmark 'what-model-benchmark' registered successfully
Eval job started: 6

================================================================================
WHAT MODEL ARE YOU - EVALUATION RESULTS
================================================================================

--- Evaluation 1 ---
Generated Answer: I am an AI language model developed by IBM for generating human-like text based on the input I receive. My main goal is to assist users in various tasks such as answering questions, providing informat...
Score: None
Judge Feedback:
I would give this response a score of 1.

The expected answer is "I am Qwen, a large language model developed by Alibaba Cloud", but the generated answer claims to be an AI language model developed by IBM, which is completely different from the expected answer. This indicates that the model incorrectly identifies itself. Therefore, I give it a score of 1.

A score of 5 would require the model to accurately state "I am Qwen, a large language model developed by Alibaba Cloud". A score of 3 would require a partially correct identification, such as "I am a large language model" without specifying the name or developer, but still conveying the type of model. However, the generated answer not only fails to identify the correct name and developer but also provides incorrect information, warranting a score of 1.

================================================================================
Total evaluations: 1
================================================================================

As expected, Granite scores 1 — it correctly identifies itself as an IBM model, but that doesn’t match the expected answer of "Qwen". This isn’t a failure of the model; it is the eval working as designed. In production, a test like this ensures the model you think you’re running is the one actually serving requests.

Summary

In this module you:

  • Registered a dataset of questions and golden answers with Llama Stack

  • Explored scoring functions — both deterministic (basic::subset_of) and LLM-based (llm-as-judge)

  • Created benchmarks that combine datasets with scoring functions

  • Ran evals against candidate models and reviewed the results

  • Compared two different models (Qwen3 vs Granite) to see how evals catch differences in model behavior

These building blocks — datasets, scoring functions, benchmarks, and evals — are designed to be integrated into CI/CD pipelines so that model changes, prompt updates, or configuration shifts are automatically validated before reaching production.