Evals with Llama Stack
Any change to one of the following can have a major impact on the behaviors and responses of an AI-infused application:
-
model vendor (e.g. Llama, Qwen, Mistral, Google, OpenAI)
-
model version (e.g. Gemini2.5, Qwen3)
-
model parameter count (e.g. 7B, 14B, 70B)
-
model quantization (e.g. q4_K_M, fp16)
-
model server vendor (e.g. vLLM, Ollama, OpenAI)
-
model server configuration (e.g. tool-call-parser, chat-template)
-
prompts
-
and of course the application code (e.g. .py, .js, .ts, .java)
|
Why does this matter? A model upgrade that improves math reasoning might simultaneously degrade tool-calling accuracy. A quantization change that halves memory usage could introduce subtle factual errors. Without automated evals, teams discover these regressions in production — the most expensive place to find them. |
Llama Stack includes first-class evaluation (Evals) capabilities. Evals in Llama Stack are dataset-driven and integrate directly with agents, tools, and models, enabling teams to measure correctness, safety, and behavior across real workflows—not isolated prompts. By supporting golden answers, deterministic scoring, and LLM-as-judge patterns, Llama Stack makes evaluations repeatable, automatable, and CI/CD-friendly, so teams can confidently test changes to prompts, tools, models, or agent logic before deploying to production.
Therefore having an automated way to perform evals is mission critical to the ongoing care and feeding of your model candidates to support an agent or any LLM-wrapping application.
For this module, we are going to focus on how to setup and execute model evals - tests that you run against different candidate models
Setup
The evals-llama-stack directory contains a set of numbered Python scripts (1_list_eval_related_providers.py, 2_register_dataset_basic_subset_of.py, etc.) that progressively build up the eval pipeline, plus a datasets/ folder with CSV test data containing questions and expected answers. You will run these scripts in order.
Make sure you are in the correct directory
cd $HOME/fantaco-redhat-one-2026/
pwd
/home/lab-user/fantaco-redhat-one-2026
If needed, create a Python virtual environment (venv)
python -m venv .venv
Set environment
source .venv/bin/activate
Change to the correct sub-directory
cd evals-llama-stack
Install the dependencies
pip install -r requirements.txt
Objects & APIs
The Llama Stack Evals capabilities involves the following objects/APIs:
-
Datasets
-
Scoring Functions
-
Benchmarks
-
Evals
-
Jobs
First see if the appropriate providers are configured on your Llama Stack Server instance
APIs and providers
python 1_list_eval_related_providers.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Fetching available providers...
Found 2 Dataset provider(s):
Provider ID: huggingface
Type: remote::huggingface
API: datasetio
Provider ID: localfs
Type: inline::localfs
API: datasetio
Found 1 Eval provider(s):
Provider ID: meta-reference
Type: inline::meta-reference
API: eval
Found 2 Scoring provider(s):
Provider ID: basic
Type: inline::basic
API: scoring
Provider ID: llm-as-judge
Type: inline::llm-as-judge
API: scoring
|
What are providers? Providers are pluggable backends that implement Llama Stack APIs. For datasets, |
Datasets
Any datasets registered?
curl -s -H "Authorization: Bearer notapplicable" $LLAMA_STACK_BASE_URL/v1/datasets | jq
{
"data": []
}
Register a Dataset based on a CSV file.
The CSV file is located at GitHub
The dataset contains rows with an input_query (the question to ask the model), an expected_answer (the "golden answer" to score against), and a chat_completion_input (the prompt formatted for the Llama Stack chat API). For example, one row asks "What is 2+2?" with an expected answer of "4".
python 2_register_dataset_basic_subset_of.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Registering dataset: basic-subset-of-evals
Using dataset provider: localfs
Registered dataset: basic-subset-of-evals
List the registered datasets
python 3_list_datasets.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Fetching available datasets...
Found 1 dataset(s):
Dataset ID: basic-subset-of-evals
Provider: localfs
Source: https://raw.githubusercontent.com/burrsutter/fantaco-redhat-one-2026/main/evals-llama-stack/datasets/basic-subset-of-evals.csv
Scoring Functions
List the scoring functions available
python 4_list_scoring_functions.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Fetching available scoring functions...
Found 8 scoring function(s):
Scoring Function ID: basic::equality
Provider: basic
Description: Returns 1.0 if the input is equal to the target, 0.0 otherwise.
Scoring Function ID: basic::subset_of
Provider: basic
Description: Returns 1.0 if the expected is included in generated, 0.0 otherwise.
Scoring Function ID: basic::regex_parser_multiple_choice_answer
Provider: basic
Description: Extract answer from response matching Answer: [the_answer_letter], and compare with expected result
Scoring Function ID: basic::regex_parser_math_response
Provider: basic
Description: For math related benchmarks, extract answer from the generated response and expected_answer and see if they match
Scoring Function ID: basic::ifeval
Provider: basic
Description: Eval intruction follow capacity by checkping how many instructions can be followed in each example
Scoring Function ID: basic::docvqa
Provider: basic
Description: DocVQA Visual Question & Answer scoring function
Scoring Function ID: llm-as-judge::base
Provider: llm-as-judge
Description: Llm As Judge Scoring Function
Scoring Function ID: llm-as-judge::405b-simpleqa
Provider: llm-as-judge
Description: Llm As Judge Scoring Function for SimpleQA Benchmark (https://github.com/openai/simple-evals/blob/main/simpleqa_eval.py)
|
These scoring functions fall into two categories:
|
Benchmarks
List the benchmarks available
python 4_list_benchmarks.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Fetching available benchmarks...
No benchmarks found
Register a benchmark. A Benchmark combines a Dataset (what questions to ask) with one or more Scoring Functions (how to grade the answers). An Eval then runs a benchmark against a specific candidate model to produce scored results.
Here’s the snippet of code used by the script to register the benchmark in LlamaStack:
# Create the Llama Stack client
client = LlamaStackClient(base_url=base_url)
provider_id = os.getenv("LLAMA_STACK_BENCHMARK_PROVIDER_ID")
if provider_id:
logger.info(f"Using benchmark provider: {provider_id}")
logger.info("Registering benchmark: my-basic-quality-benchmark")
try:
client.benchmarks.register(
benchmark_id="my-basic-quality-benchmark",
dataset_id="basic-subset-of-evals",
scoring_functions=["basic::subset_of"]
)
Run this command to register it:
python 5_register_benchmark.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Registering benchmark: my-basic-quality-benchmark
Benchmark registered successfully
List benchmarks
python 4_list_benchmarks.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Fetching available benchmarks...
Found 1 benchmark(s):
Benchmark ID: my-basic-quality-benchmark
Dataset: basic-subset-of-evals
Scoring Functions: ['basic::subset_of']
Provider: meta-reference
Evals
Now that you have a dataset, scoring function and benchmark you are ready to execute your eval. Before you execute an eval it is important to know what models are available to you as one will be your CANDIDATE_MODEL.
python 6_list_models.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Fetching available models...
Found 7 model(s):
Model ID: granite-embedding-125m
Type: embedding
Provider: sentence-transformers
Metadata: {'embedding_dimension': 768.0}
Model ID: sentence-transformers/nomic-ai/nomic-embed-text-v1.5
Type: embedding
Provider: sentence-transformers
Metadata: {'embedding_dimension': 768.0, 'default_configured': True}
Model ID: vllm/Llama-Guard-3-1B
Type: llm
Provider: vllm
Model ID: vllm/nomic-embed-text-v1-5
Type: llm
Provider: vllm
Model ID: vllm/qwen3-14b
Type: llm
Provider: vllm
Model ID: vllm/granite-4-0-h-tiny
Type: llm
Provider: vllm
Model ID: vllm/llama-scout-17b
Type: llm
Provider: vllm
Set your CANDIDATE_MODEL
export CANDIDATE_MODEL=vllm/qwen3-14b
Execute the eval
python 7_execute_eval.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Running eval for benchmark: my-basic-quality-benchmark
Using candidate model: vllm/qwen3-14b
Eval job started: 0
Make note of the Eval job started: 0. That number is significant and will be needed to reveal the results of the eval job. Set an env var called LLAMA_STACK_JOB_ID accordingly.
You can also get a listing of the current benchmarks
curl -s "$LLAMA_STACK_BASE_URL/v1/eval/benchmarks" | jq
{
"data": [
{
"identifier": "my-basic-quality-benchmark",
"provider_resource_id": "my-basic-quality-benchmark",
"provider_id": "meta-reference",
"type": "benchmark",
"dataset_id": "basic-subset-of-evals",
"scoring_functions": [
"basic::subset_of"
],
"metadata": {}
}
]
}
There is no API to ask for the list of all jobs but you can guess at the number if you happened to forget it and check the status
curl -s "$LLAMA_STACK_BASE_URL/v1/eval/benchmarks/my-basic-quality-benchmark/jobs/0" | jq
{
"job_id": "0",
"status": "completed"
}
LLAMA_STACK_JOB_ID=0 python 8_review_eval_job.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Fetching eval job result: benchmark=my-basic-quality-benchmark job_id=0
Scores:
Scoring Function: basic::subset_of
Aggregated: {'accuracy': {'accuracy': 1.0, 'num_correct': 3.0, 'num_total': 3}}
Rows: 3
Row 1: {'score': 1.0} | Generation: {'generated_answer': '<think>\nOkay, the user asked "What is 2+2?" That\'s a straightforward math question. Let me think. In basic arithmetic, 2 plus 2 equals 4. But maybe they want more than that? Let me check if there\'s any context I\'m missing. They didn\'t specify any'}
Row 2: {'score': 1.0} | Generation: {'generated_answer': '<think>\nOkay, the user is asking, "What color is the sky?" Hmm, that seems straightforward, but I need to make sure I cover all the bases. Let\'s start with the basics. On a clear day, the sky is usually blue. But wait, why blue? I remember something about Rayleigh'}
Row 3: {'score': 1.0} | Generation: {'generated_answer': "<think>\nOkay, so the user is asking who wrote Romeo and Juliet. Let me think. I know that Shakespeare is the most famous author associated with that play. But wait, I should make sure I'm not missing any details. Let me recall: Romeo and Juliet is one of the most well-known tragedies by William"}
Understanding the output:
-
accuracy: 1.0, num_correct: 3.0, num_total: 3— all 3 test cases passed. This aggregated score is what you would track over time to detect regressions when changing models or prompts. -
score: 1.0on each row means the expected answer (e.g. "4", "blue", "Shakespeare") was found somewhere within the model’s generated response. That is whatbasic::subset_ofdoes — a simple substring check. -
The
<think>tags you see in the generated answers are Qwen3’s chain-of-thought reasoning. The model "thinks out loud" before producing its final answer. The scorer ignores these tags and just checks if the expected answer appears in the full response.
The basic::subset_of scorer is effective for clear-cut factual questions, but it cannot assess whether an explanation is clear, accurate, or well-structured. A more sophisticated way to review and judge the response string from a model is to use the LLM-as-judge pattern.
LLM-as-judge
In this pattern, a separate LLM (the "judge") evaluates the candidate model’s responses on a 1-5 scale and provides a written rationale. This gives much richer feedback than a binary pass/fail — the judge can assess accuracy, clarity, completeness, and presentation. The trade-off is cost and speed: every evaluation requires an inference call to the judge model in addition to the candidate.
Set the judge and candidate model environment variables. Here we use Llama Scout 17B as the judge and Qwen3 14B as the candidate being evaluated:
export JUDGE_MODEL=vllm/llama-scout-17b
export CANDIDATE_MODEL=vllm/qwen3-14b
The LLM-as-judge example re-uses the dataset from before called basic-subset-of-evals but dynamically registers a new scoring function and benchmark
DATASET_ID = "basic-subset-of-evals"
SCORING_FN_ID = "my-llm-as-judge-scoring-fn"
BENCHMARK_ID = "my-llm-as-judge-benchmark"
Here’s the snippet from the Python script
# Create the Llama Stack client
client = LlamaStackClient(base_url=LLAMA_STACK_BASE_URL)
try:
client.scoring_functions.register(
scoring_fn_id=SCORING_FN_ID,
description="LLM-as-judge scoring function for evaluating response quality",
return_type={"type": "string"},
provider_id="llm-as-judge",
provider_scoring_fn_id="llm-as-judge-base",
params={
"type": "llm_as_judge",
"judge_model": JUDGE_MODEL,
"prompt_template": judge_prompt,
},
)
logger.info(f"Scoring function '{SCORING_FN_ID}' registered successfully")
Execute 9_llm_as_judge.py
python 9_llm_as_judge.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Registering scoring function: my-llm-as-judge-scoring-fn
Using judge model: vllm/llama-scout-17b
Using candidate model: vllm/qwen3-14b
Scoring function 'my-llm-as-judge-scoring-fn' registered successfully
Benchmark 'my-llm-as-judge-benchmark' registered successfully
Eval job started: 1
================================================================================
EVALUATION RESULTS
================================================================================
--- Evaluation 1 ---
Generated Answer:
2 + 2 equals **4**.
This is a basic arithmetic operation where adding two units to another two units results in four units. Let me know if you'd like further clarification! 😊
Score: None
Judge Feedback:
I would give this response a score of 5 out of 5 for quality and accuracy. Here's why:
**Accuracy: 5/5**
The generated answer is mathematically correct: 2 + 2 indeed equals 4.
**Quality: 5/5**
The response exceeds the expected answer in several ways:
1. **Confirmation of correctness**: The generated answer reiterates the correct calculation, providing reassurance.
2. **Explanation**: The response provides a brief, clear explanation of the arithmetic operation, which helps to demonstrate understanding and provide context.
3. **Additional support**: The offer to provide further clarification shows a willingness to help and support the user, which is a valuable aspect of a high-quality response.
4. **Tone and presentation**: The use of a friendly tone (😊) and clear formatting (**4**) make the response engaging and easy to read.
Overall, the generated response is not only accurate but also provides a clear, helpful, and well-presented answer that demonstrates a good understanding of the question and the user's needs.
--- Evaluation 2 ---
Generated Answer:
The color of the sky depends on several factors, including the time of day, weather conditions, and atmospheric composition. Here's a breakdown:
1. **Daytime (Midday):**
The sky typically appe...
Score: None
Judge Feedback:
**Score: 5**
The generated response is of exceptional quality and accuracy for several reasons:
1. **Comprehensive Coverage**: The response thoroughly addresses the question by explaining that the color of the sky is not static but depends on various factors such as time of day, weather conditions, and atmospheric composition. It covers multiple scenarios, including daytime, sunrise/sunset, cloudy or stormy weather, night, and other specific contexts like the Moon and high altitudes.
2. **Scientific Accuracy**: The explanation is scientifically accurate, particularly in discussing Rayleigh scattering as the reason the sky appears blue during the day. It also correctly explains the changes in sky color during sunrise/sunset and the effects of clouds and atmospheric conditions.
3. **Clarity and Detail**: The response is clear, well-organized, and detailed. It uses specific examples and straightforward language to explain complex phenomena, making it easy to understand for a wide range of readers.
4. **Engagement**: The use of emojis (e.g., 🌅) adds a touch of engagement and modern communication style, making the response more appealing to readers.
5. **Contextualization**: By providing a nuanced view that considers different conditions and locations, the response contextualizes the answer, showing that the color of the sky can vary greatly.
Overall, the response demonstrates a deep understanding of the topic, presents information in an accessible way, and fully addresses the question's implications, earning it a perfect score.
--- Evaluation 3 ---
Generated Answer:
**Romeo and Juliet** was written by **William Shakespeare**, one of the most renowned playwrights in English literature. The play is believed to have been composed around **1596–1597** and first per...
Score: None
Judge Feedback:
**Score: 5**
The generated response is of exceptionally high quality and accuracy. Here's why:
1. **Direct and Clear Answer**: The response directly states that William Shakespeare wrote Romeo and Juliet, which matches the expected answer.
2. **Additional Contextual Information**: The response provides additional valuable context about the play, including:
- The approximate composition date (1596-1597).
- The first performance by the Lord Chamberlain's Men.
- The publication details (quarto form in 1597 and inclusion in the First Folio of 1623).
3. **Source Material and Influences**: It discusses the source material that Shakespeare used, including:
- Arthur Brooke's 1562 poem "The Tragical History of Romeus and Juliet".
- 16th-century Italian tales by Matteo Bandello and Masuccio Salernitano.
4. **Analysis of Shakespeare's Contribution**: The response highlights how Shakespeare adapted the story into a dramatic tragedy, emphasizing themes that have made it an enduring work in world literature.
5. **Clarity and Structure**: The information is presented clearly and structured logically, making it easy to follow. The use of a "Key Context" section helps to organize the additional information.
6. **Accuracy**: The response is accurate in its details, citing specific dates, sources, and historical context related to the play and its author.
Overall, the response not only answers the question accurately but also provides a rich context that enhances understanding of the play and its significance. This makes it an exemplary response worthy of a score of 5.
================================================================================
Total evaluations: 3
================================================================================
|
Notice how the judge model evaluates multiple dimensions — accuracy, clarity, completeness, and presentation — and provides a structured written rationale for each score. This is far richer than the binary pass/fail from |
Now let’s have a little fun and evaluate two different models. This test asks each model to identify itself. The expected answer is "Qwen". When we run a model that is Qwen, it should score well. When we run a completely different model (Granite), it will honestly identify itself as an IBM model — and the judge will score it low because it doesn’t match the expected answer. This demonstrates how evals can verify model identity and catch cases where a model has been swapped or misconfigured in your serving infrastructure.
First, register a new dataset with a "what model are you?" question:
python 10_register_dataset.py \
--base-url $LLAMA_STACK_BASE_URL \
--dataset-uri "https://raw.githubusercontent.com/burrsutter/fantaco-redhat-one-2026/main/evals-llama-stack/datasets/what-model-are-you-eval.csv"
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Registering dataset: what-model-are-you-eval
Dataset URI: https://raw.githubusercontent.com/burrsutter/fantaco-redhat-one-2026/main/evals-llama-stack/datasets/what-model-are-you-eval.csv
Using dataset provider: localfs
Purpose: eval/question-answer
Registered dataset: what-model-are-you-eval
export JUDGE_MODEL=vllm/llama-scout-17b
export CANDIDATE_MODEL=vllm/qwen3-14b
python 11_llm_as_judge_what_model_am_i.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Registering scoring function: what-model-scoring-fn
Using judge model: vllm/llama-scout-17b
Using candidate model: vllm/qwen3-14b
Scoring function 'what-model-scoring-fn' registered successfully
Benchmark 'what-model-benchmark' registered successfully
Eval job started: 5
================================================================================
WHAT MODEL ARE YOU - EVALUATION RESULTS
================================================================================
--- Evaluation 1 ---
Generated Answer:
I am Qwen, a large language model developed by Alibaba Cloud. I can answer questions, create text, and provide information on a wide range of topics. I support multiple languages, including Chinese ...
Score: None
Judge Feedback:
I would give this response a score of 5.
The generated answer clearly and accurately identifies the model's name and type: "I am Qwen, a large language model developed by Alibaba Cloud." This matches the expected answer exactly, and the additional information provided about the model's capabilities and features is supplementary and not required to meet the identification criteria.
Therefore, the model correctly identifies itself, and I award a score of 5.
================================================================================
Total evaluations: 1
================================================================================
Now try a different model. Granite is an IBM model, so it will not identify itself as Qwen:
export JUDGE_MODEL=vllm/llama-scout-17b
export CANDIDATE_MODEL=vllm/granite-4-0-h-tiny
python 11_llm_as_judge_what_model_am_i.py
Connecting to Llama Stack server at: http://llamastack-distribution-vllm-service:8321
Registering scoring function: what-model-scoring-fn
Using judge model: vllm/llama-scout-17b
Using candidate model: vllm/granite-4-0-h-tiny
Scoring function 'what-model-scoring-fn' already exists, skipping registration
Benchmark 'what-model-benchmark' registered successfully
Eval job started: 6
================================================================================
WHAT MODEL ARE YOU - EVALUATION RESULTS
================================================================================
--- Evaluation 1 ---
Generated Answer: I am an AI language model developed by IBM for generating human-like text based on the input I receive. My main goal is to assist users in various tasks such as answering questions, providing informat...
Score: None
Judge Feedback:
I would give this response a score of 1.
The expected answer is "I am Qwen, a large language model developed by Alibaba Cloud", but the generated answer claims to be an AI language model developed by IBM, which is completely different from the expected answer. This indicates that the model incorrectly identifies itself. Therefore, I give it a score of 1.
A score of 5 would require the model to accurately state "I am Qwen, a large language model developed by Alibaba Cloud". A score of 3 would require a partially correct identification, such as "I am a large language model" without specifying the name or developer, but still conveying the type of model. However, the generated answer not only fails to identify the correct name and developer but also provides incorrect information, warranting a score of 1.
================================================================================
Total evaluations: 1
================================================================================
|
As expected, Granite scores 1 — it correctly identifies itself as an IBM model, but that doesn’t match the expected answer of "Qwen". This isn’t a failure of the model; it is the eval working as designed. In production, a test like this ensures the model you think you’re running is the one actually serving requests. |
Summary
In this module you:
-
Registered a dataset of questions and golden answers with Llama Stack
-
Explored scoring functions — both deterministic (
basic::subset_of) and LLM-based (llm-as-judge) -
Created benchmarks that combine datasets with scoring functions
-
Ran evals against candidate models and reviewed the results
-
Compared two different models (Qwen3 vs Granite) to see how evals catch differences in model behavior
These building blocks — datasets, scoring functions, benchmarks, and evals — are designed to be integrated into CI/CD pipelines so that model changes, prompt updates, or configuration shifts are automatically validated before reaching production.