Exploring Llama Stack

Now that you have successfully deployed a Llama Stack instance, it’s time to explore its capabilities through hands-on interaction with the APIs. In this module, you’ll learn how to connect to your Llama Stack server, discover available models and APIs, and interact with the inference endpoints.

Llama Stack provides multiple API interfaces for different use cases:

  • Inference APIs: For generating responses using language models

  • Agent APIs: For building multi-turn conversational agents with tool calling

  • RAG APIs: For retrieval-augmented generation with vector databases

  • Safety APIs: For content moderation and guardrails

  • Tool APIs: For integrating external capabilities

This exploration will help you understand the Llama Stack API landscape and prepare you for building agentic applications.

Connecting to Llama Stack

export LLAMA_STACK_BASE_URL=http://llamastack-distribution-vllm-service.agentic-{user}.svc:8321
export INFERENCE_MODEL=vllm/qwen3-14b
echo "LLAMA_STACK_BASE_URL="$LLAMA_STACK_BASE_URL
echo "INFERENCE_MODEL="$INFERENCE_MODEL
LLAMA_STACK_BASE_URL=http://llamastack-distribution-vllm-service.agentic-{user}.svc:8321
INFERENCE_MODEL=vllm/qwen3-14b

The Llama Stack server is accessible via its internal Kubernetes service URL on port 8321. The INFERENCE_MODEL variable specifies which model to use for inference requests - in this lab, we’re are primarily using the Qwen3-14B model.

Reading the service URL: http://llamastack-distribution-vllm-service.agentic-{user}.svc:8321 follows the Kubernetes DNS convention: <service-name>.<namespace>.svc:<port>. This means your requests stay inside the cluster network — no external routing or authentication needed.

Discovering available resources

List available models

curl -sS $LLAMA_STACK_BASE_URL/v1/models \
     -H "Content-Type: application/json" \
     | jq -r '.data[].identifier'
sentence-transformers/nomic-ai/nomic-embed-text-v1.5
vllm/Llama-Guard-3-1B
vllm/nomic-embed-text-v1-5
vllm/qwen3-14b
vllm/granite-4-0-h-tiny
vllm/llama-scout-17b

Note: Your list of models may vary depending on what is available on the MaaS

The Llama Stack distribution provides access to multiple vLLM-hosted models (e.g. qwen3-14b, llama-scout-17b) running on a Red Hat OpenShift AI cluster dedicated to model hosting via MaaS (model-as-a-service) architecture.

Model types in the list: Notice the models fall into two categories. Embedding models (like nomic-embed-text) convert text into numerical vectors for search and retrieval — you’ll use these in the RAG module. LLM models (like qwen3-14b, granite-4-0-h-tiny) are the language models that generate text responses. Llama-Guard-3-1B is a specialized safety model used for content moderation.

List APIs

Llama Stack exposes a comprehensive set of RESTful APIs following the OpenAPI specification. These APIs provide standardized interfaces for inference, agents, RAG, safety, and operational tasks. Let’s explore the available endpoints:

curl -sS $LLAMA_STACK_BASE_URL/openapi.json \
     | jq '.paths | keys'
[
  "/v1/agents",
  "/v1/agents/{agent_id}",
  "/v1/agents/{agent_id}/session",
  "/v1/agents/{agent_id}/session/{session_id}",
  "/v1/agents/{agent_id}/session/{session_id}/turn",
  "/v1/agents/{agent_id}/session/{session_id}/turn/{turn_id}",
  "/v1/agents/{agent_id}/session/{session_id}/turn/{turn_id}/resume",
  "/v1/agents/{agent_id}/session/{session_id}/turn/{turn_id}/step/{step_id}",
  "/v1/agents/{agent_id}/sessions",
  "/v1/chat/completions",
  "/v1/chat/completions/{completion_id}",
  "/v1/completions",
  "/v1/conversations",
  "/v1/conversations/{conversation_id}",
  "/v1/conversations/{conversation_id}/items",
  "/v1/conversations/{conversation_id}/items/{item_id}",
  "/v1/datasetio/append-rows/{dataset_id}",
  "/v1/datasetio/iterrows/{dataset_id}",
  "/v1/datasets",
  "/v1/datasets/{dataset_id}",
  "/v1/embeddings",
  "/v1/eval/benchmarks",
  "/v1/eval/benchmarks/{benchmark_id}",
  "/v1/eval/benchmarks/{benchmark_id}/evaluations",
  "/v1/eval/benchmarks/{benchmark_id}/jobs",
  "/v1/eval/benchmarks/{benchmark_id}/jobs/{job_id}",
  "/v1/eval/benchmarks/{benchmark_id}/jobs/{job_id}/result",
  "/v1/files",
  "/v1/files/{file_id}",
  "/v1/files/{file_id}/content",
  "/v1/health",
  "/v1/inspect/routes",
  "/v1/models",
  "/v1/models/{model_id}",
  "/v1/moderations",
  "/v1/openai/v1/chat/completions",
  "/v1/openai/v1/chat/completions/{completion_id}",
  "/v1/openai/v1/completions",
  "/v1/openai/v1/embeddings",
  "/v1/openai/v1/files",
  "/v1/openai/v1/files/{file_id}",
  "/v1/openai/v1/files/{file_id}/content",
  "/v1/openai/v1/models",
  "/v1/openai/v1/moderations",
  "/v1/openai/v1/responses",
  "/v1/openai/v1/responses/{response_id}",
  "/v1/openai/v1/responses/{response_id}/input_items",
  "/v1/openai/v1/vector_stores",
  "/v1/openai/v1/vector_stores/{vector_store_id}",
  "/v1/openai/v1/vector_stores/{vector_store_id}/file_batches",
  "/v1/openai/v1/vector_stores/{vector_store_id}/file_batches/{batch_id}",
  "/v1/openai/v1/vector_stores/{vector_store_id}/file_batches/{batch_id}/cancel",
  "/v1/openai/v1/vector_stores/{vector_store_id}/file_batches/{batch_id}/files",
  "/v1/openai/v1/vector_stores/{vector_store_id}/files",
  "/v1/openai/v1/vector_stores/{vector_store_id}/files/{file_id}",
  "/v1/openai/v1/vector_stores/{vector_store_id}/files/{file_id}/content",
  "/v1/openai/v1/vector_stores/{vector_store_id}/search",
  "/v1/prompts",
  "/v1/prompts/{prompt_id}",
  "/v1/prompts/{prompt_id}/set-default-version",
  "/v1/prompts/{prompt_id}/versions",
  "/v1/providers",
  "/v1/providers/{provider_id}",
  "/v1/responses",
  "/v1/responses/{response_id}",
  "/v1/responses/{response_id}/input_items",
  "/v1/safety/run-shield",
  "/v1/scoring-functions",
  "/v1/scoring-functions/{scoring_fn_id}",
  "/v1/scoring/score",
  "/v1/scoring/score-batch",
  "/v1/shields",
  "/v1/shields/{identifier}",
  "/v1/tool-runtime/invoke",
  "/v1/tool-runtime/list-tools",
  "/v1/tool-runtime/rag-tool/insert",
  "/v1/tool-runtime/rag-tool/query",
  "/v1/toolgroups",
  "/v1/toolgroups/{toolgroup_id}",
  "/v1/tools",
  "/v1/tools/{tool_name}",
  "/v1/vector-io/insert",
  "/v1/vector-io/query",
  "/v1/vector_stores",
  "/v1/vector_stores/{vector_store_id}",
  "/v1/vector_stores/{vector_store_id}/file_batches",
  "/v1/vector_stores/{vector_store_id}/file_batches/{batch_id}",
  "/v1/vector_stores/{vector_store_id}/file_batches/{batch_id}/cancel",
  "/v1/vector_stores/{vector_store_id}/file_batches/{batch_id}/files",
  "/v1/vector_stores/{vector_store_id}/files",
  "/v1/vector_stores/{vector_store_id}/files/{file_id}",
  "/v1/vector_stores/{vector_store_id}/files/{file_id}/content",
  "/v1/vector_stores/{vector_store_id}/search",
  "/v1/version",
  "/v1alpha/agents",
  "/v1alpha/agents/{agent_id}",
  "/v1alpha/agents/{agent_id}/session",
  "/v1alpha/agents/{agent_id}/session/{session_id}",
  "/v1alpha/agents/{agent_id}/session/{session_id}/turn",
  "/v1alpha/agents/{agent_id}/session/{session_id}/turn/{turn_id}",
  "/v1alpha/agents/{agent_id}/session/{session_id}/turn/{turn_id}/resume",
  "/v1alpha/agents/{agent_id}/session/{session_id}/turn/{turn_id}/step/{step_id}",
  "/v1alpha/agents/{agent_id}/sessions",
  "/v1alpha/eval/benchmarks",
  "/v1alpha/eval/benchmarks/{benchmark_id}",
  "/v1alpha/eval/benchmarks/{benchmark_id}/evaluations",
  "/v1alpha/eval/benchmarks/{benchmark_id}/jobs",
  "/v1alpha/eval/benchmarks/{benchmark_id}/jobs/{job_id}",
  "/v1alpha/eval/benchmarks/{benchmark_id}/jobs/{job_id}/result",
  "/v1alpha/inference/rerank",
  "/v1beta/datasetio/append-rows/{dataset_id}",
  "/v1beta/datasetio/iterrows/{dataset_id}",
  "/v1beta/datasets",
  "/v1beta/datasets/{dataset_id}"
]

This comprehensive API surface demonstrates Llama Stack’s capabilities across the entire AI application lifecycle. Notice the variety of endpoints:

  • Inference endpoints (/v1/chat/completions, /v1/responses, /v1/completions): Different approaches to getting model responses

  • Agent endpoints (/v1/agents/*): Full agent lifecycle including sessions, turns, and steps

  • RAG and vector endpoints (/v1/vector_stores/, /v1/vector-io/): Document storage and semantic search

  • Tool runtime (/v1/toolgroups, /v1/tools/*): External tool integration

  • Safety and moderation (/v1/safety/, /v1/shields/): Content filtering and guardrails

  • OpenAI compatibility layer (/v1/openai/v1/*): Drop-in replacement for OpenAI API clients

Inference APIs

Llama Stack provides multiple inference APIs to suit different application needs. Let’s explore two primary approaches to getting an answer from large language models.

ChatCompletions API

The ChatCompletions API provides an OpenAI-compatible interface for interactions. This API is familiar to developers who have worked with OpenAI’s GPT models and supports streaming, function calling, and multi-turn conversations.

export QUESTION="what model are you?"

API_KEY is required for an OpenAI compatible API such as v1/chat/completions as well as v1/responses and since we are using Llama Stack with vLLM, self-hosted, private models, it can be set to any string you like as it is not actually being used.

Why temperature: 0.0? The temperature parameter controls randomness in model responses. At 0.0, the model always picks the most likely next token, giving deterministic (reproducible) output. Higher values (e.g. 0.7) introduce randomness for more creative or varied responses. For testing and evals, 0.0 is useful because you get the same answer every time.

export API_KEY="not-applicable"
curl -sS $LLAMA_STACK_BASE_URL/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $API_KEY" \
    -d "{
       \"model\": \"$INFERENCE_MODEL\",
       \"messages\": [{\"role\": \"user\", \"content\": \"$QUESTION\"}],
       \"temperature\": 0.0
     }" | jq -r '.choices[0].message.content'
I am Qwen, a large language model developed by Alibaba Cloud. I can answer
questions, create text, and assist with various tasks. How can I help you today?

Change the INFERENCE_MODEL and try again

export INFERENCE_MODEL=vllm/granite-4-0-h-tiny
curl -sS $LLAMA_STACK_BASE_URL/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $API_KEY" \
    -d "{
       \"model\": \"$INFERENCE_MODEL\",
       \"messages\": [{\"role\": \"user\", \"content\": \"$QUESTION\"}],
       \"temperature\": 0.0
     }" | jq -r '.choices[0].message.content'
I am an AI language model developed by IBM for the purpose of assisting with a variety
of tasks such as answering questions, providing explanations, and helping with writing
and research. My exact model name isn't publicly disclosed. However, I'm designed to
provide accurate, relevant, and helpful information across a wide range of topics.

Provider abstraction in action: Notice you used the exact same curl command and only changed the INFERENCE_MODEL environment variable. Llama Stack routed the request to a completely different model behind the scenes. This is the core value of the unified API — your application code doesn’t change when you swap models.

Change the INFERENCE_MODEL back to Qwen 3 as we will use that for our lab.

export INFERENCE_MODEL=vllm/qwen3-14b

Responses API

An OpenAI-compatible Responses API (/v1/responses) was developed by OpenAI and is also available via Llama Stack. It represents a more streamlined, purpose-built interface for AI applications. Unlike the ChatCompletions API which focuses on conversational patterns, the Responses API is designed for modern agentic workflows where you need direct, structured interaction with language models. Responses is considered to be the successor of the historical ChatCompletions API.

Llama Stack has adopted the Responses API because of its well-earned popularity as a powerful tool for agentic reasoning. This API lets you avoid juggling multiple tool endpoints and identifiers. You make one API call that handles tool discovery, execution planning, and response synthesis automatically. The Responses API supports advanced patterns, such as multi-step reasoning and automatic tool chaining, that previously required extensive custom orchestration with the legacy APIs.

LlamaStack will deprecate Agent API in favor of Response API. Today LlamaStack client changed the implementation to point to Response API in order to prepare for the migration, however in the near future Agent API will not be available. Suggested solution is to use directly Response API, or the LegacyAgent API, or to create your own class that simply wraps Response API. See this article for more examples. This lab will be updated accordingly when needed.
sequenceDiagram participant App as Your Application participant LS as Llama Stack Server participant Model as vLLM Model participant Tools as External Tools App->>LS: POST /v1/responses
{model, input} LS->>LS: Apply safety shields LS->>Model: Forward to inference provider Model->>Model: Generate response Model-->>LS: Raw model output LS->>LS: Structure output LS->>Tools: (Optional) Execute tools Tools-->>LS: Tool results LS-->>App: {output: [{content}]} Note over App,Tools: Response API handles orchestration,
safety, and tool execution
Figure 1. Responses API workflow

Llama Stack’s Responses API support

Key features of Llama Stack’s Responses API implementation:

  • Provider abstraction: Same API works with any configured inference provider

  • Structured outputs: Native support for JSON schemas, structured data extraction and structured input/output

  • Tool integration: Built-in support for RAG, web search, and custom tools

  • Observability: Request/response logging and monitoring through OpenAPI endpoints

Let’s interact with the Responses API:

export QUESTION="What is the capital of Italy?"
curl -sS "$LLAMA_STACK_BASE_URL/v1/responses" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $API_KEY" \
    -d "{
      \"model\": \"$INFERENCE_MODEL\",
      \"input\": \"$QUESTION\"
    }" | jq -r '.output[0].content[0].text'
The capital of Italy is **Rome**. It is a historic city known for its rich cultural heritage, ancient landmarks like the Colosseum and Vatican City, and its role as the political and administrative center of the country.

The Responses API provides a clean, direct interface for getting model responses. Notice the simplified request structure compared to ChatCompletion - you specify a model and input rather than managing message arrays. The response is structured as an array of output content, making it easy to extract text, handle tool calls, or process structured data.

This API becomes even more powerful when combined with Llama Stack’s agent framework, which uses the Responses API as the foundation for multi-step reasoning and tool execution.

Tool runtime and capabilities

Beyond inference, Llama Stack provides a runtime environment for tools that agents can use to interact with external systems. These tools extend the capabilities of language models by allowing them to search the web, query databases, perform calculations, and integrate with enterprise systems.

Available tool groups

curl -sS -H "Content-Type: application/json" $LLAMA_STACK_BASE_URL/v1/toolgroups | jq
{
  "data": []
}

Why are tool groups empty? Llama Stack starts with no tools registered by default. This is intentional — tools are capabilities you explicitly opt into. You register them here so that agents can discover and invoke them during their reasoning process. Think of it as giving your agents permission to use specific external capabilities.

Register the RAG and Web Search tools.

curl -sS -X POST "$LLAMA_STACK_BASE_URL/v1/toolgroups" \
  -H "Content-Type: application/json" \
  -d '{"toolgroup_id": "builtin::rag", "provider_id": "rag-runtime"}' \
  -w "\nHTTP Status: %{http_code}\n"
null
HTTP Status: 200
curl -sS -X POST "$LLAMA_STACK_BASE_URL/v1/toolgroups" \
  -H "Content-Type: application/json" \
  -d '{"toolgroup_id": "builtin::websearch", "provider_id": "tavily-search"}' \
  -w "\nHTTP Status: %{http_code}\n"
null
HTTP Status: 200
curl -sS -H "Content-Type: application/json" $LLAMA_STACK_BASE_URL/v1/toolgroups | jq
{
  "data": [
    {
      "identifier": "builtin::rag",
      "provider_resource_id": "builtin::rag",
      "provider_id": "rag-runtime",
      "type": "tool_group",
      "mcp_endpoint": null,
      "args": null
    },
    {
      "identifier": "builtin::websearch",
      "provider_resource_id": "builtin::websearch",
      "provider_id": "tavily-search",
      "type": "tool_group",
      "mcp_endpoint": null,
      "args": null
    }
  ]
}

RAG tool: Provides document retrieval and semantic search capabilities. There is a module dedicated to exploring RAG.

Web Search tool: Integrates with Tavily and augments the model context with web search data. There is a module dedicated to exploring Web Search.

These tools can be invoked by agents during their reasoning process, allowing them to augment their knowledge with external data. In later modules, you’ll see how agents automatically decide when and how to use these tools.

Model Context Protocol (MCP)

curl -sS -H "Content-Type: application/json" \
    "$LLAMA_STACK_BASE_URL/v1/toolgroups" \
  | jq '[.data[] | select(.provider_id == "model-context-protocol") | .identifier]'
[]

There are no MCP Servers deployed and registered with this Llama Stack instance at this time.

The Model Context Protocol is an open standard for integrating external tools and data sources with language models. While this instance doesn’t have MCP servers configured yet, you’ll learn how to add MCP capabilities in the dedicated MCP module later in this lab.

Summary

In this module, you explored the Llama Stack API landscape through hands-on interaction:

  • Connected to Llama Stack and discovered available models across local and cloud providers

  • Explored the API surface with over 100 endpoints for inference, agents, RAG, safety, and tools

  • Compared inference approaches between ChatCompletion and Responses APIs

  • Emphasized the Responses API as Llama Stack’s native interface for agentic workflows, with provider abstraction, and simplified semantics

In the following modules, you’ll use these APIs to implement RAG, integrate tools via MCP, and build complete agentic applications.