Ollama for Local LLMs: REST APIs, Modelfiles, and RAG (with Mermaid Diagrams)

Introduction

TL;DR:

Ollama runs LLMs locally and exposes an HTTP API (example calls use http://localhost:11434).
Key endpoints include /api/generate, /api/chat, and /api/embed for embeddings used in RAG pipelines.
Modelfiles let you package a base model plus parameters and a fixed system prompt.

What is Ollama?

Ollama’s docs show that once it’s running, the API is available and can be called via curl against localhost:11434.

1
2
3
4
5
flowchart LR
  App[Application] -->|HTTP| Ollama[Ollama Server :11434]
  Ollama --> Model[Local LLM Model]
  App --> Docs[Local Documents]
  Ollama --> Vec[(Vector DB - optional)]

Why it matters: It provides a straightforward path from local experimentation to app integration using stable HTTP calls.

Core REST APIs (Generate, Chat, Embed)

Ollama’s API reference documents chat/generation endpoints, and the embeddings endpoint is documented separately.

Goal	Endpoint	Notes
Text generation	`POST /api/generate`	Example shown in the API introduction
Chat	`POST /api/chat`	Listed as chat completion in API reference
Embeddings	`POST /api/embed`	Creates vector embeddings

1
2
3
4
5
curl http://localhost:11434/api/generate -d '{
  "model": "gemma3",
  "prompt": "Explain Ollama in one paragraph.",
  "stream": false
}'

Mermaid: A minimal RAG flow using `/api/embed`

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
flowchart TD
  D[Documents] --> P[Preprocess & Chunk]
  P --> E1[Ollama /api/embed (docs)]
  E1 --> V[(Vector DB)]
  Q[Question] --> E2[Ollama /api/embed (query)]
  E2 --> V
  V --> R[Retrieve Top-K Chunks]
  R --> C[Compose Prompt: Q + Context]
  C --> G[Ollama /api/chat or /api/generate]
  G --> A[Answer]

Why it matters: Embeddings are the foundational building block for search + grounded generation workflows.

Custom models with Modelfiles

The Modelfile reference documents directives like FROM, PARAMETER, SYSTEM, and TEMPLATE.

1
2
3
4
5
6
7
FROM llama3.2
PARAMETER temperature 0.2
SYSTEM """
You are a policy summarizer.
Never invent facts.
Return concise bullet points.
"""

Why it matters: Packaging a stable system prompt and parameters improves reuse and operational consistency.

OpenAI-compatible API: verify your supported surface

Ollama’s docs state OpenAI compatibility (including /v1/responses with non-stateful limitations), and the Ollama blog provides example usage. A historical GitHub issue shows earlier limitations, so always validate against your installed version.

1
2
3
4
flowchart LR
  Existing[Existing OpenAI-SDK app] -->|change base_url| Compat[Ollama OpenAI-compatible API]
  Compat --> Core[Ollama Core Runtime]
  Core --> Model[Local LLM]

Why it matters: Compatibility can reduce migration work, but the exact endpoint support can change over time.

LangChain and Spring AI integrations

LangChain: ChatOllama in langchain-ollama
Spring AI: OllamaChatModel

Why it matters: You can adopt local LLMs without rewriting your whole application stack (Python or Java).

Conclusion

Ollama exposes local inference via localhost:11434 and standard REST calls.
Use /api/embed to build a basic RAG pipeline with a vector database.
Use Modelfiles to package role-specific behavior and parameters.
For OpenAI compatibility, confirm which endpoints your version supports.

Summary

Local REST APIs: /api/generate, /api/chat, /api/embed.
RAG baseline: embed -> vector search -> chat/generate with retrieved context.
Modelfile-based customization for reusable, consistent behavior.

Recommended Hashtags

#ollama #llm #localai #rag #embeddings #modelfile #langchain #springai #vectorsearch #mlops

References

(Introduction, 2025-12-31)[https://docs.ollama.com/api/introduction]
(API Reference - api.md, 2025-12-31)[https://github.com/ollama/ollama/blob/main/docs/api.md]
(Generate embeddings, 2025-12-31)[https://docs.ollama.com/api/embed]
(Modelfile Reference, 2025-12-31)[https://docs.ollama.com/modelfile]
(OpenAI compatibility - Ollama Docs, 2025-12-31)[https://docs.ollama.com/api/openai-compatibility]
(OpenAI compatibility - Ollama Blog, 2024-02-08)[https://ollama.com/blog/openai-compatibility]
(Responses API support issue - GitHub Issues, 2025-04-16)[https://github.com/ollama/ollama/issues/10309]
(ChatOllama integration - LangChain Docs, 2025-12-31)[https://docs.langchain.com/oss/python/integrations/chat/ollama]
(Ollama Chat - Spring AI Reference, 2025-12-31)[https://docs.spring.io/spring-ai/reference/api/chat/ollama-chat.html]

Introduction#

What is Ollama?#

Core REST APIs (Generate, Chat, Embed)#

Mermaid: A minimal RAG flow using /api/embed#

Custom models with Modelfiles#

OpenAI-compatible API: verify your supported surface#

LangChain and Spring AI integrations#

Conclusion#

Summary#

Recommended Hashtags#

References#