Introduction

TL;DR:

  • Ollama runs LLMs locally and exposes an HTTP API (example calls use http://localhost:11434).
  • Key endpoints include /api/generate, /api/chat, and /api/embed for embeddings used in RAG pipelines.
  • Modelfiles let you package a base model plus parameters and a fixed system prompt.

What is Ollama?

Ollama’s docs show that once it’s running, the API is available and can be called via curl against localhost:11434.

1
2
3
4
5
flowchart LR
  App[Application] -->|HTTP| Ollama[Ollama Server :11434]
  Ollama --> Model[Local LLM Model]
  App --> Docs[Local Documents]
  Ollama --> Vec[(Vector DB - optional)]

Why it matters: It provides a straightforward path from local experimentation to app integration using stable HTTP calls.

Core REST APIs (Generate, Chat, Embed)

Ollama’s API reference documents chat/generation endpoints, and the embeddings endpoint is documented separately.

GoalEndpointNotes
Text generationPOST /api/generateExample shown in the API introduction
ChatPOST /api/chatListed as chat completion in API reference
EmbeddingsPOST /api/embedCreates vector embeddings
1
2
3
4
5
curl http://localhost:11434/api/generate -d '{
  "model": "gemma3",
  "prompt": "Explain Ollama in one paragraph.",
  "stream": false
}'

Mermaid: A minimal RAG flow using /api/embed

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
flowchart TD
  D[Documents] --> P[Preprocess & Chunk]
  P --> E1[Ollama /api/embed (docs)]
  E1 --> V[(Vector DB)]
  Q[Question] --> E2[Ollama /api/embed (query)]
  E2 --> V
  V --> R[Retrieve Top-K Chunks]
  R --> C[Compose Prompt: Q + Context]
  C --> G[Ollama /api/chat or /api/generate]
  G --> A[Answer]

Why it matters: Embeddings are the foundational building block for search + grounded generation workflows.

Custom models with Modelfiles

The Modelfile reference documents directives like FROM, PARAMETER, SYSTEM, and TEMPLATE.

1
2
3
4
5
6
7
FROM llama3.2
PARAMETER temperature 0.2
SYSTEM """
You are a policy summarizer.
Never invent facts.
Return concise bullet points.
"""

Why it matters: Packaging a stable system prompt and parameters improves reuse and operational consistency.

OpenAI-compatible API: verify your supported surface

Ollama’s docs state OpenAI compatibility (including /v1/responses with non-stateful limitations), and the Ollama blog provides example usage. A historical GitHub issue shows earlier limitations, so always validate against your installed version.

1
2
3
4
flowchart LR
  Existing[Existing OpenAI-SDK app] -->|change base_url| Compat[Ollama OpenAI-compatible API]
  Compat --> Core[Ollama Core Runtime]
  Core --> Model[Local LLM]

Why it matters: Compatibility can reduce migration work, but the exact endpoint support can change over time.

LangChain and Spring AI integrations

  • LangChain: ChatOllama in langchain-ollama
  • Spring AI: OllamaChatModel

Why it matters: You can adopt local LLMs without rewriting your whole application stack (Python or Java).

Conclusion

  • Ollama exposes local inference via localhost:11434 and standard REST calls.
  • Use /api/embed to build a basic RAG pipeline with a vector database.
  • Use Modelfiles to package role-specific behavior and parameters.
  • For OpenAI compatibility, confirm which endpoints your version supports.

Summary

  • Local REST APIs: /api/generate, /api/chat, /api/embed.
  • RAG baseline: embed -> vector search -> chat/generate with retrieved context.
  • Modelfile-based customization for reusable, consistent behavior.

#ollama #llm #localai #rag #embeddings #modelfile #langchain #springai #vectorsearch #mlops

References

  • (Introduction, 2025-12-31)[https://docs.ollama.com/api/introduction]
  • (API Reference - api.md, 2025-12-31)[https://github.com/ollama/ollama/blob/main/docs/api.md]
  • (Generate embeddings, 2025-12-31)[https://docs.ollama.com/api/embed]
  • (Modelfile Reference, 2025-12-31)[https://docs.ollama.com/modelfile]
  • (OpenAI compatibility - Ollama Docs, 2025-12-31)[https://docs.ollama.com/api/openai-compatibility]
  • (OpenAI compatibility - Ollama Blog, 2024-02-08)[https://ollama.com/blog/openai-compatibility]
  • (Responses API support issue - GitHub Issues, 2025-04-16)[https://github.com/ollama/ollama/issues/10309]
  • (ChatOllama integration - LangChain Docs, 2025-12-31)[https://docs.langchain.com/oss/python/integrations/chat/ollama]
  • (Ollama Chat - Spring AI Reference, 2025-12-31)[https://docs.spring.io/spring-ai/reference/api/chat/ollama-chat.html]