Introduction
TL;DR:
- Ollama runs LLMs locally and exposes an HTTP API (example calls use
http://localhost:11434). - Key endpoints include
/api/generate,/api/chat, and/api/embedfor embeddings used in RAG pipelines. - Modelfiles let you package a base model plus parameters and a fixed system prompt.
What is Ollama?
Ollama’s docs show that once it’s running, the API is available and can be called via curl against localhost:11434.
| |
Why it matters: It provides a straightforward path from local experimentation to app integration using stable HTTP calls.
Core REST APIs (Generate, Chat, Embed)
Ollama’s API reference documents chat/generation endpoints, and the embeddings endpoint is documented separately.
| Goal | Endpoint | Notes |
|---|---|---|
| Text generation | POST /api/generate | Example shown in the API introduction |
| Chat | POST /api/chat | Listed as chat completion in API reference |
| Embeddings | POST /api/embed | Creates vector embeddings |
| |
Mermaid: A minimal RAG flow using /api/embed
| |
Why it matters: Embeddings are the foundational building block for search + grounded generation workflows.
Custom models with Modelfiles
The Modelfile reference documents directives like FROM, PARAMETER, SYSTEM, and TEMPLATE.
| |
Why it matters: Packaging a stable system prompt and parameters improves reuse and operational consistency.
OpenAI-compatible API: verify your supported surface
Ollama’s docs state OpenAI compatibility (including /v1/responses with non-stateful limitations), and the Ollama blog provides example usage.
A historical GitHub issue shows earlier limitations, so always validate against your installed version.
| |
Why it matters: Compatibility can reduce migration work, but the exact endpoint support can change over time.
LangChain and Spring AI integrations
- LangChain:
ChatOllamainlangchain-ollama - Spring AI:
OllamaChatModel
Why it matters: You can adopt local LLMs without rewriting your whole application stack (Python or Java).
Conclusion
- Ollama exposes local inference via
localhost:11434and standard REST calls. - Use
/api/embedto build a basic RAG pipeline with a vector database. - Use Modelfiles to package role-specific behavior and parameters.
- For OpenAI compatibility, confirm which endpoints your version supports.
Summary
- Local REST APIs:
/api/generate,/api/chat,/api/embed. - RAG baseline: embed -> vector search -> chat/generate with retrieved context.
- Modelfile-based customization for reusable, consistent behavior.
Recommended Hashtags
#ollama #llm #localai #rag #embeddings #modelfile #langchain #springai #vectorsearch #mlops
References
- (Introduction, 2025-12-31)[https://docs.ollama.com/api/introduction]
- (API Reference - api.md, 2025-12-31)[https://github.com/ollama/ollama/blob/main/docs/api.md]
- (Generate embeddings, 2025-12-31)[https://docs.ollama.com/api/embed]
- (Modelfile Reference, 2025-12-31)[https://docs.ollama.com/modelfile]
- (OpenAI compatibility - Ollama Docs, 2025-12-31)[https://docs.ollama.com/api/openai-compatibility]
- (OpenAI compatibility - Ollama Blog, 2024-02-08)[https://ollama.com/blog/openai-compatibility]
- (Responses API support issue - GitHub Issues, 2025-04-16)[https://github.com/ollama/ollama/issues/10309]
- (ChatOllama integration - LangChain Docs, 2025-12-31)[https://docs.langchain.com/oss/python/integrations/chat/ollama]
- (Ollama Chat - Spring AI Reference, 2025-12-31)[https://docs.spring.io/spring-ai/reference/api/chat/ollama-chat.html]