Running Models Locally
Configure Engrammic to use local embedding models and LLMs with no cloud API keys
The self-hosted image supports running both the embedding model and the generation LLM on your own hardware. This page covers how to wire up Ollama and the Hugging Face Text Embeddings Inference sidecar.
Before going further, a quick reminder of the two-model split:
- Embedding model (required). Powers
remember,learn, andrecall. The service will not start meaningfully without one. - Generation LLM (optional). Powers SAGE synthesis only. Without it, the service runs in passive mode: storage and recall work, synthesis is disabled.
You can run embeddings locally without touching the LLM at all, or run both locally.
Embeddings Locally
The service uses LiteLLM to route embedding calls. Set EMBEDDING_MODEL to an Ollama model string and point OLLAMA_API_BASE at your local Ollama instance.
Container networking
Containers reach a host-running Ollama via host.docker.internal. This hostname resolves automatically on Docker Desktop (macOS, Windows) and on Linux when you start Docker with --add-host host.docker.internal:host-gateway.
Configuration (.env)
# Embedding model (required)
EMBEDDING_MODEL=ollama/nomic-embed-text
OLLAMA_API_BASE=http://host.docker.internal:11434
# Dimensions must match the model exactly
EMBEDDING_DIMENSIONS=768Pull the model on your host before starting the stack:
ollama pull nomic-embed-textChanging the embedding model after data exists requires re-embedding everything.
EMBEDDING_DIMENSIONS must match the chosen model. If you swap models later, you must recreate the Qdrant collections and re-embed all stored content. Plan your model choice before writing significant data.
Common dimension values: 768 for nomic-embed-text and vertex_ai/text-embedding-005, 384 for MiniLM-based models, 1536 for openai/text-embedding-3-small.
Alternative: Hugging Face Text Embeddings Inference (TEI)
TEI is a high-throughput embedding server from Hugging Face. It is a good alternative to Ollama if you need faster throughput or want to run a specific BERT-family model. Add it as a sidecar in your compose file, then point EMBEDDING_MODEL at the appropriate LiteLLM model string and configure the sidecar URL as the base.
See LiteLLM embeddings documentation for the exact model string format.
LLM Locally (optional)
SAGE synthesis uses a separate generation LLM. Unlike embeddings, the LLM provider and model are chosen in models.yaml, not through a single environment variable. You point that file at Ollama and supply the Ollama base URL through the environment.
This uses the host config override directory described in Config Files. In short: the compose file mounts ~/.engrammic/config into the container, and any file you place there wins over the default baked into the image.
Step 1: Override models.yaml
Copy the default out of the image, then edit it:
mkdir -p ~/.engrammic/config
docker run --rm --entrypoint cat \
europe-north1-docker.pkg.dev/engrammic/releases/engrammic-api:latest \
/app/config/models.yaml > ~/.engrammic/config/models.yamlIn ~/.engrammic/config/models.yaml, set the reasoning, fast, and query_expander blocks of the active tier to use Ollama:
reasoning:
provider: ollama
model: llama3.2
fast:
provider: ollama
model: llama3.2
query_expander:
provider: ollama
model: llama3.2Step 2: Point the LLM at your Ollama (.env)
OLLAMA_BASE_URL=http://host.docker.internal:11434The LLM path uses OLLAMA_BASE_URL, while the embedding path uses OLLAMA_API_BASE. These are different variables. Set both if you are running both models through Ollama.
If the active tier's LLM provider has no usable credentials, the service runs in passive mode: no synthesis, but all storage and recall tools work normally.
Pull your chosen model before starting:
ollama pull llama3.2Full Local Example
A .env with embeddings through Ollama and no cloud keys:
# Required core
ENGRAMMIC_LICENSE_KEY=ENGR_your_key_here
POSTGRES_PASSWORD=your-secure-password
# Embeddings via Ollama (required)
EMBEDDING_MODEL=ollama/nomic-embed-text
OLLAMA_API_BASE=http://host.docker.internal:11434
EMBEDDING_DIMENSIONS=768
# Generation LLM via Ollama (optional, also needs the models.yaml override above)
OLLAMA_BASE_URL=http://host.docker.internal:11434Start the stack from ~/.engrammic/:
docker compose up -dAfter changing models.yaml, restart the services that read it so the new config is picked up:
docker compose restart app dagster dagster-daemon reaction-workerHelpful Links
- Ollama - run models locally
- Ollama model library - available models
- Hugging Face Text Embeddings Inference - embedding sidecar
- LiteLLM Ollama docs - routing details
- LiteLLM embedding providers - full embedding model list
- Configuration reference - all environment variables
- Self-hosting guide - full setup walkthrough