Running Models Locally

Configure Engrammic to use local embedding models and LLMs with no cloud API keys

The self-hosted image supports running both the embedding model and the generation LLM on your own hardware. This page covers how to wire up Ollama and the Hugging Face Text Embeddings Inference sidecar.

Before going further, a quick reminder of the two-model split:

Embedding model (required). Powers remember, learn, and recall. The service will not start meaningfully without one.
Generation LLM (optional). Powers SAGE synthesis only. Without it, the service runs in passive mode: storage and recall work, synthesis is disabled.

You can run embeddings locally without touching the LLM at all, or run both locally.

Embeddings Locally

The service uses LiteLLM to route embedding calls. Set EMBEDDING_MODEL to an Ollama model string and point OLLAMA_API_BASE at your local Ollama instance.

Container networking

Containers reach a host-running Ollama via host.docker.internal. This hostname resolves automatically on Docker Desktop (macOS, Windows) and on Linux when you start Docker with --add-host host.docker.internal:host-gateway.

Configuration (.env)

# Embedding model (required)
EMBEDDING_MODEL=ollama/nomic-embed-text
OLLAMA_API_BASE=http://host.docker.internal:11434

# Dimensions must match the model exactly
EMBEDDING_DIMENSIONS=768

Pull the model on your host before starting the stack:

ollama pull nomic-embed-text

Changing the embedding model after data exists requires re-embedding everything.

EMBEDDING_DIMENSIONS must match the chosen model. If you swap models later, you must recreate the Qdrant collections and re-embed all stored content. Plan your model choice before writing significant data.

Common dimension values: 768 for nomic-embed-text and vertex_ai/text-embedding-005, 384 for MiniLM-based models, 1536 for openai/text-embedding-3-small.

Alternative: Hugging Face Text Embeddings Inference (TEI)

TEI is a high-throughput embedding server from Hugging Face. It is a good alternative to Ollama if you need faster throughput or want to run a specific BERT-family model. Add it as a sidecar in your compose file, then point EMBEDDING_MODEL at the appropriate LiteLLM model string and configure the sidecar URL as the base.

See LiteLLM embeddings documentation for the exact model string format.

LLM Locally (optional)

SAGE synthesis uses a separate generation LLM. Unlike embeddings, the LLM provider and model are chosen in models.yaml, not through a single environment variable. You point that file at Ollama and supply the Ollama base URL through the environment.

This uses the host config override directory described in Config Files. In short: the compose file mounts ~/.engrammic/config into the container, and any file you place there wins over the default baked into the image.

Step 1: Override models.yaml

Copy the default out of the image, then edit it:

mkdir -p ~/.engrammic/config
docker run --rm --entrypoint cat \
  europe-north1-docker.pkg.dev/engrammic/releases/engrammic-api:latest \
  /app/config/models.yaml > ~/.engrammic/config/models.yaml

In ~/.engrammic/config/models.yaml, set the reasoning, fast, and query_expander blocks of the active tier to use Ollama:

    reasoning:
      provider: ollama
      model: llama3.2
    fast:
      provider: ollama
      model: llama3.2
    query_expander:
      provider: ollama
      model: llama3.2

Step 2: Point the LLM at your Ollama (.env)

OLLAMA_BASE_URL=http://host.docker.internal:11434

The LLM path uses OLLAMA_BASE_URL, while the embedding path uses OLLAMA_API_BASE. These are different variables. Set both if you are running both models through Ollama.

If the active tier's LLM provider has no usable credentials, the service runs in passive mode: no synthesis, but all storage and recall tools work normally.

Pull your chosen model before starting:

ollama pull llama3.2

Full Local Example

A .env with embeddings through Ollama and no cloud keys:

# Required core
ENGRAMMIC_LICENSE_KEY=ENGR_your_key_here
POSTGRES_PASSWORD=your-secure-password

# Embeddings via Ollama (required)
EMBEDDING_MODEL=ollama/nomic-embed-text
OLLAMA_API_BASE=http://host.docker.internal:11434
EMBEDDING_DIMENSIONS=768

# Generation LLM via Ollama (optional, also needs the models.yaml override above)
OLLAMA_BASE_URL=http://host.docker.internal:11434

Start the stack from ~/.engrammic/:

docker compose up -d

After changing models.yaml, restart the services that read it so the new config is picked up:

docker compose restart app dagster dagster-daemon reaction-worker

Helpful Links

Ollama - run models locally
Ollama model library - available models
Hugging Face Text Embeddings Inference - embedding sidecar
LiteLLM Ollama docs - routing details
LiteLLM embedding providers - full embedding model list
Configuration reference - all environment variables
Self-hosting guide - full setup walkthrough

Other pages in the docs that reference this one:

YAML Config FilesThe YAML configuration files baked into the Engrammic self-hosted image, and how to override them from the host.ConfigurationEnvironment variables for self-hosted Engrammic deployments Example ConfigurationsComplete self-hosted configuration examples for common deployment setups Self-HostingRun Engrammic on your own infrastructure