How to Build a Local LLM in Docker: A Fast-Track Guide for AI Engineers

Running a large language model (LLM) locally offers serious advantages: privacy, speed, offline capabilities, and cost control. Wrap that in Docker, and you’ve got a clean, portable setup ready for production or local R&D.

Here’s how to build a local LLM inside a Docker container without rage-quitting or sacrificing GPU horsepower.

Step 1: Pick Your LLM (and Don’t Go Too Big)

Choose a model that fits your system constraints. For local dev, consider:

Mistral 7B (great balance of performance and speed)
Phi-2 (small but solid reasoning)
LLaMA 3 8B or smaller (if you have enough VRAM)
GGUF-quantized models via llama.cpp (ultra-lightweight)

💡 Tip: Use Hugging Face to find Docker-ready or quantized versions.

Step 2: Prepare Your Docker Environment

Install Docker, NVIDIA Container Toolkit (for GPU use), and optionally Docker Compose.

# Install Docker
sudo apt install docker.io -y
# Add yourself to docker group
sudo usermod -aG docker $USER
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update && sudo apt install -y nvidia-docker2
sudo systemctl restart docker

Make sure GPU containers work:

docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

Step 3: Create a Lightweight Dockerfile for Your LLM

FROM nvidia/cuda:12.0.1-base-ubuntu22.04

RUN apt-get update && apt-get install -y \
    python3 python3-pip git && \
    pip3 install torch transformers accelerate

WORKDIR /app

# Optional: Clone a repo like Text Generation WebUI or llama.cpp
RUN git clone https://github.com/oobabooga/text-generation-webui .

CMD ["python3", "server.py"]

You can swap in llama.cpp or vllm depending on your architecture.

Step 4: Run It Like You Mean It

docker build -t local-llm .
docker run --gpus all -p 7860:7860 local-llm

If you’re going the llama.cpp route:

docker run --gpus all -v $(pwd)/models:/models llama-cpp --model /models/ggml-model.gguf

✅ Optional Bonus: Add a volume for persistent logs or external API integration.

Step 5: API It Up

Use FastAPI or Flask inside the container to serve your local model as a REST API. Great for local dev, automation, or integrating into agent frameworks.

from fastapi import FastAPI
from transformers import pipeline

app = FastAPI()
llm = pipeline("text-generation", model="path_to_local_model")

@app.post("/generate")
def generate(prompt: str):
    return {"output": llm(prompt)[0]['generated_text']}

Final Thoughts: Containerized Models Are the Future

Dockerizing your LLM setup gives you:

Repeatability across environments
GPU-accelerated local inference
Data sovereignty (hello, enterprise)
Smooth deployment to edge or hybrid infrastructure

Whether you’re running a local agent, testing AI workflows, or building an internal ChatGPT clone, a Docker-wrapped LLM is your launchpad.