Running a large language model (LLM) locally offers serious advantages: privacy, speed, offline capabilities, and cost control. Wrap that in Docker, and you’ve got a clean, portable setup ready for production or local R&D.
Here’s how to build a local LLM inside a Docker container without rage-quitting or sacrificing GPU horsepower.
Step 1: Pick Your LLM (and Don’t Go Too Big)
Choose a model that fits your system constraints. For local dev, consider:
Mistral 7B (great balance of performance and speed)
Phi-2 (small but solid reasoning)
LLaMA 3 8B or smaller (if you have enough VRAM)
GGUF-quantized models via llama.cpp (ultra-lightweight)
💡 Tip: Use Hugging Face to find Docker-ready or quantized versions.
Step 2: Prepare Your Docker Environment
Install Docker, NVIDIA Container Toolkit (for GPU use), and optionally Docker Compose.
# Install Docker
sudo apt install docker.io -y
# Add yourself to docker group
sudo usermod -aG docker $USER
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update && sudo apt install -y nvidia-docker2
sudo systemctl restart docker
Make sure GPU containers work:
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
Step 3: Create a Lightweight Dockerfile for Your LLM
FROM nvidia/cuda:12.0.1-base-ubuntu22.04
RUN apt-get update && apt-get install -y \
python3 python3-pip git && \
pip3 install torch transformers accelerate
WORKDIR /app
# Optional: Clone a repo like Text Generation WebUI or llama.cpp
RUN git clone https://github.com/oobabooga/text-generation-webui .
CMD ["python3", "server.py"]
You can swap in llama.cpp
or vllm
depending on your architecture.
Step 4: Run It Like You Mean It
docker build -t local-llm .
docker run --gpus all -p 7860:7860 local-llm
If you’re going the llama.cpp
route:
docker run --gpus all -v $(pwd)/models:/models llama-cpp --model /models/ggml-model.gguf
✅ Optional Bonus: Add a volume for persistent logs or external API integration.
Step 5: API It Up
Use FastAPI or Flask inside the container to serve your local model as a REST API. Great for local dev, automation, or integrating into agent frameworks.
from fastapi import FastAPI
from transformers import pipeline
app = FastAPI()
llm = pipeline("text-generation", model="path_to_local_model")
@app.post("/generate")
def generate(prompt: str):
return {"output": llm(prompt)[0]['generated_text']}
Final Thoughts: Containerized Models Are the Future
Dockerizing your LLM setup gives you:
Repeatability across environments
GPU-accelerated local inference
Data sovereignty (hello, enterprise)
Smooth deployment to edge or hybrid infrastructure
Whether you’re running a local agent, testing AI workflows, or building an internal ChatGPT clone, a Docker-wrapped LLM is your launchpad.