vLLM with OpenClaw

Self-host LLMs locally with vLLM—OpenAI-compatible inference

vLLM is a high-throughput inference server for LLMs. It exposes an OpenAI-compatible API, so OpenClaw can connect to it like any OpenAI endpoint. Run Llama, Mistral, Qwen, and other models locally for full privacy and zero API costs.

When to Use vLLM

Full privacy — No data leaves your machine
No API costs — Run indefinitely after setup
GPU acceleration — Higher throughput than CPU-only
Multiple models — Serve several models from one vLLM instance

Prefer Ollama if you want simpler one-command setup. Use vLLM when you need maximum throughput, multi-model serving, or compatibility with vLLM's model format.

Configuration

Point OpenClaw at your vLLM server's OpenAI-compatible endpoint:

vLLM provider config

{
  "agent": {
    "model": "meta-llama/Llama-3.2-8B-Instruct",
    "provider": "openai",
    "baseUrl": "http://localhost:8000/v1",
    "apiKey": "dummy"
  }
}

vLLM often accepts apiKey: "dummy" for local servers. Set baseUrl to your vLLM instance (default port 8000). Use the model name vLLM serves (e.g. from --model flag).

Hardware Requirements

vLLM typically needs:

7B models: 8–16GB VRAM (or CPU, slower)
13B models: 16–24GB VRAM
70B models: 80GB+ VRAM or multi-GPU

See Hardware requirements for local models. For lighter setups, use Ollama or cloud providers.

Tiered routing

Use a capable model for interactive chat and a cheaper variant (or smaller context window) for scheduled checks. See example setups & model routing.

GPU and ops

vLLM needs stable GPU drivers and enough VRAM for your chosen weights. Run health checks after reboot; if the Gateway starts but models 500, the inference server—not OpenClaw—is usually down. Pin model names in config after upgrades.