vLLM with OpenClaw
Self-host LLMs locally with vLLM—OpenAI-compatible inference
Self-host LLMs locally with vLLM—OpenAI-compatible inference
vLLM is a high-throughput inference server for LLMs. It exposes an OpenAI-compatible API, so OpenClaw can connect to it like any OpenAI endpoint. Run Llama, Mistral, Qwen, and other models locally for full privacy and zero API costs.
Prefer Ollama if you want simpler one-command setup. Use vLLM when you need maximum throughput, multi-model serving, or compatibility with vLLM's model format.
Point OpenClaw at your vLLM server's OpenAI-compatible endpoint:
{
"agent": {
"model": "meta-llama/Llama-3.2-8B-Instruct",
"provider": "openai",
"baseUrl": "http://localhost:8000/v1",
"apiKey": "dummy"
}
}
vLLM often accepts apiKey: "dummy" for local servers. Set baseUrl to your vLLM instance (default port 8000). Use the model name vLLM serves (e.g. from --model flag).
vLLM typically needs:
See Hardware requirements for local models. For lighter setups, use Ollama or cloud providers.
Use a capable model for interactive chat and a cheaper variant (or smaller context window) for scheduled checks. See example setups & model routing.
vLLM needs stable GPU drivers and enough VRAM for your chosen weights. Run health checks after reboot; if the Gateway starts but models 500, the inference server—not OpenClaw—is usually down. Pin model names in config after upgrades.