vLLM with OpenClaw

Self-host LLMs locally with vLLM—OpenAI-compatible inference

vLLM is a high-throughput inference server for LLMs. It exposes an OpenAI-compatible API, so OpenClaw can connect to it like any OpenAI endpoint. Run Llama, Mistral, Qwen, and other models locally for full privacy and zero API costs.

When to Use vLLM

  • Full privacy — No data leaves your machine
  • No API costs — Run indefinitely after setup
  • GPU acceleration — Higher throughput than CPU-only
  • Multiple models — Serve several models from one vLLM instance

Prefer Ollama if you want simpler one-command setup. Use vLLM when you need maximum throughput, multi-model serving, or compatibility with vLLM's model format.

Configuration

Point OpenClaw at your vLLM server's OpenAI-compatible endpoint:

vLLM provider config
{
  "agent": {
    "model": "meta-llama/Llama-3.2-8B-Instruct",
    "provider": "openai",
    "baseUrl": "http://localhost:8000/v1",
    "apiKey": "dummy"
  }
}

vLLM often accepts apiKey: "dummy" for local servers. Set baseUrl to your vLLM instance (default port 8000). Use the model name vLLM serves (e.g. from --model flag).

Hardware Requirements

vLLM typically needs:

  • 7B models: 8–16GB VRAM (or CPU, slower)
  • 13B models: 16–24GB VRAM
  • 70B models: 80GB+ VRAM or multi-GPU

See Hardware requirements for local models. For lighter setups, use Ollama or cloud providers.