vLLM with OpenClaw

Self-host LLMs locally with vLLM—OpenAI-compatible inference

vLLM is a high-throughput inference server for LLMs. It exposes an OpenAI-compatible API, so OpenClaw can connect to it like any OpenAI endpoint. Run Llama, Mistral, Qwen, and other models locally for full privacy and zero API costs.

When to Use vLLM

Full privacy — No data leaves your machine
No API costs — Run indefinitely after setup
GPU acceleration — Higher throughput than CPU-only
Multiple models — Serve several models from one vLLM instance

Prefer Ollama if you want simpler one-command setup. Use vLLM when you need maximum throughput, multi-model serving, or compatibility with vLLM's model format.

Configuration

Point OpenClaw at your vLLM server's OpenAI-compatible endpoint:

vLLM provider config

{
  "agent": {
    "model": "meta-llama/Llama-3.2-8B-Instruct",
    "provider": "openai",
    "baseUrl": "http://localhost:8000/v1",
    "apiKey": "dummy"
  }
}

vLLM often accepts apiKey: "dummy" for local servers. Set baseUrl to your vLLM instance (default port 8000). Use the model name vLLM serves (e.g. from --model flag).

Hardware Requirements

vLLM typically needs:

7B models: 8–16GB VRAM (or CPU, slower)
13B models: 16–24GB VRAM
70B models: 80GB+ VRAM or multi-GPU

See Hardware requirements for local models. For lighter setups, use Ollama or cloud providers.

vLLM with OpenClaw

When to Use vLLM

Configuration

Hardware Requirements

Related