Building a Local LLM Server on Blackwell: Everything That Went Wrong
One machine, on the local network, running a state-of-the-art LLM without sending anything to an external API. Reachable from anywhere on the LAN for AI coding agents and general use.
The hardware arrived as an NVIDIA RTX PRO 6000 Blackwell Workstation Edition — 96 GB GDDR7 VRAM. Getting to a working stack took a full day of debugging and produced more interesting failure modes than expected. Then the model changed. Then web search was added. Then monitoring. This is the full story.
Part 1: The GPU That Wouldn’t Wake Up
When the Pro 6000 was installed and the system booted, lspci found it:
01:00.0 VGA compatible controller: NVIDIA Corporation GB202GL [RTX PRO 6000 Blackwell Workstation Edition]
But nvidia-smi returned: No devices were found.
The driver was loaded. /dev/nvidia0 existed. /proc/driver/nvidia/gpus/ listed the card. But GPU Firmware: N/A. The driver knew the GPU was there but couldn’t initialize it.
The first diagnosis was wrong — I tried switching driver variants (server, consumer, open) and versions (595, 580). None helped. The real problem showed up when checking what firmware each driver package actually ships:
dpkg -L nvidia-firmware-595-server-595.71.05
# → gsp_ga10x.bin (Ampere)
# → gsp_tu10x.bin (Turing)
That’s it. No Blackwell. Every proprietary nvidia driver package — 595, 580, all of them — only ships firmware for Turing and Ampere.
Meanwhile, linux-firmware-nvidia-graphics ships GB202 firmware at /usr/lib/firmware/nvidia/gb202/gsp/gsp-570.144.bin.zst. But the proprietary kernel module uses flat paths (nvidia/<version>/gsp_<arch>.bin) and never looks in per-chip directories. That path layout is only for the open-source kernel module.
The fix:
sudo apt-get install nvidia-driver-580-open
sudo reboot
After reboot, nvidia-smi showed the full 95.6 GB.
The rule for Blackwell on Linux: the proprietary nvidia driver has no GB202 firmware. You must use the open-source kernel module (nvidia-driver-xxx-open). It’s not a preference — the proprietary path simply cannot initialize the hardware. The open module has been co-developed with the kernel community since Turing and is production-quality. For Blackwell, it’s the only option.
Part 2: llama.cpp → vLLM
Before the Pro 6000, the inference stack was simpler: llama-server from llama.cpp, a Q4_K_M quantized 32B model, LiteLLM in front for API format bridging. This worked fine for a single user.
With 96 GB and multiple concurrent clients (coding agents fire parallel requests), two things changed:
- PagedAttention handles concurrent requests with different context lengths efficiently. With one user it barely matters; with 3+ parallel requests, it avoids KV cache OOM and fragmentation that llama.cpp doesn’t manage.
- Continuous batching processes concurrent requests together rather than serially. llama.cpp runs one request at maximum speed; vLLM batches them, using the GPU more efficiently under load.
- Native FP8. Blackwell has dedicated FP8 tensor cores (the GB202 die). vLLM uses them natively. GGUF quantization doesn’t target hardware FP8.
Decision: migrate to vLLM.
Part 3: Python 3.14 Is a Minefield
Ubuntu 26.04 ships Python 3.14.4 as the system Python. This caused two separate failures.
Problem 1: python3.14-venv not installed by default.
sudo apt install python3.14-venv
Easy.
Problem 2: LiteLLM proxy completely broken on Python 3.14.
pip install "litellm[proxy]" failed in two ways:
orjsonhas no pre-built wheel for 3.14. Source build requires Rust.uvloopreferencesasyncio.events.BaseDefaultEventLoopPolicy, which was removed from the standard library in Python 3.14. LiteLLM proxy couldn’t even start.
Fix: a separate virtualenv using Python 3.12 via uv:
~/.local/bin/uv python install 3.12
~/.local/bin/uv venv ~/litellm-env --python 3.12
~/.local/bin/uv pip install "litellm[proxy]" --python ~/litellm-env/bin/python3
The pattern that works: vLLM in ~/vllm-env (Python 3.14 — vLLM is fine with it), LiteLLM in ~/litellm-env (Python 3.12). Two venvs, two systemd services, no shared state between them.
Lesson: For anything mixing C extensions and async frameworks, stay on Python 3.12 until the ecosystem catches up. Python 3.14 broke things that were load-bearing.
Part 4: The 100 GB Root Partition
Downloading Qwen3-Coder-Next-FP8 (the first model choice — an 80B MoE with only 3B parameters active per token):
hf download Qwen/Qwen3-Coder-Next-FP8 --local-dir ~/models/qwen3-coder-next-fp8
Side note: huggingface-cli is deprecated on Ubuntu 26.04. The command is hf. And this is a gated model, so:
hf auth login --token hf_xxx...
The download got through 27 of 40 shards (~57 GB) then died:
OSError: I/O error: No space left on device (os error 28)
Root cause: the LVM logical volume for the root filesystem was provisioned at 100 GB on a 2 TB drive. 57 GB of model plus OS and packages filled it.
Fix — online LVM expansion, no reboot:
sudo lvextend -l +100%FREE /dev/ubuntu-vg/ubuntu-lv
sudo resize2fs /dev/mapper/ubuntu--vg-ubuntu--lv
Root partition went from ~3 GB free to 1.7 TB free while mounted. Cleaned up the 12 incomplete shards, resumed, and all 40 downloaded cleanly. Final size on disk: 75 GB.
Expand LVM before starting large downloads. The 100 GB ceiling was a provisioning oversight that cost an hour.
Part 5: Three vLLM Startup Failures
Creating the systemd service exposed three separate bugs.
Bug 1: --dtype fp8 is not a valid flag.
--dtype fp8
vLLM rejected it. Valid options: auto, bfloat16, float, float16, float32, half. FP8 is a quantization scheme that vLLM auto-detects from the model’s config.json. The correct flag: --dtype auto.
Bug 2: Mamba cache blocks exceeded.
Qwen3-Coder-Next is a hybrid Mamba+Attention architecture — not a pure transformer. vLLM sizes a separate Mamba state cache alongside the standard KV cache. At --gpu-memory-utilization 0.95 and --max-num-seqs 1024, vLLM logged:
max_num_seqs (1024) exceeds available Mamba cache blocks (687)
Fix: --max-num-seqs 512, which fits within the available blocks.
Bug 3: uvloop uninstalled from the wrong venv.
During LiteLLM debugging I ran pip uninstall uvloop — but in ~/vllm-env instead of ~/litellm-env. vLLM’s uvicorn server also depends on uvloop and started failing.
Fix: pip install uvloop in the right venv. Check which python before running pip when multiple venvs are involved.
Part 6: Tool Calling Required Three Separate Fixes
Getting tool calls working for agentic clients involved fixing distinct bugs at different layers.
Bug 1: Empty tool_calls: [] in every streaming chunk.
After enabling tool calling with --enable-auto-tool-choice --tool-call-parser hermes, every streaming chunk contained "tool_calls": [] even for plain text responses. Clients that check for the presence of the tool_calls key (even if empty) misfire.
Fix: a stream-filter proxy between clients and LiteLLM that strips the empty array from SSE chunks:
def _strip(data: dict) -> dict:
for choice in data.get("choices", []):
delta = choice.get("delta")
if isinstance(delta, dict):
tc = delta.get("tool_calls")
if isinstance(tc, list) and len(tc) == 0:
del delta["tool_calls"]
return data
The filter lives on port 4000; LiteLLM moved to port 4001 (internal only).
Bug 2: Wrong tool call parser.
Even with the empty-array fix, tool calls still didn’t work. Debug logging showed the model was responding — but in Qwen3’s native XML format, not Hermes JSON. With --tool-call-parser hermes, vLLM expected JSON inside <tool_call> tags, got XML, failed to parse it, and returned the XML blob as plain text:
{
"finish_reason": "stop",
"message": {
"content": "<tool_call>\n<function=write>\n<parameter=path>hello.txt</parameter>\n</function>\n</tool_call>"
}
}
Fix: vLLM ships a dedicated parser for this model:
--tool-call-parser qwen3_coder
Lesson: before assuming hermes, check what model-specific parsers vLLM ships. ls vllm/tool_parsers/ would have found qwen3coder_tool_parser.py in 10 seconds.
Bug 3: LiteLLM drops tool_choice on Anthropic→OpenAI conversion.
Claude Code uses the Anthropic Messages API. LiteLLM’s format conversion wasn’t correctly propagating tool_choice: auto to vLLM, so the model ignored tool definitions and responded in plain text.
Fix: the stream-filter now handles /v1/messages (Anthropic format) directly — it converts the request to OpenAI format inline and sends it straight to vLLM, bypassing LiteLLM entirely for this path. Tool calling works correctly when going directly to vLLM.
Part 7: The Model Changed
After running Qwen3-Coder-Next-FP8 for a few days, a new model family was released: Qwen3.6.
The comparison:
| Model | VRAM | SWE-bench Verified | Architecture |
|---|---|---|---|
| Qwen3-Coder-Next-FP8 | ~91 GB | >70% | 80B MoE, 3B active |
| Qwen3.6-27B (BF16) | ~52 GB | 77.2% | Dense, 27B |
| Qwen3.6-35B-A3B | ~65 GB | 73.4% | MoE, 3B active |
The 27B dense model outperforms the 80B MoE on the benchmark that matters, while using 40 GB less VRAM. The MoE architecture’s efficient per-token compute doesn’t compensate for its lower accuracy.
With 27B weights at ~52 GB and 96 GB available, BF16 fits comfortably — no need for FP8 quantization on the weights. The KV cache still runs FP8 (via --kv-cache-dtype fp8), halving KV cache memory. Result: 77% benchmark score with no checkpoint-level quantization error on the weights.
The interesting architecture note: Qwen3.6-27B has 64 layers but only 16 are full attention. The other 48 are Gated DeltaNet (linear attention / SSM-style) with a fixed recurrent state — they contribute no growing KV cache. This is why 262K context is feasible at this weight size.
Switch was straightforward: delete the 75 GB model, download the 52 GB replacement, update the systemd unit, restart.
Part 8: Adding Transparent Web Search
Claude Code’s web_search tool normally routes through Anthropic’s servers. Pointed at a local stack, it fails silently — the model can’t execute searches and the Anthropic search backend is unreachable.
The stream-filter proxy was extended to intercept web_search tool calls and execute them via DuckDuckGo (the ddgs Python package — no API key required). The interception is transparent: the client never sees the intermediate tool call round-trip. It receives a final text response that already incorporates the search results.
The loop inside the proxy:
- Send request to vLLM in OpenAI format
- Buffer the full streaming response
- If the model called
web_search: run the DDG search, append the result as a tool message, go to step 1 - If no tool call: convert the final response to the client’s expected format and stream it
One bug found here: Claude Code periodically injects environment context (current directory, git status, current time) as role: "system" messages inside the messages array, mid-conversation — not only in the top-level system field. Qwen3.6-27B’s chat template requires the system message to appear at position 0 only, so vLLM was returning HTTP 400 on real requests while minimal test requests worked fine.
Fix: the Anthropic→OpenAI converter now collects all system content from both the top-level field and any inline role: "system" messages, merges them into a single system message at position 0, and skips inline system messages during the rest of the conversion.
The “Did 0 searches” message in the Claude Code UI is cosmetic — the proxy intercepts the tool call before Claude Code can count it. The model receives real results and the answers reflect it.
Part 9: Making It Observable
Up to this point the only visibility into the stack was journalctl and occasional nvidia-smi calls. With the stack stable, it was time to add proper monitoring.
What I didn’t want to do: use the DCGM exporter (requires nvidia-container-toolkit, which means restarting Docker while vLLM is running) or fight with Linux Docker bridge networking (Prometheus in a container trying to reach host services via host.docker.internal, which doesn’t auto-resolve on Linux).
What I did instead:
nvidia_gpu_exporterv1.4.1 installed from a.debas a plain systemd service. No Docker, no nvidia-container-toolkit. It scrapesnvidia-smiand exposes metrics at:9835. The.debcreates the systemd unit automatically.- Prometheus and Grafana in Docker Compose, both using
network_mode: host. With host networking, Prometheus targets are justlocalhost:8000,localhost:9835,localhost:9100— no networking configuration needed. - node-exporter in the same Compose stack for host CPU, RAM, and disk metrics.
Three scrape targets:
scrape_configs:
- job_name: vllm
static_configs:
- targets: ['localhost:8000']
- job_name: gpu
static_configs:
- targets: ['localhost:9835']
- job_name: node
static_configs:
- targets: ['localhost:9100']
The Grafana dashboard auto-provisions on first boot (JSON file dropped into the provisioning directory). Three rows:
- At a glance: model state, VRAM used/free, GPU utilization, temperature, power draw, active requests, p95 time-to-first-token
- GPU over time: VRAM stacked area (used + free), GPU util + power + temp on a dual-axis panel
- Model performance: request queue depth (running + waiting), latency histograms (TTFT and E2E, p50/p95), output token throughput
The “queue depth growing” signal is the most useful one: it means the model is receiving requests faster than it can serve them.
Current State
Clients (Claude Code, opencode)
│
▼ :4000
stream-filter ← strips empty tool_calls, handles web_search,
│ converts Anthropic↔OpenAI formats
▼ :4001
LiteLLM (Python 3.12) ← API key management, routing alias
│
▼ :8000
vLLM 0.22.0 ← Qwen3.6-27B inference
│
▼
RTX PRO 6000 Blackwell (96 GB GDDR7)
Prometheus :9090 ← vLLM, GPU, node-exporter
Grafana :3000 ← Prometheus
| Metric | Current Value |
|---|---|
| Model | Qwen3.6-27B BF16 |
| VRAM used | ~84 GB of 96 GB |
| Benchmark | 77.2% SWE-bench Verified |
| Context window | 262,144 tokens |
| KV cache dtype | FP8 |
| Tool calling | Working — file, bash, web_search |
| Web search | DuckDuckGo, transparent |
| Monitoring | Prometheus + Grafana, 3 scrape targets |
The Complete Bug List
| # | Problem | Root Cause | Fix |
|---|---|---|---|
| 1 | nvidia-smi: No devices found |
Proprietary driver has no GB202 firmware | nvidia-driver-580-open + reboot |
| 2 | python3.14-venv missing |
Not installed by default on Ubuntu 26.04 | sudo apt install python3.14-venv |
| 3 | huggingface-cli not found |
Deprecated; replaced by hf |
Use hf download |
| 4 | HF token not propagating | Shell export not visible to subshell | hf auth login --token <token> |
| 5 | Download failed at shard 27/40 | Root partition was 100 GB | lvextend -l +100%FREE + resize2fs (online) |
| 6 | LiteLLM proxy broken on Python 3.14 | orjson no wheel; uvloop uses removed stdlib API |
Separate Python 3.12 venv via uv |
| 7 | --dtype fp8 rejected by vLLM |
Not a dtype; FP8 detected from config.json | --dtype auto |
| 8 | Mamba cache blocks exceeded | Hybrid model; Mamba cache sized separately | --max-num-seqs 512 |
| 9 | vLLM crashed after debugging session | uvloop uninstalled from wrong venv |
pip install uvloop in correct venv |
| 10 | Model ignores tool definitions | --enable-auto-tool-choice not set |
Add flag + --tool-call-parser hermes |
| 11 | tool_calls: [] in every SSE chunk |
vLLM injects empty array with auto tool choice | stream-filter strips empty arrays |
| 12 | Model outputs XML, not tool calls | hermes parser expects JSON; Qwen3 outputs XML |
Switch to --tool-call-parser qwen3_coder |
| 13 | tool_choice: auto not forwarded |
LiteLLM Anthropic→OpenAI conversion drops it | Bypass LiteLLM for /v1/messages path |
| 14 | HTTP 400 on real Claude Code requests | Inline role: "system" messages in conversation |
Merge all system content to position 0 |
| 15 | DCGM exporter needs nvidia-container-toolkit | Container-based GPU access requires Docker GPU runtime | Use nvidia_gpu_exporter binary instead |
| 16 | host.docker.internal unresolved on Linux |
Not auto-registered on Docker bridge networks | Use network_mode: host for all containers |
What I’d Do Differently
Start with nvidia-driver-xxx-open for any recent NVIDIA GPU. The open-source module has been the right choice since Ampere and is mandatory for Blackwell. The proprietary driver’s feature gap is real and growing.
Expand the root partition before downloading large models. A 100 GB root volume is a provisioning oversight that interrupts work at the worst moment.
Check for model-specific vLLM tool parsers before assuming hermes. ls vllm/tool_parsers/ shows what’s available. A 10-second check would have caught the hermes/qwen3_coder mismatch.
Use Python 3.12 for anything proxy-adjacent. vLLM and PyTorch are fine on 3.14; the proxy ecosystem (uvicorn, uvloop, LiteLLM) isn’t yet.
Use network_mode: host for Prometheus/Grafana on Linux. Bridge networking + host.docker.internal requires extra configuration that’s easy to get wrong. Host mode just works.