Building a Local LLM Server on Blackwell: Everything That Went Wrong

One machine, on the local network, running a state-of-the-art LLM without sending anything to an external API. Reachable from anywhere on the LAN for AI coding agents and general use.

The hardware arrived as an NVIDIA RTX PRO 6000 Blackwell Workstation Edition — 96 GB GDDR7 VRAM. Getting to a working stack took a full day of debugging and produced more interesting failure modes than expected. Then the model changed. Then web search was added. Then monitoring. This is the full story.

Part 1: The GPU That Wouldn’t Wake Up

When the Pro 6000 was installed and the system booted, lspci found it:

01:00.0 VGA compatible controller: NVIDIA Corporation GB202GL [RTX PRO 6000 Blackwell Workstation Edition]

But nvidia-smi returned: No devices were found.

The driver was loaded. /dev/nvidia0 existed. /proc/driver/nvidia/gpus/ listed the card. But GPU Firmware: N/A. The driver knew the GPU was there but couldn’t initialize it.

The first diagnosis was wrong — I tried switching driver variants (server, consumer, open) and versions (595, 580). None helped. The real problem showed up when checking what firmware each driver package actually ships:

dpkg -L nvidia-firmware-595-server-595.71.05
# → gsp_ga10x.bin  (Ampere)
# → gsp_tu10x.bin  (Turing)

That’s it. No Blackwell. Every proprietary nvidia driver package — 595, 580, all of them — only ships firmware for Turing and Ampere.

Meanwhile, linux-firmware-nvidia-graphics ships GB202 firmware at /usr/lib/firmware/nvidia/gb202/gsp/gsp-570.144.bin.zst. But the proprietary kernel module uses flat paths (nvidia/<version>/gsp_<arch>.bin) and never looks in per-chip directories. That path layout is only for the open-source kernel module.

The fix:

sudo apt-get install nvidia-driver-580-open
sudo reboot

After reboot, nvidia-smi showed the full 95.6 GB.

The rule for Blackwell on Linux: the proprietary nvidia driver has no GB202 firmware. You must use the open-source kernel module (nvidia-driver-xxx-open). It’s not a preference — the proprietary path simply cannot initialize the hardware. The open module has been co-developed with the kernel community since Turing and is production-quality. For Blackwell, it’s the only option.

Part 2: llama.cpp → vLLM

Before the Pro 6000, the inference stack was simpler: llama-server from llama.cpp, a Q4_K_M quantized 32B model, LiteLLM in front for API format bridging. This worked fine for a single user.

With 96 GB and multiple concurrent clients (coding agents fire parallel requests), two things changed:

PagedAttention handles concurrent requests with different context lengths efficiently. With one user it barely matters; with 3+ parallel requests, it avoids KV cache OOM and fragmentation that llama.cpp doesn’t manage.
Continuous batching processes concurrent requests together rather than serially. llama.cpp runs one request at maximum speed; vLLM batches them, using the GPU more efficiently under load.
Native FP8. Blackwell has dedicated FP8 tensor cores (the GB202 die). vLLM uses them natively. GGUF quantization doesn’t target hardware FP8.

Decision: migrate to vLLM.

Part 3: Python 3.14 Is a Minefield

Ubuntu 26.04 ships Python 3.14.4 as the system Python. This caused two separate failures.

Problem 1: python3.14-venv not installed by default.

sudo apt install python3.14-venv

Easy.

Problem 2: LiteLLM proxy completely broken on Python 3.14.

pip install "litellm[proxy]" failed in two ways:

orjson has no pre-built wheel for 3.14. Source build requires Rust.
uvloop references asyncio.events.BaseDefaultEventLoopPolicy, which was removed from the standard library in Python 3.14. LiteLLM proxy couldn’t even start.

Fix: a separate virtualenv using Python 3.12 via uv:

~/.local/bin/uv python install 3.12
~/.local/bin/uv venv ~/litellm-env --python 3.12
~/.local/bin/uv pip install "litellm[proxy]" --python ~/litellm-env/bin/python3

The pattern that works: vLLM in ~/vllm-env (Python 3.14 — vLLM is fine with it), LiteLLM in ~/litellm-env (Python 3.12). Two venvs, two systemd services, no shared state between them.

Lesson: For anything mixing C extensions and async frameworks, stay on Python 3.12 until the ecosystem catches up. Python 3.14 broke things that were load-bearing.

Part 4: The 100 GB Root Partition

Downloading Qwen3-Coder-Next-FP8 (the first model choice — an 80B MoE with only 3B parameters active per token):

hf download Qwen/Qwen3-Coder-Next-FP8 --local-dir ~/models/qwen3-coder-next-fp8

Side note: huggingface-cli is deprecated on Ubuntu 26.04. The command is hf. And this is a gated model, so:

hf auth login --token hf_xxx...

The download got through 27 of 40 shards (~57 GB) then died:

OSError: I/O error: No space left on device (os error 28)

Root cause: the LVM logical volume for the root filesystem was provisioned at 100 GB on a 2 TB drive. 57 GB of model plus OS and packages filled it.

Fix — online LVM expansion, no reboot:

sudo lvextend -l +100%FREE /dev/ubuntu-vg/ubuntu-lv
sudo resize2fs /dev/mapper/ubuntu--vg-ubuntu--lv

Root partition went from ~3 GB free to 1.7 TB free while mounted. Cleaned up the 12 incomplete shards, resumed, and all 40 downloaded cleanly. Final size on disk: 75 GB.

Expand LVM before starting large downloads. The 100 GB ceiling was a provisioning oversight that cost an hour.

Part 5: Three vLLM Startup Failures

Creating the systemd service exposed three separate bugs.

Bug 1: --dtype fp8 is not a valid flag.

--dtype fp8

vLLM rejected it. Valid options: auto, bfloat16, float, float16, float32, half. FP8 is a quantization scheme that vLLM auto-detects from the model’s config.json. The correct flag: --dtype auto.

Bug 2: Mamba cache blocks exceeded.

Qwen3-Coder-Next is a hybrid Mamba+Attention architecture — not a pure transformer. vLLM sizes a separate Mamba state cache alongside the standard KV cache. At --gpu-memory-utilization 0.95 and --max-num-seqs 1024, vLLM logged:

max_num_seqs (1024) exceeds available Mamba cache blocks (687)

Fix: --max-num-seqs 512, which fits within the available blocks.

Bug 3: uvloop uninstalled from the wrong venv.

During LiteLLM debugging I ran pip uninstall uvloop — but in ~/vllm-env instead of ~/litellm-env. vLLM’s uvicorn server also depends on uvloop and started failing.

Fix: pip install uvloop in the right venv. Check which python before running pip when multiple venvs are involved.

Part 6: Tool Calling Required Three Separate Fixes

Getting tool calls working for agentic clients involved fixing distinct bugs at different layers.

Bug 1: Empty tool_calls: [] in every streaming chunk.

After enabling tool calling with --enable-auto-tool-choice --tool-call-parser hermes, every streaming chunk contained "tool_calls": [] even for plain text responses. Clients that check for the presence of the tool_calls key (even if empty) misfire.

Fix: a stream-filter proxy between clients and LiteLLM that strips the empty array from SSE chunks:

def _strip(data: dict) -> dict:
    for choice in data.get("choices", []):
        delta = choice.get("delta")
        if isinstance(delta, dict):
            tc = delta.get("tool_calls")
            if isinstance(tc, list) and len(tc) == 0:
                del delta["tool_calls"]
    return data

The filter lives on port 4000; LiteLLM moved to port 4001 (internal only).

Bug 2: Wrong tool call parser.

Even with the empty-array fix, tool calls still didn’t work. Debug logging showed the model was responding — but in Qwen3’s native XML format, not Hermes JSON. With --tool-call-parser hermes, vLLM expected JSON inside <tool_call> tags, got XML, failed to parse it, and returned the XML blob as plain text:

{
  "finish_reason": "stop",
  "message": {
    "content": "<tool_call>\n<function=write>\n<parameter=path>hello.txt</parameter>\n</function>\n</tool_call>"
  }
}

Fix: vLLM ships a dedicated parser for this model:

--tool-call-parser qwen3_coder

Lesson: before assuming hermes, check what model-specific parsers vLLM ships. ls vllm/tool_parsers/ would have found qwen3coder_tool_parser.py in 10 seconds.

Bug 3: LiteLLM drops tool_choice on Anthropic→OpenAI conversion.

Claude Code uses the Anthropic Messages API. LiteLLM’s format conversion wasn’t correctly propagating tool_choice: auto to vLLM, so the model ignored tool definitions and responded in plain text.

Fix: the stream-filter now handles /v1/messages (Anthropic format) directly — it converts the request to OpenAI format inline and sends it straight to vLLM, bypassing LiteLLM entirely for this path. Tool calling works correctly when going directly to vLLM.

Part 7: The Model Changed

After running Qwen3-Coder-Next-FP8 for a few days, a new model family was released: Qwen3.6.

The comparison:

Model	VRAM	SWE-bench Verified	Architecture
Qwen3-Coder-Next-FP8	~91 GB	>70%	80B MoE, 3B active
Qwen3.6-27B (BF16)	~52 GB	77.2%	Dense, 27B
Qwen3.6-35B-A3B	~65 GB	73.4%	MoE, 3B active

The 27B dense model outperforms the 80B MoE on the benchmark that matters, while using 40 GB less VRAM. The MoE architecture’s efficient per-token compute doesn’t compensate for its lower accuracy.

With 27B weights at ~52 GB and 96 GB available, BF16 fits comfortably — no need for FP8 quantization on the weights. The KV cache still runs FP8 (via --kv-cache-dtype fp8), halving KV cache memory. Result: 77% benchmark score with no checkpoint-level quantization error on the weights.

The interesting architecture note: Qwen3.6-27B has 64 layers but only 16 are full attention. The other 48 are Gated DeltaNet (linear attention / SSM-style) with a fixed recurrent state — they contribute no growing KV cache. This is why 262K context is feasible at this weight size.

Switch was straightforward: delete the 75 GB model, download the 52 GB replacement, update the systemd unit, restart.

Part 8: Adding Transparent Web Search

Claude Code’s web_search tool normally routes through Anthropic’s servers. Pointed at a local stack, it fails silently — the model can’t execute searches and the Anthropic search backend is unreachable.

The stream-filter proxy was extended to intercept web_search tool calls and execute them via DuckDuckGo (the ddgs Python package — no API key required). The interception is transparent: the client never sees the intermediate tool call round-trip. It receives a final text response that already incorporates the search results.

The loop inside the proxy:

Send request to vLLM in OpenAI format
Buffer the full streaming response
If the model called web_search: run the DDG search, append the result as a tool message, go to step 1
If no tool call: convert the final response to the client’s expected format and stream it

One bug found here: Claude Code periodically injects environment context (current directory, git status, current time) as role: "system" messages inside the messages array, mid-conversation — not only in the top-level system field. Qwen3.6-27B’s chat template requires the system message to appear at position 0 only, so vLLM was returning HTTP 400 on real requests while minimal test requests worked fine.

Fix: the Anthropic→OpenAI converter now collects all system content from both the top-level field and any inline role: "system" messages, merges them into a single system message at position 0, and skips inline system messages during the rest of the conversion.

The “Did 0 searches” message in the Claude Code UI is cosmetic — the proxy intercepts the tool call before Claude Code can count it. The model receives real results and the answers reflect it.

Part 9: Making It Observable

Up to this point the only visibility into the stack was journalctl and occasional nvidia-smi calls. With the stack stable, it was time to add proper monitoring.

What I didn’t want to do: use the DCGM exporter (requires nvidia-container-toolkit, which means restarting Docker while vLLM is running) or fight with Linux Docker bridge networking (Prometheus in a container trying to reach host services via host.docker.internal, which doesn’t auto-resolve on Linux).

What I did instead:

nvidia_gpu_exporter v1.4.1 installed from a .deb as a plain systemd service. No Docker, no nvidia-container-toolkit. It scrapes nvidia-smi and exposes metrics at :9835. The .deb creates the systemd unit automatically.
Prometheus and Grafana in Docker Compose, both using network_mode: host. With host networking, Prometheus targets are just localhost:8000, localhost:9835, localhost:9100 — no networking configuration needed.
node-exporter in the same Compose stack for host CPU, RAM, and disk metrics.

Three scrape targets:

scrape_configs:
  - job_name: vllm
    static_configs:
      - targets: ['localhost:8000']
  - job_name: gpu
    static_configs:
      - targets: ['localhost:9835']
  - job_name: node
    static_configs:
      - targets: ['localhost:9100']

The Grafana dashboard auto-provisions on first boot (JSON file dropped into the provisioning directory). Three rows:

At a glance: model state, VRAM used/free, GPU utilization, temperature, power draw, active requests, p95 time-to-first-token
GPU over time: VRAM stacked area (used + free), GPU util + power + temp on a dual-axis panel
Model performance: request queue depth (running + waiting), latency histograms (TTFT and E2E, p50/p95), output token throughput

The “queue depth growing” signal is the most useful one: it means the model is receiving requests faster than it can serve them.

Current State

Clients (Claude Code, opencode)
      │
      ▼  :4000
stream-filter            ← strips empty tool_calls, handles web_search,
      │                     converts Anthropic↔OpenAI formats
      ▼  :4001
LiteLLM (Python 3.12)   ← API key management, routing alias
      │
      ▼  :8000
vLLM 0.22.0             ← Qwen3.6-27B inference
      │
      ▼
RTX PRO 6000 Blackwell (96 GB GDDR7)

Prometheus :9090  ←  vLLM, GPU, node-exporter
Grafana    :3000  ←  Prometheus

Metric	Current Value
Model	Qwen3.6-27B BF16
VRAM used	~84 GB of 96 GB
Benchmark	77.2% SWE-bench Verified
Context window	262,144 tokens
KV cache dtype	FP8
Tool calling	Working — file, bash, web_search
Web search	DuckDuckGo, transparent
Monitoring	Prometheus + Grafana, 3 scrape targets

The Complete Bug List

#	Problem	Root Cause	Fix
1	`nvidia-smi`: No devices found	Proprietary driver has no GB202 firmware	`nvidia-driver-580-open` + reboot
2	`python3.14-venv` missing	Not installed by default on Ubuntu 26.04	`sudo apt install python3.14-venv`
3	`huggingface-cli` not found	Deprecated; replaced by `hf`	Use `hf download`
4	HF token not propagating	Shell export not visible to subshell	`hf auth login --token <token>`
5	Download failed at shard 27/40	Root partition was 100 GB	`lvextend -l +100%FREE` + `resize2fs` (online)
6	LiteLLM proxy broken on Python 3.14	`orjson` no wheel; `uvloop` uses removed stdlib API	Separate Python 3.12 venv via `uv`
7	`--dtype fp8` rejected by vLLM	Not a dtype; FP8 detected from config.json	`--dtype auto`
8	Mamba cache blocks exceeded	Hybrid model; Mamba cache sized separately	`--max-num-seqs 512`
9	vLLM crashed after debugging session	`uvloop` uninstalled from wrong venv	`pip install uvloop` in correct venv
10	Model ignores tool definitions	`--enable-auto-tool-choice` not set	Add flag + `--tool-call-parser hermes`
11	`tool_calls: []` in every SSE chunk	vLLM injects empty array with auto tool choice	stream-filter strips empty arrays
12	Model outputs XML, not tool calls	`hermes` parser expects JSON; Qwen3 outputs XML	Switch to `--tool-call-parser qwen3_coder`
13	`tool_choice: auto` not forwarded	LiteLLM Anthropic→OpenAI conversion drops it	Bypass LiteLLM for `/v1/messages` path
14	HTTP 400 on real Claude Code requests	Inline `role: "system"` messages in conversation	Merge all system content to position 0
15	DCGM exporter needs nvidia-container-toolkit	Container-based GPU access requires Docker GPU runtime	Use `nvidia_gpu_exporter` binary instead
16	`host.docker.internal` unresolved on Linux	Not auto-registered on Docker bridge networks	Use `network_mode: host` for all containers

What I’d Do Differently

Start with nvidia-driver-xxx-open for any recent NVIDIA GPU. The open-source module has been the right choice since Ampere and is mandatory for Blackwell. The proprietary driver’s feature gap is real and growing.

Expand the root partition before downloading large models. A 100 GB root volume is a provisioning oversight that interrupts work at the worst moment.

Check for model-specific vLLM tool parsers before assuming hermes. ls vllm/tool_parsers/ shows what’s available. A 10-second check would have caught the hermes/qwen3_coder mismatch.

Use Python 3.12 for anything proxy-adjacent. vLLM and PyTorch are fine on 3.14; the proxy ecosystem (uvicorn, uvloop, LiteLLM) isn’t yet.

Use network_mode: host for Prometheus/Grafana on Linux. Bridge networking + host.docker.internal requires extra configuration that’s easy to get wrong. Host mode just works.