Anthropic Changed the Rules: Revising My OpenClaw Split Deployment for Zero Cloud Spend

Five days ago, Anthropic pulled the rug on every OpenClaw user running Claude through a Pro or Max subscription. Starting April 4, 2026, third-party tools like OpenClaw can no longer draw from your subscription quota. If you want Claude in your agent pipeline, you now pay per token through API keys or Anthropic’s new “extra usage” billing — no more flat-rate all-you-can-eat.

This directly impacts the split-deployment architecture I documented last week: a Minisforum UM890 Pro as the hardened always-on gateway, and a gaming PC (7800X3D + RTX 5080) as the GPU-accelerated inference server, connected over Tailscale. That guide assumed Claude Sonnet as the “thinking” model and the fallback when the gaming PC was off. Under the new billing, that assumption could easily cost $300+/month in API charges.

Time to revise the plan. The goal: spend nothing beyond electricity.

What Anthropic Actually Changed

The short version: Claude subscriptions (Pro at $20/month, Max at $200/month) now only cover Anthropic’s own products — claude.ai, Claude Code CLI, Claude Desktop, and Claude Cowork. Third-party tools like OpenClaw, Cursor, and others are cut off from subscription quota entirely.

If you still want to use Claude with OpenClaw, you have two options: a dedicated API key (pay-per-token at $3/$15 per million input/output tokens for Sonnet 4.6), or Anthropic’s new “extra usage” pay-as-you-go add-on. Both are metered. Both can get expensive fast under agentic workloads, because OpenClaw conversations accumulate context — the 10th turn resends all previous turns, and token consumption grows roughly exponentially.

Anthropic’s stated reasoning is that third-party harnesses bypass their prompt cache optimization layer, meaning a heavy OpenClaw session consumes far more infrastructure than an equivalent Claude Code session. Whether you view this as reasonable capacity management or strategic moat-building probably depends on how much your workflow just broke.

For context on scale: testing by the German tech outlet c’t 3003 found that a single day of OpenClaw usage on Claude Opus consumed over $100 in API-equivalent tokens. Even Sonnet-only usage can run $10-20/day once you factor in context accumulation, tool calls, and system prompt overhead. At those rates, a month of moderate usage would dwarf what I spend on electricity for both machines combined.

Claim Your Credit — You Have 8 Days

Before anything else: if you’re on a Claude Pro or Max subscription, Anthropic is offering a one-time credit equal to your monthly subscription price. You must redeem it by April 17, 2026, and it’s valid for 90 days. Go to your Anthropic account settings and claim it now. It’s free money and it expires permanently if you don’t act.

They’re also offering up to 30% off pre-purchased extra usage bundles, which could be worth it if you decide to keep a small Claude budget for emergencies.

The Original Architecture

Here’s the architecture from my previous post:

┌──────────────────────┐       Tailscale (WireGuard)       ┌──────────────────────┐
│   UM890 Pro          │◄─────────────────────────────────► │   Gaming PC          │
│   (Gateway)          │    100.x.x.1 ◄──► 100.x.x.2      │   (Inference)        │
│                      │                                    │                      │
│  OpenClaw Gateway    │   ollama/qwen3.5:27b @ :11434     │  Ollama + CUDA       │
│  Docker (rootless)   │──────────────────────────────────► │  RTX 5080 (16GB)     │
│  Browser automation  │                                    │  Qwen 3.5 27B        │
│  Messaging channels  │   fallback: Claude Sonnet (cloud)  │                      │
│  Netdata monitoring  │                                    │  Netdata monitoring   │
└──────────────────────┘                                    └──────────────────────┘

The model configuration looked like this:

{
  agents: {
    defaults: {
      model: {
        primary: "ollama/qwen3.5:27b",
        thinking: "anthropic/claude-sonnet-4-6-20260514",
        fallbacks: ["anthropic/claude-sonnet-4-6-20260514"]
      }
    }
  }
}

The idea was sound: local model for routine work, Claude for complex reasoning and as a safety net when the gaming PC is off. But under the new billing, every “thinking” invocation and every fallback hit now costs real money. A few complex coding sessions per day could easily burn $5-10, and that adds up to $150-300/month — exactly the kind of bill I built this hardware setup to avoid.

The Revised Plan: Local-Only Inference

The updated architecture eliminates cloud dependency entirely:

┌──────────────────────┐       Tailscale (WireGuard)       ┌──────────────────────┐
│   UM890 Pro          │◄─────────────────────────────────► │   Gaming PC          │
│   (Gateway)          │    100.x.x.1 ◄──► 100.x.x.2      │   (Inference)        │
│                      │                                    │                      │
│  OpenClaw Gateway    │   ollama/qwen3.5:27b @ :11434     │  Ollama + CUDA       │
│  Docker (rootless)   │──────────────────────────────────► │  RTX 5080 (16GB)     │
│  Browser automation  │   ollama/qwen3.5:7b  @ :11434     │  Qwen 3.5 27B + 7B   │
│  Messaging channels  │                                    │                      │
│  Netdata monitoring  │   NO cloud fallback                │  Netdata + nvidia-smi│
└──────────────────────┘                                    └──────────────────────┘

And the revised model configuration:

{
  agents: {
    defaults: {
      model: {
        primary: "ollama/qwen3.5:27b",
        // No thinking model — primary handles everything
        fallbacks: ["ollama/qwen3.5:7b"]
      }
    }
  },
  models: {
    providers: {
      ollama: {
        baseUrl: "http://100.x.x.2:11434",
        apiKey: "ollama-local",
        api: "ollama"  // Native API, NOT /v1 — this is critical for tool calling
      }
    }
  }
}

Key changes:

No Claude anywhere in the model chain. The 27B Qwen model handles all tasks — complex reasoning, coding, multi-step planning. It won’t match Claude Sonnet on the hardest problems, but it’s remarkably capable and the cost per token is exactly $0.
Fallback is a smaller local model, not a cloud API. If the 27B model fails or is slow, the 7B variant picks up. Both run on the same GPU.
Local embeddings for memory search. Setting memorySearch.provider to "local" or "ollama" keeps semantic memory queries on-device instead of hitting OpenAI or Voyage embedding APIs.
QMD (Quick Memory Database) enabled. This is OpenClaw’s local semantic search feature that builds a vector database on-device. It indexes conversation history and documents, then retrieves only relevant context snippets rather than resending the full conversation history to the model. This directly attacks context accumulation — the biggest cost driver in agentic workloads, and even in a local-only setup, it reduces inference time and improves response quality.

Model Routing: Not Every Task Needs 27 Billion Parameters

OpenClaw supports complexity-based model routing, and this is where the two-model local setup pays off. The routing configuration scores tasks by complexity signals — token count, file count, keywords — and dispatches to the appropriate model:

{
  routing: {
    complexity_signals: {
      token_count_threshold: 2000,
      file_count_threshold: 3,
      primary_keywords: ["refactor", "architect", "optimize", "security", "debug"],
      economy_file_patterns: ["*.md", "*.json", "*.yaml", "*.toml", "*.txt"]
    }
  }
}

Heavy reasoning — architecture decisions, complex debugging, multi-file refactoring — goes to the 27B model. Simple tasks — calendar checks, message forwarding, config file edits, documentation lookups — go to the 7B model, which responds faster and frees VRAM for the next heavy request. Both models cost nothing to run beyond electricity.

What About When the Gaming PC Is Off?

This is the real trade-off. The original plan had Claude Sonnet as the graceful degradation path: gaming PC off → agent keeps working through the cloud. Without that fallback, a powered-down gaming PC means the agent stops responding.

My solution: wake-on-demand via Tailscale + Wake-on-LAN. The UM890 gateway detects that the Ollama endpoint is unreachable, sends a WoL magic packet to the gaming PC’s LAN MAC address, and queues incoming requests for 30-60 seconds while the machine boots and loads the model into VRAM. It’s not instant, but it’s functional, and it costs nothing.

For overnight hours when I genuinely don’t need the agent, the gaming PC suspends to save power. The gateway returns a polite “agent is resting, will respond when available” to any messaging channels. Not every AI agent needs to be available 24/7.

The One Exception: Emergency Claude Budget

I’m keeping a Claude API key configured but with a hard $2/day spending cap in the Anthropic console. This exists purely for the scenario where I genuinely need frontier-level reasoning on something time-sensitive and the local model isn’t cutting it. At $2/day, the worst-case monthly overshoot is $60 — but I expect actual usage to be near zero most months.

The configuration keeps Claude available but never automatic:

{
  // Claude is NOT in the fallback chain — it won't trigger automatically
  // Only invocable via explicit /model switch command in a session
  models: {
    providers: {
      anthropic: {
        apiKey: "sk-ant-xxxxx",  // Dedicated key with $2/day cap
        api: "anthropic"
      }
    }
  }
}

This way, Claude never fires without my conscious decision to invoke it. No surprise bills.

A Note on the Billing Proxy Approach

I’ve seen the openclaw-billing-proxy project making the rounds — it’s a 7-layer proxy that rewrites OpenClaw requests to look like Claude Code requests, injecting billing headers and renaming tool schemas to bypass Anthropic’s detection. It works today. It will almost certainly break tomorrow.

I’d strongly recommend against this path. It explicitly circumvents Anthropic’s billing enforcement, requires constant maintenance as detection layers evolve (it’s already on v2.0 with 30+ pattern bypasses), and risks account termination. Anthropic has shown they’re willing to enforce aggressively here, and building your workflow on a cat-and-mouse evasion game is not a stable foundation.

The right answer is either to pay for what you use, or to not use it. I’m choosing the latter for routine work.

Implementation Status

I have Ubuntu Server 24.04 LTS installed on the UM890 Pro. The gaming PC still needs its Ubuntu install. Here’s the phased execution plan:

Phase 1 — UM890 Pro Base Hardening (today)

Post-install security baseline: UFW, SSH key-only auth, fail2ban
Unattended security updates
Dedicated openclaw user with no sudo
Tailscale installation and authentication

Phase 2 — Gaming PC Setup

Ubuntu Server 24.04 LTS install
Nvidia driver installation and verification (nvidia-smi)
Ollama install, pull both Qwen 3.5 27B and 7B models
Bind Ollama to Tailscale interface only
UFW restricting port 11434 to tailscale0

Phase 3 — Gateway Configuration

Rootless Docker for the openclaw user
Node.js 24 via nvm
OpenClaw install and onboarding with Ollama
Revised model configuration (local-only with routing)
QMD and local embeddings enabled
Gateway bound to loopback, agent sandboxing enabled

Phase 4 — Monitoring & Power Management

Netdata on both machines (with GPU monitoring on the gaming PC)
Wake-on-LAN configuration for demand-based power management
Suspend/wake scripts integrated with the gateway

Phase 5 — Validation

End-to-end inference testing through messaging channels
WoL wake-from-suspend testing
One-week cost audit via /usage full
openclaw doctor and openclaw security audit --deep

Revised Cost Analysis

Component	Monthly Estimate
UM890 Pro electricity (24/7, ~15W idle)	~$3
Gaming PC electricity (12hr/day, ~120W avg)	~$16
Anthropic API (emergency manual-only)	$0–2
Total	~$19–21/month

Compare this to:

Original plan (pre-April 4): ~$24–34/month with regular Claude fallback on subscription
Post-April 4 with old config: Potentially $150–300+/month on metered API billing
Pure cloud OpenClaw: $300–600/month at moderate usage levels

The hardware pays for itself even faster now.

The Bigger Picture

Anthropic’s move isn’t surprising in hindsight. Flat-rate subscriptions and autonomous agents with unbounded token consumption were never going to coexist sustainably. The real question is whether this pushes the ecosystem toward better local models faster — and based on how capable Qwen 3.5 27B already is on a single RTX 5080, I think the answer is yes.

The split-deployment architecture I designed actually becomes more justified under the new billing. The whole point was to minimize cloud dependency and keep the critical path on hardware I own. Anthropic just made that decision even more economical by making the alternative dramatically more expensive.

I’ll follow up with detailed implementation posts as I work through each phase. If you’re in the same boat — an OpenClaw user suddenly staring at metered billing — the TL;DR is: invest in local inference hardware, configure aggressive model routing, and treat cloud APIs as an emergency escape hatch rather than a default.

The models are good enough. The hardware is affordable enough. And now, the incentive alignment is clear enough.