← ALL ARTICLES
AI ENGINEERING5 MIN READ

Build Local AI Agents on Windows: Microsoft and NVIDIA Stack

Microsoft and NVIDIA unveiled secure agent sandboxing, RTX Spark hardware, and 2× faster inference at COMPUTEX 2026 — here's what the new local AI stack looks like for developers.

M
Mayur Domadiya
Jun 03, 2026 · 5 min read

Over 100 million NVIDIA RTX PCs now have the infrastructure to run autonomous AI agents locally without a round-trip to the cloud. At COMPUTEX 2026, NVIDIA and Microsoft announced a coordinated set of tools to make that practical: secure container sandboxing via Microsoft eXecution Containers, the RTX Spark line of AI-optimized hardware, and inference engines that run 2× faster on consumer GPUs through multi-token prediction and tensor parallelism. These aren't incremental updates. They change the cost equation for shipping agentic AI features that run entirely on users' own hardware.

A Security Layer for Local Agents

The biggest friction in building local AI agents has always been security. An agent that reads files, executes code, and takes actions across apps is useful — and dangerous if not contained. Microsoft eXecution Containers (MXC), announced at Build 2026, solve this at the OS level. MXC defines isolation policies that agents cannot override. The agent lives in a container with explicit permissions for file access, network calls, and system interactions.

NVIDIA OpenShell integrates with MXC to provide a ready-to-use runtime for agent developers. Instead of building sandboxing from scratch, you drop in OpenShell and get policy management, inference routing, and PII obfuscation out of the box. OpenClaw and Hermes Agent are already integrating MXC through OpenShell. That means the Windows desktop is about to get a wave of agents that are both capable and contained.

The practical takeaway for AI engineers: the prompt-injection problem that keeps local agents in demo mode now has a production answer. MXC ensures an agent cannot access the full system even if a malicious instruction slips through. That changes the risk equation for deploying agents on customer laptops.

RTX Spark — Hardware Built for Local Agents

Running a 27B-parameter model locally requires GPU memory and compute that most laptops do not have. NVIDIA's RTX Spark product family fills that gap. These small-form-factor desktops and laptops deliver 1 petaflop of AI compute with up to 128 GB of unified memory — enough to run Qwen 3.6 27B or multiple smaller models simultaneously.

Microsoft is shipping a Surface RTX Spark Dev Box, preloaded with Windows configured for AI development and MXC support out of the box. For teams building agents that need to run 24 hours a day, seven days a week on local hardware, this removes the hardware objection from deployment conversations. The hardware exists, the OS is ready, and the sandboxing layer is built in.

The hardware matters more than most developers realize. A local agent running on cloud-tier hardware at a desktop price point means you can architect agents that assume persistent local compute. That assumption changes how you design memory, batching, and real-time inference paths.

2× Faster Inference on Consumer GPUs

Agents running continuously need efficient inference. NVIDIA collaborated with llama.cpp and vLLM to deliver significant throughput gains on GeForce RTX and RTX PRO GPUs. Two techniques drive most of the improvement.

Multi-Token Prediction

Multi-Token Prediction (MTP) allows a smaller draft model to propose several tokens that the target model verifies in a single forward pass. For models that already support it — Qwen 3.5 and 3.6 — this delivers up to 2× throughput on an RTX 5090 with no quality loss and no additional training required.

Key insight. MTP is the most practical speculative decoding technique for developers because it requires no additional training for models that already support it. You get the speed gain without the pipeline complexity.

Programmatic Dependent Launch

Programmatic Dependent Launch (PDL) lets dependent CUDA kernels execute concurrently on the same stream rather than sequentially. On Qwen 3.6 35B mixture-of-expert models, MTP and PDL together yield 1.6× faster generation on an RTX 5090. vLLM shows even bigger gains at 2.6× on DGX Spark through better BF16 kernel selection for MoE models and reduced CUDA Graph overhead.

These gains are available today through LM Studio, llama.cpp, and vLLM. For agent developers, faster inference means tighter agent loops. An agent that re-plans every 30 seconds instead of every 60 seconds can handle more complex multi-step tasks before the user notices a wait.

Multi-GPU Without the Server Architecture

PC frameworks like llama.cpp and ComfyUI historically optimized for single-GPU setups. NVIDIA worked with both projects to add proper multi-GPU support for RTX PCs with two equivalent GPUs. llama.cpp now supports tensor parallelism, delivering up to 1.8× compute performance and roughly 2× memory capacity on dual-RTX-5070 configurations compared to single-GPU inference.

ComfyUI gets Classifier-Free Guidance for splitting compute across GPUs and device-level model chain splitting. Users running high-VRAM workflows no longer hit memory-swapping overhead that stalls generation.

For developers building agentic applications that need larger context windows or concurrent model serving, multi-GPU on a local PC is now viable without moving to a server cluster. The tradeoff is hardware cost. But for teams that already have multiple RTX GPUs in their lab or dev machines, this unlocks local testing at production scale.

What This Means for AI Engineers

The local AI development stack on Windows is no longer a side project. MXC plus OpenShell provides production-grade security for agents. RTX Spark gives them a hardware target with real GPU memory. And 2× faster inference means agents can run persistent loops without burning compute budget.

For AI engineers evaluating where to deploy agents, the head-to-cloud decision just shifted. Local deployment on Windows now has security guarantees, hardware availability, and inference performance that made sense only on servers a year ago. That changes the architecture choices for your next agent project — and the question is not whether to build for local, but which tools in this new stack fit your use case.

TAGS ·#ai-agents#ai-engineering#for-ctos#for-founders
Production AI in your stack

Researching this for a real task? We ship it in 5–7 days.

If you're reading up on RAG, MCP, an LLM integration, or a new framework, odds are you're scoping work for your team. Boundev is a senior AI engineering subscription: drop the task in Slack, we open a clean GitHub PR with tests, an eval suite, and a deploy guide. Python primary, TypeScript when needed, your stack always. Cursor + Claude Code make our engineers ~3× faster than a typical FTE — you get those gains without onboarding anyone.

40+
AI features shipped to SaaS teams
5.4 d
Median time to first PR
Faster via Cursor + Claude Code
See pricingHow it works
● 4 ENGINEERS ON-SHIFT · LAST SHIP 2H AGO
Have a real AI task? Shipped as a GitHub PR in 5–7 days.See pricing →