Microsoft’s AI Found 16 Windows CVEs — Including 4 Critical RCEs. Here’s How the Agentic Pipeline Actually Works

Microsoft MDASH multi-model agentic scanning harness — AI-powered vulnerability discovery finds 16 Windows CVEs including 4 Critical RCEs in a single Patch Tuesday cycle

Estimated reading time: 16 minutes

As a Power Platform Solution Architect, you don’t patch Windows. You architect on top of it. However, Microsoft’s AI found 16 Windows CVEs — what Power Platform Solution Architects need to know is how these security vulnerabilities might affect your solutions. That’s where understanding the MDASH agentic security pipeline becomes important for ensuring your architecture is secure and resilient.

Azure, Dataverse, Power Pages, Power Automate — every one of those services runs on the Windows networking and authentication stack that Microsoft just announced its AI system scanned and found 16 vulnerabilities in. Four of them Critical remote code execution flaws. Most reachable with no credentials.

That’s the context that makes Microsoft’s May 12 announcement worth your time. Not because you need to understand kernel internals, but because:

  • The platform you stake your client architectures on is being scanned by production AI at a depth that wasn’t possible two years ago
  • The security posture of that platform is now a talking point you can use in governance and architecture reviews
  • The architectural principles behind how MDASH was built map directly to problems you solve in Dataverse and Power Platform design

Microsoft’s new agentic scanning harness — codename MDASH (Microsoft Security multi-model agentic scanning harness) — found those 16 vulnerabilities in a single scan cycle and shipped them all to the same Patch Tuesday. That’s not a research demo. That’s an AI system closing the full loop: find, verify, patch, ship.

Three Architectural Takeaways

Three things stand out from an architecture and governance perspective:

  1. The bugs made it to Patch Tuesday — they passed triage, reproduction, and patch review. That’s the bar most automated scanners never clear, and it’s relevant to how you evaluate security tooling for your clients.
  2. The architecture is model-agnostic by design — the value doesn’t reset when a new model drops, which is exactly the kind of durable investment story you want to see in enterprise security tooling.
  3. 88.45% on the CyberGym public benchmark — highest on the leaderboard, using generally available models. The pipeline is doing most of the work, not the model.

Let’s get into how it works — and then I’ll translate the architectural lessons into what they mean for you.



Microsoft Autonomous Code Security team built MDASH — origin from DARPA AI Cyber Challenge Team Atlanta, designed to prove vulnerabilities end-to-end, not just flag candidates

What MDASH Actually Is

MDASH was built by Microsoft’s Autonomous Code Security (ACS) team. A few members came from Team Atlanta — the group that won the $29.5 million DARPA AI Cyber Challenge by building an autonomous system that found and patched real bugs in complex open-source software. The DARPA challenge is worth knowing about because it required end-to-end proof, not just candidate flagging. You had to demonstrate the exploit, not just describe the vulnerability. That discipline is baked into MDASH’s design.

As a Solution Architect, the constraint Microsoft faces with its own codebase is relevant to how you think about security scanning on your client environments:

  • Everything is proprietary. Windows, Hyper-V, Azure, the driver ecosystem — none of it was ever in a model’s training data. The AI can’t rely on pattern matching. It has to reason about component-specific trust boundaries, ownership semantics, and concurrency models. Sound familiar? Dataverse has its own calling conventions, transaction boundaries, and plugin execution context that generic tools don’t understand either.
  • False positives have a real cost. Every finding has an owner and a triage queue. A tool that floods teams with noise doesn’t get used. The same problem shows up in every enterprise security programme I’ve seen — alert fatigue kills adoption faster than anything else.
  • High-value targets raise the bar. Windows and Azure serve billions of users. A single Critical CVE in tcpip.sys is a very bad day for a very large number of people. The acceptable false-positive rate is near zero.

All of that shapes why MDASH is built the way it is. It’s not optimised to find the most candidates. It’s optimised to find findings that survive being proven.


MDASH five-stage vulnerability pipeline — Prepare, Scan, Validate, Dedupe, Prove — codebase in, proven findings out

The Five-Stage Pipeline

Think of MDASH as a pipeline: you put a codebase in one end and get validated, proven findings out the other. There’s no single step where “the AI reads the code and finds bugs.” It’s staged, and each stage has a specific job.

Prepare

Before any agent touches the code, MDASH ingests the source target, builds language-aware indices, and maps the attack surface by analysing past commits. Domain context enters the system here — not through the model’s weights, but through structured tooling that understands what the code does before any scanning starts.

Scan

More than 100 specialised auditor agents run over candidate code paths. Each one was built from deep research into past CVEs and their patches, so they’re not generic “find a bug” agents — they know what specific classes of vulnerability look like in this kind of code. They work independently, and their findings get ensembled into a single report.

Validate

This is the stage that separates real findings from noise, and it’s where MDASH does something most scanners skip entirely. A second cohort of agents — the debaters — argue against each finding. They try to prove it’s not reachable, not exploitable, not real. If the auditor flagged something and the debater can’t knock it down, the finding’s credibility goes up. That disagreement signal is used explicitly.

Dedupe

Semantically equivalent findings get collapsed. Patch-based grouping is the primary mechanism — if two findings trace back to the same root cause, they show up in triage as one item, not ten.

Prove

This is where MDASH either proves a finding or drops it. The prove stage constructs actual triggering inputs and validates them dynamically. For C/C++ code, that means ASan integration. If a finding can’t be demonstrated, it doesn’t make the cut.

The CLFS proving plugin is a good example of how this extensibility works. It knows the on-disk container layout, the block-validation sequence, and the in-memory state machine for the Common Log File System — enough to construct triggering log files for candidate findings. The model doesn’t need to know any of this. The plugin embeds it, the model uses it, and the result is bugs that ship to Patch Tuesday rather than bugs that sit in a backlog.


Three architectural properties of MDASH — multi-model ensemble with disagreement signal, specialised agents per stage, model-agnostic pipeline

The Three Properties That Make It Work in Practice

1. An ensemble of diverse models

No single model is best at every stage — and MDASH doesn’t pretend otherwise. It runs a configurable panel: a SOTA model as the heavy reasoner, a distilled model as a cost-effective debater for high-volume passes, and a second separate SOTA model as an independent counterpoint. The key insight is that disagreement between models isn’t a problem to resolve — it’s a signal. When two models diverge on a finding, that divergence tells you something about how credible the finding actually is.

2. Specialised agents

An auditor thinks differently to a debater, which thinks differently to a prover. Each stage has its own role, its own prompt regime, its own tools, its own stop criteria. MDASH doesn’t try to cram everything into one agent or one prompt. That might sound obvious, but most AI coding tools still do exactly that.

3. Model-agnostic architecture

This is the one that matters most long-term. When a new model lands, swapping it in is one configuration flip. When a model improves, everything the team already built — scan plugins, scope configurations, proving agents — carries over. The value doesn’t reset. Compare that to a system built around a specific model: the moment a better one ships, you’re rebuilding from scratch.


MDASH May 2026 Patch Tuesday results — 16 CVEs found including 4 Critical RCEs in tcpip.sys, IKEv2, Netlogon, and DNS, most reachable with no credentials

What It Found: The May 2026 Patch Tuesday Cohort

Here’s the full list of 16 CVEs from the May 12 Patch Tuesday that MDASH found. As a Solution Architect, pay attention to the components column — tcpip.sysikeext.dllnetlogon.dlldnsapi.dll, and http.sys are all part of the Windows infrastructure stack that Azure, Entra ID, and Power Platform run on.

ComponentDescriptionCVESeverityType
tcpip.sysRemote unauth SSRR IPv4 packets causing UAFCVE-2026-33827CriticalRemote Code Execution
ikeext.dllUnauth IKEv2 SA_INIT double-free → LocalSystem RCECVE-2026-33824CriticalRemote Code Execution
netlogon.dllUnauthenticated CLDAP User= filter stack overflowCVE-2026-41089CriticalRemote Code Execution
dnsapi.dllCrafted UDP DNS response triggers heap OOBCVE-2026-41096CriticalRemote Code Execution
tcpip.sysNULL deref via crafted IPv6 extension headersCVE-2026-40413ImportantDenial of Service
tcpip.sysKernel DoS via ESP SA refcount underflowCVE-2026-40405ImportantDenial of Service
tcpip.sysUse-after-free in Ipv4pReassembleDatagram → info disclosureCVE-2026-40406ImportantInformation Disclosure
tcpip.sysIPsec cross-SA fragment splicing via reassemblyCVE-2026-35422ImportantSecurity Feature Bypass
tcpip.sysUnauthenticated local WFP RPC disables name cacheCVE-2026-32209ImportantSecurity Feature Bypass
ikeext.dllMemory leakCVE-2026-35424ImportantDenial of Service
telnet.exeOOB read in FProcessSB via malformed TO_AUTHCVE-2026-35423ImportantInformation Disclosure
tcpip.sysIPv6+TCP MDL-split packet triggers NULL derefCVE-2026-40414ImportantDenial of Service
tcpip.sysICMPv6 packet triggers NdisGetDataBuffer NULL derefCVE-2026-40401ImportantDenial of Service
tcpip.sysPre-auth remote UAF via SA double-decrementCVE-2026-40415ImportantRemote Code Execution
http.sysUnauth remote QUIC control-stream OOB readCVE-2026-33096ImportantDenial of Service
tcpip.sysKernel stack buffer overflow via RPC blobCVE-2026-40399ImportantElevation of Privilege

10 kernel-mode, 6 usermode. Most are reachable from the network with no credentials required.

The ones most worth flagging in a client architecture review: the dnsapi.dll heap overflow (DNS is everywhere in hybrid environments), the netlogon.dll RCE (directly relevant to any AD-connected Power Platform tenant), and both ikeext.dll Critical findings (affects any environment using Always-On VPN, DirectAccess, or IPsec connection rules — common in enterprise Power Platform on-premises gateway configurations).

All of these were patched in May Patch Tuesday. But the fact that they were found this way is the news.


Two Deep Dives: The Architectural Lessons Worth Stealing

Microsoft walked through two findings in detail and explicitly called them out as the type of bug the multi-model pipeline catches that a single-model harness doesn’t. Both are worth reading carefully, because they show you exactly where the architectural difference shows up in practice.

CVE-2026-33827 — tcpip.sys use-after-free via SSRR: Path object released then reused while three concurrent subsystems can free it first, invisible to single-file analysis

CVE-2026-33827 — Trust Boundary Violation Across Concurrent Subsystems

The bug is in Ipv4pReceiveRoutingHeader, the Windows IPv4 receive path. The function drops its sole owned reference to a Path object, then later reuses the same pointer when handling Strict Source and Record Route (SSRR) processing. Classic use-after-free setup.

But here’s what makes it hard to catch: the timing window is real, and it involves multiple independent subsystems. The path-cache scavenger, explicit flush routines, and interface state-driven garbage collection can all concurrently remove the object and drop the final reference. None of them are synchronised with the receive-side execution window in this function. No lock is held. On an SMP system, the freed object can be reclaimed and overwritten before the subsequent dereference — a race-driven UAF with actual exploitation feasibility.

An attacker can trigger this with crafted IPv4 packets carrying the SSRR option. No credentials, no special setup.

So why did single-model systems miss it? The lifetime violation isn’t locally visible in any single function. The release and reuse are separated by non-trivial control flow — alternate branches, multiple validation checks, early-drop conditions. Without tracking reference ownership across all of that, the model just sees two independent operations. And the decisive signal — the correct version of the same pattern elsewhere in the codebase — only becomes visible when you’re doing cross-file reasoning. A single-shot analysis misses the connection entirely.

CVE-2026-33824 — IKEv2 IKEEXT double-free: shallow memcpy creates two owners of the same heap allocation, both free it, leading to LocalSystem RCE

CVE-2026-33824 — Shallow Copy Creates Implicit Shared Ownership

This one is in IKEEXT, the Windows component that handles IKE and AuthIP keying for IPsec. It’s reachable over UDP/500 on any host acting as an IKEv2 responder — think RRAS VPN, DirectAccess, Always-On VPN, or any machine with an inbound IPsec connection security rule.

Two UDP packets. No race. No special timing. Deterministic.

The root cause is a shallow copy problem. When IKEEXT reinjects a reassembled IKEv2 fragment through its receive pipeline, it copies the packet’s receive context with a flat memcpy. That copies the struct bytes but not the heap allocations the struct points to. One of those allocations is the attacker-supplied security-realm identifier. After the copy, both the queued context and the live Main Mode SA hold the same pointer — and both think they own it. On teardown, both free it. Double-free of a fixed-size heap chunk. IKEEXT runs as LocalSystem inside svchost.exe. That’s pre-auth RCE into one of the highest-privilege contexts on the machine.

Why Single-Model Systems Missed It

The bug spans six source files: the bad memcpy in ike_A.c, the alias origin in ike_B.c, the wrong free in ike_C.c, the right pattern and the second free in ike_D.c, the remote population in ike_E.c, and the UAF read site in ike_F.c. No single-file analysis connects all of that. The strongest evidence that the bug is real is the correct version of the same pattern immediately after the memcpy in ike_D.c — but you only see that by comparing across files. MDASH’s specialised auditor agents are built to surface exactly that kind of cross-file pattern comparison, and the debate stage forces each finding to hold up under scrutiny before it moves forward.


MDASH benchmark results — 88.45% on CyberGym (top of leaderboard), 96% recall on clfs.sys MSRC cases, 100% on tcpip.sys over five years

How Capable Is MDASH? Retrospective Benchmarks

The Patch Tuesday results are forward-looking. The retrospective benchmarks are where you get ground truth — can it rediscover bugs that real attackers found and that real engineers already patched?

Recall on Historical MSRC Cases

The team ran MDASH against pre-patch snapshots of two heavily reviewed Windows components and measured re-discovery of confirmed historical bugs:

  • clfs.sys: 96% recall across 28 MSRC cases over five years
  • tcpip.sys: 100% recall across 7 MSRC cases over five years

Think about what the MSRC case database actually represents. These are the bugs real attackers exploited, the ones that triggered emergency patches, the ones defenders had to react to under pressure. A system that rediscovers 96% of a five-year MSRC backlog in a heavily reviewed kernel component isn’t finding theoretical weaknesses. It’s finding the bugs that mattered.

Microsoft is honest about the limits here. These are retrospective recall numbers on a finite case count. They tell you the system would have been useful if it had existed at the time. They don’t guarantee the same rate on the next 38 CLFS bugs. Fair caveat.

CyberGym Public Benchmark

On the public CyberGym benchmark — 1,507 real-world vulnerability reproduction tasks across 188 OSS-Fuzz projects — MDASH scored 88.45%. That’s the top result on the leaderboard, roughly five points ahead of the next entry at 83.1%.

What makes that number meaningful isn’t just the ranking. It’s that it was achieved with generally available models. No special fine-tune. No proprietary model. The agentic system around the model is doing the heavy lifting — which is exactly the architectural claim Microsoft is making.

Failure Analysis

The remaining ~12% that MDASH missed breaks down into two patterns, and Microsoft published the breakdown:

  1. Wrong code area targeting — 82% of these came from tasks with vague descriptions that lacked function or file identifiers. Better scan input = better results. This is a solvable problem.
  2. Harness format mismatch — the agent built correct reproductions using libFuzzer-style inputs, but the benchmark expected honggfuzz format. Sound finds, wrong format. Also solvable.

Neither failure pattern points to a fundamental model limitation. Both point to tooling and input quality issues.


What MDASH means for Power Platform Solution Architects — platform security posture, client architecture reviews, governance conversations, and shared responsibility model

What This Means for Enterprise Defenders

Let me give you the honest takeaway.

AI vulnerability discovery has crossed from “interesting research project” into “this is how we ship Patch Tuesday now.” That’s the real significance of the May 12 announcement — not the benchmark score, not the CVE count, but the fact that the full loop closed: AI found bugs, teams verified them, patches shipped. At scale. In production.

Here’s what I think the MDASH architecture teaches us, regardless of whether you’re ever going to use MDASH specifically:

The Model Isn’t the Product. The Pipeline Is.

The bugs MDASH found — the tcpip.sys race, the six-file ikeext.dll alias chain — are invisible to a model handed a single function. They become visible when you have a system that can sequence cross-file comparison, multi-step reachability analysis, debate between agents, and end-to-end proof construction. If you’ve been evaluating AI security tools based on “which model does it use,” you’ve been asking the wrong question.

Validation Is Where Most of the Work Actually Lives

A tool that flags candidate bugs without proving them just creates a triage backlog. The reason the Patch Tuesday cohort exists is that MDASH didn’t stop at flagging. It debated, deduped, and proved. Validation isn’t a checkbox at the end of the pipeline. It’s its own sub-system, and it’s where most of the day-to-day engineering effort goes.

Model-Agnostic Architecture Is a Competitive Advantage

The model market is going to keep moving. Any security tool whose core value is locked to a specific model is a tool that needs to be rebuilt every six months. MDASH’s architecture carries investment forward — scan plugins, scope configurations, proving agents — across model generations. That’s what durable value looks like.

When you’re evaluating any AI security tooling, the question worth asking is: what does this system do with the model, and what survives when the next model arrives?

MDASH is currently in limited private preview. Sign up here if you want early access.ASH to .NET and TypeScript targets, that becomes directly relevant to Power Platform development pipelines. Sign up for the preview to stay ahead of where this goes.


MDASH key takeaways for Power Platform Solution Architects — 16 CVEs, five-stage pipeline, 88.45% CyberGym, model-agnostic architecture, private preview

Key Takeaways

  • 16 CVEs found in Windows networking infrastructure in a single scan cycle — including components (dnsapi.dllnetlogon.dllikeext.dllhttp.sys) directly relevant to enterprise Power Platform and Azure deployments.
  • MDASH closes the full loop: find → validate → prove → patch → ship. That’s the bar automated security tools almost never clear.
  • The five-stage pipeline (Prepare → Scan → Validate → Dedupe → Prove) is the product. The model is just one input.
  • 100+ specialised agents across stages; multi-model ensemble uses disagreement as a credibility signal.
  • 96% recall on 5 years of clfs.sys MSRC cases; 100% on tcpip.sys — ground-truth retrospective evidence, not just benchmark numbers.
  • 88.45% on CyberGym — top of the public leaderboard, using generally available models.
  • For architecture reviews: use this as evidence that Microsoft’s infrastructure security posture operates at a materially higher level than configuration-layer controls alone.
  • For solution design: the two CVE deep dives are case studies in shared ownership and cross-component trust boundary design — patterns that apply directly to Dataverse plugins, Power Automate flows, and Code Apps.
  • For tooling decisions: MDASH targets C/C++ infrastructure code today. Track it for when it extends to .NET and TypeScript targets.
  • Private preview is openhttps://aka.ms/AI-drivenScanningHarness

References

Leave a Reply