Blog Post
July 29, 2024 | From the experts
The Brass Tacks of Windows Kernel Access
By Matt Holland
With contributions from Erik Egsgard and Sean Alexander.
Locking security vendors out of the Windows kernel would be terrible for everybody, except Microsoft.
It’s been an interesting few weeks, hasn’t it?! The last thing I expected when I woke up on July 19th, 2024 was that a large portion of the world’s computers would be in a cycle BSOD state due to a preventable crash caused by the world’s largest cybersecurity company (by brand), CrowdStrike. This was obviously an unreal bad event for the planet – I suspect the cost in down time will be many billions of dollars, and the impact on human lives immeasurable.
One of the interesting reactions has been “Microsoft should not let security vendors into the kernel”. Three articles have caught my gaze over the past few days:
- Microsoft wants to make future CrowdStrike outages impossible, and it could mean big changes for security software - Windows Central
- Microsoft Confirms It Broke Windows As 30-Minute Crashes Hit After Update - Forbes
- 97% of Crowdstrike systems are back online; Microsoft suggest Windows changes - Ars Technica
Given I have spent my entire adult life writing security kernel drivers (for the past 25 years and counting), I have an opinion on this. Pulling security vendors out of the Windows kernel is a rational reaction given the impact of the CrowdStrike incident, but it’s a surface level reaction that won’t have the outcome that proponents think it will.
But before I dive into the technical weeds, I’d like to share some interesting data taken straight from the delivery of Field Effect MDR to our thousands of customers. I asked the team to take a look at the number of Blue Screens of Death (BSODs) that we have observed on our entire fleet over the first 6 months of 2024, and summarize what we saw into crashes caused by Microsoft, hardware vendors and security vendors (we co-exist with many of them), and the results may be surprising to you. It is important to note that we have observed that about 20% of hosts have experienced at least one BSOD during this time period, which means 80% of Windows hosts are as stable as a mountain goat.
While this is only a capture of a 6-month period, it is reflective of what we have observed for over five years of MDR delivery:
Security vendors are not the leading cause of Microsoft Windows kernel crashes. Microsoft is.
Also, you may be thinking, “Wait a second, how does Field Effect know the cause of BSODs?”. Part of our service is that we profile the daylights out of our agent and any instabilities we observe on any customer machines, allowing us to identify reasons for crashes and offending drivers or kernel components. This helps us understand customer environments, but also helps us detect and fix any instabilities that we cause.
OK, back to the technical aspects of security vendors operating in the Windows kernel.
Let me start by giving a quick picture of how security vendors’ drivers for Microsoft Windows work regarding data flow and decision making. Microsoft has a series of structured security APIs that facilitates very straightforward inspection of key operating system events, such as process/thread activity, file activity, registry activity, network activity, etc:
Drivers receive/intercept activity of interest from Windows via layering or callout-APIs, potentially make a decision, and return the processing to the operating system (or store the information on the side to enhance further decisions), all of which takes place near real-time in the kernel. There are other data feeds; Event Tracing for Windows (ETW) for example, that are available in user mode, but they are asynchronous and lack the ability to block in real-time. This is important. I’m also ignoring more “creative” ways of intercepting activity in the Windows kernel, but nobody wants me to go down that rabbit hole in a single blog post.
There are very tangible benefits to Microsoft both allowing, and facilitating this approach for the past 20+ years:
- Action-time blocking is possible – security vendors can block actions as they happen. This is a key consideration because the alternative is a decision made asynchronously which would allow an attacker to make subsequent offensive moves before a security vendor can take action. And if you’ve ever been an attacker (which I had been for many years in previous careers), you would know that those extra microseconds are gold.
- Minimal possible processing – because the “distance” between regular processing and security driver making an allow/block decision is minimal, the processing cost can theoretically be extremely minor. Obviously, security vendors can make mistakes here, but conceptually the model is best-case for CPU kindness. At the end of the day, there are millions of possible security events per minute that need to be inspected, and residing in the Windows kernel makes inspection significantly more performant.
- Level playing field – unless operating systems become hack-proof, attackers will always have the ability to elevate to privileged access or kernel execution, and having the ability to utilize kernel security drivers allows vendors to take a gun to a gunfight.
I would highlight that Linux has these same benefits architecturally, but I wouldn’t go near the Linux kernel with a 10-foot pole. Plus, you need a custom kernel module (kmod) for each Linux kernel version. There be dragons. 😉
With all of that said, what would disallowing security vendors practically mean for Windows users, the industry, etc? The outcome extends well beyond the assumption that events like the CrowdStrike crash issue would never happen again. Let’s dig deeper…
Significantly Worse Security Outcome
Prior to founding Field Effect, I was part of a world-class team that built incredible offensive capabilities that could quite literally hack into almost any server, desktop or mobile phone on the planet. The platforms that had the least amount of processing visibility (iOS, Android and macOS), or facilitated cybersecurity vendors the least (same list), were the easiest to effect offensive outcome on. Just look at how easily NSO Group continues to build attack-chains that can hack phones around the world, and they aren’t the only ones.
One of the key principles of defending an operating system is having a level of visibility or access that is at least equal to what an attacker can attain. Regarding Windows, security vendors having kernel access gives us the ability to observe and defend. There will always be kernel exploits that allow attackers to get into the Windows kernel, and without security vendors having kernel access, those types of attacks/exploits would be almost impossible to detect and defend against.
Significantly Worse System Performance
As I highlighted above, one of the things that Microsoft has done a great job of is providing security vendors with standardized APIs for intercepting key events. This allows security vendors to very rapidly make decisions and block attackers with very little CPU consumption, assuming of course that the vendor knows what they are doing.
However, there are other models that utilize similar APIs that reside in User Mode, and the side effect is pretty bad. I’m referring to macOS, and the side effect of excessive CPU utilization, because there is no other choice. Let me draw a picture of this – actually, I should note that I don’t know jack about the macOS kernel, and I’m basing this on speaking with much smarter/experienced team members than me in this area:
Apple's EndpointSecurity (ES) framework delivers events to clients via user mode (i.e. endpointsecurityd). The endpointsecurityd daemon acts as an intermediary between the kernel and user-space clients. The kernel generates events and sends them to endpointsecurityd, which then forwards these events to the EndpointSecurity clients. The clients then decide to block the event (if that's an option), ignore, or consume for later context.
It might seem simple given a box in the diagram just moves from kernel mode to user mode vs the Windows model. However, that introduces CPU penalty given that call transition requires a process context switch, and a CPU mode switch. These aren’t cheap. And this happens in both directions.
There are also times when this mechanism just decides not to forward events, which is odd considering the criticality of the events. We've seen it on several occasions and can't make heads or tails of it.
In practicality, this results in terrible endpoint agent performance in contrast to the Windows model. For those who run Windows Defender on macOS, take a look at how much CPU gets used. Now guess what happens when more than one endpoint agent runs on macOS – the impact on the CPU is compounded and can be perceived by the user at the keyboard as a laggy user experience.
I’m sure Microsoft could do better than this convoluted model, but Windows also produces significantly more security events than macOS, so there would undoubtedly be a perceivable processing penalty.
Also, if you know any endpoint developers for macOS, go give them a hug. They definitely need it after dealing with Apple’s attempt at supporting security vendors.
Microsoft Anti-Trust
Right out of the gate, there would be a massive anti-trust issue for Microsoft unless they were willing to remove all Windows Defender components out of the Windows kernel, which of course they would not be willing to do.
If Defender was to remain in the kernel in any fashion and all other security vendors were kicked out, Microsoft would have an unfair advantage over all other endpoint security companies. I suspect Microsoft is already walking that line with access to private APIs. In my opinion, it would start an immediate anti-trust fire in the industry for Microsoft.
Hardware Vendors
Almost every hardware vendor that supports Microsoft Windows builds its own Windows kernel drivers, and has historically had significant challenging times in regards to stability. Back in the early 2000s, graphics drivers went through a whole world of stability pain as they caused continuous BSOD headaches. Unless Microsoft is going to go to a closed hardware model like Apple, the thought of removing hardware vendors from the kernel is ludicrous. And if Microsoft isn’t going to boot hardware vendors out of the kernel, booting security vendors out of the kernel would only benefit Microsoft.
What Really Matters
At the end of the day, industry or company size does not directly contribute to stable code in the Windows kernel. Rather, company culture, developer experience and development mindset are what lead to great stable code. Excellent quality assurance, well designed architectures and rollout procedures that assume the worst, are what lead to stable operational outcomes. Those who buy cybersecurity technology should be demanding these things, because it is these things that matter.
Overreacting because of one vendor’s serious mistake, without considering real world data and realities, does not lead to a better outcome for the world. Kicking security vendors out of the Windows kernel would make the world’s cybersecurity much worse almost immediately.