Kernels, Shells, and Operating Systems (Re)Visited

While recording a podcast episode in the wake of the 2024 Crowdstrike Incident, I spoke briefly about the reactions I had seen from the technical community. At that moment, I realized that, even though many of us treat the terms “kernel”, “shell”, and “operating system” as 101 level, basic, simple concepts, we should not. You can become a grizzled IT veteran while knowing surprisingly little about them. I witnessed a spread of misinformation from individuals that I thought would know better. Let’s have a serious talk about this.

I referenced a podcast episode in the opener for context. You do not need to watch or listen to it to follow this article. For the sake of completeness, I appeared as a guest, with hosting provided by Project Runspace co-founder Andy Syrewicze in his daily role at Hornetsecurity, on Security Swarm episode titled CrowdStrike Chaos, VMware ESXi Vulnerability & More.

The Operating System Solution

I have noticed a phenomenon that probably has a lot written about it in some journal that I have never read. Within some timeframe after a solution becomes ubiquitous, whether days or decades, people forget why that solution came into existence. Clearly, that phenomenon now applies to computer operating systems. Their core purpose and function seems to have become lost knowledge, even for computer professionals.

Once upon a time, you started a computer, then you put a program into it, and then the computer ran that program. If you wanted to run a different program, first you needed to reset the computer, and then you could put the new program into it. That was the way things worked.

Operating systems solved that problem. An operating system allows a computer to run different software without a complete reset. The ability to have more than one program in computer memory at a time was an even later advancement. After that, multi-tasking significantly changed what we could do with operating systems. Then, multi-threading came along and blew everyone’s mind. Today, we consider all this so normal that we rarely talk about it outside of programming.

What does an operating system do? It manages other software. That is its purpose. When operating systems first appeared, techies called them “supervisors”. Just as human supervisor oversee other humans, software supervisors oversee other software. Operating systems have so many components today that we often forget that they exist for that one fundamental purpose.

The Kernel

Reading modern articles and commentary, I could understand if someone took on the belief that operating system kernels have some magical, mythical, incomprehensible existence on a higher plane. They do not.

Right from dictionary.com, definitions #1 and #2:

the softer, usually edible part contained in the shell of a nut or the stone of a fruit.

the body of a seed within its husk or integuments.

The definition for an operating system “kernel” came from definitions like those. It’s the inner parts, the ooey bits, the things regular people don’t want to see. More technically, the kernel does the work of managing the software. In modern computing, that also involves hardware management. That’s really the entire definition.

The Shell

Think of PowerShell, bash, Windows, whatever. The “shell” of an operating system is the part that you can see and touch. Maybe you include all the supporting cast members of programs that perform auxiliary tasks. Maybe you don’t. It doesn’t matter. Shells dwarf kernels, but they need much less explanation.

The BSOD Solution

Wait, can I refer to the blue screen of death (BSOD) as a solution? Yes, I can. A BSOD alerts you to the presence of a problem, no doubt about that. But BSODs themselves solve a problem.

You know from above what a kernel does. So, think about how you would build one. You need to create software that controls software that you know effectively nothing about. You must make sure that all software can run. As a natural side effect of providing that service, you must also protect software from other software. Think about how your kernel would do that.

Take your fresh mental model of your homegrown kernel. Now, imagine that some piece of software, maybe a regular user app, maybe a driver, who knows what, has malfunctioned. If it runs in user space, no big deal. Your kernel can just end it and claw back whatever resources it was using. For this exercise, imagine that the naughty program was running in kernel space. Remember that your kernel can only operate generically because you only have a generic idea of what anyone will want to run on your operating system. If said software runs in kernel space, then it probably needs to do important system level stuff. That could include things that your kernel would never allow software in user space to do. Perhaps it intercepts the activity of other software. Sounds kind of like something that anti-malware would do, doesn’t it?

OK, so this software has malfunctioned. Your kernel does not have a way to know what that software does. It only knows that something went wrong. Since software in kernel space might tinker with other software in kernel space, including the kernel, halting that program may not solve the problem. You might have seen someone refer to this condition as the kernel having an indeterminate state. What do you think that a kernel should do when it does not know its own state? If I give my answer by copying the decisions made by the teams that work on today’s kernels, I would say that a kernel halt is the most appropriate action. That makes sense to me; if the kernel would halt a misbehaving user mode application, then should it not halt a misbehaving kernel mode application? Remember that the difference between user mode and kernel mode is neither cryptic nor magical. Applications in user mode must keep their hands to themselves. Applications in kernel mode have substantially fewer restrictions. Even the kernel cannot know with certainty that a malfunctioning kernel mode program only damaged itself. If it could, then why was that program running in kernel mode in the first place?

On Windows, Microsoft follows up a kernel halt with a BSOD to let you know what happened.

Is a BSOD Appropriate?

For as long as the BSOD has existed, I have heard people say snarky things like, “Thanks, Bill Gates,” or “Thanks, Microsoft” every time one occurs. At first, that was mostly a joke from people who knew better laughed at by people who also knew better. I don’t think people know better anymore. When that poorly designed video card driver that your manufacturer hastily published without quality assurance tramples kernel memory, Microsoft did not cause that. When a fiber channel adapter has an electrical failure and starts blasting things into memory addresses that do not belong to it, Microsoft did not cause that either. The BSOD limits the extent of the damage.

If you know of a better way to respond to these events, then you might have a calling for kernel design. The entire world could benefit. Please help.

But [Insert Other Kernel Name] Does Not Halt as Often!

I hear forms of this claim frequently, but I have not seen aligning evidence. I know that one kernel provider places extreme restrictions on the hardware and software allowed to run on its operating system, not to mention in its kernel space. I know another that has effectively no restrictions on what can run, but still has only a fraction of the number of software authors trying to run programs in its kernel space. I have seen no statistically neutral support for this claim.

Can Microsoft Do Anything to Reduce BSODs?

I suspect that Microsoft could reduce BSODs. However, that answer is far too simplistic to have value. The problem: reduce BSODs at what cost? We will not get additional kernel mode policing for free. I can think of two ways.

First, Microsoft could implement gatekeeping on software that makes it into the kernel. Goodbye, independent fly-by-night software geniuses cranking out kernel mode applications. Does the world have many of those? I don’t know. Is that question important? We all know that the world has a lot of criminally minded people cranking out nefarious kernel mode software. Should Microsoft limit kernel access to companies with enough resources to pass their requirements and the bad guys who can figure out how to break in? I would argue for an emphatic, “NO!” Aside from a general opposition to restricting benign programmers because of the actions of malignant programmers, another CrowdStrike-type event will eventually happen even if Microsoft insists on validating every piece of software that runs in its kernel space.

Second, the kernel itself could implement more restrictive policing. For starters, that kind of tightening has gone on continuously since the dawn of operating systems. I remember a virus from the 1980s that would instruct the video display unit to run at frequencies that no video display unit could support, causing systems to overheat and eventually catch fire. Moving hardware access under the purview of the kernel and kernel mode drivers ended all that. However, that also prevented all software from talking directly to hardware. It must now communicate with hardware via the kernel and drivers in kernel mode (for those so inclined, take a free “not all drivers” card here as long you also take the big note that hardware that doesn’t need a kernel mode driver also won’t accept the kind of commands that necessitate kernel mode drivers). Every CPU cycle and memory access that the kernel needs to inspect kernel mode applications comes out of cycles and bus bandwidth that other applications could have used.

Why Does this Matter?

This discussion matters because the CrowdStrike incident has generated a lot of reactionary response directed at Microsoft. The public still does not know, and may never know, what really happened at CrowdStrike. However, we all know that the problem happened at CrowdStrike. Microsoft did not cause the problem. Simplistically, CrowdStrike either made an honest mistake, or it betrayed customer trust. Either way, we should not direct much blame at Microsoft.

Does that fully absolve Microsoft of the need to take action? I do not know. I might know enough about kernels to speak about them intelligently, but I certainly do not consider myself an expert. I do know that I do not want a hasty reaction to a single event by a third party to introduce slow downs to my computers. We already went through that after side channel execution attacks started popping up. We accepted those performance hits because we needed them to stop the bad guys. While we should not ignore the damage that CrowdStrike caused, it would not be fair to classify them as bad guys. It would be even less fair to penalize everyone that creates kernel mode software for Windows.

Technical professionals also need to remember our duty. If we want a superhero slogan, then we can say that, “We make the world of technology safer for non-technical people.” Because non-technical people, as well as technical people coming up after us, look to us for answers, we must take care to speak responsibly.