Decision Making for Server Administrators and System Architects

In a perfect world, server administrators would only need to install, maintain, and troubleshoot hardware and software. We would have access to detailed manuals, walkthroughs, and capable support staff. Employers would give us the freedom we need to select and deploy the best options. Industries and verticals would have standardized on the most superior solutions, and each would accept and integrate public suggestions and solutions. We don’t live in that world. Instead, we deal with conflicting information, politically driven decisions, substandard products, incompetent vendors, and outsized egos. We have fought these problems since the beginning of our professional, and we will always struggle against them. Unfortunately, the battle has brought about poor solutions. This article discusses the challenges and provides a sustainable framework of decision making for server administrators and system architects.

[citationic]

Beware Tribal Superstition

Throughout my time as a publishing author, I have never struggled against anything so much as entrenched superstition. At some point, someone reasoned their way to something, that reasoning sounded good to someone else, and it passed on organically. Sometimes, that reasoning made sense in an earlier period of technology. Other times, no one had a way to test it. Too many times, it became “the way we do things,” and no one can safely challenge it.

I want to say that everything that contributes to these fears comes from good intention. Unfortunately, that’s rarely the case. Mostly, I see false theories pushed by outsized egos. As I said in The Backup Bible, “Success breeds a success mentality.” For many in tech, the absence of overt failure suggests a perfect success. However, we get lucky a lot. You won’t see this problem anywhere so much as in the realm of digital security. A handful of threat actors have genuine talent and skill, but most fumble along with others’ leftovers. As a result, some administrators believe that a long track record without an uncovered security breach means that they’ve done a stellar job protecting their infrastructure. More often, it means that they did enough to keep out automated routines, casual interlopers, and the typical unskilled attacker. In worse cases, it could mean that they have a breach and do not know about it; someone intent on stealing information will try to avoid detection. Almost universally, these inflated egos work with small organizations and low-value information. They overlook how easily they can avoid attacks when they draw no interest from attackers.

In the worst of cases, you have senior administrators in technology service providers that have gotten themselves into a loop. During my time as a service provider, I came across this problem far too frequently. Small businesses don’t have the time to maintain a server and network infrastructure, nor do they have the income (or workload) for full-time technology staff. So, they bring in consultants and service providers, usually at an hourly rate. Since the contracting small business lacks the expertise to vet technology contractors, they sometimes wind up with a provider who doesn’t know that much more. In more severe cases, the knowledge level of the contractor doesn’t come into play much; they just take advantage of the small business owner’s ignorance and trust. Either way, this starts a feedback loop with the contractor.

To see the loop, go back to the notion of “Success breeds a success mentality.” As long as the customer doesn’t recognize a problem, they continue to trust their provider. As long as the provider doesn’t encounter problems, they believe that they have none and continue doing things the same way. If they run into trouble, then the provider only needs a good explanation (even if the customer doesn’t understand it), and to do enough to keep the customer happy. All by itself, this represents a loop. Worse, service providers typically bill their clients per hour. This creates incentive, even if not deliberately treacherous. These elements work against the client’s best interests, and ultimately damage the people working at the service provider as well.

Prove It

You have one simple way to break down superstitions: proof. If someone declares some process “the best” and it seems problematic, ask them for proof. When I do that, I usually get long-winded thought experiments, sometimes with minor anecdotes. That doesn’t work for me. Humans naturally decide what outcome we want, look for the evidence that supports it, and downplay anything that contradicts it. For that reason, I want proof that passes the academic standard: testable, repeatable, and demonstrable. Established academic and scientific disciplines have a process called “peer review” that enforces restrictions on submissions, which then receive critique from other credentialed experts. The peer review system has problems, and bad ideas sometimes proliferate there, but nothing like what happens in the pool of server and network administration knowledge. Our “peer review” mainly consists of Internet mailing lists, forums, and the like, with controls limited to basic moderation of language. In such arenas, the best ideas don’t win; the loudest do. Even during my time as a Microsoft MVP, I have abandoned debates with other recognized experts where I can prove that I’m right because the exchange of ideas has turned into a battle of egos.

Despite the challenges, we have one major fact in our corner: computers are precise. In contrast, medical doctors and mental health practitioners always work with “sometimes”, “usually”, “frequently”, “rarely”, “uncommonly”, or some more scientific variant. Server administrators often get to say, “it always works this way or something has broken”. Modern hardware and software are unquestionably highly complex, but they all boil down to the same category of on/off mechanisms found in the first computing devices. Even when computing devices break, they exhibit predictable patterns. All the causes and fixes lie within the current scope of human knowledge (at least, as far as the parts that systems administrators deal with).

Therefore, whenever we encounter a claim, someone can devise a way to prove or disprove it. Few other disciplines so completely share that ability, and we do not take enough advantage of it. I’ll share some undying myths that I have encountered:

  • You must routinely defragment your disks: As SSD and NVMe take over from mechanical disks, we don’t see this myth as much. It certainly had a hot run for decades, though. The problem was, no one ever proved it. Instead, we got these long (and truthfully, well designed) thought experiments that explained the complexity of disk I/O and how it left data scattered all over the drive. However, the presented problems always depicted a human as the protagonist. Certainly, trying to store and retrieve data in the way a hard drive does would drive a human to insanity. Computers don’t care that much. Also, the natural course of using a computer rarely results in the thoroughly scattered data distribution that these treatises describe. Sure, your system does some periodic data cleanup that helps a bit, but not full end-to-end, file-by-file, block-by-block defragmentation. And, yes, we have confirmed cases where a thorough defragmentation delivered measurable improvement. However, we do not have any proof that full defragmentation does anything meaningful for a statistically significant number of systems. When you consider the time that we have had defragmentation utilities, that void of proof means something. First, put numbers to it: if a thorough disk defragmentation pass requires six hours and restores 5% performance on 5% of systems, do you consider that time well spent? What if, to maintain that 5%, you must devote 2 hours per week to keep the loss of performance at 0.1% for the week, and most systems improve by less than 0.01%? Is it still worth it? Now, for a bigger question: will you take the measurements necessary to validate the existence of that improvement? When researching an earlier article, I discovered zero publications by anyone who had done that work. I found a stack of anecdotes, and the article that I wrote resulted in angry e-mails with even more anecdotes, but no data. Personally, I first discovered disk defragmentation in Norton Utilities back in the late 90s. So, in well over twenty years, no one has ever conclusively demonstrated that you need to perform full disk defragmentation regularly. No one. And yet, you could commonly find it on “best practice” lists.
  • Servers should not run antivirus or anti-intrusion systems: Hopefully by now, you have heard of “defense in depth”. Knowing that we cannot stop all threats, we use layers of security: firewalls, multi-factor authentication, passive threat scanning, active scans of all sorts, log reviews, event monitoring, and more. Hopefully, these layers allow us to prevent, or at least uncover, malicious activity. Of course, everything that you add to a system also contributes a potential problem source. We don’t want a cure worse than the disease. So, a thought experiment sprung up that points out that most threat actors make their way in through users. Based on that fact, the reasoning goes, you should not risk your servers with their antimalware/anti-intrusion layer. They figure that you only need to protect the users’ computers. I encounter this thought experiment frequently, which suggests that a lot of systems administrators think this way. However, I have never seen this experiment paired with data. When I challenge it, the response (if I get one at all) usually assaults my intelligence or competence, and maybe it includes an unverifiable anecdote or two. Let’s begin with an examination of the thought experiment. First, while it is a fact that users are the most common attack vehicle, “most” does not mean “all”. Second, beware anyone who does not realize that the set of people that we call “administrators” also belongs to the set of people that we call “users”. Third, right from the second, servers are also computers. So, the experiment makes no sense on its own. Next, we need to find out if we have any data to bring to bear. I mostly work with healthcare now; a recent statistic claimed that malicious attacks accounted for 44% of extended downtime, while system failures accounted for 0%. ZERO. Obviously, they rounded that number down (I can attest that failures result in a non-zero amount of lost time). However, if that statistic comes anywhere near truth (also in my experience, it does), then avoiding the use of security software on a server to avoid problems will mathematically result in zero improvement to uptime while exposing it to catastrophe-level risk. Instead of doing nothing to avoid the possibility of a bad cure, we should vet our security vendors so that we apply good cures. Sadly, I have seen the no-security-software-on-servers practice employed by service providers that get paid by the hour to clean up messes and would stay in business even if their clients failed. I do not know how many adopted this stance maliciously, but I see a clearly biased path for them to prefer this conclusion.
  • Performance matters above all. Users do not like slow systems. They have jobs to do, and staring at the wait cursor does not get them done. So, it seems that we systems administrators have incentive to make fast systems. So, often at the urging of the people selling us hardware and software, we buy the fastest stuff that management will fund. We stand it up in our labs, run benchmarks and stress tests at it, and compare results with our peers. If we don’t like something, we get on the Internet and sometimes on the phone with our vendor to look for ways to make it faster. We poke and prod and push and tinker until we squeeze every microsecond’s worth of performance out of it. Then, we push to production and… no one cares. We have a parallel in the motor vehicle industry. Look at the staggering difference between high performance cars and the cars that regular people use on regular roads in their regular lives. All those four-cylinders out there would have no chance in a race against the burly ten cylinder monsters, but almost no one ever gets their four-cylinder to its capacity. See, while users will tell you that they don’t like slow systems as much as anyone will tell you that they don’t like slow cars, they really mean that they don’t want their computer (or their car) to slow them down. If they don’t see that wait cursor, then the time their computer sits idle means nothing to them, whether that’s 10% or 95%. To some extent, this myth does not matter. However, in the world of small business where every dollar spent internally has competition, this practice leads to excessive waste. It also generates an unnecessary demand for technology, which, at scale, results in higher pricing for everyone. I see this practice by service providers, some innocently trying to save their own resources with “one-size-fits-most” solutions and others that just don’t know better. Unfortunately, a few charlatans will push the higher-priced items for those higher margins. Again, we use math to beat this one. Sure, we administrators wish we had 10 gigabit or faster networks housewide. But, when we have twenty users that submit about a dozen requests every hour to the server that need a few hundred kilobytes to transfer and store, then due diligence demands that we leave that 10 gigabit network in our dreams.
  • Poor technology decisions are the administrator’s fault. I have been calling out poor administrative practices, so let’s take the heat elsewhere for a while. Most of us that write and speak publicly started our careers in the same trenches as all the other administrators. But then, a lot of us move out of that world and permanently join the speaking and teaching circuits. That often connects us to a breadth of knowledge that few doing the day-to-day work can reasonably access. Over time, some immerse themselves so deeply in what should be that we forget what must be. I have become increasingly irritated at the burden of shame they try to lay on the shoulders of administrators. Has anyone ever chastised you for not upgrading all your operating systems within a short time of release? Do you get hounded for patches a few weeks out of date? Has anyone ever scoffed at your oddball hardware build? When you try to explain that you have internal or vendor restrictions to cope with, do your antagonists blame you for all of it? Have you been told, “Vote with your wallet,” or “just pick another vendor,” or “as the administrator, all technology decisions are yours”? Sure, some administrators are plain lazy and do not do their best. But, I would venture that their number pales in comparison to those administrators who know better and want to do better but have some barrier beyond their control. In my experience, the market leader of any given line-of-business software makes an overall terrible product and has no real competition. Customers in that industry cannot simply go without. So, if the vendor decides that they’re never going to support anything beyond Windows Server 2008 because they don’t want to go 64-bit, guess what happens? Their customers stay on Windows Server 2008. A few will defy that, and find themselves stuck paying software maintenance so that they can still get upgrades but have no backing from support. Others will go to competitors, trading one set of problems for another, and only rarely resulting in a net improvement. One of my favorite examples from healthcare is asking the talking heads what to do about a major falling out with fMRI suppliers. When “just go to the competition” involves architects and construction crews and helicopters and traffic diversions and more money than most people will see in a lifetime, you tend to tolerate a lot. It’s easy to tell someone else to make changes when you don’t have to implement them.

I could go on, and you could probably suggest a lot more. Look for the common thread: if a technology claim within the purview of a server administrator is true, then someone can prove or disprove it. Ask for the receipts.

Recognize Your Limits

I believe that people have good intentions until they prove otherwise. I think that most technology myths stem from ignorance and persist because we have so little time to examine them. Also, even those outsized egos that annoy us came from somewhere innocent. Most people that stay in the technology world have reasonably high intelligence and a high success rate. Unchecked, that grows into arrogance.

A few things to keep in mind:

  1. No matter who you are, if you started right now and devoted the rest of your life to visiting the console of every server on the planet and had no physical or security restrictions stopping you, you could not possibly touch a statistically significant number of them. Of course, the maximum deviation between servers is relatively low (as opposed to, say, the animal kingdom), so that doesn’t mean too much. However, it does mean that we all have limited experience.
  2. If you stay in this field long enough, you will eventually begin to hear things like, “I’ve never seen this before,” when working with support and vendors. You might even start to think that you must be some sort of statistical anomaly. You’re not. If your problems were common and you have experience, then you would not need to engage support.
  3. Few “cookie cutter” environments exist. Every customer, every facility, and every datacenter will have at least one thing that makes it a special snowflake. Like snowflakes build into snowballs and avalanches, the things that make an environment unique will affect other things. You cannot know everything in advance, and all your experience will never fully prepare you for whatever you see next.
  4. From the perspective of the individual, the world of hardware peripherals and software is infinite. You will know some things fairly well and you will know others very well, but there will always be far more that you know nothing about.

If you can remember all these things, then you can maintain a much needed perspective. You will never know it all. That’s OK, because there are a lot of us. Collectively, we know it all. It’s not all on you.

A Decision-Making Framework for Server Administrators and Systems Architects

The world already has an abundance of decision-making frameworks. Some have fancy acronyms, others have lots of research, many show up in college curriculum around the world. They all look similar, and this one won’t break the mold. It only needs to exist because so many rise to systems administrator or architect ranks with no exposure to any others. Like any good framework, you will not always use every component every time. Mainly, you want to develop habits that will serve you throughout your career.

  1. Understand the problem
  2. Reflect the problem
  3. Measure the problem
  4. Consider solutions
  5. Consult others
  6. Test solutions
  7. Implement solution
  8. Validate solution
  9. Look for additional problems

These seem self-explanatory, but most of the problems outlined earlier in this article arise from skipping them.

Understand the Problem

We could prevent a massive percentage of problems from occurring just by understanding the ones that we have. Our industry places a premium on speed and intelligence. So, finding an implementing a solution quickly often nets a great deal of praise. Admitting that you don’t know something can impair your reputation. If you work in a support organization, your employer may link your performance ratings to your speed and your number of reopened tickets. When you invoice a client by the hour, they don’t want you to minimize that bill.

These pressures lead to poor decisions and overly short-circuited thinking. Many administrators started in quick-response support positions and have an influence on the ones that didn’t. The habits developed in end-user support rarely translate well to the datacenter. Some problems certainly have known, quick solutions. Many do not.

Most importantly, listen. Many long-time systems administrators and architects start throwing out solutions before really hearing the problem. They often miss details that would change their answer, or worse, they comment on technologies that they know nothing about just because they heard something familiar.

Reflect the Problem

You can use several verbs here: reflect, repeat, rephrase, etc. However you do it, summarize your understanding of the problem. If you can, state it back to whomever brought it to your attention. Otherwise, or additionally, write it down. This small exercise helps to ensure that you didn’t mistake the problem for a symptom or miss the problem entirely. The time spent on it also allows your brain to absorb it and start its branching out for solutions.

Sometimes, you have to loop back to understanding the problem. Really listening requires practice. Some people, especially frustrated users, cannot express their problems clearly. Often, an obvious fix does not address the basis of the problem.

Measure the Problem

Before moving on from understanding the problem, you need to grasp its scope. Consider a classic problem: a user reports that they can’t access a server. You spend a lot of time troubleshooting that user’s computer, only to realize that the entire network segment of that user’s area has failed. You might have caught that with a more careful evaluation of the problem, but probably even faster by finding out if anyone else was having trouble.

If that scenario seems improbable (the other users probably wouldn’t sit idly by), such things happen frequently. Also, we have plenty of other scoping difficulties. Someone says their system is slow… what does “slow” mean? You could jump right in and apply all your knowledge about applying computers, but what if the user has unrealistic expectations? What if some upstream problem manifests itself as local slowness? Furthermore, we need to triage. One user’s slowness can wait if you need to restore a crashed server system.

I want you to take away one large thing from this point: you need baselines. You cannot accurately understand slowness or sizing problems without them. For example, if your supervisor asks you to project SAN storage needs for the next five years, you need historical information to do anything except guess. If a vendor barges in and insists that you must upgrade everything 10 gigabit, you can’t evaluate the truth of that claim without knowing what your typical network load looks like. Capturing baselines takes time, and you need to take them regularly. However, the investment pays off.

Consider the Solutions

Once you understand the problem and its scope, then you can start working through solutions. Rarely will you have one perfect answer. Most involve tradeoffs. Most importantly, don’t jump to implement the first idea for anything non-trivial (and even some trivial problems). Reflexive answers lead to problems that persist even though a permanent solution exists.

Remember my warnings about superstitions? If you did not encounter them earlier, possibly while trying to understand the problem, they will likely pop up here. Slow computer? Defrag! What makes that a bad answer? How do you even know the slowness involves disk I/O? Have you even determined that the source of the slowness exists on the reported computer? Even if you have that one in a billion situation where defrag will help, you cannot justify skipping due diligence.

Some models involve intuition. Intuition works best with uncertainty. Computers do not have unpredictability (until advances in AI and new hardware paradigms make this a lie, anyway). Also, you do not need to rationalize your way to a solution, at least not entirely. Whatever you think might work, you can try it and then test it. As an example, a simple performance comparison against a baseline (fancy way to say “before and after pictures”) will allow you to determine conclusively if an optimization was effective. You do not need to guess.

Gather data! So many solutions, especially superstitious ones, solve problems that no one has. Sometimes they waste money by prompting improper purchases. Often, they cost time by not addressing the real problem. In the worst cases, they remove a symptom while leaving the problem behind. You need that “prove it” mentality here.

Consult Others

Unless you must solve the problem immediately, you have time to find people with more experience. Egos get in the way a lot here, but remember: you have not seen it all. Insulated decision making also results in problems that recur needlessly. You may get new options that make you reconsider your options. Unfortunately, you will often get suggestions from people that don’t understand the problem. Learn to say, “Thank you for your input,” and move on.

Test Solutions

Resist the urge to implement a fix immediately. Situational demands can override this, of course. When possible try multiple solutions before implementing anything. You want an understanding of potential side effects. More importantly, many fixes that look viable do not help. You want to find that out in private before putting a non-functional fix in place with an audience.

Failures here may necessitate falling all the way back to considering solutions. Hopefully, you had more than one solution lined up and can simply move to the next.

Something else that almost needs its own bullet point: have a rollback plan. Unless your fix ships with some sort of formal way to undo (like uninstalling a patch), then you will need to come up with one. Test that before moving to implementation. Watch for side effects. You cannot consider anything less than complete restore as a clean rollback.

Implement Solution

Probably the most straightforward step, you need to put your selected and tested fix in place. Remember to follow institutional rules around things such as notifications and change tracking. If possible, take some sort of backup first. While not technically a backup, snapshotting or checkpointing systems can give you a fast way to revert a bad change. Never change a customer’s system without telling them, no matter how minor the change or how certain the repair.

Validate Solution

Everyone has a story of a time that everything worked swimmingly in test and failed spectacularly in production. Remember that bit about every environment being a special snowflake? We should always strive to make test environments a facsimile of production environments, but we will always fall short. Expect for something to happen that you did not anticipate. Test thoroughly. Watch for side effects. Again, gather data! You may not immediately know that something worked. Comparisons between the prior and current state help the most.

Look for Additional Problems

In a way, you will satisfy the search for additional problems during validation of implementing previous solutions. However, the larger point is realizing that these problems trigger another iteration of the decision making process. Should you rollback? Should you keep the change consider this a new problem/solution branch.

Practice, Practice, Practice

We often joke about how medical professionals say that they “practice” their trade. “Practice” does not seem like something that you should do on a patient! However, it’s an accurate depiction. Each time they carry out a procedure, even one that they perform several times each day, they walk through specific steps in a determined course. Over time, they get faster and produce better results. Likewise, every time that you go through the decision-making process, whether in a live or test environment, you gain experience that you will carry forward.