10 Principles for Successful DevOps Infrastructure Architecture
In my current role at a startup company, I’ve been fortunate to be entrusted with creating pretty much the whole of their operations and infrastructure from almost nothing — truly a worthwhile adventure and one of my proudest professional achievements. I wanted to share some of the guiding principles I feel are useful for designing robust architecture that I learned along the way.
Simple is better than complicated — One of the reasons that electric cars will soon take over the world is that outside the impressive technology that goes into making batteries, (and the questionable Twitter habits of their creators) electric cars are just simpler than internal combustion cars. Electric cars have only a handful of moving parts — the motor, the suspension, the wheels. This is a huge advantage for durability and reliability. In a word, electric cars are elegant. Internal combustion cars have thousands of moving parts, spinning, crashing, and grinding, each constantly wearing out and getting ready to fail. Similarly, an infrastructure with a large number of moving parts is inviting failure. How many loadbalancers are you using? How many NAT gateways? How many stages are in your build scripts? How many database servers? Could these be reduced and simplified? It is easy to allow complexity sprawl to build up in your infrastructure as new services are added. It takes thought, creativity, and discipline to design systems that are both scalable and simple — to design an elegant system.
Dumb is better than smart — One time my team came up with a smart solution for monitoring crashed customer-facing services: we wrote log parsing scripts that searched log output for key phrases that would alert us if certain log entries appeared, such as “500”, “timeout” etc. The scripts that parsed the logs were clever and well constructed. However, they were also a failure. We didn’t anticipate every failure state in the system, and one day we experienced a customer-facing outage when the authentication service database crashed. Our log parsing scripts did not inform us of this unexpected issue, and this problem persisted for some time before we discovered it. We could have added this edge-case to our scripts, but we realized this would only buy us time until the next unexpected edge-case came along, and we would miss that one too. Instead, we tossed the whole log-parsing idea and came up with a crash program to install active probes instead. These probes are quite dumb. All they do is send in test queries to our production systems, and if they fail for any reason, we are alerted and paged. So far, these “dumb” probes have proven highly successful and we have not missed an outage situation since adopting them.
Dumb is the difference between a tank and a Ferrari. We all know that Ferraris are cool and high performance and fast, but they also break down a lot and need lots of attention. When designing infrastructure, you want tanks that keep rolling through the muck, not sports cars that look good on the mechanic’s lift.
Beautiful is better than ugly — Look at your infrastructure diagram. How does it make you feel? Are you filled with wonder as if viewing Starry Night in the Van Gogh museum, or filled with unease as if chewing on a piece of aluminum foil? Are there lots of arrows pointing this way and that? Or is the diagram organized, and so easy to understand that it flows like a Notorious B.I.G freestyle? Good architecture is beautiful. Good architecture is elegant and balanced, and it will just look right. If something looks off or ugly, there is a good chance that some experienced and war-weary part of your brain is trying to tell you something. Definitely trust that part of your brain, it will keep you out of trouble.
Symmetry is good — If you draw a diagram of your infrastructure, does it look balanced? Do you have even numbers of things on the left and the right? Or are there some things that lean over to one side, leaving the other side empty? This might be a clue that you are not distributing your services with an eye to high availability. For example, do you have all of your production systems in one Availability Zone or datacenter, and your Dev systems in another? What happens, then, if the AZ hosting production goes out of service due to a cut cable or a fire, or a disgruntled employee? Now you are scrambling to deploy your production environment somewhere else. Better order some pizza because this will take a while. (I know because I have done this!) If you had spread your production systems across both datacenters, your services would have survived the loss of one of them.
It’s tempting to allow things to pile up in your “main” datacenter, but it’s important to resist this temptation and treat all your datacenters or AZs as first-class citizens, and your design and deployment automation should always assume that you are running things in a couple of places at the same time. Your infrastructure should be symmetrical.
Use Clusters with caution — The Boeing 747 Jumbojet has four engines for obvious reasons — four engines are clearly safer than two! You can lose two engines and keep flying. However, research in the 1980s showed that this intuitive belief was not actually true. Airplanes with two engines were not only just as safe statistically as four-engine aircraft, but were cheaper, easier to maintain, and quieter than four-engine aircraft. It turns out that that having more engines just opened the door to more engine failures, and in the end were simply not worth the trouble. Consequently, Boeing and Airbus no longer make four-engine passenger planes.
This same principle applies to clustering technologies in computing. It’s quite easy to assume clusters automatically bring the benefits of high availability, scalability, and self-healing that the vendors promise, but in my experience, this all comes at a cost, and may not actually make your systems more stable.
A well-engineered clustering technology can actually be a tremendous bonus when it’s working properly. At my current company, we have been using Kubernetes clusters for almost four years. I don’t recall a single case where we had a problem that came down to Kubernetes itself acting up. It’s been rock stable. In fact, our clusters have saved our bacon more than once, when single VMs or instances crashed, or services failed health checks, and Kubernetes acted on its own to respawn lost pods or restart stuck services before we had even reacted to our alerts and monitoring.
I can’t be as generous with one of our, shall we say, “Stretchy” log searching tools that run in a cluster. This technology, although mature and well funded, has been an ongoing headache for our team. It requires constant tweaking, it fails gracelessly into a state that needs extensive repair when it crashes and requires a disproportionate amount of effort from the team to keep running. It fails, in my opinion, to satisfy points 1 and 2 on my list. It isn’t simple and it’s too clever by half. It is great when it works, but the added complexity of running it in a cluster has offset the advantages.
Trust the veterans — By veterans, I mean the grizzled, battle-hardened, scuffed, and battered technologies that have withstood the test of time. Are you doing something that persists lots of data? Surely you need something new, shiny and exciting. NoSQL! Document databases! Graph databases! Clusters! Unlimited Power!
Right, but have you tried this with Postgres yet? Because here’s the thing with Postgres — it just works. It’s dead simple to manage. Everyone knows how to use it, and it’s a lot more flexible than you might think. Postgres has been around for decades for the same reason sharks have been around for longer than trees: survival of the fittest. Postgres has been assaulted by Mongo, speared by Aerospike, and foretold of doom by Cassandra. And yet it lives because it’s so good at what it does. That’s not to say there are no good use cases for these other technologies, it’s just that you need to be really sure that you need what they offer before overlooking the tried and true.
This same principle can be applied to lots of other technology. You must always weigh the benefits of using something new that offers better performance against something old that offers reliability, predictability, and stability. In the end, 99.99 uptime is a strong argument to make against the fastest of the fast and the newest of the new.
Use the fewest tools possible — Remember PostgreSQL? This is kind of a corollary to that. If you are asked to support a new service, every effort should be made to see if that new service can be made to work on something that you are already using. Do you need to automate something? Can this be done with cron and Jenkins or Gitlab or CircleCI or Websphere or whatever wratched deployment tools you are already using? Yeah, maybe not perfectly or in a really cool way, but can you make it work? Because adding a new tool is going to be all of the work of the original task, plus all of the work supporting and maintaining the new tool. In fact, adding new tools is probably the worst thing you can do for reliability, stability, and the mental health of an operations team. Add tools if you must, but fight then if you can! And always make clear to other stakeholders in the business how much new tools actually cost.
Community matters — a few years ago I got into a friendly argument with a fellow engineer about whether we should adopt Kubernetes or Docker Swarm. I boldly predicted that Swarm was the one to bet on. There were good reasons for that! We were already using Docker and Docker tooling everywhere. Swarm leveraged the same commands and config formats that we were already using. Compared to Kubernetes, it was simpler and had fewer moving parts. It had strong backing from a vendor in case we needed service contracts. My colleague agreed but pointed out that the community had already chosen, and it had chosen Kubernetes. He was right. By turning the code loose, and putting gifted communicators like Kelsey Hightower on the job to evangelize it, Google had put Kubernetes front and center, and the community embraced Kubernetes wholeheartedly.
So, today Kubernetes is still too complex (in my opinion!) but it is richly documented, well supported by third parties, constantly improved and updated, and has become the de-facto standard platform for cloud computing almost everywhere. My lesson was when you are placing your bets on a technology that will scale and grow with your company, go with the one that has the strongest community.
Write more documentation than you think you need — Keep a wiki. Write in it a lot. Make your team write in it a lot. It really doesn’t need to be pretty, but it should have lots of pictures and the pictures should be in a format that is easy to edit so that it’s not hard to keep them up to date. Make it a requirement that you can’t close a work ticket until you have updated something in the wiki. If someone asks you a question, you should tell them that can look it up in the wiki, and it should make sense to them. This is like wearing sunscreen, trust me on the wiki.
Do the annoying stuff first — What’s annoying? Security is annoying. User and account management is annoying. Privacy and Compliance are annoying. Nobody wants to get involved with that stuff, because it gets in the way of doing all real work like delivering features and getting paid. The thing is, it’s way harder, maybe impossible, to add these things after you have built all the other parts of your infrastructure. What you get instead is a bunch of hacky and inconsistent workarounds that get bolted on and become a source of needless toil and heartache. Better by far to build these things into your systems from the beginning.
Almost everything has a login, so you need some kind of single-sign-on system to manage user accounts. Have a policy in place to enforce its use. Having twenty different logins at work is sucky and inefficient and should be avoided.
Have TLS certs? Better manage those little time bombs before they explode and take production down.
Almost everything has credentials, how are you going to keep people from putting those little rascals into their source code? Better to have a mechanism and a policy for that from the beginning than to try to fish them out line-by-line later when a big customer decides they want to audit your company’s security policies.
The lesson here is that if you find yourself in the enviable position of creating infrastructure from scratch, it’s best to lay down the tools and the processes that lock in best practices while you still can.
To sum it up — keep it simple, balance it out, make it pretty, write it down, be a hard case about compliance, keep your head, bring a sweater and you will enjoy the fruits of a reliable and stress-free infrastructure!