Infrastructure as Code is Not the Answer!
You know the sales pitch, of course. If you work in the DevOps, Platform or SRE field you have probably made it yourself. Infrastructure as Code! The benefits are many, varied and self-evident:
- It’s reproducible!
- It’s self-documenting!
- It’s visible!
- It prevents mistakes!
- It lowers cost!
- It prevents drift!
- It prevents toil and increases joy!
All of these things, no doubt, can be true. In an ideal world, all of these benefits would be obvious and vastly beneficial to any organization willing to make the up front investment in converting the deployment of infra to code and configuration management. And mostly we all are. But what about things that, let’s say, are not in an ideal world? What are the actual costs, risks and difficulties in adopting IAC? Where does the cool sales pitch begin to melt down under the glaring, hot lights of reality? Let’s take a look at some of the benefits above, and examine some of those assumptions.
Well, sort of. I would say that infrastructure code is awesomely reproducible — for a while. But then… things get flaky. Packages go out of date. Features are deprecated and stop working. Code syntax changes. Images are no longer available. Dockerhub wants you to pay. Someone’s user account got built in, and they don’t work here any more. Lots of the interlocking dependencies that make a deployment work just change with time. The point is that deployment code has a shelf-life, and if you don’t constantly refresh and update it, it will not work when you suddenly need it to.
Again, sort of. There’s a couple of flaws with this argument, though. The first issue relates to the problem above. Your code starts to get moth-eaten after a while. This means that an engineer who is viewing your code for the first time may start chasing down rabbit holes, trying to get some old module to work in a new cloud environment where a bunch of assumptions about how things are done no longer apply. I have colleagues who have spent days chasing after failed deployments and debugging things step-by-step, and getting totally confused because what they see in the code doesn’t match reality. In truth, it probably would have been faster and better if they had just started over and written new code.
The second issue is that code is, in my opinion, really not good at being documentation. I have written code myself, and come back to it six months later, and been totally baffled by what I was doing there. There are all sorts of hacks, half-measures, and just plain wrong stuff you can do when writing code, because the odds are good that you were learning about what you are doing as you were doing it. There are plenty of things you learn in the meantime that can make you want to cry when looking at your first couple of attempts. Infrastructure code is no different, and can lock into place all sorts of janky assumptions and approaches that will only serve to confuse future engineers when they come back to look at it in the future.
Yes, it is visible — the same way it would be visible if I spray painted Cyrillic graffiti on the walls of my CTOs corner office. It would be cool and transgressive, but nobody would know what it said. The issue is that the code is, well, code. It’s not something that a non-engineer can easily read, or even an engineer who is not familiar with Terraform or whatever can easily read. It takes practice and training to understand the information contained, and sometimes even experienced engineers may take some time to fully work out where all the info is located, especially if you have lots of nested files and dependencies and scripts that call scripts that call scripts. The point is that, yes, you can see it, but that doesn’t mean that you can quickly or easily understand it.
It Prevents Mistakes!
Or, it can amplify them like the deuterium jacket surrounding the core of a hydrogen bomb. I heard a story once of an engineer running a “terraform destroy” on his laptop — in the wrong tab. He was in the Prod directory — and starting nuking production. Realizing his mistake, he Control-C’d out of it, but it was too late. A lot of stuff was already gone, and on top of that, killing the process left the state file in a confused state. They were able to get it back by manually fixing the state file and re-running the terraform, and the issue was resolved in a couple of hours. But it was a really awfully easy mistake that anybody could have made.
It Lowers Cost!
If everything works perfectly, and nothing ever changes, and you do the same things all the time, sure. But in my experience, the effort to keep your code up-to-date is a sink for engineering time. It takes hours and hours per day for engineers to debug, troubleshoot, learn and write the code. It might take days of work to automate a task that would only take minutes if you just logged into the Console and clicked some boxes. If you use your code all the time, then yes, it’s worth the investment. But you need to really ask yourself how many times will you actually deploy a new RDS instance, or a new Cloudfront Distro? Many of these tasks are one-offs that you may never do again. Do you really need to spend days automating that?
It Prevents Drift!
This is true if and only if, you only ever, ever, ever, create code to change things in your environment. The moment a person has access to the console, you can get drift. So who is this feckless miscreant, logging into the console and making changes out of band? Well, sometimes it’s you. Because there’s an outage and you needed to throw a new certificate in your loadbalancer, or because you’re being DDOSed and you need to add a CloudFront Distro to soak up incoming requests. Or because you need to scale something up to handle unexpected load. Now obviously, you can go back after the fact and retrofit your code to reflect the changes you made. This seems silly, but in my experience this sort of thing happens with some regularity in the real world.
It Prevents Toil and Increases Joy!
You want toil? How about updating Terraform version 0.10 to Terraform 1.3? There are pretty significant changes in syntax, which some automated tools can handle. But also old versions required lots of wonky workarounds that are improved in newer versions as features got added. This means that automated tools to redo syntax are not enough; you need to refactor the whole repo from end-to-end. I have colleagues who have been working on Terraform migrations for more than a year. It’s a huge job that requires many, many engineering hours to bring up to date. It’s sometimes hard to convince your boss that you need to do this, especially if you have already consumed your ration of goodwill in sinking man-months of effort into creating the code in the first place.
The Holy Grail of the Platform Team — a self-service PAAS for the whole company. All of your infra and services have been so thoroughly automated, that a non-DevOps engineer can pull a git repo, edit some template files, and push them back up, and your CI/CD system will deploy it automatically. All the logging, alerting, metrics and security policies have been baked in. I frankly love this idea. I’m working on getting my current company to this state right now. But again this bumps against reality. For example, we used to have a quite complicated automated system for managing user accounts in Github. We would edit files, push them to CI, and users and repos would be populated. However, we realized this was needlessly complicated and dramatically slowed down on-boarding new FTEs and contractors while they waited for somebody technical to edit the automation files. It was much easier and faster to just make sure that team managers were vetted and trained in security policies, and given access to Github directly. We trust that they will click around the right way to add new accounts and give them permissions.
So does this mean that we just abandon the whole idea of IAC and go back to clicking around the Console? Well, not always? I think that’s the best answer. Much like good ideas of the past — think Agile, Scrum, DevOps, Serverless, Microservices, all these things can bring a lot of value. You just can’t get tunnel-vision and assume that you need to commit to the idea like it’s your sole meaning in life. IAC has its use case, but that’s just it, a use case. Not every case will be the same, and there are always devilish details to contend with. A good engineer or engineering manager will know when it’s time to let go of the Platonic Ideal of automated perfection and allow people some flexibility when it’s warranted.