Chaos engineering - why and how?

I’ve always said that I would’ve really wanted to be in the meeting room when the Netflix engineers first came up with the idea of pulling the plug on running production services on AWS. Management’s faces would have been worth seeing. :) “So you plan to do what exactly?!?!”

From a crazy idea to common practice

Netflix folks are really the pioneers on this area and have written a lot on the subject. From those early days of it being a slightly crazy idea, the whole culture of unleashing voluntary chaos to services and data centres has evolved quite a bit. In fact, it’s become such a common and well established practice that it is now commonly referred to as chaos engineering. The term chaos engineering actually summarises the practice really well. Although what we essentially try to do is cause chaos, but we do it in a controlled manner using well established engineering practices. Hence the term chaos engineering. There’s even a kind of official manifesto for chaos engineering at http://principlesofchaos.org/

Why

But why would anyone really cause chaos voluntarily? Well, the answer is actually quite simple. Unleashing a monkey with a gun to destroy infrastructure pieces under your services is the best possible test to see if your service really has the resiliency level needed. Today’s systems are typically highly distributed in nature where components and microservices are running in various environments and possibly even geographical locations. When the system is comprised of multiple processes/components/services running on multiple servers/datacentres/regions you are pretty much bound to Murphy’s law: “Anything that can fail will eventually fail”.

And it’s not really only about seeing if the system is resilient enough to survive random failures. It’s also very much about building confidence in the operations of the system to be able to recover from different types of error situations. If your system fails in production in a very mysterious way and no-one knows how to recover from it, the impact on paying customers might be pretty significant. If you’ve practiced system recoveries caused by self-induced chaos engineering, the probability of operations having already seen that kind of error is pretty high and thus the recovery is probably a lot easier. And if you’ve seen those kind of errors before, naturally you should’ve fixed the system to be resilient to them.

How

There are varying ways to cause chaos in your system environment, ranging from manually shutting down servers to automatically unleashing a cyber monkey with a gun to destroy servers and services on your AWS account. The possibilities are almost endless and of course highly dependent on the environment where you run the services.

There are many levels of chaos that can be induced on any given system. The usual approach is to start from the lower levels where individual components, perhaps containers, are made to fail. When going up the ladder, the blast radius of the caused failure widens. Shutting down servers, virtually shutting down entire data centres and on the far end of scale, making whole geographic regions disappear from the map. Of course, when using third party infrastructure or cloud services, it might be impossible to go and actually shutdown entire data centres. But one can emulate that kind of scenario by shutting down every single server in a given data centre.

Where?

Optimally we should run our chaos testing in production as that’s the only real way to see if the production setup and services are built in truly resilient way. I personally at least would not have the guts to start the first chaos experiments directly in production. I think it makes sense to follow a step-wise approach where you first run your chaos experiments in the test or staging environment. Once you build more trust that the system is resilient enough and more importantly, the team has the confidence that they can fix the system if it actually does break down, you should move on to run the chaos in an actual production environment. As said, you don’t really want to see the system failing in unknown ways in production.

What about containers?

Many are building their systems now utilising containers and container orchestration technologies. From a chaos inducing point of view, containers make it very easy to cause chaos at an individual component/service level. It’s pretty easy to just go and shoot one container off the planet thus causing some unwanted chaos for the system.

Containers and related technologies also make building chaos resilient systems slightly easier as they allow the spinning up of rapid replacements for containers that are MIA due to servers and data centres shutting down. Most of the container orchestration tools automatically handle cases where a replacement container has to be spun up as the predecessor left the building. And now, when orchestration tools are starting to support running stateful services using container volumes even the data can resiliently move within the worker nodes running the containers. Of course, given that the volume is backed up by some resilient data store such as AWS elastic block storage.

One good principle of container adoption is “Trust the scheduler”. But to be able to trust it, you need some proof, right? Chaos engineering practices also help you to build trust in the selected container orchestration technology as it allows you to verify, for example, that the corrective actions that are supposed to happen actually do happen. And with a minimum possible distraction in your services.

Tools

Whether or not you really need any tooling is pretty much a question of how automated you want the process to be and also on the size of your environment. If your runtime environment is only a few servers running a handful of containers, causing chaos manually isn’t that hard. Just go to your cloud provider management UI and stop random servers. Or SSH into some of the boxes and randomly kill some containers.

The Netflix team has open sourced a wonderful set of tools called Simian Army. It’s a set of tools to cause chaos specifically on AWS infrastructure and is really nicely automated. It can cause chaos at multiple levels, all the way from shutting down random servers to shutting down things at a region level. Although these tools are for AWS only, setting up something similar for your specific cloud environment is not that big of a task. Stopping random servers through a REST API would be pretty easily doable with some scripting.

To cause chaos on a container level there’s an open source tool called Pumba. Pumba runs on a single node scope and causes different types of chaos for containers, ranging from stopping them randomly to causing various network issues on the containers.

If you’re running your containers on the Kontena Platform, as everyone should :), one easy way to cause some chaos on a service level is to remove service instances using kontena service rm --instance 10 nginx. Kontena scheduler will notice this and will take corrective actions.

There’s also a new toolkit in the making called chaostoolkit. The idea behind that is to provide an abstract way to define your chaos experimentation policies as json files. Those policies can run the experiments against multiple platforms, kinda like Terraform for chaos engineering. The chaos toolkit is really in it’s early stages, but the general idea is really promising.

Summary

Chaos engineering is all about causing voluntary chaos to identify the different ways the system can break. That feeds information into engineering teams to make the system more and more resilient towards these failures and also helps operations to recover from different kinds of failure situations. Inducing random chaos into your systems of course feels a bit wrong at the start but in the long run it makes your system highly resilient against any kind of chaos.

About Kontena

Want to learn about real life use cases of Kontena, case studies, best practices, tips & tricks? Need some help with your project? Want to contribute to a project or help other people? Join Kontena Forum to discuss more about Kontena Platform, chat with other happy developers on our Slack discussion channel or meet people in person at one of our Meetup groups located all around the world. Check Kontena Community for more details.

Jussi Nummelin

Read more posts by this author.

Tampere, Finland
comments powered by Disqus