A Plea for Idempotence and Immutability

The basic needs when it comes to infrastructure configuration is for it to be predictable, scalable and allow for easy recovery in case of failure.
Things will break eventually, for sure, so we have to keep the risk of failure and recovery time low (to reach high MTBF and low MTTR).
Without looking at infrastructure design patterns that will help with service availability such as HA, DR, Business Continuity and so on; let’s focus on the configuration management perspective.
The problem with complex infrastructure, is that it has many moving parts that we are trying to integrate into a bigger system. Each of those components are variable in configurations, and have multiple possible configuration state and status over time, whether desired or not. To say a word about expected state, I think Don Jones really hit the sweet spot in saying it should be driven by Policies in this article about Infrastructure as Code (IaC).
Systems often evolve following an unmanaged growth, piling up new features as needs arise, ending up with untested configuration not defined by a policy but by the result of these more or less chaotic changes.

1. Scaling configuration

Everything is exacerbated with scale, so let’s dive first in the Scaling challenge.
Imagine a single node system, configured by several engineers for different features. The node is in a different state (letters in graph below) after each configuration. Try to picture a directed graph from such system, where letters are configuration state, and arrows its transitions when changes are applied.
 fig1.1
Now add transitions for when you want to revert a change to go back to a previous state, by adding the reciprocal arrow between state.
fig1.2
The first way to simplify such system is to reduce the number of end states, by focusing on what you really care about (say A is your starting point, E, and F your end states), any other state is then transitional. E and F should be defined by a policy, your target state.
This:
fig1.3
Is an abstracted view of that:
fig1.4
The arrow between A and F is a set of changes, which abstract the path ABBCCDDF, but the transitions haven’t changed, we’ve merely abstracted some of the systems complexity by bundling changes together.
If the node fails completely, you will need to re-apply its configuration from A so that you get back to E or F.
Let’s illustrate the impact of managing different transitions.
Increasing the number x of your policies in your abstracted view will increase the number of transition y so that start –> end increase proportionally y = x, while if you try to cover each transition from each possible state, the number of transitions grows by y = (x*(x-1)).
fig1.5
This is what intrinsically happens when systems are manually configured without control/process. You expect the good ol’ sysadmins to know what state a configuration is at any given time, and to be able to transition from that state to any other, on demand.
Monitoring does help keeping a view on the state of systems (sometimes assuming that if it works, it has the right configuration), but this again needs real discipline for scalability, and again, how to manage monitoring configuration?
Also, how are you sure the changes have a low risk of failure?
Some scenarios can ensue:
– Someone missing a step in a manual configuration guide, creates a new end state that is probably unknown and undefined, and it can take as much time (or more) to fix the problem than to restart from scratch
– Someone ‘interactively’ makes a change after an automated install/config: who made that change? When? how to revert? will revert work? Does it need a reboot?
– A configuration is sometimes tested in an end state, but not at every transition.
– After a system has been running and patched for years, how do you replace it? How do you replay its configuration? Are you worried your pet server is irreplaceable?
– Sometimes, Change Management procedures are set to keep track of changes (transitions) happening to a system. You can then trace and re-apply those changes to a new system. But if those changes are manual, you can still miss a step. Long lived system with regular mutation are no easier to manage, and the mutation logging/control becomes a burden, in the way of agility, security, and value creation for the business.
The complexity is applicable on each components of an infrastructure individually, and the association of those components into a whole, and is in my opinion one of the root challenge to scalability. No wonder why containerisation, microservices and PaaS are so attractives.
The common approach to complexity is to build abstraction to our system, by focusing on specific models (whether it’s using OSI, Virtualization, storage, software, stack…).
Imagine that each Hardware, each OS, distribution, application stack and so on, are a system component.
The complexity is not only the sum of its component, but also the possible transitions within each component.
To balance deployment speed with maintainability, in some cases you may introduce different start state by way of Templates or Golden Image (or more accurately, think of those as pre-defined transitional state in your transition from RAW base/source/binary/ISO to your end states defined in a policy).
Similarly to other state, those pre-defined transitional states (checkpoints) will need to be managed the same way as any end state. You will need to replay configuration to your invariable source, the Base source, usually provided as an installation media for a specific version.
By reducing the states we enforce the system to be in (and monitor its compliance), we reduce the number of variables, hence enable easier troubleshooting and recovery because we have well documented / tested states. This does not replace monitoring and inventory, but if well implemented it can help.
The direct effect of reducing the number of moving parts of a system, is reducing its apparent complexity, enabling abstraction from its underlying systems, but such abstraction must not interfere with its predictability.

2. Predictability

So how to make a system predictable?
(Without talking about performance degradation, load and things that can vary over time that are external to configurational changes)
When you have your state policy defined, you need a reliable way to transition to them from your starting state. Then you need to confirm and ensure you stay in that end state in terms of configuration.
The first key, is to remove the human as a variable, we’ve always done mistakes and acknowledged so for more than 2000 years…
So that state transition from a base state to an end state (your policy), let’s call it deploymentfor an infrastructure, needs to happen without manual human interaction to be reliable. That’s an Automated deployment, which starts from some data defining the policy and/or some logic to transition to it from a known, current, state.
Then when you change that deployment process, say update a template, you then need to re-test it as a unit, then within the whole system (unit test + integration test) before accepting the changes. Whenever you change a template, you need to ensure that everything that is based on that image would still work after the change (avoid regression).
It’s those tests, that gives you the confidence your system stays predictable, as long as you are in the expected state. Monitoring the state is another requirement to ensure your service won’t be degraded without you knowing. Benchmarking and load/stress testing ensures that changes to the system have no significant impact on expected performance.
If the component gets corrupted, you need to be able to re-generate one with the exact same configuration, this is where the definition data and automation of the template is important and need to be self-sufficient (no need for manual input).
The initial approach in computer configuration is an attempt to script it, end to end. Not a bad one, as a piece of code should be reproducible.
The flexibility of scripting languages allows to run many tasks and interact at every layers of a system while managing some kind of orchestration. However, by itself, it offers little abstractions and separation of duty and everything is up to the developer, who has to be skilled and make a special effort to layer the deployment script adequately.
As such development efforts become really important to automate the deployment of full blown infrastructure, it soon requires some standards and discipline to allow several engineers to work together on a same system. The separation of duty between the specific data, the policy definitions, the orchestration, the implementation of a configuration item and the tests needs to be codified and enforced.
This is why some OSS projects and vendors came up with Configuration Managementprinciples (one being Infrastructure as Code), that would help abstract the complexity of the systems layer by layer, with low coupling and high cohesion between components.
In some implementation, the top layer could be the policy describing the expected state, another layer would handle some of the orchestration of the whole infrastructure, then further layers would handle implementation of sub-systems, which would in turn rely on abstraction for their own sub-system and usually leverage some scripting to get the thing done at the smallest unit possible.
Each component being tested, so that they produce the same result every time, and when assembled with other component you can rely on the abstraction created.
The configuration is now decoupled in different layers of system, the further we go up, the more high level we get, towards the highest level defining the infrastructure, the policy.
This policy is not necessarily manually input, it can be calculated following a business logic such as: “we have more users, using more resources, lets deploy more servers in the cloud automatically”.
Because the whole definition is text based (policy document similar to code), at every layer of the configuration, it is then self-documenting.
And that leads us to the last requirement for a system, is to be easily recoverable in case of failure, with a low risk/rate of failure.

3. Recoverability

Besides designing your system with redundancy, high availability, and abstracting the hardware with virtualization or cloud, having your infrastructure definition (Policies) at hand means you can automatically deploy another instance of a component in a reliable way.
Because each change to your policy will be tested, along with their integration and their dependencies, it ensures that you can identify and revert or fix your change before it reaches production.
Reducing the feedback loop to avoid issues in production is covered with the basic principle of Continuous Delivery / Continuous Integration, it’s just a matter of applying it to configuration management for Infrastructure so I won’t go in details here.
Should your system still have a defect despite your tests, then maybe you forgot to test something or one is not accurate, and implementing this new test will fix the problem, forever.
Managing your configuration policies as code, using the best practices for code management such as versioning, code review, unit testing, TDD and so on, will give you the same benefits to your infrastructure that it gives to software development.
In the same way, it gives you the opportunity to be more agile in developing your infrastructure, and easier to adapt methodologies, such as Scrum, to your infrastructure project.

4. Takeaways

In conclusion, and I could have started by this, the Infrastructure configuration should be manageable in a way that aims to be:
– Idempotent: No matter how many times you use it, it will provide the same result
– Immutable: policy driven infrastructure, preventing configuration drifts / snowflake configurations. Replace/rebuild components instead of patching/modifying it. (once setup, you never change/mutate the instance, you rebuild based on the new DATA)

There will be cases where you want some specific mutations to be supported, these will be extra transitions to manage and maintain, but remember that they will be a burden as the system grows from them.

The policy definition should be the only unique source of truth of the targeted state, monitoring should report the differences from that state at a point in time, and the infrastructure should be easy to recover. Those recovery processes should also be tested(have you heard of Chaos monkey?).
Before a change can occur (delivered in production), the change must have been tested and verified. This relate directly to the definition of done in Scrum.
It enables:
– Making change management easier/auditable/revertible with source control
– Better clarity of the changes to the policy (diff)
– Easier way to continuously improve and develop your infrastructure
– self-documentation of the policies via declarative DSL
– Improve migration of server/service by always knowing what you expect (what it should look like when compliant with the policy)
One simple and abstract way to see it:
Policy describe the What (as an end state), not the How.
Logic describes the How, in as many layers as necessary (very simplified here):
– Orchestration: wait for service x before starting service y
– Atomic implementation: /etc/rc.d/init.d/x start; Start-Service y
I wanted to wait the last minute to talk about the actual tools, as I often see companies trying to adapt tools to their existing practices, thinking they’re addressing the problem while they’re only moving it/hiding part of it, without questioning their processes and approaches.
Many modern tools uses similar principles with their own advantages/constraints: Chef, Puppet, Terraform, PowerShell DSC, Ansible, Saltstack, CFEngine and other… You’ll need to do due diligence to find the one ‘fit for YOUR purpose’, so start experimenting now!
Add a testing framework, CI with ways to spin up/down your infrastructure at will, and you should have enough to go a decent way.
A word of warning at last, being able to move off from the old ways to such model is a long and difficult task. You need to reverse-engineer your existing configurations, create the tests that will allow you to ensure consistency, idempotence, and saves you from regression. You will have to ensure the unicity of truth and restrict access to win back control. Although it may sounds like a daunting task, it’s merely a way to pay off some of the technical debt accumulated and needs to be a progressive but resolute effort.
The light at the end of the tunnel is that you will then gain control of your system and improve your agility, change will become a frictionless process and you will not fear production issues.
Have you went tried that route, and did it work for you? Have you found the obstacles too difficult: technology, human resistance, resource?


Comments are Closed