You are here: Home > Publications > RIPE Labs > Marco Giuliani > How We're Migrating our Configuration Management System

How We're Migrating our Configuration Management System

Marco Giuliani — 15 Jun 2018
Contributors: Sjoerd Oostdijck
At the RIPE NCC we are migrating our configuration management system to a mixed solution using Salt and Ansible. This article describes the issues we were facing with the old system and the selection of a new platform.

Introduction

To cope with the ever increasing complexity of IT infrastructure the use of a configuration management system has become customary. Moreover, there is a strong trend towards virtualisation and integration in a so-called “Infrastructure as a Code“ framework which helps to manage and oversee an IT environment.

Like many other organisations, the RIPE NCC has been relying for years on CFEngine 3 Community Edition (CF3) and - more recently - also on Ansible to maintain our internal and external services. We used CF3 to manage hundreds of virtual and physical machines in order to provide both internal and external IT services. In addition, Ansible was chosen for the deployment and the management of specific, globally distributed services, for example K-root servers and RIPE Atlas anchors.

Although these tools allowed us to manage our infrastructure quite efficiently until now, we were facing a number of issues with the old system:

  • CF3 Code Maintainability
  • Lack of code-review and lack of structured training 
  • Error-prone architecture
  • Lack of internal knowledge-sharing

Let’s look at these issues in some more detail. 

Code maintainability

A configuration management system generally involves a set of descriptions of a desired-state that one or more systems have to adhere to. These "descriptions" take different names such as policies, states, playbooks, according to the specific system under consideration.

Maintaining an infrastructure basically involves two actions:

  • Creating new descriptions and configurations
  • Updating existing descriptions and configurations

In a complex infrastructure every new description is directly or indirectly related to another existing one. So, in order to write a new policy, one has to be able to predict how the new policy will interact with the existing situation. When the set of existing policies becomes too complex, creating a new description of a desired state becomes tricky. Furthermore, in some cases, for example when we troubleshoot a problem during an emergency, we need to be able to update existing policies quickly.

Also, readability of the syntax and compliance to well-known standards is crucial.

We realised that our CF3 policies and configurations have become increasingly difficult to maintain over the years.

Lack of code-review and lack of structured training

When a group of people work together for years on the same code, making changes every day, without a proper review process, things can get quite complicated. This is something we are dramatically improving while re-designing our architecture.

Furthermore, we also noticed that, especially for newcomers, learning CF3 proved to be quite difficult due to a steep learning curve. Newcomers were learning on the job without former training or mentoring. We have now decided that our new configuration management system will be paired with a structured training process for IT staff.

Error-prone architecture 

Our CF3 architecture relies on Apache Subversion (SVN). Every engineer shares the same privileges when committing changes. Despite the fact that some pre-commit hooks assure a minimum syntax check, each SVN commit may potentially affect the production environment with a risk of harming systems, since each contributor directly makes policy changes on the SVN repository used by the CF3 master server.

Although this has happened quite rarely, sometimes a change has had an impact on the availability of the services. It became evident that our architecture was too prone to human-errors.

Lack of internal knowledge-sharing

After some years with CF3, we decided to add Ansible to deploy and manage specific systems. Although we were very happy with the many advantages offered by Ansible over CF3, this had some drawbacks. In particular, we noticed that eventually different teams were working on different systems which made inter-departmental cooperation difficult. 

We therefore decided to look for a new configuration management system: a single solution that could used by all the engineers in the organisation to manage the entire infrastructure. 

Selecting a new solution: Salt and Ansible 

In early 2017, we set up a team with engineers from various technical departments to look at the pros and cons of available alternatives. The main goal was to find a single solution suitable for our current and possible future systems. After a first selection, we ended up with two feasible choices: Ansible and Salt.

These systems have many similarities. Both systems are based on Python. Their configuration files are written in YAML and both rely on Jinja as templating language. Both are actively developed and supported by a large community. For both there is a good documentation available. However, there are also important differences. Let's analyse the most relevant ones.

Client-server connection and availability of a client-side agent

Salt by default uses ZeroMQ over two TCP permanent connections for data exchanges between "master" and "minions". Paired with built-in channel encryption, ZeroMQ offers higher performances compared with the Ansible SSH-based transport. Salt provides a local-agent while Ansible simply relies on the SSH client. Furthermore, Salt-SSH, an agent-less implementation which mimics the behaviour of Ansible, can be used in all those cases when installing a local agent may not be desirable. In addition, Salt-Proxy allows the management of devices that cannot run a client-side agent, like network devices maintained via API. Finally, for very large distributed Salt architectures, Zero MQ can also be replaced by RAET which offers even greater communication capabilities.

We want to stress the fact that the availability of a dedicated agent is not better in absolute terms: instead, it depends on the constraint of the architecture you're designing. Although Ansible's minimalistic approach stands out for its simplicity and ease of deployment, Salt is remarkable for its incredible flexibility.

The extended Jinja integration

In Salt, Jinja can be used literally everywhere while the usage of Jinja in Ansible is limited. 

Variable storage

Salt manages group-specific or host-specific variables in the "pillar" dictionary,  which can be stored in the master or in some external databases. The Salt Master pushes variables only to hosts that need them. As far as we know, Ansible does not offer anything similar. To overcome this limitation, at the RIPE NCC we use an rsync-based solution to protect the transmission of Ansible variables over insecure networks.

Reporting capabilities

Salt offers incredible reporting features through the use of "Returner" modules. By default, the return values of each command  executed in the minion are returned to the Salt Master in JSON format  through a dedicated TCP channel. However, a returner module can be configured to send data to external databases as well. For example, we plan to export our data to an Elastic Search cluster. Ansible simply uses a local log file in each host. It is worth mentioning here that Ansible Tower provided by Red Hat offers more extended reporting capabilities, like central logging.

Event-driven action

Salt uses Reactor modules to monitor states and act on changes. These modules can be tailored to specific needs by defining custom events.

Inter-host dependency definition

Salt, via Orchestrate Runner modules, can define a set of inter-dependencies between different hosts, for example the Salt master can order the execution of specific states in several hosts, or it can run a state in host A only when some specific  modules succeed/fail in host B.

In Ansible you can define inter-host dependencies, thanks to its "delegation" feature that can trigger actions on multiple hosts. However, in local-mode, inter-host dependency is not available.

Ease of deployment and learning curve

Thanks to its simplicity, Ansible appears to be easier to learn and faster to deploy. However, our experience shows that, once the initial deployment has been properly carried out, for a well-trained IT staff working with Salt is a breeze. Since both systems share many similarities, the previous knowledge acquired on one system greatly simplifies the learning process on the other system. This aspect also facilitates the integration of Salt and Ansible in the same architecture. 

Below you can find an overview of all features we evaluated. 

 Feature  Ansible  Salt
 Language  Python  Python
 DSL -  Domain Specific Language  YAML+ Jinja (limited)  YAML+Jinja
 Client-Server connection  SSH  ZeroMQ via two TCP channels
 RAET via UDP channels
 SSH 
 Availability of a client agent   No  Yes
 Reporting capabilities 

 Limited

 Extended

 Variables storage  Limited  Extended 
 Run one-off commands remotely  Yes   Yes 
 Event-driven actions  No  Yes

 Inter-host dependency

 Limited

 Extended

 Ease of deployment   Quite easy  Quite complex
 Learning Curve  Easy to learn  Steep
 Established knowledge within our organisation  High   Low
 Community support  Excellent   Excellent
 Documentation quality  Excellent  Excellent
 Availability of a commercial edition and technical support   Red Hat Ansible Tower   Saltstack Enterprise
 License   GPL  Apache 2.0

 

Salt, with a sprinkle of Ansible

We initially considered migrating the entire infrastructure to Ansible, since this system was already used to manage part of our infrastructure and especially because of the availaibility of established knowledge within the RIPE NCC.

However, after careful consideration, Salt appeared to offer better abilities and more flexibility to future-proof our infrastructure, although it comes at a price in terms of added complexity. We do acknowledge that one of the key features of Ansible is its simplicity. 

Eventually, although Salt is going to play a primary role in our infrastructure, we are actually designing a system where Salt and Ansible will be integrated. To give an example, Ansible can be used to automate the deployment of the Salt master hosts. 

This way, we are taking advantage of the best features of both systems. 



Conclusions

We are convinced that we have found the best solution for our new configuration management system: a combination of Salt and Ansible. It will help us to maintain our complex infrastructure in a clean and extendable way. We are currently preparing extensively and training all staff who will work with the new system. In the next articles, we will go into more details about the actual architecture and our migration strategy.

If you have any experience in this area or if you have gone through a similar exercise, we would love to hear from you.

6 Comments

Ju says:
21 Jun, 2018 01:15 AM
Nice comparison. I would have like few examples on the complex cases to illustrate :)

For Ansible reporting, would recommend to look at ARA by Openstack project (https://github.com/openstack/ara)

Thanks for sharing!
Marco Giuliani says:
22 Jun, 2018 11:32 AM
Thanks for your feedback Ju.
This should be the first of a series of articles about our configuration management systems. In the next articles we will delve into more technical details and we will provide configuration examples. We are still actively deploying our infrastructure.
ARA is a definitely interesting project!
MarcoG
Flo says:
23 Jun, 2018 04:00 PM
Hi,

so, after reading this twice:
You well identified the problems in your old setup.
Then you write about the new one, and basically dwell in technical details, and, besides mentioning there will (wow?) be formal training and review from now on, those points are not described any further.
Right now, from all you wrote, you are prepping yourself to do the same thing again.

A reader would assume those process-side changes aren't anywhere ready to be used, and especially lack goals that decide if they are _working_ once you use them.
And if a reader would have assume that, so should you.

My advice would be to force yourself to write one more post that covers your plans (everything is just a plan) for training and release management on the same level of detail as you did for the ansible and salt combo.
Marco Giuliani says:
29 Jun, 2018 03:02 PM
Hi Flo,
many thanks for reading this article so thoroughly and thanks for the good advices.
I really appreciated your comment.

We are fully aware that the training and reviews processes require a careful planning involving the definition of a series of goals and indicators to properly evaluate whether and to which degree these changes are effective. I am taking into account your suggestion.
Thanks again.
MarcoG
Paul says:
25 Jun, 2018 04:17 PM
This is quite a comprehensive comparison. I agree with most of your findings, except two:
 - Jinja2 templating: Ansible users can also use Jinja variables in playbooks (although they are restricted to one line, due to security concerns: https://github.com/ansible/ansible/issues/8233)
- Salt learning curve: I agree that Salt is more difficult to learn than Ansible, but I consider that 'steep' is a bit much :)

Good luck!
Marco Giuliani says:
29 Jun, 2018 03:09 PM
Hi Paul,
thanks for your feedback and for your correction. I am amending the text :).
About SaltStack learning curve: yes, maybe "steep" is a bit much:).

Thanks,
MarcoG
Add comment

You can add a comment by filling out the form below. Comments are moderated so they won't appear immediately. If you have a RIPE NCC Access account, we would like you to log in.