At the RIPE NCC we are migrating our configuration management system to a mixed solution using Salt and Ansible. This article describes the issues we were facing with the old system and the selection of a new platform.
To cope with the ever increasing complexity of IT infrastructure the use of a configuration management system has become customary. Moreover, there is a strong trend towards virtualisation and integration in a so-called “Infrastructure as a Code“ framework which helps to manage and oversee an IT environment.
Like many other organisations, the RIPE NCC has been relying for years on CFEngine 3 Community Edition (CF3) and - more recently - also on Ansible to maintain our internal and external services. We used CF3 to manage hundreds of virtual and physical machines in order to provide both internal and external IT services. In addition, Ansible was chosen for the deployment and the management of specific, globally distributed services, for example K-root servers and RIPE Atlas anchors.
Although these tools allowed us to manage our infrastructure quite efficiently until now, we were facing a number of issues with the old system:
- CF3 Code Maintainability
- Lack of code-review and lack of structured training
- Error-prone architecture
- Lack of internal knowledge-sharing
Let’s look at these issues in some more detail.
A configuration management system generally involves a set of descriptions of a desired-state that one or more systems have to adhere to. These "descriptions" take different names such as policies, states, playbooks, according to the specific system under consideration.
Maintaining an infrastructure basically involves two actions:
- Creating new descriptions and configurations
- Updating existing descriptions and configurations
In a complex infrastructure every new description is directly or indirectly related to another existing one. So, in order to write a new policy, one has to be able to predict how the new policy will interact with the existing situation. When the set of existing policies becomes too complex, creating a new description of a desired state becomes tricky. Furthermore, in some cases, for example when we troubleshoot a problem during an emergency, we need to be able to update existing policies quickly.
Also, readability of the syntax and compliance to well-known standards is crucial.
We realised that our CF3 policies and configurations have become increasingly difficult to maintain over the years.
Lack of code-review and lack of structured training
When a group of people work together for years on the same code, making changes every day, without a proper review process, things can get quite complicated. This is something we are dramatically improving while re-designing our architecture.
Furthermore, we also noticed that, especially for newcomers, learning CF3 proved to be quite difficult due to a steep learning curve. Newcomers were learning on the job without former training or mentoring. We have now decided that our new configuration management system will be paired with a structured training process for IT staff.
Our CF3 architecture relies on Apache Subversion (SVN). Every engineer shares the same privileges when committing changes. Despite the fact that some pre-commit hooks assure a minimum syntax check, each SVN commit may potentially affect the production environment with a risk of harming systems, since each contributor directly makes policy changes on the SVN repository used by the CF3 master server.
Although this has happened quite rarely, sometimes a change has had an impact on the availability of the services. It became evident that our architecture was too prone to human-errors.
Lack of internal knowledge-sharing
After some years with CF3, we decided to add Ansible to deploy and manage specific systems. Although we were very happy with the many advantages offered by Ansible over CF3, this had some drawbacks. In particular, we noticed that eventually different teams were working on different systems which made inter-departmental cooperation difficult.
We therefore decided to look for a new configuration management system: a single solution that could used by all the engineers in the organisation to manage the entire infrastructure.
Selecting a new solution: Salt and Ansible
In early 2017, we set up a team with engineers from various technical departments to look at the pros and cons of available alternatives. The main goal was to find a single solution suitable for our current and possible future systems. After a first selection, we ended up with two feasible choices: Ansible and Salt.
These systems have many similarities. Both systems are based on Python. Their configuration files are written in YAML and both rely on Jinja as templating language. Both are actively developed and supported by a large community. For both there is a good documentation available. However, there are also important differences. Let's analyse the most relevant ones.
Client-server connection and availability of a client-side agent
Salt by default uses ZeroMQ over two TCP permanent connections for data exchanges between "master" and "minions". Paired with built-in channel encryption, ZeroMQ offers higher performances compared with the Ansible SSH-based transport. Salt provides a local-agent while Ansible simply relies on the SSH client. Furthermore, Salt-SSH, an agent-less implementation which mimics the behaviour of Ansible, can be used in all those cases when installing a local agent may not be desirable. In addition, Salt-Proxy allows the management of devices that cannot run a client-side agent, like network devices maintained via API. Finally, for very large distributed Salt architectures, Zero MQ can also be replaced by RAET which offers even greater communication capabilities.
We want to stress the fact that the availability of a dedicated agent is not better in absolute terms: instead, it depends on the constraint of the architecture you're designing. Although Ansible's minimalistic approach stands out for its simplicity and ease of deployment, Salt is remarkable for its incredible flexibility.
The extended Jinja integration
In Salt, Jinja can be used literally everywhere while the usage of Jinja in Ansible is limited.
Salt manages group-specific or host-specific variables in the "pillar" dictionary, which can be stored in the master or in some external databases. The Salt Master pushes variables only to hosts that need them. As far as we know, Ansible does not offer anything similar. To overcome this limitation, at the RIPE NCC we use an rsync-based solution to protect the transmission of Ansible variables over insecure networks.
Salt offers incredible reporting features through the use of "Returner" modules. By default, the return values of each command executed in the minion are returned to the Salt Master in JSON format through a dedicated TCP channel. However, a returner module can be configured to send data to external databases as well. For example, we plan to export our data to an Elastic Search cluster. Ansible simply uses a local log file in each host. It is worth mentioning here that Ansible Tower provided by Red Hat offers more extended reporting capabilities, like central logging.
Salt uses Reactor modules to monitor states and act on changes. These modules can be tailored to specific needs by defining custom events.
Inter-host dependency definition
Salt, via Orchestrate Runner modules, can define a set of inter-dependencies between different hosts, for example the Salt master can order the execution of specific states in several hosts, or it can run a state in host A only when some specific modules succeed/fail in host B.
In Ansible you can define inter-host dependencies, thanks to its "delegation" feature that can trigger actions on multiple hosts. However, in local-mode, inter-host dependency is not available.
Ease of deployment and learning curve
Thanks to its simplicity, Ansible appears to be easier to learn and faster to deploy. However, our experience shows that, once the initial deployment has been properly carried out, for a well-trained IT staff working with Salt is a breeze. Since both systems share many similarities, the previous knowledge acquired on one system greatly simplifies the learning process on the other system. This aspect also facilitates the integration of Salt and Ansible in the same architecture.
Below you can find an overview of all features we evaluated.
|DSL - Domain Specific Language
|YAML+ Jinja (limited)
|ZeroMQ via two TCP channels RAET via UDP channels SSH
|Availability of a client agent
|Run one-off commands remotely
|Ease of deployment
|Easy to learn
|Established knowledge within our organisation
|Availability of a commercial edition and technical support
|Red Hat Ansible Tower
Salt, with a sprinkle of Ansible
We initially considered migrating the entire infrastructure to Ansible, since this system was already used to manage part of our infrastructure and especially because of the availaibility of established knowledge within the RIPE NCC.
However, after careful consideration, Salt appeared to offer better abilities and more flexibility to future-proof our infrastructure, although it comes at a price in terms of added complexity. We do acknowledge that one of the key features of Ansible is its simplicity.
Eventually, although Salt is going to play a primary role in our infrastructure, we are actually designing a system where Salt and Ansible will be integrated. To give an example, Ansible can be used to automate the deployment of the Salt master hosts.
This way, we are taking advantage of the best features of both systems.
We are convinced that we have found the best solution for our new configuration management system: a combination of Salt and Ansible. It will help us to maintain our complex infrastructure in a clean and extendable way. We are currently preparing extensively and training all staff who will work with the new system. In the next articles, we will go into more details about the actual architecture and our migration strategy.
If you have any experience in this area or if you have gone through a similar exercise, we would love to hear from you.