At the RIPE NCC, we recently migrated one of the Hardware Security Modules (HSMs) we use to handle keys in RPKI. In this article, we give a detailed overview of the role HSMs play in the broader RPKI system and talk through the steps we took to carry out the migration process.
The RIPE NCC uses a Hardware Security Module (HSM) to handle keys in Resource Public Key Infrastructure (RPKI). In fact, we use HSMs for two purposes. One of these is the ‘online HSM’ that handles objects that need to be created continuously - e.g. when a member creates a Certification Authority (CA), or when a Route Origin Authorisation (ROA) is created. The other is used for our offline Trust Anchor CA.
Having recently carried out a migration of this latter HSM, we thought it would be a good idea to talk about the steps we took to do this. But before getting into that, let's start by taking a closer look at the role our HSMs play in the RIPE NCC RPKI Trust Anchor (TA) system.
The RIPE NCC RPKI Trust Anchor
You can picture Public Key Infrastructure (PKI) as a tree of certificates signed at each level by Certification Authorities (CAs). At the root, there's the Trust Anchor certificate, the only certificate in the tree that is self-signed. The rest is spread over three other levels, with the CA at each level signing the certificate (and a manifest) of the next:
- The offline Trust Anchor CA (with all resources)
- The operational certificate (with all resources)
- RIPE NCC managed resources (‘production CA’, with all member resources)
- Member certificates, with the private key in RIPE NCC Hardware Security Modules (HSM) (hosted), or controlled by a member (delegated)
Objects for the CAs at levels 2, 3, and 4 have to be created continuously. This is done by ‘RPKI core’. In contrast, the trust anchor itself needs to sign objects every three months to keep its manifest up-to-date. This request is created by the RPKI online CA software (‘rpki-core’ on github) and signed with the trust anchor utility (‘rpki-ta0’ – on github).
The Trust Anchor system (level 1) is made up of two main physical components. First, there’s the machine with the trust anchor software. This machine is kept physically secure and strictly offline (there's no network or Bluetooth connection, and all updates are done from physical media). Second, there's the HSM itself, which is also a physical device that you use to securely manage keys. These components are used together in the signing ceremony for the offline CA.
The signing ceremony
When the signing ceremony takes place with the offline CA, all inputs and outputs of the system are transferred over physical media. Three separate roles are used in this ceremony:
- Engineer: facilitates the technical part of the ceremony and performs the commands
- Trusted observer: checks the inputs and outputs, makes sure the audit log is started, and makes sure a backup is made. After the ceremony is complete, the trusted observer transfers the signing response to the engineer
- CA operators (cardholders): each cardholder activates the CA by providing their smartcard and passphrase. Three out of ten cardholders are needed to activate the key of the CA
How we use the HSM for our offline CA
Our HSM is a piece of hardware that is used in combination with what's called ‘security world software’ on the machine to which it's connected. This software stores a ‘security world’ on disk, which contains encrypted keys that can only be used by an HSM that is currently part of that ‘security world’.
The HSM is operated using physical cards. Two types of card sets exist, a single administrator card set for the security world, and (optionally) operator card sets. The administrator and operator cards are used to authorise operations using the HSM, such as importing the security world (administrator card set) or unlocking a key for use (operator card set).
To authorise operations, the HSM requires only a number of these cards, called a quorum. This means that the process can tolerate the loss or corruption of one or more cards without impacting the authorisation of actions on the HSM. There are two types of cards used: administrator cards and operator cards. These are not interchangeable, so an operator cardholder cannot execute the actions which would need an administrator card. The administrator card set is needed for operations that affect the whole security world, such as recovering keys (migrating keys to a new operator card set) or replacing card sets. Each card is protected by a password, known only to the card owner.
Both of these card sets have a quorum. For the administrator card set, three out of five cards (and passphrases) must be entered to perform an operation. For the operator card set, three out of ten need to be present to use a key. For the offline CA we use a key that needs authorisation by the operator card set before an application can use it. The total number of cards and the quorum are specified in RIPE NCC’s internal cardholder policy.
The operator cards allow access to the application keys and are used for the signing ceremony. A quorum of administrator cards is required to authorise the creation of an operator card set, and only one operator card set can be active for a key. Once the action is authorised, the operator cardholders can present their cards and choose a password.
The cardholders for both types of cards are selected based on internal RIPE NCC policy, cardholders:
- Must be an employee of the RIPE NCC with a permanent contract
- Must be the only representative of their department
- Must not be operationally involved in the RPKI production environment (the RPKI team itself, for example)
In addition, administrator cardholders must have sufficient authority (such as management, or senior staff who have worked for the RIPE NCC for more than 5 years)
Migration to a new HSM
Migrating the HSM was necessary to keep our environment up-to-date by moving away from old hardware into a newer, supported model. Before this migration we used an nCipher nShield 6000E HSM (Figure 3, left) for the offline CA, and we migrated to an nShield Edge HSM (Figure 3, right). The new HSM does not support the type of smartcards used by the previous HSM. At the same time, the firmware version of the old HSM did not support the type of smartcard supported by the new HSM.
As the HSM is the core of the RPKI system, the decision was made to execute the updates in a separate environment from production. This ensured that any issue that might have occurred during the whole process would have been isolated, as rollback of firmware versions on the HSM is not possible.
Since we were also migrating to a different HSM model, we needed to ensure we had the correct software version of the security world that is compatible with the new hardware. To do this, we decided to use a ‘transition’ set up, where the upgrades were applied until we reached a version supported by the new HSM. The result of this migration would be a new version of the security world that is supported by nShield Edge HSM, which is then copied to the new clean production environment (see Figure 4).
To reach a supported version, we had to perform several consecutive upgrades of the HSM firmware, software and security world, followed by a backup after every upgrade as there could have been changes to the state on the filesystem. The full process is as follows:
The migration itself was performed on a separate, temporary machine that had similar hardware to the production one. The software was installed on it to mimic the production environment and then the production data was copied over with the presence of a trusted observer and a consultant. To ensure there was room for potential failure, we did the migration on a spare HSM. In case this was not successful, another spare HSM was available.
The latest backup of production data was used as a starting point of the migration.The data was copied over to the migration machine using an external medium and was loaded after being authorised by the administrator cards. Several version upgrades were required on this migration machine to reach a version compatible with the new hardware. Once this was completed, the administrator card set was reissued to newly selected cardholders.
During migration, we used the administrator card set to:
- Initialise the HSM, after a firmware upgrade cleared it
- Issue a new administrator card set on a newer type of smart-card, that is compatible with the nShield Edge
- Initialise the new HSM, and
- Create the new operator card set and ‘recover’ (migrate) the keys to that operator card set.
To test that the migration to a compatible version was successful, a signing ceremony was performed. After this was successfully completed, a new backup was made and copied to the new hardware, which had the security world software already installed. The HSM was introduced into the security world after the HSM was presented with a quorum of administrator cards.
Operator card set migration
The last step of the migration process was to create a new set of operator cards and assign new cardholders to them. This was done at a later stage due to the unpredictable duration of the first migration (HSM and administrator card sets).
The second migration was executed in the presence of engineers, trusted observers, the administrator cardholders and the ten new operator cardholders. The process was authorised by a quorum of administrator cards and was executed on the new environment which was a result of the previous migration. The keys were migrated under the new operator card set and the old card set was removed from the HSM.
To ensure everything was correctly set up, we had another signing ceremony, this time with the new operator cardholders. As with the previous migration, all important files were backed up on a physical medium once the signing ceremony was completed.
During this migration process we recreated an old machine for the migration process, installed and configured a new environment, and migrated the security world and keys to the new HSM. To keep the required operational experience, we plan to actively maintain this offline environment.
Now that the offline TA is up to date, our next HSM related project will be to update the online HSMs. While the migration for the offline CA was relatively straight forward, migrating the online CA will be more complicated because many more keys are involved. The online CA has thousands of keys, which can be created or deleted at any time, but during the migration to a new firmware version, all keys in that HSM will need to be migrated. If you are using our hosted RPKI model, your key will be migrated. This is expected to start later this year.