Proposed Improvements to "Dummification" of Personal Data in the RIPE Database

Kaveh Ranjbar — May 08, 2013 11:55 AM
Filed under:
The RIPE NCC proposes improvements to the algorithm used for removing personal data from the bulk provisioning of RIPE Database data.

 

The RIPE NCC provides RIPE Database data in bulk format in three different ways:

To adhere to data protection rules and community requirements, the RIPE NCC has to remove what is considered personal data from these data dumps.

Current Situation

The algorithm currently removes all objects containing personal data (PERSON and ROLE objects) and replaces them with a single dummy object:

 

 

It also replaces all references to the PERSON or ROLE objects in other objects in order to keep the whole dataset consistent:

 

Since ORGANISATION and MNTNER objects might also include personal data, the algorithm tries to obfuscate the contents by removing optional attributes and replacing the values in some mandatory attributes:

 

It has been suggested to us that the dummification process goes beyond what is needed to protect the data. For example:

  • Currently ORGANISATION and MNTNER objects are available through the live RIPE Database without any access restrictions or limits so it is easily possible to collect object keys from one of the dummified dumps and retrieve the full objects from the live database.
  • Obfuscating the references doesn't provide any added value since all other object types are available with no limits from the live RIPE Database or the split dump files. All of the references are already exposed with no limits. 

The current dummification process renders the data useless for many different uses, for example:

  • The references between actual resources and their administrative and technical contacts create a meaningful and useful relation between resources and entities which is completely lost in the current dummification process.

  • Useful research data is lost by dampening all objects containing personal data to a single dummy placeholder object.

The RIPE NCC is proposing a new dummification process to address these shortcomings, while staying withing the data protection rules.

Proposed algorithm:

  • Keeping the links and references in all of the objects

  • Keeping the PERSON and ROLE objects in the dump with their real NIC handles and only obfuscating the personal data fields:
    • For email addresses, we will keep the domain part of the address and will only obfuscate the email account part
    • For phone and fax numbers, we will keep the first half of the number and will obfuscate the rest
    • For addresses, if the address is longer than two lines, we will keep the last line
    • Names will be fully obfuscated
  • One exception is a ROLE object with an "abuse-mailbox:" attribute. The email value of the "abuse-mailbox:" attribute will not be obfuscated. All other email addresses in the object will have the email account part obfuscated. None of the other obfuscation that applies to a ROLE object (as described above) will be done. By design, this object will be available in any bulk data without any query limits. So there is no added value in obfuscating too much of this object.
  • In all other objects we will:
    • Always replace the MD5 password hash in the MNTNER objects with a default hash
    • For all email addresses in any attribute, keep the domain part of the address and will obfuscate the email account part

If the proposal is accepted, the examples used in the previous section will look like the objects below:

 

 

 

 

Implementation

Since the resulting dataset is still self-consistent and RIPE RPSL compliant, we don't expect any incompatibility with existing tools:

  • We can keep generating both old and new dumps on the RIPE NCC's FTP server. If we go live with this new dummification algorithm, we will move the old data format to a subdirectory in our FTP server and will keep generating files in both formats for 30 days.
  • The same will happen for the NRTM feed. We will switch the main feed to the new format but will keep a server with old format running, pointing customers in case of an incompatibility to the old server.

We will decommission the old dummification software if there are no open incompatibility reports after 30 days of running both processes in parallel.

Any feedback about this proposal would be appreciated on the RIPE Database Working Group mailing list's discussion for this topic: http://www.ripe.net/ripe/mail/archives/db-wg/2013-May/004048.html