At RIPE 63, the RIPE NCC was asked to consider possible options for Internationalisation of the RIPE Database content. This article provides a quick overview of the current situation as well as possible developments and challenges.
The RIPE NCC provides services to members from 76 countries. The RIPE Database is a public and open database used by the Internet community from the RIPE NCC service region and beyond.
The RIPE Database design documentation mentions ASCII as the chosen character set for the RIPE Database content. Currently, most of the data is in US-ASCII English characters.
However, the wide user community of the RIPE Database use many different character sets and in some cases, such as person names or addresses, restricting user input to US-ASCII results in inaccurate data. On the other hand, allowing non US-ASCII characters has some challenges, for example it might render the data unusable for users who query it, because they might not be able to read the character set used for the data.
Right now there are no policies regarding the use of non-latin characters in the RIPE Database. From a technical point of view, the core storage of the RIPE Database stores data in bytes and is agnostic to encoding. This means users could choose their own encoding when entering data into the RIPE Database. However, this has not been tested and could generate unexpected results. To analyse the current behaviour we can divide user interaction with the RIPE Database in two main sections: Database Updates and Queries.
RIPE Database updates
For email updates, the message is decoded based on the encoding used for the email. If no encoding is found, US-ASCII is the preferred encoding. If the email is encoded with UTF-8 the data will be stored as UTF-8.
For Syncupdates, the connection is HTTPS and our servers offer UTF-8 as well as ISO-8859-1, but prefers UTF-8. So if the client supports UTF-8, the data will be stored as UTF-8. The same situation applies for web based updates. Our web based update tools (Webupdates and Quick updates) all use HTTPS and our server offers and prefers UTF-8 and if that's not available then ISO-8859-1 is selected.
The RIPE Database Update API also uses HTTPS and behaves exactly the same as the Syncupdates.
On the data entry level, many attributes, including all object primary keys, restrict the syntax to a very specific set of ASCII characters. Attributes that allow free form text input, like address and description, would not reject non ASCII characters during the syntax checking process. But as this has not been tested we cannot guarantee that the data finally entered into the database is what you expected.
RIPE Database queries
Command line queries, sent through the CLI return the results as they are stored in the database. Since port 43 queries are based on RAW TCP connections, raw data from RIPE Database storage is sent to the user's terminal and interpretation of data is totally dependent on the behaviour of the user's terminal. If the terminal supports UTF-8, then the user can see any data that is stored in UTF-8.
Web based queries and API based queries behave in the same way. Again our webserver offers UTF-8 and ISO-8859-1 and prefers UTF-8. The data is then presented to the user as it is stored in the RIPE Database.
For the current technical situation, the update software needs significant end to end testing using different UTF-8 characters. Depending on the test results, changes may be necessary. From a policy point of view, nothing is set. Most of the current data is in Latin characters but there is nothing limiting or enforcing a user to enter their address in their local script. This means, at the moment, a user in Iran, can choose to use Farsi to enter their organisation address. There is no policy to prohibit nor to encourage this action.
The benefit is that the address will be more accurate and it will be more relevant for local users. The downside, however, is that the whole address field might be unreadable to any user who cannot read Farsi. Most users will not even know which city the address is registered in.
Looking at similar implementations, the current data set should probably be restricted to US-ASCII characters. This means that all current object attributes would only accept US-ASCII characters. An additional set of optional attributes (mainly concerned with contact and locally relevant information such as: name, address, city and description) could be made available in some objects to duplicate the standard information. These may be identified with a suffix on the attribute name, for example "address-local-lang:".
These additional optional attributes should only be allowed in an instance of an object if the original attribute is present. They provide local language versions of existing information rather than replacing existing information. So users will have the option to provide information in their local language, but they always have to provide the information in English for an international audience. It should also be made clear that in case of any dispute or question, the English data is the authoritative registered information. The local language information is just a complementary optional dataset.