The RIPE NCC was requested by the Database Working Group to perform an impact analysis for UTF-8 in the RIPE Database, including the technical and functional effects on the RIPE NCC itself. How would the RIPE Registry function with UTF-8? And what would be the impact on running the Registry?
The RIPE Database is used by a variety of users across our service region (and worldwide) for a variety of purposes. One of these purposes is “Facilitating coordination between network operators (network problem resolution, outage notification etc.)”. In order to support this, the information in the RIPE Database must be in a format that is understandable.
The official working language of the RIPE NCC is English, but many languages are spoken by our staff. This means that, in order to support the functions of the RIPE Database, information not in English must be translated (or transliterated) into English by staff members, or a separate certified translation must be provided.
In recent years, there have been numerous discussions about the need for internationalisation in the RIPE Database.
- At RIPE 63, the RIPE NCC was asked to consider possible options for the internationalisation of RIPE Database content. In April 2012, Kaveh Ranjbar wrote a RIPE Labs article on this topic to provide an overview of the current situation and possible developments and challenges. This article suggested retaining all attributes as Latin-1 but allowing additional optional attributes for local language versions. Since the article was written, the Database was updated to only allow Latin-1 encoding because mixed encoding was causing operational issues.
- In April 2015, Piotr Strzyżewski made a proposal to allow UTF-8 in all free-text attributes of all Database objects, except in primary keys.
- In June 2020, NWI-11 was proposed to support Internationalised Domain Names (IDN) in the RIPE Database. Due to a lack of UTF-8 support to represent IDN, this proposal was implemented as a Punycode conversion of IDN domains in email addresses.
- In July 2020, Cynthia Revström proposed allowing Latin-1 or UTF-8 in organisation names. Currently, the “org-name:” attribute is ASCII only, and this presents an issue when non-ASCII characters are part of the legal name.
Why is UTF-8 needed?
The RIPE Database contains the names and addresses of organisations as well as administrative and technical contacts for resources and routing in the RIPE region. The Database currently supports the Latin-1 (ISO-8859-1) character set, which can represent most characters in western European languages. However it cannot fully support characters used in other languages and alphabets that are commonly used in the RIPE service region and beyond. Currently, any unsupported characters must be transliterated into Latin-1 (or even ASCII in some cases, which is a subset of Latin-1), potentially corrupting the meaning. Switching the RIPE Database from Latin-1 to UTF-8 allows names and addresses to be properly represented for querying and display.
Current rules for names and addresses
The following are the current rules for names and addresses in the RIPE Database:
- Organisation name
- “The organisation name ('org-name:' attribute) is an ASCII-only text attribute. The restriction is because this attribute is a look-up key, and the WHOIS protocol does not allow specifying character sets in queries. The user can put the name of the organisation in non-ASCII character sets in the "descr:" attribute if required." (See Appendix A- Syntax of Object Attributes)
- All Latin-1 characters are allowed in “address:” attributes.
- Phone number
- Only decimal numerals are allowed in “phone:” attributes.
- Email address
- Only ASCII characters are allowed in the local part of email addresses.
- Non-ASCII characters in the domain part are converted to Punycode.
- Person or role name (used as an admin, abuse, tech or zone contacts)
- Only ASCII characters are allowed in “person:” and “role:” attributes.
Any non-conforming characters result in a syntax error on WHOIS updates.
Note that the “descr:” and “remarks:” attributes can be used as a workaround to record names in Latin-1 encoding, but these attributes are not searchable and so will not be returned if the Latin-1 name is queried.
Current Rules in RIPE NCC processes
When making a request to the RIPE NCC, any names and addresses entered that are synchronised to the RIPE Database must conform to the same syntax rules. If non-ASCII name and address information is to be supported, the internal Registry must be updated to match.
For example, to become a RIPE NCC member , all names and addresses entered in forms must be ASCII only. No normalisation or transliteration is done upon input. Non-Latin characters were accepted in the past, but these caused problems with handling an application, so validation now requires ASCII only.
Supporting documentation (including company registration papers) submitted to the RIPE NCC must be understood by our staff . We do our best effort to translate information not provided in English. We make a best effort to translate information not provided in English where possible during data entry, if someone understands the language used. Otherwise, we ask organisations to submit professionally certified translations of documents.
There is a legal requirement that the company name must be added to the Standard Signing Agreement (SSA), which is signed when a member joins the RIPE NCC, and which is written in English.
The internal Registry currently requires records to be maintained in a readable and searchable format. That means that at least the legal name, legal address and contact names should be written in ASCII.
Additionally, our payment system cannot currently handle non-Latin characters. The company name and address are used to generate invoices, so any non-Latin characters must be transliterated.
Functional considerations to support UTF-8
The following functional requirements must be considered to support UTF-8. The community needs to define where and how UTF-8 can be used. How will UTF-8 affect using the RIPE Database for its defined purpose?
Firstly, should UTF-8 be allowed in any existing attributes, or should additional attributes be defined to support an alternative representation? It may be difficult to query for a name if the Roman equivalent is not available.
Secondly, should the full UTF-8 character set be allowed, or a subset? For example, RFC 5892 defines a subset useful for Internationalised Domain Names (IDN).
Should UTF-8 input be normalised? For example, should unprintable characters be substituted? Should non-break space be replaced with a regular space? Should visually similar characters (homoglyphs) be substituted with a common character (for consistency and to mitigate homoglyph attacks)? Finally, should the software attempt to automatically transliterate non-Roman characters (such as Arabic, Greek or Cyrillic), or should it require a Roman alternative be added simultaneously?
Technical considerations to support UTF-8
It is not technically difficult to support UTF-8 in the RIPE Database. The Database team have already implemented a version of the WHOIS server that supports UTF-8 in Database objects. However client APIs must change to accommodate UTF-8.
Currently, the RIPE Database is stored in a MariaDB database using the Latin-1 character set. Database objects are stored in binary format (BLOB). A data conversion will be needed to support UTF-8 (as UTF-8 is backwards compatible with ASCII, not with Latin-1). We can initially label all objects as Latin-1 and convert to UTF-8 incrementally, and we can convert objects to UTF-8 if necessary on query and update. The Database index tables will also need to be converted to support querying in UTF-8. To convert the entire Database to UTF-8 (for all object and index tables), approximately one hour of downtime for WHOIS updates will be needed.
Currently, any non-Latin-1 encoded data is converted to Latin-1 upon update. Characters are mapped to Latin-1 or are substituted with a question mark if there is no equivalent. Additionally, control characters are also substituted (apart from tab, linefeed and carriage return). A non-break space is substituted with a regular space, and a silent hyphen is substituted with a regular hyphen. Similar mapping may be necessary for UTF-8.
The RIPE Database APIs will need changes to support UTF-8:
- Port 43: request and response use the Latin-1 character set. We can keep Latin-1 for compatibility (with substitutions for non-Latin-1 characters) or switch to UTF-8 by default, which will require client changes. We could also add a client flag to specify which character set to use.
- NRTM: similar considerations as port 43.
- REST API and Syncupdates: currently, request data is converted into Latin-1. This will need to change to use UTF-8 directly. The response, however, is always encoded as UTF-8.
- Mailupdates: the client encoding is specified in the “Content-Type:” mail header. Request data is converted into Latin-1, which will need to change to support UTF-8.
Supporting UTF-8 in the RIPE Database will allow for proper internationalisation of names and addresses. However, we must be careful not to impede the use of the RIPE Database to facilitate cooperation. We must also facilitate the work of the RIPE NCC in supporting our members and community. Many changes will be needed in the RIPE Database and in RIPE NCC procedures to properly support UTF-8.
- “Purpose of the RIPE Database”, Article 3, RIPE Database Terms and Conditions
- “What We Do”, RIPE NCC
- “Internationalisation of the RIPE Database Content” (April 2012), Kaveh Ranjbar, Chief Information Officer, RIPE NCC
- “Proposal to allow UTF8” (April 2015), Piotr Strzyżewski
- NWI-11 Internationalised Domain Names (June 2020), Database Working Group
- “Regarding non-ASCII org-name” (July 2020), Cynthia Revström
- “Appendix A- Syntax of Object Attributes”, RIPE Database Documentation
- “Become a RIPE NCC Member”
- “Language Support” RIPE NCC Services WG session, RIPE 81 (October 2020), Fergal Cunningham, Head of Membership Engagement, RIPE NCC
- “RIPE NCC Standard Service Agreement”
- RFC 5892 “The Unicode Code Points and Internationalized Domain Names for Applications (IDNA)”