Designing a Name Service - Part 4: Validation, Normalization, Encoding

mbodecek · May 15, 2020, 1:05pm

Designing a Name Service - Part 4: Validation, Normalization, Encoding

This article is also available as part of our Tezos Name Service publication on Medium. That’s where you can find our future articles on the topic as they become available, but we will post them here too for easy discussion.

Defining what name strings can be registered is an important technical and functional decision that impacts usability and security. In this article, we will consider our options.

Character Set

Unsurprisingly, the choices aren’t that many:

Support for Latin characters only is very easy to implement. However, this option offers no possibility for internationalization, which makes it unusable in our eyes.
Unicode offers good prospects for internationalization, but we have to be more careful about what we allow in our names. The string primitive in Michelson currently doesn’t support Unicode so another on-chain representation has to be found. Options could include Punycode, UTF-7 or using Michelson bytes storing UTF-8 text.

Allowed Names

Let’s discuss the types of characters we want to use in our names. A simple set of rules that comes to mind could be:

Letters and numbers should be allowed, but properly normalized.
Special characters like @, ! or _ do not make good additions to names: they don’t meet our usability objective (easy to read, easy to pronounce) and don’t make names any more meaningful. This also goes for a large number of non-letter Unicode characters, like the emoji characters. However, the Unicode character space currently counts 143,859 characters so making educated judgments on individual characters might not be realistic.
Non-printable characters and whitespaces should be explicitly disallowed, as they confuse the user and serve little purpose.
The dot character (.) is allowed but has a specific meaning: separating names into individual labels in a hierarchical system.

We don’t see a particularly strong reason for arbitrary length limits, although a technical upper limit might be needed to curb gas costs.

Validation and Normalization

Validation is making sure that a name is valid according to a set of rules. By normalization we mean mapping of certain characters to their normalized versions. Usually, this includes case-folding (i.e. conversion to lower-case) and removal of some invisible characters. The process makes sure that ALICE.tez and Alice.tez are equivalents to the canonical name alice.tez. Lower case letters also tend to be visually more distinguishable.

Unicode, Punycode, and IDNA

It’s no surprise that in the context of the DNS, there has been a lot of effort spent towards internationalized names. Originally only Latin characters were allowed, which brought a need for a standard that would enable full Unicode support in names but keep compatibility with the existing infrastructure that only supported ASCII.

The result was a comprehensive group of standards called IDNA that allows for internationalization using Unicode-supported scripts. It specifies a validation and normalization algorithm that fits any Unicode text into ASCII space. A reverse algorithm converts it back to Unicode when rendered for the user.

The steps for transforming Unicode names into their ASCII equivalent in IDNA are basically:

Normalization: perform case-folding and remove some other variant differences.
Validation: check that the string conforms to the DNS label rules (no special characters or non-printables, no leading or trailing hyphens, etc.) and certain Unicode-specific rules.
Encoding: convert the string using the Punycode encode algorithm and add the ACE prefix (xn--), resulting in an ASCII string on the output.

This approach has been battle-tested by millions of users as part of the DNS infrastructure. It could be reasonably used for our purposes after adjusting it a bit: we don’t need to store the ACE prefix differentiating Punycode names from regular ASCII names - we simply store all names in Punycode (or another encoding) by default.

Regardless of the used encoding, there are libraries already available implementing steps 1 and 2 that are well tested and production-ready. That is a major advantage of using the IDNA mechanism. Another upside is that any kind of future DNS integration is made much easier if our name format is compatible with the DNS name format.

Protection Against Look-Alike Attacks

In our introduction article, we talked about spoofing attacks using Unicode characters that are graphically indistinguishable from their counterparts between scripts. For example, a name in Latin like apex.tez can be imitated using Cyrillic homoglyphs (арех.tez).

The existing countermeasures in the DNS ecosystem generally involve flagging the combination of different Unicode scripts in one label as a potential spoofing attempt. Problematic strings are either rejected (in the case of DNS registrars) or displayed to the user in their Punycode variant (in the case of the browser’s address bar). We could use a similar approach for our purposes.

Note that while the approach covers mixed-script spoofing reasonably well, it does nothing against single-script attacks (like apex vs. арех where each string is based entirely in one script). Susceptibility to single-script spoofing might not even have a solution: two names from different scripts that happen to share the same visual representation are a completely valid use case.

On-chain vs Off-chain

Responsibility for normalization and validation could lie:

100% on-chain. Ideally, we would like all validation and normalization to happen on-chain to enforce total compliance. Given the complex rules for the Unicode character space, we believe this to be extremely hard to implement in smart contracts.
Mixed on- and off-chain. We can move the responsibility for validation and normalization to the client and still perform some basic validation on the smart contract level (checking the length, checking for white-spaces, etc.). This might prevent naively implemented clients from putting invalid data on-chain in some cases at least.

Even with limited on-chain validation, there can be reasonable confidence that our name rules will be observed as long as the majority of clients implement them correctly. If a rouge client succeeds in storing an invalid name, it won’t be resolved by the majority of users. In effect, it will be as if the name was not stored at all. One way we can promote correct implementations is to provide libraries for popular languages and a comprehensive set of test vectors for those who want to write their implementation.

We remain skeptical about full on-chain validation and normalization due to gas costs and the high chances of implementation errors. That being said, some prototyping needs to be done to see whether at least a subset of the logic could be implemented on-chain.

Encoding Options

In the context of IDNA, we mentioned Punycode as an algorithm for storing Unicode text in 7-bit ASCII strings, but there are multiple options.

Punycode

Punycode was specifically designed for the DNS. The output strings only contain lower-case ASCII letters, numbers, and hyphens. It is implemented as a state machine iteratively performing changes on a string buffer, which means most basic operations like calculating the length of a string basically require a fully implemented decode algorithm. Unfortunately doing that on-chain will be challenging to say the least.

UTF-7

UTF-7 is a lesser-known encoding standard, which fits Unicode text into 7-bit ASCII space. Each Unicode character takes 1 to 5 encoded characters. It is considerably easier to implement than Punycode as operations like calculating the length or checking a string for specific characters are single-pass loops. Another advantage is that Latin UTF-7 characters are human-readable without additional processing.

UTF-8 or UTF-16 Bytes

Representing names using the bytes type is also possible with one practical disadvantage: byte sequences won’t be human-readable in their JSON representation.

Namehashes

We have talked about the concept of namehashes in our article about structure. Names represented as namehashes are not publicly readable on-chain. The normalization and validation responsibility lies fully with the client. Because labels can be looked up in a rainbow table, it is quite arbitrary which names are publicly visible and usability suffers. For that reason, we believe namehashes could be used as a lookup format, but an open-text representation will still be needed when registering a name.

Summary

We outlined our views on a good name validation and normalization process. Then we explored options for character encoding and discussed the feasibility of these algorithms implemented as part of smarts contracts.

Relevant Links

Join the Conversation

We are excited to start a discussion on this topic with the community! Do you have an opinion on name validation or encoding? Have we missed something? Let’s share ideas.

And if you want to chat, come join TNS Group on Telegram!

mbodecek · May 16, 2020, 12:52pm

Just a small note about strings and UTF-7 I realized after writing this:

Michelson has no bitwise operations on characters (or a dedicated character type for that matter) so validating characters outside of ASCII would not be pretty (UTF-7 uses 6-bit groups for characters above the 127th codepoint -> bit shifting & masking is needed)