Support UTF8 in Michelson string

emishur · December 11, 2020, 8:34pm

I would like to suggest a protocol upgrade to convert Michelson string type from ASCII-only to Unicode/UTF8 representation.

Currently, Michelson string supports ASCII-only text. At a glance, it seems like a reasonable approach, but it creates problems when trying to integrate with off-chain tools and/or store some human-readable information on chain.

Strings Should be Human-Readable.

I have heard an argument that contract developers should put rich data off-chain and keep on chain only minimal information which helps to identify and retrieve related off-chain resources.

In general, it is a good approach, but what if a developer still decides to keep all the data on-chain? Strings are supposed to be human-readable (and human-readable does not mean English-only). What if a developer wants to put something like a token symbol or short name of the entity represented by the contract or a record into contract storage? She must limit herself to ASCII-only symbols and/or figure out how to “sanitize” those string before putting them on chain.

Interop With Off-chain Tools.

I think it is safe to say that almost all software these days runs on Unicode/UTF8. Supporting ASCII-only strings on chain makes it hard to integrate with off-chain tools. One of the examples would be storing a URL on chain to access off-chain resources. A URL cannot be stored as a Michelson string now since it may contain UTF8 characters.

Look at TZIP-16 standard. It encodes external URL as bytes because of the existing limitation of the Michelson string type.

All in all, although limiting string to ASCII-only may have some benefits for the blockchain implementation, it makes life of an application/contract developer more difficult.

wyc · December 12, 2020, 1:26am

+1 to this general idea, and bonus points for forward compatibility due to ASCII existing as a subset of UTF-8. URI-like storage without encoding could be a great use case for String, aka IRIs.

Wanted to point out that after some weekend reading I have confirmed that URIs are specified as ASCII-only, as per RFC3986:

The ABNF notation defines its terminal values to be non-negative
integers (codepoints) based on the US-ASCII coded character set
[ASCII]. Because a URI is a sequence of characters, we must invert
that relation in order to understand the URI syntax. Therefore, the
integer values used by the ABNF must be mapped back to their
corresponding characters via US-ASCII in order to complete the syntax
rules.

Would like to hear more thinking about the safe handling of multi-byte runes.