Sunday, February 14, 2021

Unicode URLs

tl:dr; Get URLs that cannot be verbally communicated.


URLs can only be in English characters (really, Roman characters) and a select few special characters like -.  So how is はじめよう.みんな a full domain name, even though it is written in Japanese (Hiragana script) and a dot?


The answer lies in character encodings, and translation done by the browser.

There is an encoding of characters called Unicode. That is what allows English, Japanese, Hindi, and other characters to be written on a single document.  URLs are Internet "addresses", the stuff that shows up in the address bar of your browser. These URLs can only be coded in ascii characters (a-z, a-Z, 0-9 etc). Due to their long history, URLs cannot be specified in Unicode. But creative minds have been hard at work here. There is a coding for Unicode that uses only the characters that are allowed in URLs. This encoding is called punycode.  Punycode URLs start out with 'xn--', but the browser transparently renders the unreadable combination of ascii characters into their corresponding unicode characters.


So in the https://はじめよう.みん example above, はじめよう is xn--p8j9a0d9c9a and is みん is xn--q9jyb4c. So the full URL is https://xn--p8j9a0d9c9a.xn--q9jyb4c/  If you can remember all those characters, you can type them out by hand. Your browser, detecting the punycode beginnings 'xn--' will decode the characters xn--p8j9a0d9c9a into はじめよう and xn--q9jyb4c into みん.

Try this translation yourself using online punycode converters.


This mechanism works for Top Level Domains (the .com in mail.google.com), domain names (the google in maill.google.com), or sub-domains (mail in mail.google.com)


What do you do with it? You can create URLs that look short but can only be recreated if the person can read/write the language. Here's an example: विक्रम.eggwall.com. This is a URL that can be clicked on but can only be verbally communicated by Hindi speakers.


You can mix-and-match different scripts. Here's an example with different scripts that I cannot even read. In the screenshot here, you have a URL. Try typing that address out.


There are some Top Level Domains (TLDs) that don't allow for punycode domain names. .fun, for instance doesn't allow punycode URLs. Try registering विक्रम.fun, for example. But you can always register a subdomain with punycode. So can have विक्रम.circuits.fun.