Sunday, February 14, 2021

Unicode URLs

tl:dr; Get URLs that cannot be verbally communicated.

URLs can only be in English characters (really, Roman characters) and a select few special characters like -.  So how is はじめよう.みんな a full domain name, even though it is written in Japanese (Hiragana script) and a dot?

The answer lies in character encodings, and translation done by the browser.

There is an encoding of characters called Unicode. That is what allows English, Japanese, Hindi, and other characters to be written on a single document.  URLs are Internet "addresses", the stuff that shows up in the address bar of your browser. These URLs can only be coded in ascii characters (a-z, a-Z, 0-9 etc). Due to their long history, URLs cannot be specified in Unicode. But creative minds have been hard at work here. There is a coding for Unicode that uses only the characters that are allowed in URLs. This encoding is called punycode.  Punycode URLs start out with 'xn--', but the browser transparently renders the unreadable combination of ascii characters into their corresponding unicode characters.

So in the https://はじめよう.みん example above, はじめよう is xn--p8j9a0d9c9a and is みん is xn--q9jyb4c. So the full URL is https://xn--p8j9a0d9c9a.xn--q9jyb4c/  If you can remember all those characters, you can type them out by hand. Your browser, detecting the punycode beginnings 'xn--' will decode the characters xn--p8j9a0d9c9a into はじめよう and xn--q9jyb4c into みん.

Try this translation yourself using online punycode converters.

This mechanism works for Top Level Domains (the .com in, domain names (the google in, or sub-domains (mail in

What do you do with it? You can create URLs that look short but can only be recreated if the person can read/write the language. Here's an example: विक्रम This is a URL that can be clicked on but can only be verbally communicated by Hindi speakers.

You can mix-and-match different scripts. Here's an example with different scripts that I cannot even read. In the screenshot here, you have a URL. Try typing that address out.

There are some Top Level Domains (TLDs) that don't allow for punycode domain names. .fun, for instance doesn't allow punycode URLs. Try registering विक्रम.fun, for example. But you can always register a subdomain with punycode. So can have विक्रम

Tuesday, January 19, 2021

FreeBSD, the alternative

 I like FreeBSD. Something about a barebones, simple UNIX that is clean at the core. Not particularly a pragmatic choice, I'll gladly admit. For most needs, a Linux installation is both simpler and more capable. More people are familiar with Linux, so this is a niche world.

But FreeBSD is a hobby, and a particularly gratifying hobby.

The simplicity of FreeBSD is a huge draw, you can get very far with an ancient Intel machine and 1 GB of RAM. The source code is a pleasure to read, the system is easy to modify, and there is relatively little of it on a base install. Full system upgrades are clean and easy, the ports system is elegant though often impractical.

For most people, the easiest way to run FreeBSD would be to use a virtual image, and boot it up using qemu (Linux) or Hyper-V (Windows). Both work very well, both support networking, which is really what you're looking for. Face it, you're not booting into FreeBSD for its giant catalog of games or media utilities.

Get the VM images from here. qcow2 works for QEMU, vhd works for Hyper-V, and vmdk works for VMWare. 

For QEMU, I use the following commandline:

qemu-system-x86_64 \
        -m 2048 \
        -enable-kvm \
        -hda bsd-with-jail.img \
        -net user,hostfwd=tcp::10023-:22 \
        -net nic \

That specifies:

  • (-m 2048) 2048 MB RAM
  • (-enable-kvm) use the Linux KVM hypervisor
  • (-hda bsd-with-jail.img) the disk bsd-with-jail.img as the first hard disk 
  • (-net nic)  a full Network Interface Card, make your FreeBSD setup trivial as it can get a network interface called em0
  • (-net user, hostfwd=tcp::10023-:22) network translation is user-level, so no root permissions required and forward port 10023 on the host to port 22 on the guest. This allows you to ssh to the FreeBSD guest by ssh'ing port 10023 on the host.
  • (-curses) Use the text-based interface using libncurses rather than a graphical interface. Great for running on cloud instances, or through GNU screen so you can leave it on forever.

Other ports can be forwarded for running web servers, dns servers, ...

The same setup can be replicated on Hyper-V. For network, you can use the default switch. Port forwarding is more complicated, and you could achieve it with ssh tunnels on the Windows host or in the FreeBSD guest.

FreeBSD runs fine with 1GB of RAM, and is a great system to learn the fundamentals of computing. You can play with Dtrace, create constrained environments called Jails, or modify the behavior of a clean UNIX kernel.  A lot of this knowledge translates to Mac OS, as the OS X kernel is a hybrid of FreeBSD and Mach, and the userland is very similar to FreeBSD.

You can easily run a dozen jails on 1G of RAM, depending on what you do with them.  Each jail is separated from the other and can be a low-cost, throwaway computing environment. Similar to Docker, or Linux namespaces, root in a jail is safe, and cannot damage the host FreeBSD system.

Set up your own UNIX sandbox today!

Thursday, January 07, 2021

Mac OS Kernel (XNU) source, on github

 Few people know that Apple releases the source code for many components of their MacOS software. Many are surprised when they hear it.

Apple has a full site for their open source releases. Their most recent Mac OS release is for 11.0.1 (Big Sur). Not all source is released. There are the usual UNIX utilities: Perl, Python, Bash, and userland tools that make a functioning system. Functioning, but not always updated, as Apple doesn't always track the latest versions. They still ship Python 2.5, and other tools can be many years old. This is why you need to get a functional, bugfixed UNIX userland with homebrew or other tools.

The most interesting source there would probably be the kernel, called XNU. It is a combination of Mach, a microkernel, and BSD, a monolith. There's some Solaris code there, some Joyent code, some Sun code, NeXT, UC Berkeley, Carnegie Mellon code, and of course, Apple code.

Apple does have an official github profile, which also hosts xnu. But it was last updated two years ago, and stands at version 4903, while the latest version is 7195.

Their open source browser is passable but doesn't show every file properly. I find it much easier to read through code using Emacs so I pulled it into a github repository if anyone wants to clone it or browse it online.

Image credit: Apple Inc

Tuesday, December 29, 2020

Offline-first, the design of great picture gallery websites

tl:dr; How to use service workers to make your website responsive and functional offline

Websites have come a long way from the boring world of static <html> markup. There's a lot of new functionality in browsers. This new functionality brings benefit for big websites, but these techniques are useful for small personal websites as well.

One of the more important changes has been the support of Service Worker in many browsers. A Service Worker is Javascript code that can intercept network requests, and satisfy them locally on the client machine, rather than making server calls. There's a lot you can do with Service Workers, but I was most interested in making my home picture gallery work offline.

I wanted to allow a person to load up the gallery and be able to view it on their mobile phone or desktop even if the connection is lost. My photographs are shared with static html that I create using sigal, a simple image gallery software written in Python. It uses the Galleria library to create web pages that are self-contained. Since the galleries are static html & Javascript, they make a great case study for simple and fast web pages. In the current gallery, the images are downloaded as they are needed, with the next few images prefetched so the user doesn't have to wait. I wanted to make this entirely offline-first, so the images are downloaded, stored offline. Each image in my gallery is 150kb, and galleries have 10-20 images in them. The entire gallery is roughly 4mb, which is tiny. As a result, it loads fast and can be cached offline.

You can always implement your own Service Worker, the interface is straight-forward. If you just want to use Service Workers for browser-side caching, there is a much simpler alternative. We're going to use the upup library, which is a Javascript library that can be configured to store content offline, and cache it.

First,  Service Workers need HTTPS support. Get yourself a certificate with LetsEncrypt. This is a nonprofit that has issued 225 Million certificates already, and are valiantly helping the entire web move to 100% HTTPS. Get their certificate if you don't have it. Heck, get HTTPS support even if you don't want offline first on your website. I heartily endorse them, I have moved to HTTPS thanks to them. You should too.

Now, let's add UpUp to your website. It is important where you place it. The upup library can answer requests for a specific scope (subdirectory of your website). It sets its scope based on its location on your website. Since you want the library to serve content to your entire website, you want the library to be as close to the root level of your website. Let's see a concrete example in action.

Let's say your website is If you put the javascript at  then the javascript can only cache offline content for /js and not for  To serve content for the entire subdomain you want the Javascript at

We're going to put it at, with this in the html:
    <script src=""></script>

So download the upup library and put the contents of the dist/ directory at the root level of your website. Putting random javascript like this is usually a bad idea, so examine the source code, make sure you understand what the Javascript is doing. The service worker source is much more important than the framework code.

Now that you have the library in place, invoke the UpUp.start method, and give it the base web page (index.html) and all the content that you want it to cache. The references here have to be relative to the location of the upup.sw.min.js. If you put the library in the root of your page, all the references here have to be relative to your root page:
          'content-url': 'g_Aircraft/index.html',
          'assets': [

For simple pages like this, I find it helpful to include the <base> tag to remind me that everything is relative to the root:
<base href="">

 On this gallery, all images and content is stored in the subdirectory g_Aircraft/. All thumbnails are stored in g_Aircraft/thumbnails/. So you want to load up all the images in upup.start:

          'content-url': 'g_Aircraft/index.html',
          'assets': [



You don't need to change anything on the server for this. I use nginx, but anything serving out static pages will do just fine. Offline-first changes your server-side metrics because many requests are handled directly on the client. So you won't be able to see when the client loaded the page again. Browser-side caching messes with these numbers too, so if you will have to roll your own Javacript if you want perfect interaction tracking.

These changes are fine for browsers that don't support service workers. Older browsers will be served the static content. Since they don't initialize a Service Worker, all requests will go to the server, as before. The Upup.start section just gets ignored. Browser-side caching will continue working as before, too.

With this, the UpUp service worker will cache all the content specified in assets above. The user can go offline, and the page still functions normally. The gallery demo is available if you want to play with it.

Service Workers add complexity. You can debug the site using Chrome Developer tools -> "Application" -> "Service Workers" or "Web Developer" -> Application -> "Service Workers" in Firefox. You want to check if the service worker initialized and is storing content in "Cache storage" -> upup-cache.

Here's a demo video on Android. You can see the user load up the site on their mobile browser, go offline, and still navigate normally.

Monday, December 21, 2020

Audio feature extraction for Machine Learning

tl:dr; Books and papers for audio processing, for building Machine Learning models.

I've been experimenting with Machine Learning for audio files.

Much machine learning literature for music deals with MIDI files, which are a digital format for specifying notes, duration and loudness. This is the format to use for models that work on the level of individual notes. A simple introduction for training such models is using the book "Hands On Machine Learning 2nd edition" (2019), An exercise in Chapter 15 (RNNs) introduces you to Bach chorales, and shows how to generate chords from digial music. Google's Magenta project has datasets and models for such discrete note-level training, generation and inference.

While MIDI is a convenient format for discrete music, most music data is stored as waveforms rather than MIDI. These are either raw WAV / FLAC, or encoded with the MP3 or Ogg Vorbis encoding. Extracting features from these files is considerably harder and requires a good understanding of audio analysis and the kinds of features that these waveforms represent. Depending on the audio stream, some understanding of music structure might come in handy.

Essentia is a freely-available library for handling audio information for music analysis, with bindings for C++ and Python. A Python tutorial on Essentia covers some of the basics.

A survey paper "An Evaluation of Audio Feature Extraction toolboxes", by Moffat D., Ronan D., and Reiss J.D. (2015) covers some toolkits in a variety of languages.

In order to use any toolkit for feature extraction, you need to know what features to look for, and which algorithms to select. This is a fast-moving area of research. The book: "Fundamentals of Music Processing", by Meinard Müller (2016) covers all the background on audio encoding, music representation, and analysis algorithms. The first two chapters cover the core concepts, and chapters 3 onwards dive into individual topics, and can then be read in parallel. This allows software engineers to understand the basics, and then immediately focus on the task at hand. This book is dense and requires a firm understanding of Linear Algebra. Once you know the terminology, you can read the relevant papers on feature types and either use a readily available library, or write the extraction code yourself.

Finally, pre-trained models in Essentia allow you to use existing models for classification tasks. An online demo exists to test out the functionality in a browser.

Friday, December 11, 2020

Tailscale: the best, secure, private VPN you need

 tl:dr; Tailscale easily sets up remote machines as though they were on your local network (VPN)

Tell me if this situation sounds familiar: you have a variety of machines, at home and at remote locations. All combinations: behind a proxy/NAT,  connected directly, a mix of Linux and Mac/Windows systems, a mixture of physical hardware and cloud instances. You want these machines to behave like they're on a local network and to use them without jumping through hoops. You could access these machines with proxies like ngrok or other tunneling software like ssh, but that's complicated.

You could set up a VPN, but that is time consuming and difficult to manage. Setting up a VPN requires skill and is difficult to get right across host platforms or architectures.

Tailscale is an excellent, simple-to-setup and secure VPN, with clients available for all major systems and architectures. You use an existing authentication (like your Gmail address), download the client software for your platform, and authenticate by navigating to a web address. Setting it up is refreshingly simple. It even sets up a dns, so you can refer to your machines by their hostname: p31 can be used instead of the full VPN IP address like

I've used a variety of VPN systems in the past; I've also set up my own tunnels using different providers; I've rolled my own tunnels from first principles. Compared to existing systems, Tailscale is easier to setup, efficient, and has great network performance. Network latencies are lower than traditional hub-and-spoke systems, which relay through a central server. If the central server is located far from both VPN'd machine, network performance is usually poor.

Right now I'm pinging machines that are behind a NAT, accessing web pages on a different physical network, all by referring to simple hostnames. There's arm32, arm64, x64, different operating systems, physical and cloud instances that all appear as a local Class-A network. This is like magic!

Tailscale is also great for working on projects on your cloud or local instance without exposing it to the wild Internet traffic.

Image courtesy: Wikipedia.