Monday, December 21, 2020

Audio feature extraction for Machine Learning

tl:dr; Books and papers for audio processing, for building Machine Learning models.


I've been experimenting with Machine Learning for audio files.

Much machine learning literature for music deals with MIDI files, which are a digital format for specifying notes, duration and loudness. This is the format to use for models that work on the level of individual notes. A simple introduction for training such models is using the book "Hands On Machine Learning 2nd edition" (2019), An exercise in Chapter 15 (RNNs) introduces you to Bach chorales, and shows how to generate chords from digial music. Google's Magenta project has datasets and models for such discrete note-level training, generation and inference.


While MIDI is a convenient format for discrete music, most music data is stored as waveforms rather than MIDI. These are either raw WAV / FLAC, or encoded with the MP3 or Ogg Vorbis encoding. Extracting features from these files is considerably harder and requires a good understanding of audio analysis and the kinds of features that these waveforms represent. Depending on the audio stream, some understanding of music structure might come in handy.

Essentia is a freely-available library for handling audio information for music analysis, with bindings for C++ and Python. A Python tutorial on Essentia covers some of the basics.

A survey paper "An Evaluation of Audio Feature Extraction toolboxes", by Moffat D., Ronan D., and Reiss J.D. (2015) covers some toolkits in a variety of languages.


In order to use any toolkit for feature extraction, you need to know what features to look for, and which algorithms to select. This is a fast-moving area of research. The book: "Fundamentals of Music Processing", by Meinard Müller (2016) covers all the background on audio encoding, music representation, and analysis algorithms. The first two chapters cover the core concepts, and chapters 3 onwards dive into individual topics, and can then be read in parallel. This allows software engineers to understand the basics, and then immediately focus on the task at hand. This book is dense and requires a firm understanding of Linear Algebra. Once you know the terminology, you can read the relevant papers on feature types and either use a readily available library, or write the extraction code yourself.


Finally, pre-trained models in Essentia allow you to use existing models for classification tasks. An online demo exists to test out the functionality in a browser.