Saturday, September 19, 2020

Compiling Tensorflow without AVX support, a Googler's perspective

tl:dr; Tensorflow compilation teaches you about the complexity of present-day software design.


Compiling Tensorflow is a curious experience. If I put on my external-user hat, the process is baffling.  Many Tensorflow choices are motived by development practices inside Google, rather than common open source development idioms. And so, as a Google engineer, I can explain what is going on and the motivations for the choices.

Why compile Tensorflow?

I want to run Tensorflow on an old Intel CPU that doesn't have AVX-instruction support. These are special "vector" instructions that speed the computation on large data streams, and are available on relatively new Intel and AMD processors. The solution is to compile Tensorflow from source as Tensorflow prepackaged binaries (after version 1.6) are compiled expecting AVX support on the processor. 

No problem: I'm an engineer and have done my share of large system compilations. I can do this.

Dependencies

Tensorflow compilation has only been tested on gcc 7.3.0, which was released in November, 2019. The latest version of gcc shipping with Ubuntu 20.04 is 9.3.0. If a user is compiling software from source, they are probably going to use a recent version of the compiler toolchain. I doubt most users will install an old version of gcc on their machine (or in a Docker image) just to compile Tensorflow. I didn't either, and went with gcc-9.3 with fingers crossed.

Perhaps it is the complexity of software development today. With the pace of development and releases: you cannot possibly support all versions of gcc, all versions of Ubuntu, Debian, Mac OS, Windows, all combinations of compute architecture: x86, x86_64, x86 with AVX, arm64, GPU with cuda. Add to this all the complexity of different target platforms: python, C++, ...

Unlike a few years ago, compilers like gcc and llvm are themselves being updated frequently. This is great, as bugs can be fixed, but it leads to a large burden on supporting different toolchains.

Lessons

Tensorflow downloads its own version of llvm. Instead of relying on the system version of llvm, which might have its own quirks, it just gets everything.

That's not all: Tensorflow downloads all its dependencies: boringssl, eigen3, aws libraries, protobuf libraries, llvm-project from github or their respective repositories. I suspect most of these go into //third-party. 

It is an interesting choice to download most of these rather than expecting them installed locally. On one level, it reduces the complexity in figuring why Tensorflow builds on Ubuntu versus Fedora, or on FreeBSD. But managing these packages also adds complexity. How do you know which version of protobuf or llvm to check-out, and what if those dependencies are no longer available.
The obvious complexity is that you have to compile your dependencies too. A user might already have a pre-compiled llvm and the protobuf library.

If anything, the Tensorflow style looks similar to Android Open Source (AOSP) or FreeBSD's Ports collection.  In both of these, the downloaded repository creates a parallel universe of source and objects. You compile everything from scratch. The notable difference between FreeBSD is that the output of FreeBSD's Ports are installed in /usr/local/ and are then available to be used on the system. After you compile protobufs for Tensorflow, you still don't have the protobuf library available to the wider system.

The reason for this is probably because Google engineers compile the whole world. Google production binaries shouldn't rely on the specific version of eigen3 you happen to have on your development machine. Instead, you get a specific version of eigen3 from the repository (the "mono" repo), and use that. Ditto for llvm. Most of this open-source dependency code does not diverge too far from upstream, as bugfixes are reported back to the authors. This provides some sanity of dependencies. I suspect the version of llvm or eigen3 chosen are the same versions that were in the mono repo at the time Tensorflow 2.4 was released. 

This differs from other large open source projects. You are expected to have all the dependencies locally if you are to compile Emacs. It needs libjpeg, so install that through apt or yum. Then you realize you need x11 libraries. Ok, go get those separately. Cumbersome, and it increases the risk of a failure at runtime as your version of libjpeg might not be what the authors tested against.

Bazel does help when compiling everything. On a subsequent run, it won't need to recompile boring_ssl. Inside Google, the build system reuses objects from prior runs of other engineers which vastly speeds up individual's compiliation. An open source developer does not benefit from this on their first compile of Tensorflow. They're starting out cold. Their subsequent compile runs are sped up, of course, but how often do you compile Tensorflow again? You generate the Python Wheel, rm -Rf the checked out repo, and carry on with your Python data analysis.


Another quirk: at the end, the bazel server is still running on that machine, and it shut down after many hours of disuse. This might be fine for Google engineers who will be compiling other software, soon. For them, the cost of keeping bazel up and running is small compared to the benefit from the pre-warmed caches and memory. I suspect independent open source developers are baffled why bazel is holding on to 400+Mb of RAM, hours after the compilation is done.

The choice of bazel itself is interesting. Most open source software uses the 'make' tool, despite its numerous flaws. Bazel is an open source implementation of an internal Google build tool, so Tensorflow uses that. Even Android AOSP uses make, since bazel wasn't available in open-source back in 2008 when Android AOSP was released.

Other systems

Let's see how other systems manage this sort of complexity.

PyTorch, by comparison offers a rich selection of choices. You select the build, the OS to run on, which package manager you want to use (Conda/Pip/...) whether you have CUDA installed, and if so, which version. It finally tells you exactly how to get PyTorch installed on your system.
 
This raises a question: why can't Tensorflow be available as a pre-compiled binary in a lot more configurations? The Intel Math Kernel Library (MKL), for example, is available in 50 variants packaged natively for Ubuntu: libmkl-avx, libmkl-avx2, libmkl-avx512, libmkl-avx512-mic, libmkl-vml-avx, ... These are all variants for specific Intel CPUs, to extract the maximum possible performance from each system. Tensorflow is similar: it is built to efficiently process compute-intensive workloads. Why isn't Tensorflow available in 50 different variants, targetting avx, no-avx, avx2, avx512, ...?
Here, I am guessing the choices are due to the Google-engineer/Open Source divide. At Google, most engineers run a specific kind of machine, and so the pre-compiled binaries target these workstations and similar CPUs on cloud compute farms. Most internal (and Google cloud) users don't deviate from these computing setups. So Tensorflow on a Core2Duo from 2006, or ARM32, or ARM64 isn't a high priority. This is a real lost opportunity, because compiling multiple targets can be automated. The real cost here is of maintenance. If you do provide Tensorflow on a Core2Duo or arm32, you are implicitly providing support for it.
The open source answer here would be to appoint a maintainer for that architecture. The Macintosh PowerPC port of Linux is still maintained by Benjamin Herrenschmidt, among others. He cares about that architecture, so he helps keep it up and running. The community will probably maintain a no-avx binary if you empower them. 


 The Linux kernel is also an incredibly complex system. You are building an operating system kernel, which by definition is hardware architecture and device specific. Even in 2020, you can build the Linux kernel for machines as varied as PowerPC, ARM, MIPS, Intel x64, and Intel 386. Of course, you can choose between AVX support and not.  The Linux kernel depends on very few external libraries, and is almost entirely self-contained. It compiles with make, and generates targets that work on many more architectures than Tensorflow. It has a huge configuration system, with many many choices. Most of the complexity is the skill and expertise in understanding the options and selecting them. You can always take an existing kernel configuration from a running system, and then run 'make menuconfig' to modify the specific options you want to change.



The comparison might not be entirely fair, though. The Linux kernel has been in active development for many decades. It was always developed in a decentralized way, and therefore has perfected open source development development and release. The open source process has also been shaped by the quirks of the Linux kernel to the point where it is difficult to tell whether Linux influences open source or open source influences Linux.

 

Outcome

It took a few hours of computation on the old machine, all four CPUs were busy for a whole day. But at the end, I have Tensorflow for Ubuntu 20.04 for x86_64 without AVX support. I tried this out on a Celeron and a Core2 machine, and it works great. Tensorflow is perfect for the old machines where you can run model training for a few hours, turn the screen off, and leave it alone.

Since I have the Wheel compiled for Python3, here it is if anyone needs Tensorflow 2.4 without AVX support for Ubuntu 20.04 and Python3. If you need another version, find my email address and mail me.

Just for fun, I'd love to compile PyTorch from source as well. It seems to follow the open source paradigm closely: you install specific dependencies using yum/apt, and it uses those directly.


Conclusion

Tensorflow compilation was an interesting process to see from the outside. The compilation process is far more complicated due to the wealth of dependencies. While most users of Tensorflow are aware of the complexity of Tensorflow, a lot more complexity is visible when compiling the system. The choices here are motivated by some development practices at Google, and also make an interesting case study for large system design.
 

Disclaimer: I'm a Google employee, but these are my own opinions from the public Tensorflow project. I did not examine any Google confidential systems to arrive at these observations.