Building TensorFlow for RISC-V

Background

This document describes the changes needed to a Bytedance internal version of TensorFlow v1.15.3 to enable building for RISC-V targets. The motivation for sharing the details is to provide a starting point for the kind of changes needed to enable building TensorFlow for RISC-V.

The document is based on the work done by Zhiyuan Ji. For any questions, comments or feedback, please reach out to punit.agrawal@bytedance.com

Introduction

Due to lack of performant hardware and better tools support for RISC-V hosts, TensorFlow is built using cross compilers on an x86 host.

Although many mis-steps were taken in the process of enabling the RISC-V build, once all the pitfalls were identified and build issues resolved, the actual changes are relatively small - refer to patch in appendix (~300 new lines)

TensorFlow has the following simple dependency graph similar to the one expressed below.

TensorFlow Dependencies

Among these, Eigen is header only and the other dependencies already have support for RISC-V.

Enviroment

For the build, we use a RISC-V cross-compiler running on x86. For ease of deployment, a container image containing Clang is used.

clang --version clang version 18.1.8 (XXXX) Target: riscv64-unknown-linux-gnu Thread model: posix InstalledDir: /usr/local/bin

Note: An internal compiler release is used here for convenience. Using a similar upstream release should be fine as no additional features are required.

Additionally, a complete x86 toolchain is needed to build binaries that are executed during the bazel compile tasks, such as protoc and llvm-tblgen.

Next, let's prepare the TensorFlow source code and address the problems encountered during the build. One of the advantages of the internal version is that it solves many build issues in the internal environment (library upgrades, build dependencies, etc.) compared to the upstream version.

Bazel

Unsurprisingly, when executing the unmodified build script, Bazel defaults to building the x86 version of the libraries. Bazel detects that the host is x86_64 and selects the suitable toolchain for this host @local_config_cc//: cc-compiler-k8

Using -- toolchain_resolution_debug =.* prints some toolchain debug information. Looking at the output, it can be seen that Bazel rejected some arm toolchains, and finally decided to use @local_config_cc//: cc-compiler-k8 based on the default target platform: x86_64.

INFO: ToolchainResolution: Target platform @local_config_platform//:host: Selected execution platform @local_config_platform//:host, INFO: ToolchainResolution: Type @bazel_tools//tools/cpp:toolchain_type: target platform @local_config_platform//:host: Rejected toolchain @local_config_cc//:cc-compiler-armeabi-v7a; mismatching values: armv7, android INFO: ToolchainResolution: Type @bazel_tools//tools/cpp:toolchain_type: target platform @local_config_platform//:host: execution @local_config_platform//:host: Selected toolchain @local_config_cc//:cc-compiler-k8 INFO: ToolchainResolution: Target platform @local_config_platform//:host: Selected execution platform @local_config_platform//:host, type @bazel_tools//tools/cpp:toolchain_type -> toolchain @local_config_cc//:cc-compiler-k8

Let's use some tricks to better understand the build process and identify some issues.

Replace native compiler with cross-compiler

Hacking the $PATH to force using a GCC based cross-compiler, some Bazel tasks cause errors.

The build defaults to flags that are not recognised by the cross-compiler (x86 specific optimizations).

Similarly, there is some architecture specific assembly that will not work on RISC-V.

To solve these errors, check the BUILD files that include the erroring source files. In Bazel, a BUILD file is a collection of rules that describe how to generate the compile commands.

You will find there are some selectstatements such as

# in file third_party/hwloc/BUILD src = [ # some other src or headers ] + select({ "@org_tensorflow//tensorflow:linux_x86_64": [ "hwloc/topology-linux.c", "include/hwloc/linux.h", "hwloc/topology-x86.c", "include/private/cpuid-x86.h", ], "//conditions:default": [], }) # in file tensorflow/BUILD config_setting( name = "linux_x86_64", values = {"cpu": "k8"}, visibility = ["//visibility:public"], )

select is a placeholder for a value that will be chosen based on configuration conditions. The attribute linux_x86_64 is defined in file tensorflow/BUILD. When this condition is met, Bazel generates a compile rule to compile these x86-specific files, resulting in the error seen above.

It is possible to delete the code segment above to resolve the error, but we can also change the configuration conditions to not enable the linux_x86_64 attribute the approach chosen in the linked commit.

Tools built during the build

After resolving the above, we soon hit another issue. Some tools, such as llvm-tblgen, are built as part of the TensorFlow build and used subsequently during the build. As the cross-compiler builds RISC-V versions of these binaries, they cannot be executed on the host (x86) platform.

So far, Bazel isn't aware that the compiler has been substituted with a cross-compiler. To fix these kinds of issues, let's teach Bazel that we actually are interested in cross-compiling for another architecture.

Teaching Bazel about cross-compilation targets

Fortunately, Bazel has good support for providing infrastructure for cross-compiling via definitions platforms

Based on this documentation, we need to define a RISC-V target platform to describe the target architecture that the output libraries will run on, and also need to provide a cross-compiletoolchain for this target platform to use:

In summary, toolchain describes the build toolset and the flags to be used during compilation. The platform can be thought of as a collection of constraint values to influence the generated build commands. Using these, we can tell Bazel that the target environment is different to the build environment. Bazel will generate the compile compile commands accordingly.

Refer to Bazel documentation to define a toolchain

Now, the task is to give a CcToolchainConfigInfo structure that defines the behavior of the toolchain:

Platform definition -

Register and use the platform and toolchain definitions -

With the suitable configurations in place, use --platforms =//platform:linux_riscv64 to build for the RISC-V target. Based on the debug output, it can be seen that Bazel correctly parses the custom toolchain:

The highlighted sections shows that Bazel finds two C++ toolchains - one for //platform: linux_riscv64 and the other for @local_config_platform//: host.

Bazel will use the first toolchain to build the final output libraries, and use the second to build the binaries that need to be executed during the build, such as protoc and llvm-tblgen. To have better control of the toolchain used to build these tools, we need to register an x86 toolchain k8_local_toolchain in the same way. If no local toolchain is declared, Bazel parses and uses the tools it finds on the host.

The issues seen so far, are the worst of the problems encountered. The remaining issues were around picking the right toolchain for some of the components. A common way to address these issues (e.g jpeg) was to make use of the platform definition added above.

Next steps

  1. Port instructions to upstream version - We've started porting the instructions to the corresponding upstream version. Early work suggests that some of the internal changes make it easier to do the RISC-V cross build. These will need to be ported over.

  2. Port instructions to newer versions - Another option being considered is maybe to skip "1." and look towards enabling the RISC-V port on more recent versions. The benefit of taking this approach is to be able to contribute any changes needed upstream.

  3. Performance evaluation and optimization - The goal of the porting effort is to enable running workloads and identify opportunities for optimization. Having the community to be able to build and run TensorFlow based workloads will help accelerate improvements.

Appendix

Bazel

.tf_configure.bazelrc

Patch

A sanitised version of the patch to Bytedance TensorFlow incorporating the changes discussed in this document.