nb.oの日記: Fedora 35 の GCC 11.2.1 20211203 で Tensorflow 2.7（CUDA11.5 cuDNN8.3.1）をビルドする

目的

前回、Fedora 35でTensorFlow v2.7をビルドするブログを書いた。

その後、Fedora 35のGCCのバージョンがアップデートした。

GCC 11.2.1 20210728 → GCC 11.2.1 20211203

この変更、特にlibstdc++のヘッダーファイルの変更が影響してCUDA関連のビルドが失敗することがわかった。

注）

もちろんCUDAのGCCサポートはFedora 34のGCC 11（おそらく11.1）となっている。

NVIDIA CUDA Installation Guide for Linux （2021.12.18参照）

このため、Fedora 35のGCC 11.2はサポート対象外であり、前回ビルドできたことは奇跡だったのである。

（そうでも、ちょっといきなりビルドできなくなるのはひどいが、、、）

GCC 11.2.1 20211203でのCUDA関連のビルドエラー

Fedora 35 の GCC 11.2.1 20211203でビルドすると下記の記事でまとめたが、ビルドエラーが発生する。

なお、これはTensorFlowに限らず、CUDA関連のビルドはすべて失敗すると思われる。

TensorFlow build issue on Fedora 35

環境

2021.12.18時点の環境は以下。

Fedora 35 x86_64
Python 3.10.0 (default, Oct 4 2021, 00:00:00) [GCC 11.2.1 20210728 (Red Hat 11.2.1-1)] on linux
gcc version 11.2.1 20211203 (Red Hat 11.2.1-7) (GCC)
CUDA 11.5 + cuDNN v8.3.1

事前の準備

対応の方針としては、いつものごとくCUDA用のGCCを別途用意する。

これは過去も同様。今回はGCC11.1を用意する。

GCC 11.1のビルド

ソースのダウンロード＆変更

GCC11.1のソースをダウンロードする。

wget https://ftp.gnu.org/gnu/gcc/gcc-11.1.0/gcc-11.1.0.tar.gz
tar xf gcc-11.1.0.tar.gz
cd gcc-11.1.0/

そのままではビルドエラーになってしまうため、コードの一部を変更する。

エラーの内容、変更箇所は以下のGentooのバクレポートを参照。

ビルド

これも今までの通り。

以下のビルドオプションでビルドを行う。

./contrib/download_prerequisites
mkdir build
../configure \
  --enable-bootstrap \
  --enable-languages=c,c++ \
  --prefix=/home/xxxx/gcc/11.1 \
  --enable-shared \
  --enable-threads=posix \
  --enable-checking=release \
  --disable-multilib \
  --with-system-zlib \
  --enable-__cxa_atexit \
  --disable-libunwind-exceptions \
  --enable-gnu-unique-object \
  --enable-linker-build-id \
  --with-gcc-major-version-only \
  --with-linker-hash-style=gnu \
  --enable-plugin \
  --enable-initfini-array \
  --with-isl \
  --enable-libmpx \
  --enable-gnu-indirect-function \
  --build=x86_64-redhat-linux
make -j$(nproc)
make install

ビルド後の設定

ビルド後は、specsファイルを作成、設定する。

/home/xxxx/gcc/11.1/bin/gcc -dumpspecs > specs
$ vi specs

# before
*link_libgcc:
%D

# after
*link_libgcc:
%{!static:%{!static-libgcc:-rpath /home/xxxx/gcc/11.1/lib64/}} %D

$ mv specs /home/xxxx/gcc/11.1/lib/gcc/x86_64-redhat-linux/11/

TensorFlowのビルド

あとは、いつもどおりビルドすればOK。

Configure

configureではCUDAのコンパイラーにGCC11.1を指定する。

./configure
You have bazel 3.7.2 installed.
Please specify the location of python. [Default is /home/xxxx/.virtualenvs/tf2.7/bin/python3]:

Found possible Python library paths:
/home/xxxx/.virtualenvs/tf2.7/lib/python3.10/site-packages
/home/xxxx/.virtualenvs/tf2.7/lib64/python3.10/site-packages
Please input the desired Python library path to use. Default is [/home/xxxx/.virtualenvs/tf2.7/lib/python3.10/site-packages]

Do you wish to build TensorFlow with ROCm support? [y/N]:
No ROCm support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Do you wish to build TensorFlow with TensorRT support? [y/N]:
No TensorRT support will be enabled for TensorFlow.

Found CUDA 11.5 in:
/usr/local/cuda-11.5/targets/x86_64-linux/lib
/usr/local/cuda-11.5/targets/x86_64-linux/include
Found cuDNN 8 in:
/usr/local/cuda-11.5/targets/x86_64-linux/lib
/usr/local/cuda-11.5/targets/x86_64-linux/include

Please specify a list of comma-separated CUDA compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus. Each capability can be specified as "x.y" or "compute_xy" to include both virtual and binary GPU code, or as "sm_xy" to only include the binary code.
Please note that each additional compute capability significantly increases your build time and binary size, and that TensorFlow only supports compute capabilities >= 3.5 [Default is: 6.1]:

Do you want to use clang as CUDA compiler? [y/N]:
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: /home/xxxx/gcc/11.1/bin/gcc

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -Wno-sign-compare]:

Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:
Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
--config=mkl # Build with MKL support.
--config=mkl_aarch64 # Build with oneDNN and Compute Library for the Arm Architecture (ACL).
--config=monolithic # Config for mostly static monolithic build.
--config=numa # Build with NUMA support.
--config=dynamic_kernels # (Experimental) Build kernels into separate shared objects.
--config=v1 # Build with TensorFlow 1 API instead of TF 2 API.
Preconfigured Bazel build configs to DISABLE default on features:
--config=nogcp # Disable GCP support.
--config=nonccl # Disable NVIDIA NCCL support.
Configuration finished

ビルド

あとはビルド。しばし待つ。

bazel build \
  --config=cuda \
  --config=v2 \
  --config=nonccl \
  --config=opt \
  //tensorflow/tools/pip_package:build_pip_package
./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
pip install /tmp/tensorflow_pkg/tensorflow-2.7.0-cp310-cp310-linux_x86_64.whl

nb.oの日記

2021年12月23日木曜日

Fedora 35 の GCC 11.2.1 20211203 で Tensorflow 2.7（CUDA11.5 cuDNN8.3.1）をビルドする

目的