As I run into trouble by not being able to build my own tensor ops (tensor operators) which I needed to train PointNet, PointNet2 and SpiderCNN for 3D object classification. Some solution, like this, this, and this, did not work for my setup.
One of the reasons is due to the fact that tensorflow was compiled with different gcc version. Of course, I could have just used the same version as suggested in the original repository, for example use tensorflow 1.4 and cuda 8.0, but sometimes it does not work that way on our cluster since the sysadmin may update the version of nvidia driver and gcc version and then remove the older version. Thus, tensorflow needs to be compiled locally in order to match the gcc version.
But, finally I got it working by following the steps and match the cuda, nccl and cudnn described on this page. The following is the cluster environment, setup and installation process.
-
Environment of the cluster (compute node)
- os distribution
$cat /etc/*-releaseNAME="Scientific Linux"PRETTY_NAME="Scientific Linux 7.7 (Nitrogen)"REDHAT_BUGZILLA_PRODUCT="Scientific Linux 7"REDHAT_BUGZILLA_PRODUCT_VERSION=7.7REDHAT_SUPPORT_PRODUCT="Scientific Linux"REDHAT_SUPPORT_PRODUCT_VERSION="7.7"Scientific Linux release 7.7 (Nitrogen)
- jdk version 10.0.1
java version
- cuda version 9.2
- os distribution
-
Requirements
- cudnn 7.5.6
- Download cudnn, login required
- Unzip and move them to $HOME/local/share
- nccl 2.5.6
- Download nccl, login required
- Unzip and move them to $HOME/local/share
- bazel 0.24.1 as tested on this page
- Install with argument –user to install it locally in $HOME/bin
chmod +x bazel-<version>-installer-linux-x86_64.sh./bazel-0.24.1-installer-linux-x86_64.sh --user
- export the path
export PATH="$PATH:$HOME/bin"
- Install with argument –user to install it locally in $HOME/bin
- TensorFlow r1.14
- clone the repository and checkout
git clone https://github.com/tensorflow/tensorflow.gitcd tensorflow && git checkout r1.14
- clone the repository and checkout
- cudnn 7.5.6
-
Build TensorFlow 1.14 on compute node
- Load cuda library
The availability of this command depends on your system admin. But if this is not available, you can just export the pathmodule load cuda/9.2export LD_LIBRARY_PATH=/usr/local/cuda-9.2/lib64:$LD_LIBRARY_PATH
- Add your cudnn and nccl library into LD_LIBRARY_PATH as well
replace “emha” with your home directory nameexport LD_LIBRARY_PATH=/home/emha/local/share/cudnn/lib64:/home/emha/local/share/nccl_2.5.6/lib:$LD_LIBRARY_PATH
- Activate your conda environment if you use conda
- Go to tensorflow directory and configure
The following is my configuration./configureYou have bazel 0.24.1 installed.Please specify the location of python. [Default is /home/emha/anaconda3/envs/tf-source/bin/python]:Found possible Python library paths:/home/emha/anaconda3/envs/tf-source/lib/python3.6/site-packagesPlease input the desired Python library path to use. Default is [home/emha/anaconda3/envs/tf-source/lib/python3.6/site-packages]Do you wish to build TensorFlow with XLA JIT support? [y/N]: YNo XLA JIT support will be enabled for TensorFlow.Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: NNo OpenCL SYCL support will be enabled for TensorFlow.Do you wish to build TensorFlow with ROCm support? [y/N]: NDo you wish to build TensorFlow with CUDA support? [y/N]: yCUDA support will be enabled for TensorFlow.Do you wish to build TensorFlow with TensorRT support? [y/N]: NNo TensorRT support will be enabled for TensorFlow.Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 9.2]: 9.2Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: 7.0Please specify the locally installed NCCL version you want to use. [Leave empty to use http://github.com/nvidia/nccl]: 2.5.6Please specify the comma-separated list of base paths to look for CUDA libraries and headers. [Leave empty to use the default]: /usr/local/cuda-9.2,/home/emha/local/cudnn,/home/emha/local/nccl_2.5.6
- Build with bazel
If you encounter a problem with http_archive, add the following to WORKSPACE file before the http_archive functionbazel build --configopt --config=cuda --copt=--cuda_log //tensorflow/tools/pip_package:build_pip_packageload("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive", "http_file")
- If everything is fine, now you start building tensorflow
You may want to replace tmp directory, and use your home directory for use later/bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkgs
- Install the built tensorflow
You should be in the same env as the one use for building tfpip install /tmp/tensorflow-1.14.1-cp36-linux_x86_64.whl
- After installation, if you encounter error like cannot find -ltensorflow_framework, it’s because the name of the tensorflow_framework.so is changed to tensorflow_framework.so.1, the solution is create a symlink to that file
ln -s libtensorflow_framework.so.1 libtensorflow_framework.so
- Load cuda library
-
Note
- You need to export cudnn and nccl you used to build tf inorder to load tensorflow
export LD_LIBRARY_PATH=/home/emha/local/share/nccl_2.5.6/libexport LD_LIBRARY_PATH=/home/emha/local/share/cudnn/lib64
- You need to export cudnn and nccl you used to build tf inorder to load tensorflow
Compile your own tensor operators
Example bash script how to compile tensor operators. Create bash script like the following and compile it.
/usr/local/cuda-9.2/bin/nvcc tf_grouping_g.cu -o tf_grouping_g.cu.o -c -O2 -DGOOGLE_CUDA=1 -x cu -Xcompilerg++ -std=c++11 tf_grouping.cpp tf_grouping_g.cu.o -o tf_grouping_so_hk.so -shared -fPIC -I /home/mwasil2s/anaconda3/envs/tf-source/lib/python3.6/site-packages/tensorflow/include -I /usr/local/cuda-9.2/include -I /home/mwasil2s/anaconda3/envs/tf-source/lib/python3.6/site-packages/tensorflow/include/external/nsync/public -lcudart -L /usr/local/cuda-9.2/lib64 -L /home/mwasil2s/anaconda3/envs/tf-source/lib/python3.6/site-packages/tensorflow -ltensorflow_framework -O2 -D_GLIBCXX_USE_CXX11_ABI=0