UPDATED 12/24/2016 to support TensorFlow r0.12.
Prepare to fall down a rabbit hole of Linux compiler errors — here’s a guide on how to set up a proper Python and TensorFlow development environment on Columbia’s Yeti HPC cluster. This should also work for other RHEL 6.7 and certain CentOS HPC systems where GLIBC and other dependencies are out of date and you don’t have root access to dig deep into the system. A living, breathing guide is on my GitHub here, and I will keep this post updated in case future versions of TensorFlow are easier to install.
Python Setup
Create an alias for the directory where we’ll do our installation and computing.
$WORK = /vega/<group>/users/<username>
Now, install and setup the latest version of Python (2 or 3).
cd $WORK
mkdir applications
cd applications
mkdir python
cd python
wget https://www.python.org/ftp/python/2.7.12/Python-2.7.12.tgz
tar -xvzf Python-2.7.12.tgz
find Python-2.7.12 -type d | xargs chmod 0755
cd Python-2.7.12
./configure --prefix=$WORK/applications/python --enable-shared
make && make install
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$WORK/applications/python/lib"
You can add this Python to your path, but I am just going to work entirely out of virtual environments and will leave the default path as-is. If you’re particular about folder structure, you can install specific Python versions in (for example) $WORK/applications/python/Python-2.7.12
to keep separate versions well-organized and easily available.
Now, we’ll install pip.
cd $WORK/applications
wget https://bootstrap.pypa.io/get-pip.py
$WORK/applications/python/bin/python get-pip.py
Now to install and set up a virtualenv:
$WORK/applications/python/bin/pip install virtualenv
cd $WORK/applications
$WORK/applications/python/bin/virtualenv pythonenv
Now, create an alias in your ~/.profile
to allow easy access to the virtualenv.
alias pythonenv="source $WORK/applications/pythonenv/bin/activate"
There you have it! Your own local python installation in a virtualenv just a pythonenv
command away. You can also install multiple Python versions and pick which one you want for a particular virtualenv. Nice and self-contained.
Bazel Setup
The TensorFlow binary requires GLIBC 2.14, but Yeti runs RHEL 6.7, which ships with GLIBC 2.12. Installing a new GLIBC from source will lead you down a rabbit hole of system dependencies and compilation errors, but we have another option. Installing Bazel will let us compile TensorFlow from source. Bazel requires OpenJDK 8:
# Do this in an interactive session because submit queues don't have enough memory.
cd $WORK/applications
wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u112-b15/jdk-8u112-linux-x64.tar.gz
tar -xzf jdk-8u112-linux-x64.tar.gz
Add these two lines to your ~/.profile
:
export PATH=$WORK/applications/jdk1.8.0_112/bin:$PATH
export JAVA_HOME=$WORK/applications/jdk1.8.0_112
Now, get a copy of Bazel. We also need to load a newer copy of gcc
to compile Bazel:
wget https://github.com/bazelbuild/bazel/releases/download/0.4.2/bazel-0.4.2-dist.zip
unzip bazel-0.4.2-dist.zip -d bazel
cd bazel
module load gcc/4.9.1
./compile.sh
Add the following to your ~/.profile
:
export PATH=$WORK/applications/bazel/output:$PATH
TensorFlow Setup
We’re going to install TensorFlow from source using Bazel.
Make sure numpy is installed in your pythonenv: pip install numpy
.
Clone the TensorFlow repository.
cd $WORK/applications
git clone https://github.com/tensorflow/tensorflow.git
cd tensorflow
git checkout r0.12
We also need to install swig
:
cd $WORK/applications
# Get swig-3.0.10.tar.gz from SourceForge.
tar -xzf swig-3.0.10.tar.gz
mkdir swig
cd swig-3.0.10
./configure --prefix=$WORK/applications/swig
make
make install
Add the following to your ~/.profile
:
export PATH=$WORK/applications/swig/bin:$PATH
We need to set the following environment variables. Add them to your ~/.profile
:
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-7.5/extras/CUPTI/lib64"
export CUDA_HOME=/usr/local/cuda-7.5
Note that /usr/local/cuda-7.5/lib64
is automatically added to $LD_LIBRARY_PATH
when you run module load cuda
, so we only need to add the other directories. Also note that /usr/local/cuda
is symlinked to /usr/local/cuda-7.5
, so you don’t need to include the versions in the path directories, but I’m doing it to be explicit.
To install TensorFlow, we just need to load some GPU nodes and libraries, which we can also access in an interactive session. Running module load cuda
loads CUDA 7.5 and cuDNN. Then we can install with Bazel:
# This gives you a 1-hour interactive session with GPU support.
# It may take a while to start the interactive session, depending on current wait times.
qsub -I -W group_list=<yetigroup> -l walltime=01:00:00,nodes=1:gpus=1:exclusive_process
# Use latest available gcc for compatibility.
# CUDA loads 7.5 by default.
# Load the proxy to allow TF to download and install protobuf and other dependencies.
module load gcc/4.9.1 cuda proxy
pythonenv
cd $WORK/applications/tensorflow
./configure
# I used all the default settings except for CUDA compute capabilities, which I set to 3.5 for our k20 and k40 GPUs.
Once that is done, make the following change to third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl
to add the -fno-use-linker-plugin
compiler flag:
index 20449a1..48a4e60 100755
--- a/third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl
+++ b/third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl
@@ -309,6 +309,7 @@ def main():
# TODO(eliben): rename to a more descriptive name.
cpu_compiler_flags.append('-D__GCUDACC_HOST__')
+ cpu_compiler_flags.append('-fno-use-linker-plugin')
return subprocess.call([CPU_COMPILER] + cpu_compiler_flags)
if __name__ == '__main__':
Now we can build with Bazel:
bazel build -c opt --config=cuda --verbose_failures //tensorflow/cc:tutorials_example_trainer
The build should fail with an error that goes something like undefined reference to symbol 'ceil@@GLIBC_2.2.5'
or undefined reference to symbol 'clock_gettime@@GLIBC_2.2.5'
. To fix this, modify LINK_OPTS
in bazel-tensorflow/external/protobuf/BUILD
by adding the -lm
and -lrt
flags to //conditions:default
:
LINK_OPTS = select({
":android": [],
"//conditions:default": ["-lpthread", "-lm", "-lrt"],
})
Re-start the build and run the sample trainer:
bazel build -c opt --config=cuda --verbose_failures //tensorflow/cc:tutorials_example_trainer
bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu
If everything goes okay, build the pip wheel:
bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
bazel-bin/tensorflow/tools/pip_package/build_pip_package $WORK/applications/tensorflow
pip install $WORK/applications/tensorflow/tensorflow-0.12.0-cp27-cp27m-linux_x86_64.whl
Testing TensorFlow
Try training on MNIST data to see if your installation works:
cd tensorflow/models/image/mnist
python convolutional.py
Troubleshooting
undefined reference to symbol 'clock_gettime@@GLIBC_2.2.5'
- Add
--linkopt=-lrt
flag tobazel build
or seeLINK_OPTS
fix above.
- Add
(directory not empty)
error during./configure
:Linking of rule '@protobuf//:protoc' failed: crosstool_wrapper_driver_is_not_gcc failed:
with/usr/bin/ld: unrecognized option '-plugin'
- Highwayhash issues.
- Compile with
gcc 4.9.1
or make changes here.
- Compile with
Undefined reference to symbol 'ceil@@GLIBC_2.2.5
ERROR: no such package '@local_config_cuda//crosstool': BUILD file not found on package path.
- Re-run
./configure
before re-runningbazel build
.
- Re-run
tensorflow/stream_executor/cuda/cuda_driver.cc:383] Check failed: CUDA_SUCCESS == dynload::cuCtxSetCurrent(context) (0 vs. 216)
- Set compute context to DEFAULT or EXCLUSIVE_PROCESS by adding the
exclusive_process
flag to the qsub call (see above). - Set
CUDA_VISIBLE_DEVICES
to the one inexclusive_process
mode.
- Set compute context to DEFAULT or EXCLUSIVE_PROCESS by adding the