Skip to content

AWS GPU TensorFlow Docker

Chris Fregly edited this page Sep 25, 2016 · 44 revisions

Based on this TensorFlow Documentation

Setup Nvidia Drivers on AWS EC2 Instance Host

  • Run these separately (ie. don't copy/paste the whole thing all at once)
sudo add-apt-repository -y ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt-get install -y dkms 
sudo apt-get install -y linux-headers-generic
sudo apt-get install -y nvidia-361
echo blacklist nouveau | sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf
echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf
sudo update-initramfs -u
sudo apt-get install -y nvidia-modprobe

Setup Docker on AWS EC2 Instance Host

  • Create either a g2.2xlarge or g2.8xlarge instances

GPU EC2 Instances

  • Note: Only tested with EC2 Instances configured with Ubuntu 14.04

Ubuntu 14.04

  • Setup the latest Docker (1.12+)
  • DO NOT RELY ON THE DEFAULT VERSION PROVIDED BY YOUR OS!
sudo apt-get update
sudo curl -fsSL https://get.docker.com/ | sh
sudo curl -fsSL https://get.docker.com/gpg | sudo apt-key add -

Setup Nvidia Docker on AWS EC2 Instance Host

wget https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.0-rc.3/nvidia-docker_1.0.0.rc.3-1_amd64.deb
sudo dpkg -i nvidia-docker_1.0.0.rc.3-1_amd64.deb
sudo rm nvidia-docker_1.0.0.rc.3-1_amd64.deb

Download Docker Image

sudo docker pull gcr.io/tensorflow/tensorflow:0.10.0-gpu

Start TensorFlow-GPU Docker Container

  • Start the Docker Container with a Jupyter/iPython notebook
  • Note: This Docker image is from DockerHub based on this Dockerfile
sudo nvidia-docker run -itd --name=tensorflow-gpu -p 8754:8888 -p 6006:6006 gcr.io/tensorflow/tensorflow:0.10.0-gpu

Shell into TensorFlow-GPU Docker Container and Verify Successful Startup

sudo nvidia-docker exec -it tensorflow-gpu bash
nvidia-smi
  • g2.2xlarge EC2 Instance (1 Nvidia K520 GPU)

AWS GPU Nvidia Docker

  • g2.8xlarge EC2 Instance (4 Nvidia K520 GPUs)

AWS 4 GPU Nvidia Docker

ps -aef | grep jupyter

### EXPECTED OUTPUT ###
...
root         1     0  0 13:40 ?        00:00:00 bash /run_jupyter.sh
root         7     1  0 13:40 ?        00:00:01 /usr/bin/python /usr/local/bin/jupyter-notebook
root        13     7 47 13:46 ?        00:00:29 /usr/bin/python -m ipykernel -f /root/.local/share/jupyter/runtime/kernel-c6494c82-7072-43b4-8ab2-7f5110d6b767.json
...

Verify TensorBoard Setup

Note: Ignore the ERROR about logdir, for now

tensorboard

### EXPECTED OUTPUT ###
...
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
ERROR:tensorflow:A logdir must be specified. Run `tensorboard --help` for details and examples.
A logdir must be specified. Run `tensorboard --help` for details and examples.

(Multiple GPUs Only) Test Multi-GPU TensorFlow Training Example

  • The following is an example of muilt-gpu, data-parallel training across GPU's on a single host

Multi-GPU Training

Step 1: CPU transfers model to each GPU

Step 2: CPU synchronizes and waits for all GPUs to process batch

Step 3: CPU copies all training results (gradients) back from GPU

Step 4: CPU builds new model from average of gradients from all GPUs

Step 5: Repeat Step 1 until stop condition is reached (ie. --max_steps=1000)

  • Download Example Source Code to Your Home Directory
cd ~ && git clone -b r0.10 --single-branch --recurse-submodules https://github.com/tensorflow/tensorflow.git
cd ~/tensorflow/tensorflow/models/image/cifar10

1 GPU: Note examples/sec and sec/batch

python cifar10_multi_gpu_train.py --num_gpus=1 --max_steps=1000

### EXPECTED OUTPUT ###
...
2016-09-08 16:10:30.718689: step 990, loss = 2.51 (717.6 examples/sec; 0.178 sec/batch)

2 GPUs: Note the increase in examples/sec and sec/batch

python cifar10_multi_gpu_train.py --num_gpus=2 --max_steps=1000

### EXPECTED OUTPUT ###
...
2016-09-08 16:06:01.299470: step 990, loss = 2.31 (1342.8 examples/sec; 0.095 sec/batch)

4 GPUs: Note the increase in examples/sec and sec/batch

python cifar10_multi_gpu_train.py --num_gpus=4 --max_steps=1000

### EXPECTED OUTPUT ###
...
2016-09-08 15:59:51.653752: step 990, loss = 2.36 (1925.9 examples/sec; 0.066 sec/batch)

Run TensorFlow iPython Notebook Examples

  • Run example notebooks
http://<your-cloud-ip>:8754

Run TensorBoard UI

tensorboard --logdir <logdir>

### EXPECTED OUTPUT ###
...
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
...
  • Navigate your browser to the TensorBoard UI
http://<your-cloud-ip>:6006

Stop and Start AWS EC2 Instance Host to Save Money

  • To save money, you can stop the EC2 instance when you are done experimenting
  • When you start the EC2 instance back up, you will need to start the Docker container
sudo docker start tensorflow-gpu
  • Then you can shell back in to the Docker container
sudo nvidia-docker exec -it tensorflow-gpu bash
Clone this wiki locally