AWS GPU TensorFlow Docker

Based on this TensorFlow Documentation

Setup Nvidia Drivers on AWS EC2 Instance Host

Run these separately (ie. don't copy/paste the whole thing all at once)

sudo add-apt-repository -y ppa:graphics-drivers/ppa

sudo apt-get update

sudo apt-get install -y dkms

sudo apt-get install -y linux-headers-generic

sudo apt-get install -y nvidia-361

echo blacklist nouveau | sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf

echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf

sudo update-initramfs -u

sudo apt-get install -y nvidia-modprobe

Setup Docker on AWS EC2 Instance Host

Create either a g2.2xlarge or g2.8xlarge instances

GPU EC2 Instances

Note: Only tested with EC2 Instances configured with Ubuntu 14.04

Ubuntu 14.04

Setup the latest Docker (1.12+)
DO NOT RELY ON THE DEFAULT VERSION PROVIDED BY YOUR OS!

sudo apt-get update
sudo curl -fsSL https://get.docker.com/ | sh
sudo curl -fsSL https://get.docker.com/gpg | sudo apt-key add -

Setup Nvidia Docker on AWS EC2 Instance Host

wget https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.0-rc.3/nvidia-docker_1.0.0.rc.3-1_amd64.deb
sudo dpkg -i nvidia-docker_1.0.0.rc.3-1_amd64.deb
sudo rm nvidia-docker_1.0.0.rc.3-1_amd64.deb

Download Docker Image

sudo docker pull gcr.io/tensorflow/tensorflow:0.10.0-gpu

Start TensorFlow-GPU Docker Container

Start the Docker Container with a Jupyter/iPython notebook
Note: This Docker image is from DockerHub based on this Dockerfile

sudo nvidia-docker run -itd --name=tensorflow-gpu -p 8754:8888 -p 6006:6006 gcr.io/tensorflow/tensorflow:0.10.0-gpu

Shell into TensorFlow-GPU Docker Container and Verify Successful Startup

sudo nvidia-docker exec -it tensorflow-gpu bash

nvidia-smi

g2.2xlarge EC2 Instance (1 Nvidia K520 GPU)

AWS GPU Nvidia Docker

g2.8xlarge EC2 Instance (4 Nvidia K520 GPUs)

AWS 4 GPU Nvidia Docker

ps -aef | grep jupyter

### EXPECTED OUTPUT ###
...
root         1     0  0 13:40 ?        00:00:00 bash /run_jupyter.sh
root         7     1  0 13:40 ?        00:00:01 /usr/bin/python /usr/local/bin/jupyter-notebook
root        13     7 47 13:46 ?        00:00:29 /usr/bin/python -m ipykernel -f /root/.local/share/jupyter/runtime/kernel-c6494c82-7072-43b4-8ab2-7f5110d6b767.json
...

Verify TensorBoard Setup

Note: Ignore the ERROR about logdir, for now

tensorboard

### EXPECTED OUTPUT ###
...
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
ERROR:tensorflow:A logdir must be specified. Run `tensorboard --help` for details and examples.
A logdir must be specified. Run `tensorboard --help` for details and examples.

(Multiple GPUs Only) Test Multi-GPU TensorFlow Training Example

The following is an example of muilt-gpu, data-parallel training across GPU's on a single host

Multi-GPU Training

Step 1: CPU transfers model to each GPU

Step 2: CPU synchronizes and waits for all GPUs to process batch

Step 3: CPU copies all training results (gradients) back from GPU

Step 4: CPU builds new model from average of gradients from all GPUs

Step 5: Repeat Step 1 until stop condition is reached (ie. --max_steps=1000)

Download Example Source Code to Your Home Directory

cd ~ && git clone -b r0.10 --single-branch --recurse-submodules https://github.com/tensorflow/tensorflow.git

Shell commands
Change to the example directory containing this [code](https://github.com/tensorflow/tensorflow/blob/r0.10/tensorflow/models /image/cifar10/cifar10_multi_gpu_train.py)

cd ~/tensorflow/tensorflow/models/image/cifar10

1 GPU: Note examples/sec and sec/batch

python cifar10_multi_gpu_train.py --num_gpus=1 --max_steps=1000

### EXPECTED OUTPUT ###
...
2016-09-08 16:10:30.718689: step 990, loss = 2.51 (717.6 examples/sec; 0.178 sec/batch)

2 GPUs: Note the increase in examples/sec and sec/batch

python cifar10_multi_gpu_train.py --num_gpus=2 --max_steps=1000

### EXPECTED OUTPUT ###
...
2016-09-08 16:06:01.299470: step 990, loss = 2.31 (1342.8 examples/sec; 0.095 sec/batch)

4 GPUs: Note the increase in examples/sec and sec/batch

python cifar10_multi_gpu_train.py --num_gpus=4 --max_steps=1000

### EXPECTED OUTPUT ###
...
2016-09-08 15:59:51.653752: step 990, loss = 2.36 (1925.9 examples/sec; 0.066 sec/batch)

Run TensorFlow iPython Notebook Examples

Run example notebooks

http://<your-cloud-ip>:8754

Run TensorBoard UI

Start TensorBoard
Note: This requires a TensorFlow/Notebook to write to logdir with [SummaryWriter] (https://www.tensorflow.org/versions/r0.10/api_docs/python/train.html#SummaryWriter)

tensorboard --logdir <logdir>

### EXPECTED OUTPUT ###
...
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
...

Navigate your browser to the TensorBoard UI

http://<your-cloud-ip>:6006

Stop and Start AWS EC2 Instance Host to Save Money

To save money, you can stop the EC2 instance when you are done experimenting
When you start the EC2 instance back up, you will need to start the Docker container

sudo docker start tensorflow-gpu

Then you can shell back in to the Docker container

sudo nvidia-docker exec -it tensorflow-gpu bash

Continue Following the Sidebar From Top to Bottom -->

0. Architecture Overview

Environment Setup

1. Setup Cloud Environment

2. Environment README

3. Start Docker Environment

Demos

4. Demo Code Layout

5. Explore Services

6. Serve Batch Recommendations

7. Realtime Recommendations

8. Streaming Probabilistic Algos

9. TensorFlow Image Classifier

10. AWS GPU TensorFlow Docker

11. Monitoring Metrics

Active Research (Unstable)

12. More TensorFlow Examples

13. Spark ML Code Generation

14. Spark ML PMML Support

15. Kubernetes Docker Spark ML

Managing Environment

15. Stop and Start Environment

16. Save and Restore Environment

17. Troubleshooting Guide

AWS GPU TensorFlow Docker

Setup Nvidia Drivers on AWS EC2 Instance Host

Setup Docker on AWS EC2 Instance Host

Setup Nvidia Docker on AWS EC2 Instance Host

Download Docker Image

Start TensorFlow-GPU Docker Container

Shell into TensorFlow-GPU Docker Container and Verify Successful Startup

Verify TensorBoard Setup

(Multiple GPUs Only) Test Multi-GPU TensorFlow Training Example

Run TensorFlow iPython Notebook Examples

Run TensorBoard UI

Stop and Start AWS EC2 Instance Host to Save Money

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally