-
Notifications
You must be signed in to change notification settings - Fork 0
AWS GPU TensorFlow Docker
Based on this TensorFlow Documentation
- Run these separately (ie. don't copy/paste the whole thing all at once)
sudo add-apt-repository -y ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt-get install -y dkms
sudo apt-get install -y linux-headers-generic
sudo apt-get install -y nvidia-361
echo blacklist nouveau | sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf
echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf
sudo update-initramfs -u
sudo apt-get install -y nvidia-modprobe
- Create either a
g2.2xlarge
org2.8xlarge
instances
- Note: Only tested with EC2 Instances configured with Ubuntu 14.04
- Setup the latest Docker (1.12+)
- DO NOT RELY ON THE DEFAULT VERSION PROVIDED BY YOUR OS!
sudo apt-get update
sudo curl -fsSL https://get.docker.com/ | sh
sudo curl -fsSL https://get.docker.com/gpg | sudo apt-key add -
Setup Nvidia Docker on AWS EC2 Instance Host
wget https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.0-rc.3/nvidia-docker_1.0.0.rc.3-1_amd64.deb
sudo dpkg -i nvidia-docker_1.0.0.rc.3-1_amd64.deb
sudo rm nvidia-docker_1.0.0.rc.3-1_amd64.deb
sudo docker pull gcr.io/tensorflow/tensorflow:0.10.0-gpu
- Start the Docker Container with a Jupyter/iPython notebook
- Note: This Docker image is from DockerHub based on this Dockerfile
sudo nvidia-docker run -itd --name=tensorflow-gpu -p 8754:8888 -p 6006:6006 gcr.io/tensorflow/tensorflow:0.10.0-gpu
sudo nvidia-docker exec -it tensorflow-gpu bash
nvidia-smi
-
g2.2xlarge
EC2 Instance (1 Nvidia K520 GPU)
-
g2.8xlarge
EC2 Instance (4 Nvidia K520 GPUs)
ps -aef | grep jupyter
### EXPECTED OUTPUT ###
...
root 1 0 0 13:40 ? 00:00:00 bash /run_jupyter.sh
root 7 1 0 13:40 ? 00:00:01 /usr/bin/python /usr/local/bin/jupyter-notebook
root 13 7 47 13:46 ? 00:00:29 /usr/bin/python -m ipykernel -f /root/.local/share/jupyter/runtime/kernel-c6494c82-7072-43b4-8ab2-7f5110d6b767.json
...
Note: Ignore the ERROR about logdir
, for now
tensorboard
### EXPECTED OUTPUT ###
...
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
ERROR:tensorflow:A logdir must be specified. Run `tensorboard --help` for details and examples.
A logdir must be specified. Run `tensorboard --help` for details and examples.
- The following is an example of muilt-gpu, data-parallel training across GPU's on a single host
Step 1: CPU transfers model to each GPU
Step 2: CPU synchronizes and waits for all GPUs to process batch
Step 3: CPU copies all training results (gradients) back from GPU
Step 4: CPU builds new model from average of gradients from all GPUs
Step 5: Repeat Step 1 until stop condition is reached (ie. --max_steps=1000
)
- Download Example Source Code to Your Home Directory
cd ~ && git clone -b r0.10 --single-branch --recurse-submodules https://github.com/tensorflow/tensorflow.git
- Shell commands
- Change to the example directory containing this [code](https://github.com/tensorflow/tensorflow/blob/r0.10/tensorflow/models /image/cifar10/cifar10_multi_gpu_train.py)
cd ~/tensorflow/tensorflow/models/image/cifar10
1 GPU: Note examples/sec
and sec/batch
python cifar10_multi_gpu_train.py --num_gpus=1 --max_steps=1000
### EXPECTED OUTPUT ###
...
2016-09-08 16:10:30.718689: step 990, loss = 2.51 (717.6 examples/sec; 0.178 sec/batch)
2 GPUs: Note the increase in examples/sec
and sec/batch
python cifar10_multi_gpu_train.py --num_gpus=2 --max_steps=1000
### EXPECTED OUTPUT ###
...
2016-09-08 16:06:01.299470: step 990, loss = 2.31 (1342.8 examples/sec; 0.095 sec/batch)
4 GPUs: Note the increase in examples/sec
and sec/batch
python cifar10_multi_gpu_train.py --num_gpus=4 --max_steps=1000
### EXPECTED OUTPUT ###
...
2016-09-08 15:59:51.653752: step 990, loss = 2.36 (1925.9 examples/sec; 0.066 sec/batch)
- Run example notebooks
http://<your-cloud-ip>:8754
- Start TensorBoard
- Note: This requires a TensorFlow/Notebook to write to logdir with [SummaryWriter] (https://www.tensorflow.org/versions/r0.10/api_docs/python/train.html#SummaryWriter)
tensorboard --logdir <logdir>
### EXPECTED OUTPUT ###
...
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
...
- Navigate your browser to the TensorBoard UI
http://<your-cloud-ip>:6006
- To save money, you can stop the EC2 instance when you are done experimenting
- When you start the EC2 instance back up, you will need to start the Docker container
sudo docker start tensorflow-gpu
- Then you can shell back in to the Docker container
sudo nvidia-docker exec -it tensorflow-gpu bash
Environment Setup
Demos
6. Serve Batch Recommendations
8. Streaming Probabilistic Algos
9. TensorFlow Image Classifier
Active Research (Unstable)
15. Kubernetes Docker Spark ML
Managing Environment
15. Stop and Start Environment