Deep learning can require extremely powerful hardware, often for unpredictable durations of time. Moreover, MXNet can benefit from both multiple GPUs and multiple machines. Accordingly, cloud computing, as offered by AWS and others, is especially well suited to training deep learning models. Using AWS, we can rapidly fire up multiple machines with multiple GPUs each at will and maintain the resources for precisely the amount of time needed.
In this document, we provide a step-by-step guide that will teach you how to set up an AWS cluster with MXNet. We show how to:
Amazon S3 provides distributed data storage which proves especially convenient for hosting large datasets.
To use S3, you need AWS credentials,
including an ACCESS_KEY_ID
and a SECRET_ACCESS_KEY
.
To use MXNet with S3, set the environment variables AWS_ACCESS_KEY_ID
and
AWS_SECRET_ACCESS_KEY
by adding the following two lines in
~/.bashrc
(replacing the strings with the correct ones):
export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
There are several ways to upload data to S3. One simple way is to use s3cmd. For example:
wget http://data.mxnet.io/mxnet/data/mnist.zip
unzip mnist.zip && s3cmd put t*-ubyte s3://dmlc/mnist/
The Deep Learning AMI is an Amazon Linux image
supported and maintained by Amazon Web Services for use on Amazon Elastic Compute Cloud (Amazon EC2).
It contains MXNet-v0.9.3 tag and the necessary components to get going with deep learning,
including Nvidia drivers, CUDA, cuDNN, Anaconda, Python2 and Python3.
The AMI IDs are the following:
Now you can launch MXNet directly on an EC2 GPU instance.
You can also use Jupyter notebook on EC2 machine.
Here is a good tutorial
on how to connect to a Jupyter notebook running on an EC2 instance.
MXNet requires the following libraries:
gcc >= 4.8
CUDA
(CUDNN
in optional) for GPU linear algebraBLAS
(cblas, open-blas, atblas, mkl, or others) for CPU linear algebraopencv
for image augmentationscurl
and openssl
for the ability to read/write to Amazon S3Installing CUDA
on EC2 instances requires some effort. Caffe has a good
tutorial
on how to install CUDA 7.0 on Ubuntu 14.04.
Note: We tried CUDA 7.5 on Nov 7, 2015, but found it problematic.
You can install the rest using the package manager. For example, on Ubuntu:
sudo apt-get update
sudo apt-get install -y build-essential git libcurl4-openssl-dev libatlas-base-dev libopencv-dev python-numpy
The Amazon Machine Image (AMI) ami-12fd8178 has the packages listed above installed.
The following commands build MXNet with CUDA/CUDNN, Amazon S3, and distributed training.
git clone --recursive https://github.com/dmlc/mxnet
cd mxnet; cp make/config.mk .
echo "USE_CUDA=1" >>config.mk
echo "USE_CUDA_PATH=/usr/local/cuda" >>config.mk
echo "USE_CUDNN=1" >>config.mk
echo "USE_BLAS=atlas" >> config.mk
echo "USE_DIST_KVSTORE = 1" >>config.mk
echo "USE_S3=1" >>config.mk
make -j$(nproc)
To test whether everything is installed properly, we can try training a convolutional neural network (CNN) on the MNIST dataset using a GPU:
python example/image-classification/train_mnist.py
If you've placed the MNIST data on s3://dmlc/mnist
, you can read the data stored on Amazon S3 directly with the following command:
sed -i.bak "s!data_dir = 'data'!data_dir = 's3://dmlc/mnist'!" example/image-classification/train_mnist.py
Note: You can use sudo ln /dev/null /dev/raw1394
to fix the opencv error libdc1394 error: Failed to initialize libdc1394
.
A cluster consists of multiple computers. You can use one computer with MXNet installed as the root computer for submitting jobs,and then launch several slave computers to run the jobs. For example, launch multiple instances using an AMI, e.g., ami-12fd8178, with dependencies installed. There are two options:
Make all slaves' ports accessible (same for the root) by setting type: All TCP, Source: Anywhere in Configure Security Group.
Use the same pem
as the root computer to access all slave computers, and
then copy the pem
file into the root computer's ~/.ssh/id_rsa
. If you do this, all slave computers can be accessed with SSH from the root.
Now, run the CNN on multiple computers. Assume that we are on a working
directory of the root computer, such as ~/train
, and MXNet is built as ~/mxnet
.
cp -r ~/mxnet/python/mxnet .
cp ~/mxnet/lib/libmxnet.so mxnet/
And then copy the training program:
cp ~/mxnet/example/image-classification/*.py .
cp -r ~/mxnet/example/image-classification/common .
cat hosts
:172.30.0.172
172.30.0.171
../../tools/launch.py -n 2 -H hosts --sync-dir /tmp/mxnet python train_mnist.py --kv-store dist_sync
Note: Sometimes the jobs linger at the slave computers even though you've pressed Ctrl-c
at the root node. To terminate them, use the following command:
cat hosts | xargs -I{} ssh -o StrictHostKeyChecking=no {} 'uname -a; pgrep python | xargs kill -9'
Note: The preceding example is very simple to train and therefore isn't a good benchmark for distributed training. Consider using other examples.
It is common to pack a dataset into multiple files, especially when working in a distributed environment. MXNet supports direct loading from multiple data shards. Put all of the record files into a folder, and point the data path to the folder.
Although using SSH can be simple when you don't have a cluster scheduling framework,
MXNet is designed to be portable to various platforms.
We provide scripts available in tracker
to allow running on other cluster frameworks, including Hadoop (YARN) and SGE.
We welcome contributions from the community of examples of running MXNet on your favorite distributed platform.
Can you improve this documentation? These fine people already did:
Sheng Zha & thinksankyEdit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close