Train with Docker

To facilitate the installation process on GPU machines and get started with AutoDist in minutes, the AutoDist team has published a reference Dockerfile.

Building

First clone the AutoDist repository.

git clone https://github.com/petuum/autodist.git

Once we have cloned the repository successfully we can build the Docker image with the provided Dockerfile.

cd autodist
docker build -t autodist:latest -f docker/Dockerfile.gpu .

Running on a single machine

Install nvidia-docker to run the built container with GPU access. <WORK_DIR> is the directory where your Python script is located. In this example, we are using autodist/examples/ as our <WORK_DIR>.

docker run --gpus all -it -v <WORK_DIR>:/mnt autodist:latest

Inside the Docker environment, you will be able to run the examples using the machine’s GPUs. Remember to follow the “Getting Started” tutorial to properly set up your <WORK_DIR>/resource_spec.yml.

python /mnt/linear_regression.py

Multiple Machine Setup

In this section, we will describe a way to set up passwordless SSH between docker containers in different machines. If your machines are already set up, you can go directly to the Running on multiple machines section.

For passwordless authentication, the idea is to create a shared private key across all containers and to allow this shared private key to access all machines.

Create the SHARE_DIR to hold the shared credentials

mkdir <SHARE_DIR> && mkdir <SHARE_DIR>/.ssh

Create the credentials in the created folder

ssh-keygen -f <SHARE_DIR>/.ssh/id_rsa

Copy the created public key into the <SHARE_DIR>/.ssh/authorized_keys file

cat <SHARE_DIR>/.ssh/id_rsa.pub | cat >> <SHARE_DIR>/.ssh/authorized_keys

Setup the <SHARE_DIR>/.ssh directory with the correct ownership and permissions for SSH.

Note: you might need to use the sudo command if permission is denied.

chown -R root <SHARE_DIR>/.ssh
chmod 700 <SHARE_DIR>/.ssh
chmod 600 <SHARE_DIR>/.ssh/authorized_keys

Once you have set up the credential folder, you must make sure all your machines have access to this folder or a copy of this folder. You also need to make sure that the ownership and permission of the files do not change. You can use rsync or scp to share this credential folder across all machines. An example of the command to do this is:

sudo rsync -av --rsync-path "sudo rsync" <SHARE_DIR>/.ssh/ user@remote:<SHARE_DIR>/.ssh

Running on multiple machines

Once you have set up the passwordless SSH, you need to configure the <WORK_DIR>/resource_spec.yml using the “Getting Started” and “Train on Multiple Nodes” with all worker machine’s port set to the number <PORT_NUM>.

This is an example of a resource_spec.yml file for multiple machine setup with 12345 as the <PORT_NUM>:

nodes:
  # multi-nodes docker experiment
  - address: 10.20.41.126
    gpus: [0,1]
    chief: true
  - address: 10.20.41.114
    gpus: [0,1]
    ssh_config: conf
  - address: 10.20.41.128
    gpus: [0,1]
    ssh_config: conf
ssh:
  conf:
    username: 'root'
    key_file: '/root/.ssh/id_rsa' # shared credential file
    port: 12345

Note: the resource_spec.yml file must be inside the <WORK_DIR> directory as it is the directory that will be mounted inside the docker image.

Before running the multi-machine training job, you must make sure all contents in <WORK_DIR> in all node machines are the same. In this example, we are using autodist/examples/ as our <WORK_DIR>.

Then on every single AutoDist worker machine run

docker run --gpus all -it --privileged -v <SHARE_DIR>/.ssh:/root/.ssh -v <WORK_DIR>:/mnt --network=host autodist:latest bash -c "/usr/sbin/sshd -p 12345; sleep infinity"

And on the AutoDist chief machine run

docker run -it --gpus all --network=host -v <SHARE_DIR>/.ssh:/root/.ssh:ro -v <WORK_DIR>:/mnt autodist:latest

And inside the AutoDist chief’s Docker environment run

python /mnt/linear_regression.py