Train with Docker¶
To facilitate the installation process on GPU machines and get started with AutoDist in minutes, the AutoDist team has published a reference Dockerfile.
First clone the AutoDist repository.
git clone https://github.com/petuum/autodist.git
Once we have cloned the repository successfully we can build the Docker image with the provided Dockerfile.
cd autodist docker build -t autodist:latest -f docker/Dockerfile.gpu .
Running on a single machine¶
Install nvidia-docker to run the built container with GPU access.
<WORK_DIR> is the directory where your Python script is located. In this example, we are using
autodist/examples/ as our
docker run --gpus all -it -v <WORK_DIR>:/mnt autodist:latest
Inside the Docker environment, you will be able to run the examples using the machine’s GPUs. Remember to follow the “Getting Started” tutorial to
properly set up your
Multiple Machine Setup¶
In this section, we will describe a way to set up passwordless SSH between docker containers in different machines. If your machines are already set up, you can go directly to the Running on multiple machines section.
For passwordless authentication, the idea is to create a shared private key across all containers and to allow this shared private key to access all machines.
SHARE_DIR to hold the shared credentials
mkdir <SHARE_DIR> && mkdir <SHARE_DIR>/.ssh
Create the credentials in the created folder
ssh-keygen -f <SHARE_DIR>/.ssh/id_rsa
Copy the created public key into the
cat <SHARE_DIR>/.ssh/id_rsa.pub | cat >> <SHARE_DIR>/.ssh/authorized_keys
<SHARE_DIR>/.ssh directory with the correct ownership and permissions for SSH.
Note: you might need to use the
sudo command if permission is denied.
chown -R root <SHARE_DIR>/.ssh chmod 700 <SHARE_DIR>/.ssh chmod 600 <SHARE_DIR>/.ssh/authorized_keys
Once you have set up the credential folder, you must make sure all your machines have access to this folder or a copy of this folder. You also need to make sure that the ownership and permission of the files do not change. You can use
scp to share this credential folder across all machines. An example of the command to do this is:
sudo rsync -av --rsync-path "sudo rsync" <SHARE_DIR>/.ssh/ user@remote:<SHARE_DIR>/.ssh
Running on multiple machines¶
Once you have set up the passwordless SSH, you need to configure the
<WORK_DIR>/resource_spec.yml using the “Getting Started” and “Train on Multiple Nodes” with all worker machine’s port set to the number
This is an example of a
resource_spec.yml file for multiple machine setup with
12345 as the
nodes: # multi-nodes docker experiment - address: 10.20.41.126 gpus: [0,1] chief: true - address: 10.20.41.114 gpus: [0,1] ssh_config: conf - address: 10.20.41.128 gpus: [0,1] ssh_config: conf ssh: conf: username: 'root' key_file: '/root/.ssh/id_rsa' # shared credential file port: 12345
resource_spec.yml file must be inside the
<WORK_DIR> directory as it is the directory that will be mounted inside the docker image.
Before running the multi-machine training job, you must make sure all contents in
<WORK_DIR> in all node machines are the same. In this example, we are using
autodist/examples/ as our
Then on every single AutoDist worker machine run
docker run --gpus all -it --privileged -v <SHARE_DIR>/.ssh:/root/.ssh -v <WORK_DIR>:/mnt --network=host autodist:latest bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
And on the AutoDist chief machine run
docker run -it --gpus all --network=host -v <SHARE_DIR>/.ssh:/root/.ssh:ro -v <WORK_DIR>:/mnt autodist:latest
And inside the AutoDist chief’s Docker environment run