# Train with Docker To facilitate the installation process on GPU machines and get started with AutoDist in minutes, the AutoDist team has published a reference [Dockerfile](https://github.com/petuum/autodist/blob/master/docker/Dockerfile.gpu). ## Building First clone the AutoDist repository. ```bash git clone https://github.com/petuum/autodist.git ``` Once we have cloned the repository successfully we can build the Docker image with the provided Dockerfile. ```bash cd autodist docker build -t autodist:latest -f docker/Dockerfile.gpu . ``` ## Running on a single machine Install [nvidia-docker](https://github.com/NVIDIA/nvidia-docker) to run the built container with GPU access. `` is the directory where your Python script is located. In this example, we are using `autodist/examples/` as our ``. ```bash docker run --gpus all -it -v :/mnt autodist:latest ``` Inside the Docker environment, you will be able to run the examples using the machine's GPUs. Remember to follow the "[Getting Started](getting-started.md)" tutorial to properly set up your `/resource_spec.yml`. ```bash python /mnt/linear_regression.py ``` ## Multiple Machine Setup In this section, we will describe a way to set up passwordless SSH between docker containers in different machines. If your machines are already set up, you can go directly to the [Running on multiple machines](#Running-on-multiple-machines) section. For passwordless authentication, the idea is to create a shared private key across all containers and to allow this shared private key to access all machines. Create the `SHARE_DIR` to hold the shared credentials ```bash mkdir && mkdir /.ssh ``` Create the credentials in the created folder ```bash ssh-keygen -f /.ssh/id_rsa ``` Copy the created public key into the `/.ssh/authorized_keys` file ```bash cat /.ssh/id_rsa.pub | cat >> /.ssh/authorized_keys ``` Setup the `/.ssh` directory with the correct ownership and permissions for SSH. **Note:** you might need to use the `sudo` command if permission is denied. ```bash chown -R root /.ssh chmod 700 /.ssh chmod 600 /.ssh/authorized_keys ``` Once you have set up the credential folder, you must make sure all your machines have access to this folder or a copy of this folder. You also need to make sure that the ownership and permission of the files do not change. You can use `rsync` or `scp` to share this credential folder across all machines. An example of the command to do this is: ```bash sudo rsync -av --rsync-path "sudo rsync" /.ssh/ user@remote:/.ssh ``` ## Running on multiple machines Once you have set up the passwordless SSH, you need to configure the `/resource_spec.yml` using the "[Getting Started](getting-started.md)" and "[Train on Multiple Nodes](multi-node.md)" with all worker machine's port set to the number ``. This is an example of a `resource_spec.yml` file for multiple machine setup with `12345` as the ``: ```yaml nodes: # multi-nodes docker experiment - address: 10.20.41.126 gpus: [0,1] chief: true - address: 10.20.41.114 gpus: [0,1] ssh_config: conf - address: 10.20.41.128 gpus: [0,1] ssh_config: conf ssh: conf: username: 'root' key_file: '/root/.ssh/id_rsa' # shared credential file port: 12345 ``` **Note**: the `resource_spec.yml` file must be inside the `` directory as it is the directory that will be mounted inside the docker image. Before running the multi-machine training job, you must make sure all contents in `` in all node machines are the same. In this example, we are using `autodist/examples/` as our ``. Then on every single AutoDist worker machine run ```bash docker run --gpus all -it --privileged -v /.ssh:/root/.ssh -v :/mnt --network=host autodist:latest bash -c "/usr/sbin/sshd -p 12345; sleep infinity" ``` And on the AutoDist chief machine run ```bash docker run -it --gpus all --network=host -v /.ssh:/root/.ssh:ro -v :/mnt autodist:latest ``` And inside the AutoDist chief's Docker environment run ```bash python /mnt/linear_regression.py ```