Cluster

Cluster contains almost all the networking code for AutoDist.

The notable exception is ResourceSpec and a couple functions in autodist.utils.network.

Cluster will be used by other modules to handle: 1) Copying files 2) Writing files 3) Running code on nodes in the cluster defined by the ResourceSpec.

Prerequisites: * TensorFlow is already installed in the env of all nodes. * Only supports graph launching logic. Only one node (the Chief) runs the session client. * AutoDist is already installed in the env of the worker node Chief, where the main script runs. * The SSH private key to other nodes is accessible by AutoDist on Chief given a path.

class Cluster(resource_spec: autodist.resource_spec.ResourceSpec)[source]

Bases: object

Cluster manager for TensorFlow servers.

is_chief(address=None)[source]

Check whether an address is chief or not.

If the argument address is not provided, it will check whether the local address is chief.

Parameters

address (str) – node address e.g. ip

Returns

Whether address or self is chief

Return type

bool

get_address_from_task(job_name, task_index)[source]

Given a job name and task index, return the address.

Parameters
  • job_name (str) – job name

  • task_index (int) – task index

Returns

The address for that task

Return type

str

get_local_address()[source]

Get the local (ip) address.

If labelled as AUTODIST_WORKER by the environment variable, the value of it is the address of the local node; otherwise the local node is chief.

Returns

Worker ip or chief address by default.

Return type

str

get_local_worker_task_index()[source]

Get the (first) TensorFlow task index of the “worker” for the local.

Returns

Task index

Return type

int

get_local_session_target()[source]

Get the session target of the local session.

Returns

Local session target

Return type

str

start()[source]

Start tf.servers on all nodes.

Note that this only runs (and only should run) on the chief node.

terminate()[source]

Terminate.

remote_pre_start_tf_server(hostname, tf_server_starter_filepath, working_dir='/tmp/autodist')[source]

Prepare to start a TensorFlow server remotely.

Parameters
  • hostname (str) – host name or address

  • tf_server_starter_filepath (str) – local starter file path

  • working_dir (str) – remote working directory

abstract remote_exec(args, hostname)[source]

Execute a bash script remotely.

Parameters
  • args (list) – bash commands

  • hostname (str) – host name or address

Returns

process handle

Return type

Process

abstract remote_file_write(remote_path, data, hostname)[source]

Write a remote file.

Parameters
  • remote_path (str) – remote file path

  • data (str) – data to be written

  • hostname (str) – host name or address

abstract remote_copy(local_path, remote_path, hostname)[source]

Copy a file to a remote directory.

Parameters
  • local_path (str) – local file path to be copied

  • remote_path (str) – remote directory path

  • hostname (str) – host name or address

class SSHCluster(resource_spec)[source]

Bases: autodist.cluster.Cluster

An AutoDist Cluster Based on SSH.

remote_exec(args, hostname)[source]

Execute a bash script remotely.

Parameters
  • args (list) – bash commands

  • hostname (str) – host name or address

Returns

process handle

Return type

Process

remote_file_write(remote_path, data, hostname)[source]

Write a remote file.

Parameters
  • remote_path (str) – remote file path

  • data (str) – data to be written

  • hostname (str) – host name or address

remote_copy(local_path, remote_path, hostname)[source]

Copy a file to a remote directory.

Parameters
  • local_path (str) – local file path to be copied

  • remote_path (str) – remote directory path

  • hostname (str) – host name or address

get_address_from_task(job_name, task_index)[source]

Given a job name and task index, return the address.

Parameters
  • job_name (str) – job name

  • task_index (int) – task index

Returns

The address for that task

Return type

str

get_local_address()[source]

Get the local (ip) address.

If labelled as AUTODIST_WORKER by the environment variable, the value of it is the address of the local node; otherwise the local node is chief.

Returns

Worker ip or chief address by default.

Return type

str

get_local_session_target()[source]

Get the session target of the local session.

Returns

Local session target

Return type

str

get_local_worker_task_index()[source]

Get the (first) TensorFlow task index of the “worker” for the local.

Returns

Task index

Return type

int

is_chief(address=None)[source]

Check whether an address is chief or not.

If the argument address is not provided, it will check whether the local address is chief.

Parameters

address (str) – node address e.g. ip

Returns

Whether address or self is chief

Return type

bool

remote_pre_start_tf_server(hostname, tf_server_starter_filepath, working_dir='/tmp/autodist')[source]

Prepare to start a TensorFlow server remotely.

Parameters
  • hostname (str) – host name or address

  • tf_server_starter_filepath (str) – local starter file path

  • working_dir (str) – remote working directory

start()[source]

Start tf.servers on all nodes.

Note that this only runs (and only should run) on the chief node.

terminate()[source]

Terminate.