Cluster¶
Cluster contains almost all the networking code for AutoDist.
The notable exception is ResourceSpec and a couple functions in autodist.utils.network.
Cluster will be used by other modules to handle: 1) Copying files 2) Writing files 3) Running code on nodes in the cluster defined by the ResourceSpec.
Prerequisites: * TensorFlow is already installed in the env of all nodes. * Only supports graph launching logic. Only one node (the Chief) runs the session client. * AutoDist is already installed in the env of the worker node Chief, where the main script runs. * The SSH private key to other nodes is accessible by AutoDist on Chief given a path.
-
class
Cluster
(resource_spec: autodist.resource_spec.ResourceSpec)[source]¶ Bases:
object
Cluster manager for TensorFlow servers.
-
is_chief
(address=None)[source]¶ Check whether an address is chief or not.
If the argument address is not provided, it will check whether the local address is chief.
-
get_address_from_task
(job_name, task_index)[source]¶ Given a job name and task index, return the address.
-
get_local_address
()[source]¶ Get the local (ip) address.
If labelled as AUTODIST_WORKER by the environment variable, the value of it is the address of the local node; otherwise the local node is chief.
- Returns
Worker ip or chief address by default.
- Return type
-
get_local_worker_task_index
()[source]¶ Get the (first) TensorFlow task index of the “worker” for the local.
- Returns
Task index
- Return type
-
get_local_session_target
()[source]¶ Get the session target of the local session.
- Returns
Local session target
- Return type
-
start
()[source]¶ Start tf.servers on all nodes.
Note that this only runs (and only should run) on the chief node.
-
remote_pre_start_tf_server
(hostname, tf_server_starter_filepath, working_dir='/tmp/autodist')[source]¶ Prepare to start a TensorFlow server remotely.
-
-
class
SSHCluster
(resource_spec)[source]¶ Bases:
autodist.cluster.Cluster
An AutoDist Cluster Based on SSH.
-
get_address_from_task
(job_name, task_index)[source]¶ Given a job name and task index, return the address.
-
get_local_address
()[source]¶ Get the local (ip) address.
If labelled as AUTODIST_WORKER by the environment variable, the value of it is the address of the local node; otherwise the local node is chief.
- Returns
Worker ip or chief address by default.
- Return type
-
get_local_session_target
()[source]¶ Get the session target of the local session.
- Returns
Local session target
- Return type
-
get_local_worker_task_index
()[source]¶ Get the (first) TensorFlow task index of the “worker” for the local.
- Returns
Task index
- Return type
-
is_chief
(address=None)[source]¶ Check whether an address is chief or not.
If the argument address is not provided, it will check whether the local address is chief.
-
remote_pre_start_tf_server
(hostname, tf_server_starter_filepath, working_dir='/tmp/autodist')[source]¶ Prepare to start a TensorFlow server remotely.
-