Strategy as ProtoBuf Message¶
AutoDist uses Protocol Buffer to standardize strategy representation and its configurations.
autodist/proto/graphitem.proto¶
AutoDist distributed strategy messages.
Represents how to distribute a TensorFlow computational graph.
GraphItem¶
Represents the strategy the AutoDist backend will implement.
Field | Type | Description |
---|---|---|
graph_def | google.protobuf.Any | TensorFlow graph_def |
grad_target_pairs | GraphItem.GradTargetPairsEntry | Mapping from grad tensor name to variable name |
info | GraphItem.Info |
GraphItem.Info¶
Represents the essential transformed subset of TensorFlow MetaGraph
Right now, it represents a essential AutoDist subset of collections of MetaGraph. In the future, it will generalize to captures.
Field | Type | Description |
---|---|---|
variables | google.protobuf.Any | |
table_initializers | string | |
savers | google.protobuf.Any |
autodist/proto/strategy.proto¶
AutoDist distributed strategy messages.
Represents how to distribute a TensorFlow computational graph.
Strategy¶
Represents the strategy the AutoDist backend will implement.
Field | Type | Description |
---|---|---|
id | string | unique strategy identifier |
path | string | optional serialized strategy message temp path |
node_config | Strategy.Node | configuration of some individual nodes of the computational graph |
graph_config | Strategy.GraphConfig | configuration of the computational graph as a whole |
Strategy.GraphConfig¶
Represents the configuration of the graph as a whole.
Based on the list of replicas, the AutoDist backend does a combination of in-graph and between-graph distribution.
Field | Type | Description |
---|---|---|
replicas | string | the number of batch-splitting/data-parallel replicas |
Strategy.Node¶
Represents the configuration of an individual node in the graph.
Right now, these nodes are just variables in the graph, so the only information they contain is how to synchronize the variable’s gradients.
In the future, for node partitioning, these could be any node in the graph. In that case, they would also have more logic for partitioning the op.
Field | Type | Description |
---|---|---|
var_name | string | variable name |
PSSynchronizer | PSSynchronizer | One of a synchronizer to choose |
AllReduceSynchronizer | AllReduceSynchronizer | One of a synchronizer to choose |
partitioner | string | Optional partitioner configuration, e.g. 1, 2, 1 |
part_config | Strategy.Node | Optional node configs for each node partition (if partitioned) |
autodist/proto/synchronizers.proto¶
AutoDist synchronization messages.
AllReduceSynchronizer¶
Synchronization using AllReduce.
Field | Type | Description |
---|---|---|
spec | AllReduceSynchronizer.Spec | Specification for collective communication |
compressor | AllReduceSynchronizer.Compressor | One of the compressors to choose |
group | int32 | The allreduce group to merge with. The group index should be less than the number of variables |
PSSynchronizer¶
Synchronization using a Parameter Server.
Field | Type | Description |
---|---|---|
reduction_destination | string | Parameter Server to use |
local_replication | bool | Whether to create local proxies of each PS variable |
sync | bool | Whether to sync gradients across between-graph replications |
staleness | int32 | Staleness |
AllReduceSynchronizer.Compressor¶
Which gradient compression method to use
Name | Number | Description |
---|---|---|
NoneCompressor | 0 | No compression |
HorovodCompressor | 1 | Horovod's Compression |
HorovodCompressorEF | 2 | Horovod's Compression but with Error Feedback. |
AllReduceSynchronizer.Spec¶
Which communication method to use
Name | Number | Description |
---|---|---|
AUTO | 0 | Runtime's automatic choices |
NCCL | 1 | Use ncclAllReduce for all-reduce, and ring algorithms for all-gather |
RING | 2 | TensorFlow's ring algorithms for all-reduce and all-gather |