Compute Servers and Distributed Workers
A Remote Services cluster is a collection of nodes of two different
types:
- COMPUTE
- A Compute Server node supports the offloading of
optimization jobs. Features include load balancing, queueing and
concurrent execution of jobs. A Compute Server license is required
on the node. A Compute Server node can also act as a Distributed
Worker.
- WORKER
- A Distributed Worker node can be used to execute part of
a distributed algorithm. A license is not necessary to run a
Distributed Worker, because it is always used in conjunction with a
manager (another node or a client program) that requires a license.
A Distributed Worker node can only be used by one manager at a time
(i.e., the job limit is always set to 1).
By default, grb_rs will try to start a node in Compute Server mode and
the node license status will be INVALID if no license is found.
In order to start a Distributed
Worker, you need to set the WORKER property in the
grb_rs.cnf configuration file (or the —worker
command-line flag):
WORKER=true
Once you form your cluster, the node type will be displayed in the
TYPE column of the output of grbcluster nodes:
> grbcluster nodes
ID ADDRESS STATUS TYPE LICENSE PROCESSING #Q #R JL IDLE %MEM %CPU
b7d037db server1:61000 ALIVE COMPUTE VALID ACCEPTING 0 0 10 19m 15.30 5.64
735c595f server2:61000 ALIVE COMPUTE VALID ACCEPTING 0 0 10 19m 10.45 8.01
eb07fe16 server3:61000 ALIVE WORKER VALID ACCEPTING 0 0 1 <1s 11.44 2.33
4f14a532 server4:61000 ALIVE WORKER VALID ACCEPTING 0 0 1 <1s 12.20 5.60
The node type cannot be changed once grb_rs has started. If you
wish to change the node type, you need to stop the node, change the
configuration, and restart the node. You may have to update your
license as well.
Distributed Optimization
When using distributed optimization, distributed workers are
controlled by a manager. There are two ways to set up the manager:
- The manager can be a job running on a Compute Server. In this
case, a job is submitted to the cluster and executes on one of the
COMPUTE nodes as usual. When the job reaches the point
where distributed optimization is requested, it will also request
some number of workers (see parameters DistributedMIPJobs, ConcurrentJobs, or TuneJobs). The first choice will be WORKER nodes.
If not enough are available, it will use COMPUTE nodes.
The workload associated with managing the distributed algorithm is
quite light, so the initial job will act as both the manager and the
first worker.
- The manager can be the client program itself. The manager does
not participate in the distributed optimization. It simply
coordinates the efforts of the distributed workers. The manager
will request distributed workers (using the WorkerPool
parameter), and the cluster will first
select the WORKER nodes. If not enough are available, it
will use COMPUTE nodes as well.
In both cases, the machine where the manager runs must be licensed to
run distributed algorithms (you should see a DISTRIBUTED= line
in your license file).
It is typically better to use the Compute Server itself as the
distributed manager, rather than the client machine. This is
particularly true if the Compute Server and the workers are physically
close to each other, but physically distant from the client
machine. In a typical environment, the client machine will offload the
Gurobi computations onto the Compute Server, and the Compute Server
will then act as the manager for the distributed computation.