Failover

Failover is the mechanism by which Understudies can take over from other machines in the network if a machine should fail. Failover can be triggered in 4 different ways:

Automatically when a machine times out (For more information on automatic failover see Failover Timeout)
Manually by a remote transport command
Manually by a button in the Session widget
Via the Failover API

Failing over a machine

Manual failover

Manual failover of a machine to its Understudy is conducted by pressing the replace button next to the machine that has failed.

What happens when I press replace?

The machine being replaced is now “failed”. The understudy replacing the failed machine will take on the characteristics of the failed machine, including role and feeds. If matrix routing has been configured, the understudy will automatically send the routing commands to the matrix to switch outputs from the failed machine to the Understudy. Matrix routing is sent as soon as the Understudy has been configured for its new role.

Designer cannot know why the machine has been replaced or assume to know what the failure condition might be. Designer therefore ignores any communication with the failed machine as it can’t make assumptions about its current state. No further communication from the failed machine is accepted by the rest of the session while it is marked as failed.

The failed machine is instructed to stop sending out any network data, but depending on how the machine failed Designer cannot guarantee that message will be received and acted upon; for example a networking related failure.

Failover and live update

Live update is the d3net network layer responsible for the distribution of showfile edits to all machines in session.

During failover live update is disconnected on the machine that failed to prevent it from sending or receiving edits while it’s in a bad state.

Actors and Directors have different live update behaviours when failed over.

Actor failover and live update

When an actor is failed over, live update is disabled on that machine. The understudy that takes over for it is still in the live update session and continues to receive edits as normal.

Director failover and live update

When a dedicated or non-dedicated director is failed over, live update is disabled on that machine and for all machines in session. This is because the director is responsible for marshalling all the live update data in the session, and if it’s in a bad state an understudy can’t reliably take over this responsibility.

All edits made on any machine will no longer be transfered to any other machine in the session until the session is restarted. As such these edits should be avoided until live update can be reconnected.

Restoring after failover

Previously replaced machines can be restored to their original role by pressing “restore” next to the machine in the Session widget.

When can I restore a failed machine?

What happens between pressing replace and restore will depend on the nature of the failure. Depending on the circumstances in which the failover occurred, it may be possible to restore without re-launching Designer on the failed machine, after re-launching Designer on the failed machine, or rebooting the machine. It is recommended to turn the server off and on again before restoring if there is any doubt as to the state of the machine.

What happens when I press restore?

The Understudy relinquishes its assumed role, and is reconfigured back as an Understudy. It then sends the matrix routing command (if matrix routing is configured) to switch the outputs back to the restored machine.

If edits are made on the acting Director and then the replaced machine is brought back into session, any edits already made on the acting Director will be synchronised to the machine joining the session as part of project sync. This will ONLY be the changes as captured at that moment in time.

Considerations and recommendations

In the scenario where an understudy is targeting multiple machines and has replaced a machine, Disguise does not recommend restoring back to the failed machine unless there is a good reason to do so, for example another machine fails and you now need the understudy to replace that.

The act of reconfiguring a machine role and sending matrix commands over the network is additional work outside of normal show operation in a system which potentially is already non-normal, and is therefore not something recommended unless required.

If appropriate, the failed machine can be investigated, fixed and brought back into session, but remain as replaced until there comes a need for the understudy to take over another role. Should the need come, the returned machine can be restored and the newly failed machine replaced. Any scenario in which this would be needed is incredibly rare.