Recommendations for restarting a DHCP failover pair

Question:

It's sometimes necessary to restart an ISC DHCP server. If you have two servers running as a failover pair, then there shouldn't be any significant interruption to client service during the restart - but when you need to restart both servers, what is the recommended process?

Answer:

Individual deployments vary, but here is a generic process that you can tailor to your specific environment's needs.

When restarting a failover pair, restart one first, and then the other - not both at the same time.
While one is being restarted, the other will go into 'communications interrupted' state (and will change its behavior on granting leases accordingly) but then should recover when its partner comes back online.
The servers will log all of these transitions and the server status can also be confirmed via OMAPI.
When restarting the pair, wait for the restart of the first one to complete fully before restarting the second one. That allows them time to reestablish communications and do pool balancing before you take the second one offline for its restart.

Which server should be restarted first?

If you've made significant changes such as extending or removing failover ranges, then on restarting one server, there will be a mismatch and some errors logged until both have been restarted. Restarting the secondary first is held to be better in this situation - but probably won't make that much of a difference to the overall start-up times.

Your goal is that the non-partnership parts of the configuration (i.e. the parts describing the addresses and pools and such) will be identical between the two peers once they have both restarted, although there may be a period during the transition where they are not.

For making the transition, the process flow would be similar to:

Modify the secondary's configuration file
Stop and restart the secondary using the new configuration
Modify the primary's configuration file
Stop and restart the primary

How can I tell that the first server's restart has completed fully?

Two strategies exist for script-managing this (apart from manually observing the servers as they are logging):

Inspect the log entries as the server is restarting: The logged entries as the DHCP server completes the start process should look similar to those below:

dhcpd: Wrote 0 deleted host decls to leases file.
dhcpd: Wrote 0 new dynamic host decls to leases file.
dhcpd: Wrote 72468 leases to leases file.
dhcpd: Listening on LPF/eth0/00:19:b9:df:24:3b/172.16.201.32/27
dhcpd: Sending on LPF/eth0/00:19:b9:df:24:3b/172.16.201.32/27
dhcpd: Sending on Socket/fallback/fallback-net

Then on the server that has just been restarted, you should next be seeing communications reestablished with each failover partner (if there are multiple partners, you will need to check for all sets of failover peer communications status messages):

failover peer foo: I move from normal to startup
failover peer foo: peer moves from normal to communications-interrupted
failover peer foo: I move from startup to normal
balancing pool 28593100 172.16.132.0/24  total 11  free 7  backup 4  lts 1  max-own (+/-)1
balanced pool 28593100 172.16.132.0/24  total 11  free 7  backup 4  lts 1  max-misbal 2
failover peer foo: peer moves from communications-interrupted to normal

From ISC DHCP version 4.3, dhcpd will log an explicit message to indicate that it has completed its start process. (This will be documented in the ISC DHCP release notes with reference RT #33208.)

In the simple case where there are two DHCP servers who only partner with each other, then you could alternatively monitor the state of the server that is being rebooted from the server that is waiting for its partner to be fully operational again before it itself is restarted. In the case of the server that has not been restarted, the failover peer communications status messages should look similar to these:

peer foo: disconnected
failover peer foo: I move from normal to communications-interrupted
failover peer foo: peer moves from normal to normal
failover peer foo: I move from communications-interrupted to normal
balancing pool 285ff100 172.16.132.0/24  total 11  free 7  backup 4  lts -1  max-own (+/-)1
balanced pool 285ff100 172.16.132.0/24  total 11  free 7  backup 4  lts -1  max-misbal 2
peer foo: disconnected

Use OMAPI to test the server state - and only restart the second partner when the first one has restarted and completed syncing with and establishing normal communication with its partner.

What is the recommended way to cleanly stop a DHCP server?

'kill' is the recommended option, except where there is a high turnover of leases and the production environment requires a high degree of reliability from DHCP. In that case, we'd suggest that administrators consider using OMAPI to control the daemon instead and to request a graceful shutdown.

The reason for this is that there is the slight possibility that by using kill, administrators may stop dhcpd in the middle of appending a lease to the leases file (in which case it may become corrupted). This risk, while tiny, may be significant enough for some administrators to prefer to use OMAPI instead.

How should a corrupted lease file be mended?

The workaround in this situation will be to manually edit the lease file to remove the truncated lease.

Why is "kill "preferred to OMAPI in these recommendations?

The risks of using kill to stop dhcpd are minimal and acceptable in most environments whose administrators, unless they are already using OMAPI, would find it cumbersome to set up solely for the purpose of controlling the dhcpd shutdown. However, if you already have OMAPI set up, then there is no disadvantage to using it to shut down the server.

Why is there no method to signal dhcpd to shutdown gracefully, outside of OMAPI?

Although the two options for stopping DHCP documented above have worked well over the years, we have updated the signal handling to trigger the same graceful shutdown sequence as is invoked by the OMAPI "shutdown" command. This signal-based shutdown can be used for all server, relay and client. This is available in ISC DHCP versions 4.2.6 and greater, including all 4.3 (and newer) releases.

Documentation Index