Why are lease times short and random in communications-interrupted?

Question:

We experienced a DHCP server failure at our site. During recovery with only one server in service, it was operating in communications-interupted mode. During this time we observed that it was handing out very short leases of apparently random lengths - for example 30 minutes, 15 minutes, 9 minutes etc. Once the two servers were brought back into communication, it reverted to providing leases of the usual configured length.

What is the explanation for the random times on the short leases during the communications-interrupted period of operation? Shouldn't they all be given out as the configured Maximum Client Lead Time (MCLT)?

Answer:

This is the expected operation. Here is an explanation that tries to avoid delving too deeply into the algorithms that are used to calculate lease times given out by a failover pair.

The fundamental relationship on which the DHCPv4 failover protocol depends is: the lease expiration time known to a DHCP client MUST NOT be greater by more than the MCLT beyond the later of the partner lifetime acknowledged by that server's failover partner and the current time. The server follows that rule and depending on the lease state and timers, it may calculate different lease lifetimes. More detailed explanation follows.

The Maximum Client Lead Time is the longest time, beyond the lease end time that is known by both partners of a failover pair, that a lease can be assigned.

This is why, initially, a client newly booting is assigned a very short lease time (using the MCLT), but having done this, the server assigning the lease notifies its partner with a timestamp that is sufficiently far in the future for it to be able to provide a lease that is of the proper length when the client renews halfway though the initial lease period. Assuming the partner acknowledges this new further-away time, then all will work as expected.

Now, think about what happens when a server is down, and/or when a failover pair can't communicate with each other.

The server running alone will be running with the state that it last had, including the lease end times that its partner said it 'knew'. With this 'state' information about what the other server 'knows', plus the configured MCLT and still being out of communication with its peer, the calculated new lease times would be getting shorter and heading towards MCLT - but wouldn't necessarily be that small immediately on a lease renewal because of the existing and partner-acknowledged lease end time.

The variance on the times is going to depend on how much time there was left on the current leases when the the clients renew. Further complications, such as network outages, could mean that some clients would have been trying to renew their leases for longer than others.

A large amount of technical detail is omitted here, including the fact that it's not just a single timestamp controlling what happens in the communication between failover peers - the intention is simply to explain the underlying principle being followed.

For more information, please read:

The Section 4.4.1 "MCLT Example" of the DHCPv6 Failover Protocol RFC: https://datatracker.ietf.org/doc/html/rfc8156. Note this is DHCPv6 failover, but it is based on the same principles. The MCLT example section applies to both DHCPv4 and DHCPv6.

Documentation Index

Why are the lease times short and random during communication-interrupted state?

Question:

Answer: