Kea HA Strategies Comparison

Introduction

Kea's High Availability hook is the recommended solution for high availability operation. The Kea HA hook works by pairing Kea servers (a multi-node solution is also available), in either an active-active or active-passive collaboration scheme. In this way, the Kea servers can monitor each other and assume responsibility for answering on behalf of the other server in case of failure.

It is also possible for multiple Kea servers to leverage the lease 'backend' feature to share a single lease database, enabling any Kea server to renew any existing lease. This provides another path to redundancy. This approach has some advantages, which we will discuss below, but it is subject to the performance limitations and availability of the database backend.

The Shared Lease Database Concept

What is it?

The Shared Lease Database concept involves pointing two or more Kea servers at the same database server and schema. The servers are then essentially sharing lease data. The servers are not, however, aware of each other. Each server would have an identical lease-database section (see here). The Kea servers, when configured with a lease-database of MySQL or PostgreSQL, do not keep any lease data in memory. All lease data for allocation and storage is accessed via SQL queries. In this way, the servers are sharing data without realizing it. Please see our Quickstart document to see how one might configure the Shared Lease Database.

Why would I want to use it?

In the past, before the HA Hook (see below) existed, a shared lease database was the only way to achieve high availability in Kea. If more than two servers serving the same subnet or shared network is the goal, then this is still the only way to accomplish that. You would want to use this method if a comparison of advantages and drawbacks shows that this is the best choice. Continue reading for the advantages and drawbacks of each.

What are the advantages?

More than two servers can allocate addresses from the same subnet.
No HA specific configurations or logic required. All servers with a matching subnet configured will respond to client traffic.

What are the drawbacks?

At least Kea 2.6.2 is required as resolution of two issues (3751 and 3798) are required for trouble free use of the Shared Lease Database method.
There is a serious performance hit with using database storage of leases. This performance report illustrates the difference between database and memfile storage of leases. Please note that the page takes a while to load and doesn't seem to work properly in Firefox.
The only practical lease allocator is the Random Allocator as the Free Lease Queue is not suitable for use with the Shared Lease Database method. Additionally, the default Iterative Allocator can result in multiple servers offering the same address to multiple clients. While this doesn't cause lasting problems, it does result in retransmissions as clients must start the DORA process over again. Both the Random and Iterative Allocators are notably slow when used with a lease database stored in MySQL or PostgreSQL and the subnet does not have many addresses left. The Free Lease Queue is meant to be much faster in this scenario, but does not work with the Shared Lease Database.
There is no way to use the Kea API to check the status of the Kea servers that are sharing a lease database. Similarly, there are no log messages that would give away any status of the relationship. This is because the participating Kea servers are unaware of each other.
When using the Shared Lease Database method, both the Kea servers and the database servers must be set to UTC as noted in the ARM for both MySQL and PostgreSQL.

The HA Hook

What is it?

The HA hook is a library which provides High Availability functionality to the Kea DHCP servers. This can take the form of "load balancing" or "hot standby" modes. Hot standby is the preferred mode and is the subject of this Quickstart Guide. For those moving to Kea from ISC DHCP, the differences between the HA Hook and ISC DHCP's failover are detailed here. More recent Kea releases support the "hub and spoke" model that is outlined here. And finally, there is, of course, extensive documentation in the ARM.

Why would I want to use it?

The HA Hook provides a formalized High Availability method with engineering requirements and designs, including one specific to Hub and Spoke HA Mode (See Designs Wiki where several are visible). Administrators who are interested in:

more flexibility with regard to allocators
monitoring of each server's HA status
well defined modes of operation

would be most interested in the HA Hook.

What are the advantages?

Only one server will answer a particular client. In the case of "hot-standby" this makes finding logs simple because all clients are answered from the single active server.
The HA Hook is well documented and its behavior is well defined.
It is easy to understand the status of the server partnership through log messages and the API call status-get.
Can use any of the lease allocators.

What are the drawbacks?

Not all clients increment the SECS field (DHCPv4) or the elapsed time option (DHCPv6). If there are enough of these clients in the network, automatic failover may fail. The reason for this is that the inactive server, when it loses contact with the active server, begins monitoring client traffic. It is looking for clients to self-report the length of time that they have been attempting to obtain an address using the aforementioned parameters. This is controlled by max-ack-delay, which defaults to 10000ms and max-unacked-clients which defaults to 10. The inactive server needs to see greater than max-unacked-clients that are reporting greater than max-ack-delay in the SECS field (DHCPv4) or the elapsed time option (DHCPv6) before it will take over duties. This is to prevent the "split brain scenario". If there are not enough clients who DO implement these data parameters, then this could result in a failure to failover. More information about these parameters is available in the ARM. There is a design document specifically about the split brain scenario and mitigations available here.
Sometimes, administrators will disable the previously mentioned check to avoid the previously mentioned scenario. If they do this, the "split brain scenario" may occur. If it does, duplicate address allocation may be the result.
A specific scenario that may cause some problems, if it is possible in your network, is the case where the two HA participants can still communicate with each other but the active server cannot communicate with some or all clients. In this case, the inactive server (the "standby" in the case of "hot-standby" or the server answering the other 50% of clients in the case of "load-balancing") will not step in and answer the clients because it will not have detected a communications failure. The inactive server could perhaps answer these clients but will not because it isn't aware that it should even monitor client traffic for the active server. Therefore, the clients that are not being answered will continue to receive no response until service is restored to the active server.

Conclusion

Which choice is the best depends largely on your goals. This article is meant to help the administrator understand the benefits and drawbacks of each choice. Either of these methods of achieving High Availability is a valid choice depending on your requirements.