Kea HA Quickstart Guide

Introduction

New Kea users often want to setup redundancy. Many may have previously used ISC DHCP, with its Failover implementation, but High Availability in Kea works differently. The High Availability (HA) hook documentation is complicated because the feature is powerful, with many modes and options available. This quickstart guide describes how to set up, test, and tune the HA hook in the "Hot-Standby" mode, which is the simplest HA configuration.

Why is this the simplest mode, you ask? The "load-balancing" mode is more complicated because the pools need to be split 50/50 between the two Kea servers, and access is controlled by the automatically created HA classes. The "passive-backup" mode is not really a failover mode, as it merely provides a way to back up leases to another server. Manual intervention (a configuration change) must occur to switch to this server.

The last HA mode isn't really a separate mode of operation, but rather a way to split HA relationships based on subnets instead of an entire server. It is called Hub and Spoke.

DHCPv4 and DHCPv6

This quickstart guide shows kea-dhcp4 binaries and configurations, but HA is equally useful with kea-dhcp6. The url parameter, shown later in this guide, can accept an IPv6 address. In the "hot-standby" mode, there are no other considerations. You can use an IPv6 URL with kea-dhcp4 and vice versa, but this is not recommended, as certain network outages may not trigger a failover event.

How Does Hot-Standby Mode HA Work?

Briefly, the hot-standby mode consists of two Kea servers in HA communication with each other. The first server is the primary and answers all DHCP traffic during normal operation. The second server is called the standby server. It does not answer DHCP traffic unless there is a failover event. During normal operation, the primary sends lease updates, as they are allocated, to the standby.

If the standby server loses contact with the primary server, it begins monitoring DHCP packets. By default, the standby becomes the active DHCP server after there are more than max-unacked-clients reporting delays longer than max-ack-delay in the SECS field (see RFC 2131, page 10). When the primary returns to service, the standby will sync leases to it. Then the primary becomes active again and the standby resumes monitoring leases and waiting for a new failover event to occur.

Full details about this mode are available in the Hot-Standby Configuration section of the Kea ARM.

Comparison to ISC DHCP

There was only one mode available in ISC DHCP: load-balancing failover (described in this RFC draft, which was never published as an RFC). Both servers answered their respective pools of clients when both servers were up, and the surviving server would answer all clients when the partner was down. There are some major differences between the operation of ISC DHCP failover and Kea Hot-Standby HA modes:

ISC DHCP would answer clients whose SECS field exceeded some max value from both servers, regardless of which server SHOULD have answered based on the hash check of the hardware address. This allowed for some failures to be overcome (such as downstream network problems) that could not be detected by the ISC DHCP servers. In other words, both ISC DHCP servers were always watching all packets and responding as appropriate. In Kea, the standby does not read client traffic unless there is a loss of contact with the primary. This is the case in Kea's "load-balancing" mode as well.
ISC DHCP could run out of addresses on one of the servers, which would then not be able to answer a client that was assigned to it. This was especially problematic when considering that some clients do not make use of the SECS field. These clients would not be answered. ISC DHCP did have a mechanism to transfer leases between the partners, but this was limited to certain percentages of leases. Running with 10% free leases in a pool was nearly a requirement. Kea, in "hot-standby" mode, has all addresses available from the pool.
ISC DHCP did not support DHCPv6 failover, whereas Kea fully supports both DHCPv4 and DHCPv6 HA.
ISC DHCP failover would sometimes lose lease synchronization with its peer (one server would think leases were in use that were not actually in use). At this point, the only available option was to erase the lease file on the secondary and allow a full lease sync to occur. This resulted in the secondary not answering traffic for MCLT seconds. Kea has no such issues.

Overall, the "Hot-Standby" mode in Kea is a more reliable solution than the failover that was available in ISC DHCP.

Configuring HA

In keeping with the goal of making this quickstart guide the simplest way to configure HA, we are providing a minimal configuration that accepts the defaults. The Fine Tuning section below describes the modification of certain defaults, which may be necessary.

There is a simple design consideration that maximizes the utility of the HA hook: the Kea servers should send heartbeats to each other and sync leases over the same interface where customer communication occurs. It is tempting to use an out-of-band administrative network for this communication for security purposes; however, this could lead to Kea not realizing there is an outage and not failing over if the customer-facing interface is down but the administrative interface is still up. Clients should not be able to connect to the HA service ports, so it is important to implement a firewall of some kind to protect these ports. The details of this are outside the scope of this document.

And now, on to the configuration.

The configurations of both Kea servers in an HA relationship should closely match each other. Ideally, there should only be differences with interfaces-config and this-server-name; in particular, it is very important that the shared-network and subnet(4/6) sections match. Other differences should be carefully considered.

The configuration shown below is from local test lab servers that have 172.28.0.0/24 on a local interface. This test lab is very simple and so this same subnet is used for "clients" (actually, the clients are simulated with perfdhcp). For completeness, the full configuration of these servers is shown in a later section.

Primary Kea Server HA Configuration

...
    "hooks-libraries": [
      {
        "library": "/usr/lib/x86_64-linux-gnu/kea/hooks/libdhcp_lease_cmds.so"
      },
      {
        "library": "/usr/lib/x86_64-linux-gnu/kea/hooks/libdhcp_ha.so",
        "parameters": {
          "high-availability": [
            {
              "this-server-name": "server1",
              "mode": "hot-standby",
              "peers": [
                {
                  "name": "server1",
                  "url": "http://172.28.0.253:8000/",
                  "role": "primary"
                },
                {
                  "name": "server2",
                  "url": "http://172.28.0.254:8000/",
                  "role": "standby"
                }
              ]
            }
          ]
        }
      }
    ],
...

Standby Kea server HA Configuration

...
    "hooks-libraries": [
      {
        "library": "/usr/lib/x86_64-linux-gnu/kea/hooks/libdhcp_lease_cmds.so"
      },
      {
        "library": "/usr/lib/x86_64-linux-gnu/kea/hooks/libdhcp_ha.so",
        "parameters": {
          "high-availability": [
            {
              "this-server-name": "server2",
              "mode": "hot-standby",
              "peers": [
                {
                  "name": "server1",
                  "url": "http://172.28.0.253:8000/",
                  "role": "primary"
                },
                {
                  "name": "server2",
                  "url": "http://172.28.0.254:8000/",
                  "role": "standby"
                }
              ]
            }
          ]
        }
      }
    ],
...

The configurations of HA between the two servers only differ in the this-server-name parameter; the rest of the parameters are exactly the same. Each hooks-libraries section above shows the server loading two hooks: libdhcp_lease_cmds.so and libdhcp_ha.so. The latter is the actual HA hook that makes all of this possible; the former is a hook that provides some API commands that are used by the HA hook. Both of these hooks are required for HA to function.

In the high-availability section above, you will find the global parameters this-server-name, which tells the server which peer it is from the list of peers later in the configuration, and the mode, which tells the server it is in "hot-standby" mode. The peers section features at least two peers listed with the parameters name (gives the peer a name for use in this-server-name), url (how to connect to the peer available), and role (selects which peer is active and which is standby).

There are many other possible settings. The simplest way to get started is to accept the defaults, which the configuration above will do. Testing the HA configuration may yield some changes to consider.

Hook file location

Note that the location of the hook library (specified in the library parameter) may differ on your system. The location /usr/lib/x86_64-linux-gnu/kea/hooks/ is valid on Debian-based systems. The easiest way to find the location to specify for your hooks is with the find command: find / -name libdhcp_ha.so, which will show the location of that particular hook. All of the hooks should be in the same directory.

Starting the Kea Servers with HA

After adding these configurations, you should use Kea to test that the configuration is valid. This can be done by using the Kea daemon itself: kea-dhcp4 -c /etc/kea/kea-dhcp4.conf. (The configuration file location might be different on your system.) After a successful test, now restart the Kea service. On a Debian-based system where Kea was installed from ISC packages, this would be done with systemd: systemctl restart isc-kea-dhcp4-server. Check that the server is running properly by inspecting the logs: journalctl -u isc-kea-dhcp4-server --no-pager --since=00:00:00 | grep ha-hook. Specifically, look for lines similar to the following on the primary server:

HA_STATE_TRANSITION server1: server transitions from PARTNER-DOWN to HOT-STANDBY state, partner state is READY
HA_LEASE_UPDATES_ENABLED server1: lease updates will be sent to the partner while in HOT-STANDBY state
HA_LOCAL_DHCP_ENABLE local DHCP service is enabled while the server1 is in the HOT-STANDBY state

Similar lines should appear on the standby:

HA_STATE_TRANSITION server2: server transitions from READY to HOT-STANDBY state, partner state is HOT-STANDBY
HA_LEASE_UPDATES_ENABLED server2: lease updates will be sent to the partner while in HOT-STANDBY state
HA_LOCAL_DHCP_ENABLE local DHCP service is enabled while the server2 is in the HOT-STANDBY state

Kea with HA should be running correctly by this point. Tests may now be performed to confirm this and to identify any settings that might need adjusting for your environment.

Testing HA

The following are some simple tests to run to ensure that things are working correctly. Each listing refers to a perfdhcp command and explains the purpose of the test.

perfdhcp -4 -r 1 -R 247 -p 248 -l enp0s8 -Y 1 -y 247 - This perfdhcp command will send DHCPv4 DORA exchanges at the rate of 1 per second, simulating 247 clients (the number of IP addresses available in this guide's lab configuration), for 248 seconds, increasing the SECS field starting 1 second into the test for the remaining 247 seconds. Both servers remain up during this test. The primary should answer and the standby should do nothing apart from recording the lease updates sent from the primary. On the primary, you should see normal log messages showing leases being allocated. On the standby, you should see COMMAND_RECEIVED messages containing lease4-update. This indicates that the leases are being shared with the standby.
For this test, first stop the Kea service on the primary with systemctl stop isc-kea-dhcp4-server. (This may not be the correct command depending on the source of your installation.) Now, test using the same perfdhcp command as before. At first, no packets will be answered. The standby will log messages with HA_HEARTBEAT_COMMUNICATIONS_FAILED. Once one minute has passed, an HA_COMMUNICATION_INTERRUPTED message will be logged. Now HA_COMMUNICATION_INTERRUPTED_CLIENT4_UNACKED will be logged until 10 unacked clients have been encountered. Then the standby will log an HA_STATE_TRANSITION message, noting a transition to PARTNER-DOWN state. Clients will be answered by the standby. Now start the service on the primary. Several messages about syncing will appear in each server's logs. Once this process is complete, the primary will begin answering clients.

The above is simple functionality testing to make sure that HA is working as intended. There are many network design considerations that affect how well the HA hook will function and how effective a failover recovery can be. It is important to perform tests based on the above with real clients in conditions that closely match your real-world situation, to make sure the configuration will function as expected.

perfdhcp and testing your deployment

The number of clients and the rate of discover/offer/request/acknowledge (DORA) transactions per second can be customized to match your expected environment (e.g., -r 40 -R 50000 to simulate 50,000 clients with a rate of 40 leases per second). In some cases, it may not be possible to fully simulate your DHCP client conditions using a single instance of perfdhcp.

Fine Tuning

The most common parameters that administrators need to tune are max-response-delay, heartbeat-delay, and max-unacked-clients, which are all specified at the same level as this-server-name. There are legitimate reasons to tune these values; a common scenario for each will be covered.

An administrator may want to adjust the length of time between the loss of communication and the transition to "communications interrupted" state. This is controlled by the max-response-delay and heartbeat-delay parameters. Kea will likely first notice that communication has failed by the failure of a heartbeat. This starts the timer specified in max-response-delay until the transition to "communications interrupted" occurs (noted in the logs by a HA_COMMUNICATION_INTERRUPTED message). Extending the time period specified in max-response-delay does not typically require any adjustment of heartbeat-delay; the result is more heartbeats will be attempted. If shortening max-response-delay, heartbeat-delay should be decreased so that roughly the same number of heartbeats will be attempted before entering the "communications interrupted" state. Each parameter is specified in milliseconds.

An administrator may want to adjust the monitoring performed between the state transitions to "communications interrupted" and then to "partner-down" (signified by a log message similar to HA_STATE_TRANSITION server2: server transitions from HOT-STANDBY to PARTNER-DOWN state, partner state is UNDEFINED). If the administrator wants to wait for more clients to exceed max-ack-delay before transitioning, the max-unacked-clients can be increased. Likewise, the administrator may decrease the setting to decrease the number of clients required before transitioning. A more common situation is that an administrator wants to disable the test period. This is accomplished by setting max-unacked-clients to 0.

Why would you want to disable the test period?

The test period between the "communications interrupted" and "partner-down" states is meant to avoid the split-brain scenario where the two Kea servers cannot communicate with each other but are still able to communicate with clients. In this case, the standby could transition to active and there could be two servers answering clients, which could lead to duplicate address assignments.

This test period relies on clients increasing the SECS field in the DHCPv4 packet. Not all clients increment this field, instead always setting 0 in the field no matter how long they have been trying to renew or obtain a new address. If there are not enough clients that support the SECS field (e.g., an ISP exclusively distributes the same brand of CPE that is known not to support this field), then the server will never transition to "partner-down," thus defeating the purpose of HA.

There is no evidence (so far) that this is a problem for the similar "elapsed time" field in DHCPv6.

Complete Example Kea HA configuration

Below are the complete configurations used during the creation of this Kea HA Quickstart Guide. Note that it is very possible to use the HA hook with "Dhcp6" configurations with only minimal changes (such as URL to match your IPv6 network).

Primary Kea Server HA Configuration (Complete)

{
  "Dhcp4": {
    "control-socket": {
      "socket-type": "unix",
      "socket-name": "/tmp/kea-dhcp4-socket"
    },
    "interfaces-config": {
      "interfaces": [
        "enp0s8"
      ]
    },
    "lease-database": {
      "type": "memfile",
      "persist": true,
      "name": "/tmp/kea-dhcp4-leases.csv"
    },
    "multi-threading": {
      "enable-multi-threading": true,
      "thread-pool-size": 4,
      "packet-queue-size": 28
    },
    "cache-threshold": 0.25,
    "calculate-tee-times": true,
    "valid-lifetime": 28800,
    "option-data": [
      {
        "name": "domain-name-servers",
        "data": "192.168.40.42, 192.168.40.82"
      }
    ],
    "hooks-libraries": [
      {
        "library": "/usr/lib/x86_64-linux-gnu/kea/hooks/libdhcp_lease_cmds.so"
      },
      {
        "library": "/usr/lib/x86_64-linux-gnu/kea/hooks/libdhcp_ha.so",
        "parameters": {
          "high-availability": [
            {
              "this-server-name": "server1",
              "mode": "hot-standby",
              "peers": [
                {
                  "name": "server1",
                  "url": "http://172.28.0.253:8000/",
                  "role": "primary"
                },
                {
                  "name": "server2",
                  "url": "http://172.28.0.254:8000/",
                  "role": "standby"
                }
              ]
            }
          ]
        }
      }
    ],
    "subnet4": [
      {
        "subnet": "172.28.0.0/24",
        "id": 1,
        "option-data": [
          {
            "name": "routers",
            "data": "172.28.0.1"
          }
        ],
        "pools": [
          {
            "pool": "172.28.0.2-172.28.0.249"
          }
        ]
      }
    ],
    "loggers": [
      {
        "name": "kea-dhcp4",
        "severity": "INFO",
        "output_options": [
          {
            "output": "stdout"
          }
        ]
      }
    ]
  }
}

Standby Kea server HA Configuration (Complete)

{
  "Dhcp4": {
    "control-socket": {
      "socket-type": "unix",
      "socket-name": "/tmp/kea-dhcp4-socket"
    },
    "interfaces-config": {
      "interfaces": [
        "enp0s8"
      ]
    },
    "lease-database": {
      "type": "memfile",
      "persist": true,
      "name": "/tmp/kea-dhcp4-leases.csv"
    },
    "multi-threading": {
      "enable-multi-threading": true,
      "thread-pool-size": 4,
      "packet-queue-size": 28
    },
    "cache-threshold": 0.25,
    "calculate-tee-times": true,
    "valid-lifetime": 28800,
    "option-data": [
      {
        "name": "domain-name-servers",
        "data": "192.168.40.42, 192.168.40.82"
      }
    ],
    "hooks-libraries": [
      {
        "library": "/usr/lib/x86_64-linux-gnu/kea/hooks/libdhcp_lease_cmds.so"
      },
      {
        "library": "/usr/lib/x86_64-linux-gnu/kea/hooks/libdhcp_ha.so",
        "parameters": {
          "high-availability": [
            {
              "this-server-name": "server2",
              "mode": "hot-standby",
              "peers": [
                {
                  "name": "server1",
                  "url": "http://172.28.0.253:8000/",
                  "role": "primary"
                },
                {
                  "name": "server2",
                  "url": "http://172.28.0.254:8000/",
                  "role": "standby"
                }
              ]
            }
          ]
        }
      }
    ],
    "subnet4": [
      {
        "subnet": "172.28.0.0/24",
        "id": 1,
        "option-data": [
          {
            "name": "routers",
            "data": "172.28.0.1"
          }
        ],
        "pools": [
          {
            "pool": "172.28.0.2-172.28.0.249"
          }
        ]
      }
    ],
    "loggers": [
      {
        "name": "kea-dhcp4",
        "severity": "INFO",
        "output_options": [
          {
            "output": "stdout"
          }
        ]
      }
    ]
  }
}

HA and Lease Caching

In the configurations above, you can see the "cache-threshold": 0.25, parameter. This enables lease caching, which is somewhat important when using the HA hook. Otherwise, a single misbehaving client can cause your Kea HA peers to enter the "terminated" state.

A client sending lease-renewal requests several times per second results in many lease updates on the primary, which are then sent to the standby. The standby processes updates concurrently in multiple threads. When they are all for the same lease, contention on the lease record occurs, leading to some updates being rejected. Once the number of rejected lease updates exceeds the value configured in max-rejected-lease-updates (default 10), the servers enters the "terminated" state.

Lease caching prevents this by re-issuing the same lease. Until the percentage of time configured in cache-threshold has been reached, the client receives a "renewal" with the lease period decreased proportionally to when the lease was first allocated. Since the effective lease end time is unchanged, Kea does not need to update the internal lease record, reducing database and sync activity.

Conclusion

This short, simple guide to configuring High Availability in "hot-standby" mode in Kea is intended to help users get started with HA. There are other possible modes and configurations, as previously mentioned. Configuring HA in "load-balancing" and "hub and spoke" modes is covered in the HA section of the Kea ARM.