Kea Database Connection Resilience

Introduction

This document discusses database connectivity problems, parameters that can be adjusted to make Kea compensate for those problems, and some of the implications of doing so.

Scenario

Symptoms

Often this discussion begins with error messages like this:

DATABASE_MYSQL_FATAL_ERROR Unrecoverable MySQL error occurred: unable to execute for <SELECT ...>, reason: Server has gone away (error code: 2006).

DHCP6_PACKET_PROCESS_STD_EXCEPTION ... exception occurred during packet processing: fatal database error or connectivity lost

DHCP4_DB_RECONNECT_FAILED maximum number of database reconnect attempts: 3, has been exhausted without success ...

Cause and Corrective Action

Such messages typically indicate either the database server is down, or some kind of network connectivity problem between the host running Kea, and the host running the database.

Generally, the underlying problem will need to be fixed for Kea performance and reliability to be acceptable. Trying to tune around a bad database or network rarely yields good results. MySQL, MariaDB, and PostgreSQL all have high-availability/replication features designed to reduce downtime. Building resilient networks is also a well-developed field of engineering. It is suggest efforts be focused there.

However, if you want Kea to try to compensate for a bad database connection, you have some options.

Do You Even Need a Database?

Before we get into the parameters you can change, it is worth asking if you even need to use a database in the first place. Many people assume a database is automatically better, but this is far from true in practice.

For lease storage in particular, memfile (CSV file) storage is faster and almost always more reliable. Using memfile removes dependencies on inter-process communication, network, or other processes. It keeps everything internal to Kea and in local RAM. Lease status can still be obtained using the Kea API.

The only advantages a database presents here are (1) easier integration with other tools in some scenarios (SQL vs JSON) and (2) shared-lease-database. ISC Support generally recommends memfile for lease storage for this reason.

Other database uses have similar pros and cons. Host reservations in a central database can be convenient, and may be appropriate if one has hundreds of reservations, but they can also be kept in a plain config file. Any directives in a config backend (CB) can also be placed in a config file. The forensic logging hook can log to a plain text file or syslog/journal.

So if database connectivity is a concern, first review your goals and consider how a database may or may not help you achieve them.

Database Connection Parameters

Kea offers several configuration parameters that control how it behaves when a database is not working properly.

These need to be configured for each database in use -- that is, inside each config file section that declare a database. These include the lease-database, host-databases, config-databases, and libdhcp_legal_log sections.

Tip

If all parts of Kea use the same database server, you can declare that once in a separate file, and then include the same file in each section. For example, <?include "/etc/kea/database.conf"?>. See Config Inclusion in the ARM for more.

Overview

When trying to talk to a database, Kea waits connect-timeout seconds for an answer.
If the timeout is reached, or if the connection explicitly fails, Kea then waits reconnect-wait-time before trying again.
During the wait period, the serve- or stop- part of on-fail determines if Kea attempts to keep servicing DHCP requests.
After waiting, Kea will try the database request again, up to the limit of max-reconnect-tries attempts.
If the retry limit is exceeded, Kea declares that database dead, and will make no further attempts. The -exit or -continue part of on-fail determines if Kea will attempt to keep running without that database.

Note the word "Attempt"

Simply telling Kea to continue serving does not make it capable of doing so. Please read the Implications section before configuring serve- or -continue.

Timeout


Name	`connect-timeout`
Syntax	Positive integer
Units	Whole seconds
Default	5 seconds

How long Kea will wait for a database server to respond, during initial connection, or a pending transaction.

Retry Wait


Name	`reconnect-wait-time`
Syntax	Positive integer
Units	Milliseconds
Default	5000 milliseconds (5 seconds)

How long Kea will wait before attempting to connect again, after connect-timeout has expired, or after a database command has failed (connection refused, unexpected error, etc.).

During this time, the first part of on-fail determines how Kea behaves (stop- or serve-).

Maximum Retries


Name	`max-reconnect-tries`
Syntax	Non-negative integer
Default	0 (no retries)

Once this limit is exceeded, Kea internally declares the database dead. It will not attempt further retries. As such, even if the database comes back, Kea will not notice.

Kea will take the action specified by the last part of on-fail (-exit or -continue).

Failure During Operation


Name	`on-fail`
Syntax	Enumeration
Choices	`stop-retry-exit` `serve-retry-exit` `serve-retry-continue`
Default	`stop-retry-exit` for most `serve-retry-continue` for logging

The on-fail parameter has two parts. The middle is always retry.

This is not a retry control

The on-fail parameter controls the finer points of Kea failure response. It generally does not make Kea handle failures better. Do not change on-fail simply because continue or serve sound like more appealing words.

Stop or Serve

stop- versus serve- determine whether Kea will attempt to keep servicing DHCP requests during the retry period.

stop- suppresses DHCP service if the database is in a failing/retrying state. Kea will simply not answer, as if it was not running at all.

continue- tells Kea to attempt to continue servicing DHCP requests during database retries. Depending on the circumstances, this may not do what you want. Please read the Implications section.

If the retry limit is exceeded, serve- no longer has any effect.

Exit or Continue

-exit versus -continue determine what happens if the retry limit is exceeded.

-exit causes the Kea daemon process to exit once a database is declared dead.

-continue means the Kea daemon process will continue running. The database will remain unavailable, and any functions that depend on that database will fail. Please read the Implications section.

Failure At Startup


Name	`retry-on-startup`
Syntax	Boolean
Default	False
Introduced	Kea 2.6

When Kea is started, it attempts to open connections to all configured databases. By default, if any of these fail, it immediately aborts. The assumption is the database should have been made ready before Kea was started.

Alternatively, set retry-on-startup to true, and database problems on startup will be treated the same as database problems after startup. The other parameters described above will be used to determine how Kea behaves.

This parameter is not available in Kea 2.4 or earlier.

Implications

These parameters can have some subtle implications. It is important to understand what Kea can and cannot cope with.

Lease Database

Kea needs a reliable lease storage backend to provide any kind of DHCP service. The lease list is used both to find free addresses for new leases, and to confirm renewals. There is no way to make Kea operational in the face of failed lease storage. Attempting to continue service with an unavailable lease database just causes Kea to log more errors.

For this reason, stop-retry-exit is strongly recommended for a lease database. Other values will just put Kea into a state where it cannot do anything useful until restarted. If you want to have Kea keep trying, see instead Persistent Retries.

Hosts Database

For the hosts database, Kea can still serve dynamic pools without it (using serve-), but reservations will be effectively ignored.

If you depend on reservations, continuing DHCP service without the hosts database could cause chaos. Devices which normally get a static address might get a dynamic address instead. Devices which depend on configuration via DHCP options configured within host reservations may act like they were factory reset.

Keep in mind that if a client which should get a special configuration via a reservation, instead gets a "generic" configuration via dynamic lease, it will continue to use that improper configuration until the lease needs to be renewed. For example, if the lease duration is one week, a database problem that lasts only seconds, combined with serve-, could still leave special clients inoperable for days.

If reservations are more of a nicety for you than a critical feature, serve- may be appropriate. Otherwise, stop- is recommended.

Logging Databases

Logging (or the lack thereof) does not impact DHCP service itself. Thus, serve-retry-continue is often a good choice for a forensic logging database. This tells Kea to sacrifice logging and continue providing service.

Conversely, in a high-security environment, logging may be considered essential, and stop-retry-exit may be appropriate here as well.

Durations

The time it takes Kea to retry one time will be, at maximum, the sum of both connect-timeout and reconnect-wait-time.

Both the timeout and the wait define periods of time when -- in a failure scenario -- Kea cannot use the database. Increasing them just means Kea will wait that much longer before giving up and retrying. To reduce time-to-recovery, decreasing either or both may be appropriate, so that Kea abandons a bad connection sooner, or retries sooner.

However, be aware that reducing them may also increase load on Kea, the database, and/or the network. Some wait is typically appropriate, as immediate retries can be counter-productive (allows time for recovery, avoids flooding the network, etc.).

Retry Endlessly or Fail Explicitly?

There are some failure scenarios Kea simply cannot cope with (loss of lease storage being the biggest example). In such cases, it is often better for Kea to abort and exit entirely. This has several benefits:

It gives a clear and obvious sign something is wrong
It allows for supervision software (like runit or systemd) to attempt recovery actions, or restart the process
It is easily checked with monitoring software
It does not waste machine or network resources retrying forever

The general idea is that in the face of a serious problem, it is better to alert a human to investigate, than to have software stuck ineffectually, like a wind-up toy robot walking into a wall.

Scenarios

We can offer some scenarios to illustrate the above. Please understand, these are not recommendations or even suggestions. When it comes to networks, one size definitely does not fit all. Network operators are the ones who know what is best for their systems.

Quicker Recovery

For faster recovery in the face of a momentary problem, reduce the connect-timeout and reconnect-wait-time durations. For example, a timeout of 1 second and a wait of 500 milliseconds (half a second) is still an eternity in computer time.

This can be combined with Persistent Retries.

Persistent Retries

Increase max-reconnect-tries to tell Kea to keep trying.

This can be combined with Quicker Recovery. For example, a timeout of 1 second, a wait of 1000 milliseconds, and 300 retries. Kea will keep trying the database every two seconds, for up to ten minutes. If the database server or an intermediate network device is being rebooted, this lets Kea recover as quickly as reasonably possible. If the database is gone for more than ten minutes, one can assume human intervention is going to be required regardless.

Indefinite Retries

Please read and consider Retry Endlessly or Fail Explicitly? before attempting this.

If you want Kea to retry indefinitely, set max-reconnect-tries to a very high value. For example, a retry limit of 2592000 combined with a reconnect-wait-time of 1000 milliseconds would keep retrying for about one month.