Introduction
This document discusses database connectivity problems, parameters that can be adjusted to make Kea compensate for those problems, and some of the implications of doing so.
Scenario
Symptoms
Often this discussion begins with error messages like this:
DATABASE_MYSQL_FATAL_ERROR Unrecoverable MySQL error occurred: unable to execute for <SELECT ...>, reason: Server has gone away (error code: 2006).
DHCP6_PACKET_PROCESS_STD_EXCEPTION ... exception occurred during packet processing: fatal database error or connectivity lost
DHCP4_DB_RECONNECT_FAILED maximum number of database reconnect attempts: 3, has been exhausted without success ...
Cause and Corrective Action
Such messages typically indicate either the database server is down, or some kind of network connectivity problem between the host running Kea, and the host running the database.
Generally, the underlying problem will need to be fixed for Kea performance and reliability to be acceptable. Trying to tune around a bad database or network rarely yields good results. MySQL, MariaDB, and PostgreSQL all have high-availability/replication features designed to reduce downtime. Building resilient networks is also a well-developed field of engineering. It is suggest efforts be focused there.
However, if you want Kea to try to compensate for a bad database connection, you have some options.
Do You Even Need a Database?
Before we get into the parameters you can change, it is worth asking if you even need to use a database in the first place. Many people assume a database is automatically better, but this is far from true in practice.
For lease storage in particular, memfile (CSV file) storage is faster and almost always more reliable. Using memfile removes dependencies on inter-process communication, network, or other processes. It keeps everything internal to Kea and in local RAM. Lease status can still be obtained using the Kea API.
The only advantages a database presents here are (1) easier integration with other tools in some scenarios (SQL vs JSON) and (2) shared-lease-database. ISC Support generally recommends memfile for lease storage for this reason.
Other database uses have similar pros and cons. Host reservations in a central database can be convenient, and may be appropriate if one has hundreds of reservations, but they can also be kept in a plain config file. Any directives in a config backend (CB) can also be placed in a config file. The forensic logging hook can log to a plain text file or syslog/journal.
So if database connectivity is a concern, first review your goals and consider how a database may or may not help you achieve them.
Database Connection Parameters
Kea offers several configuration parameters that control how it behaves when a database is not working properly.
These need to be configured for each database in use -- that is, inside each config file section that declare a database. These include the lease-database
, host-databases
, config-databases
, and libdhcp_legal_log
sections.
If all parts of Kea use the same database server, you can declare that once in a separate file, and then include the same file in each section. For example, <?include "/etc/kea/database.conf"?>
. See Config Inclusion in the ARM for more.
Overview
- When trying to talk to a database, Kea waits
connect-timeout
seconds for an answer. - If the timeout is reached, or if the connection explicitly fails, Kea then waits
reconnect-wait-time
before trying again. - During the wait period, the
serve-
orstop-
part ofon-fail
determines if Kea attempts to keep servicing DHCP requests. - After waiting, Kea will try the database request again, up to the limit of
max-reconnect-tries
attempts. - If the retry limit is exceeded, Kea declares that database dead, and will make no further attempts. The
-exit
or-continue
part ofon-fail
determines if Kea will attempt to keep running without that database.
Simply telling Kea to continue serving does not make it capable of doing so. Please read the Implications section before configuring serve-
or -continue
.
Timeout
Name | connect-timeout |
Syntax | Positive integer |
Units | Whole seconds |
Default | 5 seconds |
How long Kea will wait for a database server to respond, during initial connection, or a pending transaction.
Retry Wait
Name | reconnect-wait-time |
Syntax | Positive integer |
Units | Milliseconds |
Default | 5000 milliseconds (5 seconds) |
How long Kea will wait before attempting to connect again, after connect-timeout
has expired, or after a database command has failed (connection refused, unexpected error, etc.).
During this time, the first part of on-fail
determines how Kea behaves (stop-
or serve-
).
Maximum Retries
Name | max-reconnect-tries |
Syntax | Non-negative integer |
Default | 0 (no retries) |
Once this limit is exceeded, Kea internally declares the database dead. It will not attempt further retries. As such, even if the database comes back, Kea will not notice.
Kea will take the action specified by the last part of on-fail
(-exit
or -continue
).
Failure During Operation
Name | on-fail |
Syntax | Enumeration |
Choices | stop-retry-exit serve-retry-exit serve-retry-continue |
Default | stop-retry-exit for most serve-retry-continue for logging |
The on-fail
parameter has two parts. The middle is always retry
.
The on-fail
parameter controls the finer points of Kea failure response. It generally does not make Kea handle failures better. Do not change on-fail
simply because continue
or serve
sound like more appealing words.
Stop or Serve
stop-
versus serve-
determine whether Kea will attempt to keep servicing DHCP requests during the retry period.
stop-
suppresses DHCP service if the database is in a failing/retrying state. Kea will simply not answer, as if it was not running at all.
continue-
tells Kea to attempt to continue servicing DHCP requests during database retries. Depending on the circumstances, this may not do what you want. Please read the Implications section.
If the retry limit is exceeded, serve-
no longer has any effect.
Exit or Continue
-exit
versus -continue
determine what happens if the retry limit is exceeded.
-exit
causes the Kea daemon process to exit once a database is declared dead.
-continue
means the Kea daemon process will continue running. The database will remain unavailable, and any functions that depend on that database will fail. Please read the Implications section.
Failure At Startup
Name | retry-on-startup |
Syntax | Boolean |
Default | False |
Introduced | Kea 2.6 |
When Kea is started, it attempts to open connections to all configured databases. By default, if any of these fail, it immediately aborts. The assumption is the database should have been made ready before Kea was started.
Alternatively, set retry-on-startup
to true
, and database problems on startup will be treated the same as database problems after startup. The other parameters described above will be used to determine how Kea behaves.
This parameter is not available in Kea 2.4 or earlier.
Implications
These parameters can have some subtle implications. It is important to understand what Kea can and cannot cope with.
Lease Database
Kea needs a reliable lease storage backend to provide any kind of DHCP service. The lease list is used both to find free addresses for new leases, and to confirm renewals. There is no way to make Kea operational in the face of failed lease storage. Attempting to continue service with an unavailable lease database just causes Kea to log more errors.
For this reason, stop-retry-exit
is strongly recommended for a lease database. Other values will just put Kea into a state where it cannot do anything useful until restarted. If you want to have Kea keep trying, see instead Persistent Retries.
Hosts Database
For the hosts database, Kea can still serve dynamic pools without it (using serve-
), but reservations will be effectively ignored.
If you depend on reservations, continuing DHCP service without the hosts database could cause chaos. Devices which normally get a static address might get a dynamic address instead. Devices which depend on configuration via DHCP options configured within host reservations may act like they were factory reset.
Keep in mind that if a client which should get a special configuration via a reservation, instead gets a "generic" configuration via dynamic lease, it will continue to use that improper configuration until the lease needs to be renewed. For example, if the lease duration is one week, a database problem that lasts only seconds, combined with serve-
, could still leave special clients inoperable for days.
If reservations are more of a nicety for you than a critical feature, serve-
may be appropriate. Otherwise, stop-
is recommended.
Logging Databases
Logging (or the lack thereof) does not impact DHCP service itself. Thus, serve-retry-continue
is often a good choice for a forensic logging database. This tells Kea to sacrifice logging and continue providing service.
Conversely, in a high-security environment, logging may be considered essential, and stop-retry-exit
may be appropriate here as well.
Durations
The time it takes Kea to retry one time will be, at maximum, the sum of both connect-timeout
and reconnect-wait-time
.
Both the timeout and the wait define periods of time when -- in a failure scenario -- Kea cannot use the database. Increasing them just means Kea will wait that much longer before giving up and retrying. To reduce time-to-recovery, decreasing either or both may be appropriate, so that Kea abandons a bad connection sooner, or retries sooner.
However, be aware that reducing them may also increase load on Kea, the database, and/or the network. Some wait is typically appropriate, as immediate retries can be counter-productive (allows time for recovery, avoids flooding the network, etc.).
Retry Endlessly or Fail Explicitly?
There are some failure scenarios Kea simply cannot cope with (loss of lease storage being the biggest example). In such cases, it is often better for Kea to abort and exit entirely. This has several benefits:
- It gives a clear and obvious sign something is wrong
- It allows for supervision software (like
runit
orsystemd
) to attempt recovery actions, or restart the process - It is easily checked with monitoring software
- It does not waste machine or network resources retrying forever
The general idea is that in the face of a serious problem, it is better to alert a human to investigate, than to have software stuck ineffectually, like a wind-up toy robot walking into a wall.
Scenarios
We can offer some scenarios to illustrate the above. Please understand, these are not recommendations or even suggestions. When it comes to networks, one size definitely does not fit all. Network operators are the ones who know what is best for their systems.
Quicker Recovery
For faster recovery in the face of a momentary problem, reduce the connect-timeout
and reconnect-wait-time
durations. For example, a timeout of 1 second and a wait of 500 milliseconds (half a second) is still an eternity in computer time.
This can be combined with Persistent Retries.
Persistent Retries
Increase max-reconnect-tries
to tell Kea to keep trying.
This can be combined with Quicker Recovery. For example, a timeout of 1 second, a wait of 1000 milliseconds, and 300 retries. Kea will keep trying the database every two seconds, for up to ten minutes. If the database server or an intermediate network device is being rebooted, this lets Kea recover as quickly as reasonably possible. If the database is gone for more than ten minutes, one can assume human intervention is going to be required regardless.
Indefinite Retries
Please read and consider Retry Endlessly or Fail Explicitly? before attempting this.
If you want Kea to retry indefinitely, set max-reconnect-tries
to a very high value. For example, a retry limit of 2592000 combined with a reconnect-wait-time
of 1000 milliseconds would keep retrying for about one month.