What to do if your BIND, ISC DHCP, or Kea DHCP server has crashed
  • 05 Nov 2018
  • 7 Minutes To Read
  • Contributors
  • Print
  • Share
  • Dark

What to do if your BIND, ISC DHCP, or Kea DHCP server has crashed

  • Print
  • Share
  • Dark

If your BIND, ISC DHCP, or Kea DHCP server crashes (i.e. the daemon terminates unexpectedly), collecting the evidence available and submitting to us is vital if we are to help diagnose the problem and provide a solution. Below is a list of files/information to collect after a crash. This includes the process core dump - if you don't have one, it's possible that this needs to be enabled on your system so that one can be collected if the event happens again.

Enabling core dumps
Note that on many operating systems, process core dumps have to be explicitlyenabled, and sometimes have to be enabled beyond a certain size (named can drop really big core dumps) - so if there is no core dump to be found, then that's something to work on and resolve in case of a recurrence.

The ulimit command handles this in many cases - but you should check that this is appropriate before trying it in your environment:

ulimit -c unlimited

Appropriate write permissions may also need to be set for the directory that the core file is to be written to. (Are you running chrooted? What user does your daemon process run as?) You can test whether or not core dumps are possible by using gcore or kill -6 (sigabrt) against a process at a time when restarting it isn't going to impact production. 

You can check the details of enabling core files for your environment via the man pages:

man core

Information and files to collect and preserve:

Please collect and preserve the details below (in particular the logs and any core files which may be lost if not collected and copied/moved elsewhere right away). We may need some or all of them to diagnose the problem.

  1. Note down and tell us what BIND, Kea, or DHCP was doing at the time - for example, is this a test or a production environment? If testing, how is the test being run, etc.? If in production, was there anything specific happening at the time such as configuration updates or similar maintenance activities?

  2. Look for a core dump. Generally this will be in either the current working directory of the server program, or in the directory of the binary. Depending on how core dumping is managed on that system, it will simply be named 'core' or may have a more esoteric name - possibly involving the PID of the process that died. On MacOS, core files are in directory /core and suffixed with the pid of the aborting process (e.g. /core/core.).

    You can check where a core has come from by using the file command:

file core

core: ELF 32-bit MSB core file, SPARC, version 1 (SYSV), SVR4-style, from 'named'
  1. We will need the actual binary that generated the core in order to be able to read it on another machine.

    Note: If the binary was built without any debug information - particularly if it doesn't have the procedure names available for stack traces - you can sometimes get around this by building a new binary with compile option -g from the same source code bundle. This implies all other build/configure options are the same as the original build, and that the compiler/linker/optimizer software hasn't been updated in the build environment since the original binary was created.

  2. Collect any libraries that the binary loaded dynamically from the run-time environment that produced the core.

    Use ldd for a first-pass on what's needed (sometimes we need more libs that ldd doesn't show us at first - they're only exposed when we try to read the core for the first time): 

ldd /usr/sbin/named

     linux-vdso.so.1 =>  (0x00007fff11d4c000)
     libcrypto.so.0.9.8 => /usr/lib64/libcrypto.so.0.9.8 (0x00007f19d4d19000)        
     libc.so.6 => /lib64/libc.so.6 (0x00007f19d49c0000)        
     libdl.so.2 => /lib64/libdl.so.2 (0x00007f19d47bc000)        
     libz.so.1 => /lib64/libz.so.1 (0x00007f19d45a6000)        
     /lib64/ld-linux-x86-64.so.2 (0x00007f19d5096000) 

There is no ldd command in MacOS

Instead, otool -L should provide the same functionality; for example:

$ otool -L /usr/local/sbin/named
  1. Note the environment information - OS and version, hardware, #cpus, memory size, BIND/DHCP/Kea version and configure/compile/link options. For BIND, named -V; for DHCP, dhcpd --version; and for Kea, kea-dhcp4 -V.

  2. Configuration files also needed:

    • named.conf / dhcpd.conf / kea.conf (as appropriate) - key material can be obscured if you prefer

    Use named-checkconf to obscure your key material from named.conf
    The named-checkconf tool has an option -p that outputs the configuration file in canonical format (after checking it). A second option -x obscures secrets by replacing them with strings of question marks ("?"). This means that the following command will parse and output your named.conf in a format suitable for sharing without exposing key material:

    $ named-checkconf -px

    The filename to be checked defaults to /etc/named.conf; to explicitly apply named-checkconf to a configuration file in another location, use the following format:

    $named-checkconf -px filename

    • any include files that are declared in the main configuration files (look for nested includes too!)
    • leases file (dhcpd/Kea only, or possibly a database dump for Kea, see the kea-admin(8) manual)
    • zone data files (named only) - although these may not be needed initially - please ask!
  3. Gather all the relevant logfiles leading up to and covering the period of the incident.

  4. Think about what might have changed in your environment prior to experiencing this problem. For example:

    • Software upgrade (your BIND, Kea DHCP, or ISC DHCP software)
    • Operating system patches
    • Configuration changes (OS or product related)
    • Loading changes - additional clients for example
    • Networking changes - server moves, network topology changes etc.
    • Firewall updates or configuration changes
    • Seasonal usage changes
  5. If the failures are recurring, is there any pattern to them? For example:

    • Same time of week/day/hour each time (what else is going on at this time?)
    • Always occur at a server loading peaks
    • Related to a specific server operation such as a zone transfer (named) or omapi update (dhcpd)
    • Always occurs x days/hours since last restarting
    • Seemingly completely random
    • Increasing or decreasing in frequency

What to include when reporting the problem:

If you are submitting a bug report or support ticket, it is vital that you collect and preserve as much information and as many as possible of the files requested above. It may not be possible to diagnose the cause of the problem if this level of information is not available to us.

However, in the initial report, we prefer just to receive the basic details, and will let you know what else we need once we've reviewed them.

  • Environment information (see 5. and 8.).
  • BIND/DHCP/Kea server version (use "named -V" for BIND).
  • Frequency and impact to your servers or production services of the incident(s) you have experienced or are continuing to experience.
  • What named or dhcpd or the relevant modular kea-* process was doing at the time (see 1.).
  • Extracts from system and application logfiles leading up to and including the incident/crash.
  • If at all possible, a debugger backtrace from the crash from the server that produced it.

If you have a core file and gdb on the server that produced the core, you can obtain a backtrace (snapshot of what each thread was doing and the nested layers of procedure calls and data that was being used/accessed at the time) by launching gdb as follows:

  $ <path-to>/gdb <binary(full path)> <core(full path)>

And then from gdb, type:

   > thread apply all bt full

(Sometimes this will error and fail to complete - if that happens, retry it, omitting 'full')

Please include all of the output from gdb from when it is launched, not just the output from the thread apply all bt full command.

Note: MacOS users whose system has not generated a core file may still have access to useful crash data - go to Applications/Console and look at "User Diagnostic Reports" and you may see a report.

Uploading large files to ISC
We prefer to receive large and non-text files which have been compressed and bundled first using tar and gzip/bzip. Windows users can use zip instead. Please contact us for alternative upload arrangements if you need to submit large files with bug or support tickets as these cannot be accepted as email attachments.

Bug Reporting

Before submitting a bug report please ensure first that you are running a current version. Also, some packaged versions of BIND 9, Kea DHCP, and ISC DHCP might be built with source code that has been modified by the distributor - in those cases while it may be useful to report the issue directly to ISC (particularly if it might be a potential security issue), we may not be able to successfully diagnose the root cause of the problem if it cannot be reproduced with binaries that have been built directly from source code downloaded from ISC.

To report a bug, please use our Bug Report Form.

Support Customers:

If you have a support subscription with us, please contact us first that way to report problems rather than filing a bug report.

Reporting security issues

DNS and DHCP security is critical to the Internet Infrastructure. If you think you may be seeing a potential security vulnerability (for example, a crash with REQUIRE, INSIST, or ASSERT failure), please report it immediately to security-officer@isc.org and do not post it on the public mailing list. We provide numerous alternate ways to contact our security officer alias, including via the Bug Report Form (or see the Kea Known Issues List for Kea) and the ISC Contact form. Please also see our Security Vulnerability Disclosure Policy for details on how we publish security vulnerabilities.

Was This Article Helpful?