What to do with a misbehaving BIND server


Sometimes a named process will appear to behave abnormally - for example it uses more CPU or memory than usual (or less), emits unexpected error messages, doesn't respond to queries, or responds negatively or late. It's tempting just to restart named or to try a reload/reconfig/flush to see if that helps. If it does help, then this is really good for the production environment at that time, but the opportunity to collect useful troubleshooting information is destroyed at the same time.

Here are some things that we'd recommend you do as many of as possible before attempting to clear the problem - and then report the results/submit data along with the full report of the problem that was encountered and its symptoms.

This checklist assumes that you've already qualified in what way named is not working by using dig to confirm subjective/other reports of failure

  1. Run pstack (or similar OS-specific tool) against the process 3 or 4 times (from this output we get several snapshots of what named is doing at that instant - comparing instants we can see whether threads are moving or are stuck - e.g. on a lock. We also get clean stack traces of each thread from the run-time environment without any possibility of mis-matched executables/core).

  2. Obtain a snapshot of the current named status (if named is still consuming CPU, it might be useful to repeat this several times along with step 1) :
    rndc status
  3. Generate a list of the client queries that named is currently handling (the default filename is named.recursingIt may be useful to repeat this several times if named is still running and consuming CPU, especially if the reported problems relate to recursive resolution:
    rndc recursing
  4. Get a snapshot of the current state of named's cache (the default filename is named_dump.db). It may be useful to repeat this several times if named is still running and consuming CPU, especially if the reported problems relate to recursive resolution:

    rndc dumpdb -all
  5. Toggle query logging on for a few minutes (if it's not already enabled):

    rndc querylog
  6. Temporarily increase the level of server logging for a few minutes (this relies on the logging channels being defined such that this level of logging can be output - it may be necessary to review the logging configuration in named.conf if changing the debug level via rndc does not produce additional logging output anywhere):

    rndc trace 3
  7. Take a snapshot packet trace (wireshark or similar) of both inbound and outbound traffic on the nameserver.  Make sure you trace on all the interfaces on the nameserver host.

  8. If the problem is that a recursive server does not appear to be able to resolve queries that involve recursion then worth running some tests to see if the problem is external to named - perhaps the network environment.  On the actual machine that the instance of named that you are troubleshooting runs on, try using dig +trace to verify connectivity. For example:

    dig +trace www.facebook.com

    Don't use the dig +trace option from your clients for troubleshooting specific server behaviour problems

    For more information on the +trace option, read:  Why is the outcome different from dig when using the +trace option?

    Depending on the results of this, you can issue direct queries (emulating named's communication with authoritative servers). For example:

    dig @ +norec +dnssec +multi www.facebook.com
  9. Check OS resource use and whether any limits appear to have been reached (memory use, #open sockets per process, network statistics etc.) 

Once you've done all/some of the above, then the pressing need to restart the server will probably mean that there is little else you can do. 

Please try to capture a core dump however (gcore or kill -6 should provide one) rather than using rndc to halt the server - and then follow the checklist of files to submit with a core dump as well including the data that's been generated prior to stopping name