prefetch performance in BIND 9.10

AA-01315

Our new feature Early refresh of cache records (cache prefetch) in BIND 9.1 unfortunately came with a design defect that was not spotted until recently, and which can cause performance degradation in some situations.

If you are experiencing surprising and significantly poor performance when running BIND 9.10 with prefetch enabled, then it is possible that you are suffering from this design oversight (which will be remedied in BIND 9.10.4).  The identifying characteristics are:

  • Low QPS - many queries do not receive any response
  • System inbound UDP buffers are full (usually with an ongoing non-zero packet discard rate)
  • named CPU consumption drops below normal levels (named doesn't appear to be doing anything very much)
  • (Sometimes) there is an increase in inbound client queries that corresponds with this sudden performance degradation
  • It may be possible to review the inbound client queries and identify a pattern which would lead to a high prefetch rate
  • This problem may be seen particularly (but not exclusively) when there is poor connectivity with high latency and lost packet rate when named is communicating with other authoritative servers in order to provide responses to clients.

This problem can occur on servers of all capacities and query rates if/when the circumstances occur that cause named to reach the state of resource depletion that is the underlying reason for the sudden drop in performance.

The resource that can become depleted is the pool of internal structures that are used to hold inbound client queries while the response to those queries are being assembled.

When a prefetch is triggered, the triggering query response is sent back to the client immediately, but instead of releasing the internal client structure back to the pool right away, it continues to be used as a placeholder/hook for the recursion activity during the ensuing prefetch and is only released afterwards.  This was a problem that was very easy for us to address, as the situation parallels the resource management that is necessary when a client query is received that cannot be answered from cache (or authoritative data) and thus requires immediate recursion.

Note that prefetch is enabled by default in BIND 9.10.

The fix for this problem will be released in BIND 9.10.4:

4242.  [bug]      Replace the client if not already replaced when
                  prefetching. [RT #41001]

If you think you might be experiencing this issue now, there is a simple verification test which will also provide you with a viable workaround.  If you disable prefetch and the performance degradation vanishes, then this bug was the cause of your problem.

To disable prefetch, add to your named.conf options:

prefetch 0;

You can safely re-enable prefetch after upgrading to BIND 9.10.4 (or newer)

The code change for the fix can also be obtained from our public source repository:

https://source.isc.org/