DNSSEC validation and BIND 9 cache

This KB article discusses some of the problems that can be encountered by BIND 9 validating recursive servers due to intermittent problems with authoritative servers providing DNSSEC-signed zones. BIND has competing objectives when handling validation. On the one hand, it does not want to repeatedly query non-responding or faulty authoritative servers (whether the problem lies with the servers themselves, or with middleware such as firewalls or load-balancers), but on the other hand, it also needs to recover reasonably quickly after a fault is repaired.

In some situations, administrators of DNSSEC-validating recursive servers may need to take direct remedial action, rather than waiting for the built-in timeouts. This article explains what actions might help in different circumstances.

This article is applicable to versions of BIND 9 up to and including BIND 9.9.x only

Significant changes in how a BIND recursive resolver handles EDNS to authoritative servers are planned for 9.10 which change some of the behaviors listed below and should mitigate some of the temporary validation failures that can occur due to network or implementation failures.

What can go wrong and why?

Responses from authoritative servers don't include any RRSIGs.
Unsigned responses will fail validation if the parent zone has a signed DS (delegation signer) record for this zone.
Invalid (or missing) RRSIGs will cause validation failures when the parent zone is providing a signed DS record for the zone.
Possible reasons for invalid RRSIGs are expired signatures, signatures that do not match their associated RRset, signatures that do not correspond to a valid key and so on.
Broken chain of trust - DNSKEY records don't correspond with the DS record in the parent zone, records are signed with a different key than expected or the DNSKEY is missing entirely.
The responses will fail validation.
Malformed responses from authoritative servers causing the validating recursive server to retry without EDNS support.
If an authoritative server responds in a broken fashion, then BIND will discard its response and retry with reduced UDP packet size and then without EDNS0 entirely. If the authoritative server responds properly to a query with EDNS0 disabled, then BIND will mark the server as EDNS-incapable. Since EDNS0 is required for the recursive server to be able to signal to the server that it would like DNSSEC signed responses if those are available (the DO option), future queries to this authoritative server will be sent without DO and its responses will omit the RRSIGs needed for DNSSEC validation, thus validation will fail.
Intermittent lack of responses from authoritative servers causing the validating recursive server to retry without EDNS support.
Intermittent timeouts when querying authoritative servers will cause BIND to retry. However, even if there is a successful response following a retry, current production versions of BIND do not mark a server as EDNS-incapable following retries and fall-back due to server timeouts alone.

Under which circumstances does BIND mark an authoritative server as EDNS-incapable?

named will record that a server does not understand EDNS if it gets a successful answer for a plain DNS query which returned SERVFAIL/NOTIMP/FORMERR earlier to a EDNS query.
named will also record that a server does not understand EDNS if it receives a successful response to a plain DNS query from the authoritative server for which one of the following occurs when making a EDNS query:

the dispatcher returned ISC_R_EOF to an EDNS query
the parser returned ISC_R_UNEXPECTEDEND
the parser returned DNS_R_FORMERR

If the authoritative server simply fails to respond when queried with EDNS, named does not mark the server as EDNS-incapable, even when receiving a valid response to queries without EDNS (this prevents false-positives due to intermittent packet losses).

What is cached?

Responses from authoritative servers (for the originally-received TTL for each RRset) - this includes RRSIGs where RRsets are signed, NSEC and NSEC3 RRsets for signed proof of non-existence.
DNSKEY and DS RRsets (used to establish the chain of trust).
The EDNS-capability of authoritative nameservers (for up to 30 minutes on BIND 9.0 -> 9.9).
The validation status of RRsets (for the duration of the RRsets' TTL). This will generally be one of three states: signed and validated, provably insecure, and pending validation (e.g. when there is a broken chain of trust or when the original query explicitly requested no checking).
Lameness: when following delegation, a nameserver responds that it is not authoritative for the domain that has been delegated to it (for up to lame-ttl - default 10 minutes)
Bad cache for DNSSEC validation failures (for at least 30 seconds - up to the period set by lame-ttl - the types of records that may be cached in this way vary based on the reason for the validation failure).
Unreachable cache: this is where a slave server maintains a cache of master servers that do not respond to SOA or zone transfer queries when the slave is attempting a zone data refresh. This 'cache' area has no impact on recursive queries and is only included in this list in order to highlight that it's not relevant to recursive server behavior.

named 's cache can be dumped to a disk file for viewing via the rndc utility:

rndc dumpdb -all

The output of this command is a file - by default it is named_dump.db .

There is one recursive cache per view (unless the attach-cache option
has been employed). If no views have been defined, then the recursive
cache lives in the default view.

named 's cache is divided into
sections. The main cache contains the resource records (RRs) - this
includes RRSIG records (DNSSEC signatures); it also records the
DNSSEC-validation status of cached RRs. The Address Database (ADB) section of
cache is a record of authoritative servers that named has contacted in
order to resolve recursive queries from clients. Bad cache holds RRsets that have failed DNSSEC validation,

The
cache dump of the main cache lists resource records (RRs) in sets
(RRsets), each set prefixed with a line that indicates the level of
authority with which the RRset is being held. RRsets may be replaced
when new information is received from a more authoritative source (e.g.
the list of nameservers for a domain received from one of the
authoritative servers as an answer to a query for those servers will
supersede a list of nameservers included in the additional section of a
response from another server).

The cache dump of the ADB lists
most nameservers by name, with various fields that indicate the TTL of
the ADB entry, the IPv4 and IPv6 addresses and reachability of the
server, various flags (including EDNS-capability) and the current SRTT
of each address.

The
ADB is keyed by server name and by address but may also contain unassociated entries
(held by IPv4 or IPv6 address alone - no names). Unassociated entries occur, either because
there is no name associated with them (for example in forwarders {}; lists) - or because the name associated with an address was only retained as long as its A/AAAA records in main cache are unexpired. Unassociated entries will usually be reunited with their name(s) when those servers are used again during iterative query resolution.

ADB entries are maintained by server name, not by zone (this means that a problem with the ADB record for one server can impact many zones). ADB entries are retained
for up to 30 minutes, and include flags for lameness, IPv4/IPv6 support
and EDNS0 support as well as the SRTT (Smoothed Round Trip Time)

How to clear cached entries

If there are DNSSEC validation failures as a result of unexpired cached contents, there are various techniques available to resolve the problem:

Flush the entire named cache (rndc flush ). The advantage of this is that there is no need to know which entries need to be cleared - they all will be. The disadvantage is that clearing the entire cache will cause a subsequent flood of iterative queries in order to repopulate the cache with frequently-accessed records and server information. Flushing the entire cache clears all resource records (RRs), bad cache (for DNSSEC-validation failures) and also the Address Database (where named tracks the status of authoritative servers that it has queried).
Flush the cache for a specific name (rndc flushname name [view] ). This flushes entries matching the specific name both from the main cache and from the ADB.
- Use the name of a specific nameserver if there are problems with e.g. the EDNS status of that server.
- Use the name of specific records that are failing validation to force re-validation on the next client query.
Flush the cache for a specific name as well as all records below that name (rndc flushtree name [view] ). This will clear the cache, but it will not clear any names out of ADB, so may not be sufficient for some needs.
Restart the named daemon.

Bad Cache is not cleared by rndc flushtree

Currently, the only way to clear validation failures before they expire normally is to flush the entire cache, identify the name and apply rndc flushname or to restart named.