DNSSEC validation and BIND 9 cache
This KB article discusses some of the problems that can be encountered by BIND 9 validating recursive servers due to intermittent problems with authoritative servers providing DNSSEC-signed zones. BIND has competing objectives when handling validation. On the one hand, it does not want to repeatedly query non-responding or faulty authoritative servers (whether the problem lies with the servers themselves, or with middleware such as firewalls or load-balancers), but on the other hand, it also needs to recover reasonably quickly after a fault is repaired.
In some situations, administrators of DNSSEC-validating recursive servers may need to take direct remedial action, rather than waiting for the built-in timeouts. This article explains what actions might help in different circumstances.
What can go wrong and why?
- Responses from authoritative servers don't include any RRSIGs.
Unsigned responses will fail validation if the parent zone has a signed DS (delegation signer) record for this zone.
- Invalid (or missing) RRSIGs will cause validation failures when the parent zone is providing a signed DS record for the zone.
Possible reasons for invalid RRSIGs are expired signatures, signatures that do not match their associated RRset, signatures that do not correspond to a valid key and so on.
- Broken chain of trust - DNSKEY records don't correspond with the DS record in the parent zone, records are signed with a different key than expected or the DNSKEY is missing entirely.
The responses will fail validation.
- Malformed responses from authoritative servers causing the validating recursive server to retry without EDNS support.
If an authoritative server responds in a broken fashion, then BIND will discard its response and retry with reduced UDP packet size and then without EDNS0 entirely. If the authoritative server responds properly to a query with EDNS0 disabled, then BIND will mark the server as EDNS-incapable. Since EDNS0 is required for the recursive server to be able to signal to the server that it would like DNSSEC signed responses if those are available (the DO option), future queries to this authoritative server will be sent without DO and its responses will omit the RRSIGs needed for DNSSEC validation, thus validation will fail.
- Intermittent lack of responses from authoritative servers causing the validating recursive server to retry without EDNS support.
Intermittent timeouts when querying authoritative servers will cause BIND to retry. However, even if there is a successful response following a retry, current production versions of BIND do not mark a server as EDNS-incapable following retries and fall-back due to server timeouts alone.
Under which circumstances does BIND mark an authoritative server as EDNS-incapable?
named will record that a server does not understand EDNS if it gets a successful answer for a plain DNS query which returned SERVFAIL/NOTIMP/FORMERR earlier to a EDNS query.
named will also record that a server does not understand EDNS if it receives a successful response to a plain DNS query from the authoritative server for which one of the following occurs when making a EDNS query:
the dispatcher returned ISC_R_EOF to an EDNS query
the parser returned ISC_R_UNEXPECTEDEND
the parser returned DNS_R_FORMERR
If the authoritative server simply fails to respond when queried with EDNS, named does not mark the server as EDNS-incapable, even when receiving a valid response to queries without EDNS (this prevents false-positives due to intermittent packet losses).
What is cached?
- Responses from authoritative servers (for the originally-received TTL for each RRset) - this includes RRSIGs where RRsets are signed, NSEC and NSEC3 RRsets for signed proof of non-existence.
- DNSKEY and DS RRsets (used to establish the chain of trust).
- The EDNS-capability of authoritative nameservers (for up to 30 minutes on BIND 9.0 -> 9.9).
- The validation status of RRsets following successful validation; successful means that the records are either DNSSEC-authenticated or insecure (for the duration of the RRsets' TTL).
- Lameness: when following delegation, a nameserver responds that it is not authoritative for the domain that has been delegated to it (for up to lame-ttl - default 10 minutes)
- Bad cache for DNSSEC validation failures (for at least 30 seconds - up to the period set by lame-ttl )
- Unreachable cache: this is where a slave server maintains a cache of master servers that do not respond to SOA or zone transfer queries when the slave is attempting a zone data refresh. This 'cache' area has no impact on recursive queries and is only included in this list in order to highlight that it's not relevant to recursive server behavior.
named 's cache can be dumped to a disk file for viewing via the rndc utility:
rndc dumpdb -all
The output of this command is a file - by default it is named_dump.db .
There is one recursive cache per view (unless the attach-cache option has been employed). If no views have been defined, then the recursive cache lives in the default view.
named 's cache is divided into sections. The main cache contains the resource records (RRs) - this includes RRSIG records (DNSSEC signatures); it also records the DNSSEC-validation status of cached RRs. The Address Database (ADB) section of cache is a record of authoritative servers that named has contacted in order to resolve recursive queries from clients. Bad cache holds RRsets that have failed DNSSEC validation,
The cache dump of the main cache lists resource records (RRs) in sets (RRsets), each set prefixed with a line that indicates the level of authority with which the RRset is being held. RRsets may be replaced when new information is received from a more authoritative source (e.g. the list of nameservers for a domain received from one of the authoritative servers as an answer to a query for those servers will supersede a list of nameservers included in the additional section of a response from another server).
The cache dump of the ADB lists most nameservers by name, with various fields that indicate the TTL of the ADB entry, the IPv4 and IPv6 addresses and reachability of the server, various flags (including EDNS-capability) and the current SRTT of each address.
The ADB is keyed by server name and by address but may also contain unassociated entries (held by IPv4 or IPv6 address alone - no names). Unassociated entries occur, either because there is no name associated with them (for example in forwarders ; lists) - or because the name associated with an address was only retained as long as its A/AAAA records in main cache are unexpired. Unassociated entries will usually be reunited with their name(s) when those servers are used again during iterative query resolution.
ADB entries are maintained by server name, not by zone (this means that a problem with the ADB record for one server can impact many zones). ADB entries are retained for up to 30 minutes, and include flags for lameness, IPv4/IPv6 support and EDNS0 support as well as the SRTT (Smoothed Round Trip Time)
How to clear cached entries
If there are DNSSEC validation failures as a result of unexpired cached contents, there are various techniques available to resolve the problem:
Flush the entire named cache (rndc flush ). The advantage of this is that there is no need to know which entries need to be cleared - they all will be. The disadvantage is that clearing the entire cache will cause a subsequent flood of iterative queries in order to repopulate the cache with frequently-accessed records and server information. Flushing the entire cache clears all resource records (RRs), bad cache (for DNSSEC-validation failures) and also the Address Database (where named tracks the status of authoritative servers that it has queried).
Flush the cache for a specific name (rndc flushname name [view] ). This flushes entries matching the specific name both from the main cache and from the ADB.
- Use the name of a specific nameserver if there are problems with e.g. the EDNS status of that server.
- Use the name of specific records that are failing validation to force re-validation on the next client query.
Flush the cache for a specific name as well as all records below that name (rndc flushtree name [view] ). This will clear the cache, but it will not clear any names out of ADB, so may not be sufficient for some needs.
Restart the named daemon.