checkhints: unable to get root NS rrset from cache: not found

What does this warning mean, and should I be concerned when I see it in my named logs?

Answer:

Your concern will depend on the circumstances. This message is logged from category general and at log level warning. What has happened is that named has moments before re-primed the root NS RRset in cache, but when it comes to use the root NS RRset, the new entries are not available,

It may be logged very occasionally by busy Resolvers (Recursive Servers) that have a large volume of cache flux; their cache content is being updated frequently with older content being expired to make room for newly learned RRsets.

If there are no operational ill-effects being reported or seen through monitoring, then this is not a problem, although it would be worth checking if your server needs a larger max-cache-size or is experiencing unusual query patterns leading to higher-than-normal cache load.

If this warning is being seen repeatedly and frequently, along with unexpected failures to resolve client queries (no response or SERVFAIL), then it's likely that your Resolver's cache has become unmanageable, although the underlying root cause for this may be one of several situations.

See also this engineering ticket: BIND issue #2744

What should I do if I am seeing this warning repeatedly, and my Resolver appears to be having problems responding to client queries?

Restarting named should fix the problem in the short term. A full cache flush may also achieve the same outcome (rndc flush) although, depending on the underlying problem may not be as effective, since it removes only the content, not the structure.

For the longer term, you need to investigate why this is happening so that you can take appropriate steps to prevent or mitigate recurrences.

For anyone experiencing this problem for the first time, the likelihood is that one or more things have changed in your operating environment, and that these are causing cache content to be more substantial than before, or potentially distributed differently. For example:

Installing a version of BIND that has stale-cache-enable yes; as default
An increase in client queries overall
Client query patterns changing - perhaps causing a higher rate than usual of cached negative responses
An increase in dual-stack clients querying for AAAA records
An increase in client querying for HTTPS records
A new client application that uses DNS-based probing
Clients using a tunnelling-over-DNS service
Using a client filtering service that operates by means of resolving the original client query first by appending another private zone name to it and checking the response status before allowing the original query to pass - thus adding the filtering RRsets to cache as well as the actual client query responses.

Clues can be found in the BIND statistics and also in a dump of cache.

These counters (available either from the output from rndc stats or using the xml or json statistics interface), can be a good indicator that there is too much cache cleaning taking place due to memory pressure versus RRset TTL expiration:

DeleteLRU - "cache records deleted due to memory exhaustion"
DeleteTTL - "cache records deleted due to TTL expiration"

These are counters, therefore although seeing DeleteLRU far exceeding DeleteTTL in a single snapshot of the stats is a good indicator that all is not well with cache, ideally you want to monitor the trend over time.

Occasionally reaching max-cache-size (deleting records due to memory exhaustion) due to a spike in client queries or query patterns is not a problem - this is how cache content management is designed to work.

Also you may find these statistics interesting :

HeapMemInUse - "cache heap memory in use"
TreeMemInUse - "cache tree memory in use"
HeapMemMax - "cache heap highest memory in use"
TreeMemMax - "cache tree highest memory in use"

All of the above are gauges - they tell you 'this is where we are now', so a snapshot can be useful, as well as monitoring pattern over time. The 'Max' is a high water mark.

But wait - these numbers don't add up!

Both cache heap and cache tree memory consumption collectively count towards the total that is being limited by max-cache-size. But 12.5% of cache memory consumption is reserved for ADB (Additional Database). This is where named keeps track of the authoritative servers that it queries, their responsiveness, and other useful information.

In addition, reaching max-cache-size is just a trigger for commencing deletion of content based on memory exhaustion. This is a soft limit, not a hard limit - cache may yet exceed it because although some RRsets will as a result be marked as 'expired', it may not be possible to delete them right away.

Don't be tempted to look at either of these - they are not useful operationally and aren't counting what you might think they are from their names:
HeapMemTotal - "cache heap memory total"
TreeMemTotal - "cache tree memory total"

There are also counters available of what's in cache currently by RType.

These are prefixed with
! - counters of NXRRSETs (pseudo RR indicating that a name that was queried existed but the type didn't)
# - stale content (versions of BIND that have the serve-stale feature)
~ - content that has expired and is waiting on housekeeping/deletion.

If there is an unexpected skew, it might be worth dumping cache to see what's in there:

rndc dumpdb -all

Or - with newer versions of BIND, to also included expired content:

rndc dumpdb -expired

And then decide - is it just that max-cache-size is now insufficient, or is that something else needs to be done to reduce how many RRsets are being put into cache? Tactics to consider include:

max-cache-ttl - set an upper limit on cache content TTLs
max-ncache-ttl - set an upper limit on negative cache content TTLs
max-stale-ttl - if you have the serve-stale feature enabled, for how long do you wish to retain stale content in cache (noting that for positive and negative content, the new 'cap' on content retention will now be max-cache-ttl+max-stale-ttl and max-ncache-ttl+max-stale-ttl, meaning that the proportion of negative to positive cache content may be higher than before enabling this feature).
Become an authoritative server for any zones that are being queried frequently and that respond with many unique negative answers that are then added to cache (inspecting your cache dump will help to identify these)
Become a secondary (or mirror) for the root zone
Use RPZ or other techniques to block client queries for domains that you would prefer to not be handling as a resolver
Find and correct any misconfigured clients that are making incorrect and unnecessary queries (for example, appending a specific but inappropriate domain to any/all other queries).
Check that you are running with a current set of root hints
Check that there are no connectivity problems between your Resolver and the IP addresses listed in your root hints (particularly the IPv6 destinations). This should not cause problems, but we have anecdotal evidence that fixing reachability problems and/or outdated root hints was a solution for some servers.