Serve-stale Implementation Details
Here are some details about the BIND 9 serve-stale and prefetch implementations in BIND 9.17.7 and 9.16.9, and a discussion about how these features interact with fetch-limits and other quota mechanisms. This article provides some background on the logic as implemented and is not intended to give explicit guidance on how to set these parameters.
max-stale-ttlconfiguration is stored in a per-view cache.
An RRset in any given cache is marked as stale during RRset lookup in that cache, if ALL of the following conditions apply:
- The RRset's TTL has expired, i.e. the RRset is "inactive."
stale-cache-enableis set to
yesin the configuration.
- The lookup time is less than or equal to the RRset's TTL +
max-stale-ttl, i.e. the lookup happens within the time range denoted by (RRset's TTL, RRset's TTL +
stale-refresh-windowis zero (disabled), then:
- Lookup of stale RRset in cache only takes place when a previous attempt to refresh the RRset from authoritative servers has failed.
- The lookup in cache happens in the same request, right after the failure in attempting to refresh the RRset.
- All subsequent requests to the same RRset follow the same path: try to refresh from name servers, fail, try cache.
- The default behavior in BIND after the
stale-refresh-timeaddition is to have it enabled with a positive value of 30 seconds.
stale-refresh-timeis non-zero (enabled), then a lookup MAY return a stale RRset from cache before going into recursion if:
- The RRset is marked as stale.
- A previous attempt to refresh the RRset has failed.
- The lookup happens during the period
stale-refresh-timeafter the refresh failure.
Fetch-limits include the
fetches-per-zone quota mechanisms.
The default action taken when a query exceeds any of the fetch-limits is to drop the query.
The response to the client when such a query is dropped varies depending on the fetch-limit triggered, as follows:
fetches-per-server: the default action is to return a SERVFAIL to the client.
fetches-per-zone: no responses are sent to the client; the client observes this as a timeout.
Prefetching takes place in the late stage of processing a client query, in the response-building phase; more specifically, it occurs during execution of the following functions:
query_respond_any- Build the response for a query for type ANY.
query_addanswer- Fill the ANSWER section of a positive response.
query_cname- Handle CNAME responses.
query_dname- Handle DNAME responses.
Prefetching code performs some quota verification, in the following order:
- Check if the
recursive-clientsquota is below the soft clients value. If yes, prefetch attaches to the
- If there is a fetch context already created for <qname,qtype,qclass> (let's call it
fctx_num_clients= number of clients currently associated with that fetch context.
- If current client address matches one of the addresses currently associated with
curr_fctx, drop prefetch and log the query as duplicated
- Else, if current client address doesn't match any of the addresses currently associated with
curr_fctx, then check if
fctx_num_clientsis less than the current auto-tuned value for 'clients-per-query'; if the check fails, drop the prefetch.
- If none of the checks above abort prefetching, attach to
- If the current number of fetches for the target domain is greater than or equal to the value of
fetches-per-zone, then drop the fetch.
- If the number of current queries exceeds
max-recursion-queries, then drop the fetch.
- Finally, prefetch tries to find a server address on which to send the query, one that isn't over quota, i.e. a server in which the number of current fetches targeted does not exceed the configured
Prefetch, serve-stale, and fetch-limits
How does prefetch interact with fetch-limits?
Prefetch is dropped if either the
fetches-per-server or the
fetches-per-zone quota is reached.
It is also dropped if any of the following quotas are reached:
clients-per-query (actually, the value used is a self-adjusted one between
How does serve-stale interact with fetch-limits?
stale-refresh-time is zero (disabled), then if a query is dropped due to fetch-limits, no lookup in cache for stale RRsets takes place.
stale-refresh-time is non-zero (enabled), then the returning of stale RRsets takes place before fetch-limits apply, but only if all
stale-refresh-time constraints already stated apply.
In other words, fetch-limits does not affect the returning of stale RRsets eligible by
stale-refresh-time (if enabled), as those lookups in cache take place before fetch-limits restrictions are applied.
What is the logic path if content has expired and a client query comes in that would normally trigger a fetch (which ought to fail and lead to the content being marked for serving stale), but that fetch never happens because it is dropped because of fetch-limits?
- If the content (RRset) has expired and a query comes in asking for it, then assuming the RRset is not yet marked as stale, the following steps take place:
yes, and if the query lookup time is within the time period between the expired RRset's TTL and
max-stale-ttl, then mark the RRset as stale.
no, then lookup skips this record (as it is expired).
- Proceed with the fetch, which is dropped due to fetch-limits.
- Respond to the client (or not), depending on which fetch-limit was triggered (see the behavior described in the beginning of this document).
Basically, fetch-limits prevent the returning of stale records, for a couple of reasons:
- A query dropped due to fetch-limits won't activate
stale-refresh-time, as this is not considered a real failure in contacting the name servers in an attempt to refresh the given RRset.
- A query dropped due to fetch-limits does not follow up with an attempt to retrieve stale RRset from cache for the same reason stated above.
For a query dropped in this situation, does BIND initiate serve-stale for this RRset?
- The RRset is marked as stale, but unfortunately it will be unused until either: a real refresh failure happens, which could activate the
stale-refresh-time; or, following an attempt to find stale RRset in cache (after the real refresh failure takes place). This is something we can address by exempting the first query from fetch-limits.