Serve-stale Implementation Details
  • 25 Nov 2020
  • 5 Minutes To Read
  • Contributors
  • Print
  • Share
  • Dark
    Light

Serve-stale Implementation Details

  • Print
  • Share
  • Dark
    Light

Here are some details about the BIND 9 serve-stale and prefetch implementations in BIND 9.17.7 and 9.16.9, and a discussion about how these features interact with fetch-limits and other quota mechanisms. This article provides some background on the logic as implemented and is not intended to give explicit guidance on how to set these parameters.

Serve-stale

  • The max-stale-ttl configuration is stored in a per-view cache.

  • An RRset in any given cache is marked as stale during RRset lookup in that cache, if ALL of the following conditions apply:

    1. The RRset's TTL has expired, i.e. the RRset is "inactive."
    2. stale-cache-enable is set to yes in the configuration.
    3. The lookup time is less than or equal to the RRset's TTL + max-stale-ttl, i.e. the lookup happens within the time range denoted by (RRset's TTL, RRset's TTL + max-stale-ttl).
  • If stale-refresh-window is zero (disabled), then:

    1. Lookup of stale RRset in cache only takes place when a previous attempt to refresh the RRset from authoritative servers has failed.
    2. The lookup in cache happens in the same request, right after the failure in attempting to refresh the RRset.
    3. All subsequent requests to the same RRset follow the same path: try to refresh from name servers, fail, try cache.
    4. The default behavior in BIND after the stale-refresh-time addition is to have it enabled with a positive value of 30 seconds.
  • If stale-refresh-time is non-zero (enabled), then a lookup MAY return a stale RRset from cache before going into recursion if:

    1. The RRset is marked as stale.
    2. A previous attempt to refresh the RRset has failed.
    3. The lookup happens during the period stale-refresh-time after the refresh failure.

Prefetch

Fetch-limits include the fetches-per-server and fetches-per-zone quota mechanisms.

The default action taken when a query exceeds any of the fetch-limits is to drop the query.

The response to the client when such a query is dropped varies depending on the fetch-limit triggered, as follows:
- fetches-per-server: the default action is to return a SERVFAIL to the client.
- fetches-per-zone: no responses are sent to the client; the client observes this as a timeout.

Prefetching takes place in the late stage of processing a client query, in the response-building phase; more specifically, it occurs during execution of the following functions:

  • query_respond_any - Build the response for a query for type ANY.
  • query_addanswer - Fill the ANSWER section of a positive response.
  • query_cname - Handle CNAME responses.
  • query_dname - Handle DNAME responses.

Prefetching code performs some quota verification, in the following order:

  1. Check if the recursive-clients quota is below the soft clients value. If yes, prefetch attaches to the recursive-clients quota.
  2. If there is a fetch context already created for <qname,qtype,qclass> (let's call it curr_fctx), then:
    - Let fctx_num_clients = number of clients currently associated with that fetch context.
    - If current client address matches one of the addresses currently associated with curr_fctx, drop prefetch and log the query as duplicated
    - Else, if current client address doesn't match any of the addresses currently associated with curr_fctx, then check if fctx_num_clients is less than the current auto-tuned value for 'clients-per-query'; if the check fails, drop the prefetch.
    - If none of the checks above abort prefetching, attach to curr_fctx and proceed.
  3. If the current number of fetches for the target domain is greater than or equal to the value of fetches-per-zone, then drop the fetch.
  4. If the number of current queries exceeds max-recursion-queries, then drop the fetch.
  5. Finally, prefetch tries to find a server address on which to send the query, one that isn't over quota, i.e. a server in which the number of current fetches targeted does not exceed the configured fetches-per-server limit.

Prefetch, serve-stale, and fetch-limits

How does prefetch interact with fetch-limits?
Prefetch is dropped if either the fetches-per-server or the fetches-per-zone quota is reached.
It is also dropped if any of the following quotas are reached:
- recursive-clients
- clients-per-query (actually, the value used is a self-adjusted one between clients-per-query and max-clients-per-query).
- max-recursion-queries

How does serve-stale interact with fetch-limits?
- If stale-refresh-time is zero (disabled), then if a query is dropped due to fetch-limits, no lookup in cache for stale RRsets takes place.
- If stale-refresh-time is non-zero (enabled), then the returning of stale RRsets takes place before fetch-limits apply, but only if all stale-refresh-time constraints already stated apply.

In other words, fetch-limits does not affect the returning of stale RRsets eligible by stale-refresh-time (if enabled), as those lookups in cache take place before fetch-limits restrictions are applied.

What is the logic path if content has expired and a client query comes in that would normally trigger a fetch (which ought to fail and lead to the content being marked for serving stale), but that fetch never happens because it is dropped because of fetch-limits?

  • If the content (RRset) has expired and a query comes in asking for it, then assuming the RRset is not yet marked as stale, the following steps take place:
    1. If stale-cache-enable is yes, and if the query lookup time is within the time period between the expired RRset's TTL and max-stale-ttl, then mark the RRset as stale.
    2. If stale-cache-enable is no, then lookup skips this record (as it is expired).
    3. Proceed with the fetch, which is dropped due to fetch-limits.
    4. Respond to the client (or not), depending on which fetch-limit was triggered (see the behavior described in the beginning of this document).

Basically, fetch-limits prevent the returning of stale records, for a couple of reasons:

  1. A query dropped due to fetch-limits won't activate stale-refresh-time, as this is not considered a real failure in contacting the name servers in an attempt to refresh the given RRset.
  2. A query dropped due to fetch-limits does not follow up with an attempt to retrieve stale RRset from cache for the same reason stated above.

For a query dropped in this situation, does BIND initiate serve-stale for this RRset?

  • The RRset is marked as stale, but unfortunately it will be unused until either: a real refresh failure happens, which could activate the stale-refresh-time; or, following an attempt to find stale RRset in cache (after the real refresh failure takes place). This is something we can address by exempting the first query from fetch-limits.