BIND Best Practices - Authoritative

1) Run BIND on a server dedicated to DNS only.

Reasons include:

Minimized risk of impact to DNS services due to other applications consuming server resources (perhaps due to an attack on those services or application error).
Conversely, minimized risk to other applications as a result of BIND consuming all system or network resources.
Reduced likelihood of unauthorized access to the DNS server (e.g. via a code defect and root access exploit made possible via another application).
Improved ability to monitor DNS server performance (since the server is dedicated to one service).
Improved ability to troubleshoot problems.

2) Run separate authoritative and recursive DNS servers

Do not combine authoritative and recursive nameserver functions -- have each function performed by separate server sets. This advice primarily concerns separation of public-facing authoritative services from internal client-facing recursive services - administrators may, for convenience, choose to serve some internal-only zones authoritatively from their recursive servers, having determined that the benefit outweighs any risks associated with this policy.

If you share recursive and authoritative functions in the one server, then if there is a problem that impacts authoritative servers only - for example, that causes all of your authoritative servers to fail - it will break your recursive service, too.
Run multiple distributed authoritative servers, avoiding single points of failure in critical resource paths. Various strategies (including anycast and load-balancing) are available to ensure robust geographic and network diversity in your deployment.

3) Choose appropriate software and hardware

Run currently supported version(s) of BIND in your environment.
Subscribe to bind-announce@lists.isc.org to get notified of BIND 9 software updates and security issues.
Run currently-supported version(s) of your chosen operating system.
Ensure that system outbound network buffers are large enough to handle your rates of outbound query traffic. Some OS implementations (Linux particularly some versions) by default assume low rates of outbound network traffic - but an authoritative server will often be responding with significantly larger packets than the queries it received, particularly for signed zones.
Run a multi-threaded BIND build and launch named with an appropriate number of task threads tuned for the hardware and CPU architecture.
Ensure (and confirm through testing) that your infrastructure supports EDNS0 and large UDP packet sizes.

4) Prevent external access to internal data by design

Originally, DNS was designed to provide the same data to all clients from all servers. Then, the concept of a split namespace was introduced and included in BIND as the concept of “views”. Views are often used to separate internal devices from external devices. lab01printer.example.com might be visible from the inside, and possibly via a VPN connection, but it is probably not something you would want to have visible from the outside.

Problems usually occur in two places with views:

an accidental “leak” of data that is internal only
external clients, expecting internal views get external views

Number 1 is usually caused by an incorrect access control list (ACL) that allows the internal zone to be transferred to one or more external-facing servers. This failure may not be obvious without specific testing, for example, creating a canary entry that only resolves in one configuration and then testing for it from locations that should NOT be able to resolve them.

Number 2 is caused when a VPN fails to connect or when connected, is still seen by the DNS server as “outside” the list of internal networks. Both of these boil down to network infrastructure issues.

While internal vs. external DNS names using views are nearly ubiquitous these days, splitting your DNS into internal and external zones solves the problem in a very obvious and safe way. Internal names, for example, living only in int.example.com and sub-zones of int. External name servers would be configured without any knowledge of the int zone.

5) Take basic security measures

Run BIND as an unprivileged user.
To open low-numbered UDP and TCP ports, BIND must be launched as root, but an alternate uid can be specified using the -u command line argument; after opening needed resources named will change its runtime uid to an unprivileged account.
If following the preceding advice (running BIND as an unprivileged user on a dedicated server) chrooting is "de-emphasized." Our operations experts feel that chrooting does not substantially improve security under those conditions and do not affirmatively recommend it, but they do not explicitly discourage it.
Use of BIND access control mechanisms such as address match lists, to restrict recursive query service to known and authorized clients. Ideally, your Internet-facing authoritative servers should not perform recursion for any clients at all.
Consider DNSSEC-signing your public authoritative zones. (Recursive servers will then be able to use DNSSEC validation to authenticate your records.) DNSSEC signing does imply additional ongoing maintenance. However, if you operate a service with an increased risk of impersonation - such as a financial service or any public service where the user needs to be sure the resource is really your resource - the effort of signing may be well worth it.

6) Prepare for abuse of any external-facing servers

Response Rate Limiting (RRL) will significantly mitigate against some DDOS attacks on your authoritative server. For information on Response Rate Limiting, see: A Quick Introduction to Response Rate Limiting.

There are several tuning parameters for RRL, but generally, the default settings are good.

RRL works by dropping responses into different buckets. Each bucket holds the IP address (or a collection of addresses) to which the response is being sent. When a given number of identical responses are seen within a certain length of time in a single bucket, the responses to hosts in that bucket are limited. The tunable parameters include the number of identical responses before limiting is triggered, the length of time a response stays in the bucket, and the size of the network that each bucket contains.

It is impractical to create one bucket per IP address. The default bucket size is a /24 network (256 IP addresses) for IPv4 and a /56 network (256 networks of 18,446,744,073,709,551,616 addresses each) for IPv6. These bucket sizes represent common subnet sizes for each of the address families.

There are some circumstances under which these bucket sizes may be too small, most revolving around the use of NAT on IPv4 networks. If you discover that you are rate-limiting hosts that are innocent because they live with a large number of other hosts behind a single NAT’d IP address, you can either change the bucket size or “white-list” the network(s) by adding them to the exempt-clients list…

options { | view "view_name" {
	...
	rate-limit {
		slip 2;	# Every other response truncated
		window 15;	# Seconds per bucket
		responses-per-second 5;	# Number of good responses per prefix-length/sec
		referrals-per-second 5;	# referral responses
		nodata-per-second 5;	# nodata responses
		nxdomains-per-second 5;	# nxdomain responses
		errors-per-second 5;	# Error responses
		all-per-second 20;	# When we drop all
		log-only no;	# "yes" enables debugging mode
		qps-scale 250;	# x / 1000 * per-second = new drop limit
		exempt-clients { 127.0.0.1; 192.153.154.0/24; 192.160.238.0/24 };
		ipv4-prefix-length 24;	# Define the IPv4 block size
		ipv6-prefix-length 56;	# Define the IPv6 block size
		max-table-size 20000;	# 40 bytes * this number = max memory
		min-table-size 500;	# pre-allocate to speed startup
	};
	...
};

The example above provides all of the tunable parameters, but as noted, the most useful for initial tuning are the “responses-per-second”, “ipv4-prefix-length” and “exempt-clients”.

The “log-only” option can be set to “yes” to test configurations without actually changing the network performance.

A newer feature worth considering for mitigating amplification attacks is 'minimal-any'. Unlike RRL, this feature is not enabled by default.
Provision sufficient capacity to handle burst traffic up to 20x normal level. This overcapacity will help your system withstand spikes in both legitimate and abusive traffic.

Excess capacity must take into account not only server CPU and memory resources but also send and receive capacity along the entire network path.

Consider the length of the TTLs on the delegation records that you manage within your zones and those that are provided by the parent zones that delegate authority to your nameservers. Longer TTLs protect the visibility of a zone, but shorter ones allow for a faster change of nameservers. Long TTLs can also help protect the visibility of a zone when the parent zone's nameservers are under attack. See https://www.dns-oarc.net/index.php/oarc/mitigating-dns-denial-of-service-attacks for more information.
In most instances we would not recommend using inbound packet filtering for authoritative nameservers, Response Rate Limiting is the recommended solution. However, there are some circumstances where filtering at very high inbound packet rates can be helpful - please contact ISC if you think you might benefit from our operational experience in this area.

7) Monitor the service

Put in place monitoring scripts to continually check the health of servers and alert if conditions change substantially.

Conditions to monitor include:
- process presence
- CPU utilization
- memory usage
- network throughput and buffering (inbound/outbound)
- filesystem utilization (on the log filesystem and also the filesystem containing the named working directory)
Logs should be examined periodically for error and warning messages which may provide a tip-off for incipient problems before they become critical.
Review the logging configuration to ensure it meets your requirements. BIND's logging defaults are generally sane (passing most of the work to syslog) but may not align with organizational policy and/or desired data collection/retention standards.
When using size-limited files for logging, plan the size of the files and number to retain so that an increased level of logging due to a problem is unlikely to cause the logs from the start of the problem to become unavailable. The exact settings will depend on how quickly problems can be detected and the details of the baseline retention policy.
Query logging adds substantial overhead (on the order of 10x) and should only be enabled after careful consideration.

8) Consider a nanny

By design and for security purposes, BIND's most common failure mode is intentional process termination when it encounters an inconsistent state. If you do not have 24-hour operations support (and possibly even if you do), an automated minder process capable of restarting BIND intelligently is recommended. It is especially helpful if any such script can checkpoint and archive the logs when this happens.

9) Prepare for troubleshooting

Prior to any trouble, ensure a strategy is in place for collecting post-mortem information if a server encounter a problem. This includes:
- Building named with debug symbols enabled
- Enabling the BIND XML statistics channel for easy data collection.
- Designing an appropriate logging strategy and reserving sufficient space on the log filesystem for information to be collected for a significant context period before an event (several hours at least, 24 hours+ preferred.)
- Ensuring that the uid under which named is running has write permission sufficient to write a core image to its working directory if it segment faults and to write named.dump or named.run files if requested by the operator.
See What to do with a misbehaving BIND server and What to do if your BIND or DHCP server has crashed for guidance on troubleshooting problems and the type of information that is useful to collect in those circumstances.
Observe query loads periodically to establish baseline expectations. This will enable you to monitor for anything unusual - as defined by the range of 'normal' for your specific operational environment.
You should have a strategy that includes a planned upgrade path to ensure that you can take advantage of improved features and functionality, and how you will respond if there is a security advisory released that has the potential to impact your servers and services. See Which version of BIND do I want to download and install? for more information.

10) Additional measures for high availability

Our general advice for security practices is included in the list above. However many large production environments with mission-critical DNS needs may opt to run servers on multiple hardware and OS platforms to increase the "eco-diversity" of their DNS infrastructure. This also includes running different versions of BIND to ensure resilience to potential defects that may not impact all currently supported versions.

Many service providers offer "DNS secondary service" to publish your zones. In this situation, you continue to manage your own zones but keep copies updated at the service provider. This option is worth considering for added resilience and extra capacity.
We only recommend anycasting in very large deployments or if you already have experience with anycasting.

The concept of anycast is easy to grasp: A netblock is announced on your network in such a way that it appears at more than one location. Someone trying to reach an address in that netblock is routed to the service at the closest (network topology-wise) location to them.

This is very useful in complex networks where there may be tens, hundreds or even thousands of networks, each with its own name server - put your name servers into the anycast network, configure your network correctly, provide your clients with the single anycast nameserver address and magically all of your name servers become a single network address and all of your clients use the closest one!

The initial configuration of the anycast DNS instances must take into account some additional issues. These issues include the ability to quickly withdraw the anycast route from your network in the case of a DNS server malfunction, the ability to transfer zone data between anycast server instances correctly, and the ability of your support team to debug issues that stem from different clients using different servers.

If a DNS server malfunctions - hung, crashed, unable to provide correct data - the route advertisement to that specific DNS server must be quickly (and automatically) removed. Clients attempting to resolve DNS names using the affected server should be routed to other instances.

Debugging a client DNS issue in an anycast network is much more complex and involves many more hands and eyes than does debugging an issue on a traditional network. The most vital concern: When debugging in an anycast environment, be absolutely sure that you and the client with the issue are looking at the same server.