What causes "refresh: failure trying master ...: operation canceled" error messages?
  • 24 Jun 2021
  • 1 Minute to read
  • Contributors
  • Dark
    Light
  • PDF

What causes "refresh: failure trying master ...: operation canceled" error messages?

  • Dark
    Light
  • PDF

Article Summary

Problem:

Multiple operators have reported to us that on some Linux systems running BIND as a secondary, their zones can get behind and the following messages are logged:

zone my.example.zone/IN: refresh: failure trying master 10.1.2.3.4#53 (source 0.0.0.0#0): operation canceled

Solution:

Also check your MTU settings

Since this article was first written we've received reports that incorrect (too large) MTU settings can also cause this error message. We have not done any testing to verify or further characterize behavior under these conditions, but we expect that such a situation should have impact outside of BIND while the situation described below does not seem to.

For the most part, this message doesn't indicate a serious problem. BIND will retry the refresh operation, either when it receives another NOTIFY from the primary or when the refresh/retry timer triggers, and usually that succeeds and the zones don't get too far behind.

The cases of this that we've seen and have been able to troubleshoot lead us to believe that the problem is being caused by one of the Linux netfilter kernel modules. It seems that one of the netfilter modules sometimes erroneously generates a DROP on the send of the SOA query that is part of the zone refresh. This causes the sendmsg(2) call to return EPERM, which then results in the above error message.

It is suspected that this is a race condition of some sort. It seems to only occur on very busy servers and intrusive diagnostics (e.g. strace) prevent the error from occurring.

One known workaround is unloading the kernel netfilter modules, assuming that you aren't using them.

We've received reports of this from RHEL and Debian systems. We think it probably has more to do with the kernel version than it does the distribution, because in some of the existing reports the kernel version is the only difference between a system experiencing the issue and systems that are not.

If you have encountered this error and wish to submit a report, you can use our online form.

We welcome any new information or insight into this problem, but even if you are unable to provide any new evidence, by submitting a bug report we can add you to the list of those experiencing this issue.