5.4.5 DNS troubleshooting

Table of Contents

Understanding DNS Troubleshooting

DNS troubleshooting focuses on figuring out why name resolution does not work as expected and proving where the problem actually is. As an administrator you need to confirm whether the issue is on the client, on one of your DNS servers, in your zone configuration, or out on the wider DNS infrastructure.

DNS itself and the role of authoritative servers, resolvers, caching, and zones are discussed in earlier chapters. Here the focus stays on the practical techniques and tools you use to diagnose and fix problems in a running DNS deployment.

Always verify which DNS server a client is using and which name it is trying to resolve before making changes to DNS infrastructure.

Typical Symptoms of DNS Problems

Many network problems are blamed on DNS, but not all of them are DNS issues. You can recognize real DNS symptoms by looking at how applications fail.

A common sign is that you can reach a service by IP address, but not by hostname. For example, ping 93.184.216.34 works, but ping example.com fails. Web browsers might show messages like “DNS_PROBE_FINISHED_NXDOMAIN” or “Server DNS address could not be found.” Mail delivery problems can manifest as errors about being unable to look up MX records. SSH might hang for a long time at “Connecting” before eventually failing, because the client or the server is waiting on a reverse DNS lookup.

It is important to separate DNS failures from basic network failures. If you cannot reach any IP at all, then the problem is not DNS, it is connectivity. The next section shows how to confirm this.

First Checks: Network and Resolver

Before you dig into zone files or DNS server logs, verify the basics from the affected client. This helps you avoid chasing configuration issues when the underlying problem is just that the client cannot reach any DNS server.

First, check basic IP connectivity to something known, such as a public IP or your gateway. If this fails, you must fix routing, cabling, or firewall issues before dealing with DNS.

Next, check which resolvers the client is using. On systemd based Linux systems, resolvectl status shows the current DNS servers, search domains, and which interface they are attached to. On more traditional setups, /etc/resolv.conf contains nameserver lines with the configured resolver IPs and optionally search domains. Make sure these IPs actually belong to DNS servers you control or trust.

If the client is meant to get DNS settings from DHCP, confirm that the DHCP server is handing out the correct DNS addresses. That belongs to the DHCP chapter in detail, but from the troubleshooting perspective you can compare the intended DNS server addresses with what the client actually uses.

A simple reachability check is to send a ping to the resolver IP. If the resolver is unreachable, or if a firewall blocks UDP and TCP port 53, name resolution will fail even though DNS itself is correctly configured elsewhere.

If the resolver is unreachable, or port 53 is blocked, fix connectivity and firewall rules before changing DNS zones or BIND configuration.

Using dig and drill for Diagnosis

The main tools for DNS troubleshooting are query utilities that talk directly to DNS servers. On most Linux distributions dig from the bind-utils or dnsutils package is the standard choice. There is also drill from the ldns suite, which has similar usage.

You can use dig in several ways. A basic query to whatever resolver the system is using looks like:

dig example.com

The output shows the question, the answer section, and some useful metadata such as the server that replied, the response code (for example NOERROR or NXDOMAIN), and the query time. This immediately confirms whether the resolver could find an answer and which server it contacted.

To bypass your normal resolvers and talk to a specific DNS server directly, give @server before the name. For example, if you want to ask your own authoritative BIND server at 192.0.2.53:

dig @192.0.2.53 example.com

This is essential when you operate your own DNS servers, because you can check whether your server has the correct answer even if recursive resolvers or caches somewhere else are misbehaving.

You can also request specific record types. For example, dig example.com A looks for IPv4 addresses, dig example.com AAAA for IPv6, dig example.com MX for mail records, and dig example.com NS for name server records. For reverse lookups, use the -x option with an IP:

dig -x 203.0.113.5

Release-grade diagnostics usually rely on +trace. This option makes dig walk the DNS hierarchy step by step, starting from the root. For example:

dig +trace example.com

The output starts with the root servers, shows the delegation to the top level domain, and finally the authoritative servers for example.com. This traces how the name is resolved and is particularly useful for troubleshooting delegation and glue problems in public DNS.

The drill command works similarly. For example, drill example.com or drill -T example.com for tracing. The output is formatted a bit differently but contains the same core information: question, answer, authority, and additional sections.

Checking Different Record Types

DNS misconfiguration sometimes affects only some record types. For example, a site might resolve correctly for A and AAAA records, but email fails because the MX records are broken. For that reason, you should explicitly query each relevant type when troubleshooting.

To check address records for a host, ask for A and AAAA:

dig www.example.com A
dig www.example.com AAAA

If clients on your network use both IPv4 and IPv6, a missing or wrong AAAA record can cause hard to debug failures. Some applications try IPv6 first and only fall back to IPv4 after a timeout.

To verify mail configuration, use:

dig example.com MX

This shows the priority and hostname of each mail exchanger. You can then perform additional queries on the MX targets to confirm that they have A or AAAA records and reverse DNS entries.

NS records are crucial for domain delegation:

dig example.com NS

Compare this with the list in your registrar’s control panel or the parent zone, like com. If the NS records in the parent do not match the ones in your zone, or if some of them are not reachable, you have a delegation problem.

You may also work with CNAMEs and TXT records. For example, a missing or incorrect CNAME chain can break services like content delivery networks or application front ends. TXT records are often used for SPF, DKIM, and domain ownership verification. Query them directly with dig example.com TXT to verify that the contents match the expected format.

Always verify that MX and NS records point to hostnames that themselves have valid A or AAAA records and are reachable over the network.

Comparing Recursive and Authoritative Answers

A frequent source of confusion is that a resolver might give one answer, while the authoritative server gives another. This can happen because of caching, propagation delays, or misconfiguration. To isolate such problems, you should compare recursive and authoritative answers.

First, query using your normal resolver:

dig www.example.com

Note the answer and the server that replied. Then, look up the authoritative name servers for the zone:

dig example.com NS

Pick one of these authoritative servers and send the same query directly to it:

dig @ns1.example.com www.example.com

If the answers differ, you know that either the resolver is using stale cached data or there is a mismatch between different authoritative servers, such as an unsynchronized secondary.

If the authoritative server returns REFUSED or SERVFAIL, you have a problem on that DNS server. That might be an ACL blocking your query, a zone that failed to load, or another server side issue. In contrast, if the authoritative answer is correct but the recursive resolver gives NXDOMAIN or wrong data, you should focus on the resolver configuration or its cache.

You can also force dig to ask with a flag that mimics recursive behavior, for example +norecurse to disable recursion when you know you are talking to an authoritative server and want to see only what it knows.

Verifying Zone Files and BIND Configuration

When you manage your own BIND or similar authoritative servers, zone file errors are a common cause of DNS problems. Fortunately, you can validate zone files offline before reloading them.

For BIND, the named-checkconf command checks the global configuration file. Run:

named-checkconf

If there are syntax errors in /etc/named.conf or equivalent, fix them before restarting the service. For individual zones, use named-checkzone. For example, if your domain is example.com and the zone file is /var/named/example.com.zone:

named-checkzone example.com /var/named/example.com.zone

This validates the zone syntax, checks for some logical errors, and prints warnings. Correct any reported problems before you load the zone. Common mistakes include missing semicolons, invalid TTL values, and malformed record lines.

When working with public DNS, pay close attention to the SOA record. It contains a serial number that resolvers and secondary servers use to detect changes. If you edit your zone file manually, you must increment the serial each time you make modifications. A popular format uses the date and a counter, for example 2026010801. A simple mathematical rule is:

$$
\text{new\_serial} > \text{old\_serial}
$$

If the serial does not increase, secondary servers will not reload the updated zone, and you might keep serving old records.

Every time you change a zone file manually, increase the SOA serial number so that secondary servers and caches can detect and propagate the update.

After you have fixed any zone issues, reload the DNS server using your service manager and confirm that it started without errors. Then use dig against the authoritative server to verify that the new records are visible.

Cache, TTL, and Propagation Issues

DNS responses are cached for a time defined by the TTL (time to live) field. While caching improves performance and reduces load, it can delay the propagation of changes. Many troubleshooting sessions are in fact about cached data that has not expired yet.

You can see the remaining TTL in the answer section of dig output. For example:

dig www.example.com

The second column in the answer line is the TTL in seconds. If you just changed a record but resolvers still show the old value with a high TTL, it means that some caches have not yet expired the old answer.

You cannot force caches on the internet to expire early. You can only wait for the TTL to reach zero. That is why it is often recommended to set a lower TTL temporarily before a planned change, for example reducing it from 86400 seconds to 300 a day before a migration, then raising it again after the change is complete.

On your own recursive resolvers, you can sometimes flush the cache manually using an administrative tool or by restarting the service. For BIND, rndc flush clears the entire cache, and rndc flushname example.com clears entries associated with a particular name.

When troubleshooting a name that might be affected by caching, it helps to query several different public resolvers, such as the ones operated by major providers. Comparing their answers and TTL values gives you an idea of where in the propagation the new data is.

Never assume a DNS change is live everywhere immediately. Always account for the record TTL when planning and troubleshooting changes.

Reverse DNS and PTR Issues

Reverse DNS maps IP addresses back to hostnames using PTR records in special in-addr.arpa or ip6.arpa zones. Problems with reverse DNS often show up in mail delivery and access control lists.

To test reverse DNS, use:

dig -x 203.0.113.5

If the answer is NXDOMAIN, no PTR record exists. If there is a PTR but the hostname is wrong or out of date, you might encounter issues with services that rely on hostname checks. For strict mail servers, a mismatch between forward and reverse DNS can cause messages to be rejected.

Reverse DNS zones are usually delegated by the owner of the IP address range, often your ISP or hosting provider. If you run a server on an IP you do not control, you may need to request that provider to create or change the PTR record. For IP ranges you manage, you configure reverse zones on your own DNS servers just like regular zones, except with the reversed address notation.

When troubleshooting, always verify both directions. First, use dig -x on the IP to check the PTR, then use dig on the returned hostname to ensure it has the correct A or AAAA record that points back to the same IP.

DNSSEC Related Problems

DNSSEC adds cryptographic signatures to DNS data. While it improves security, it also introduces new failure modes. If DNSSEC validation fails, resolvers might return SERVFAIL even when an unsigned zone would have responded with useful data.

To inspect DNSSEC related information with dig, add +dnssec:

dig +dnssec example.com

The answer section will include RRSIG records if the zone is signed. You can also look for DS records in the parent zone to confirm that the delegation is correctly secured. For example, to see the DS at the top level:

dig example.com DS

If your resolver performs DNSSEC validation, a broken chain of trust, expired signatures, or mismatched keys can cause all queries for that zone to fail. Public tools and online validators can be helpful to confirm whether the problem is related to DNSSEC.

On your own resolvers, check the logs for DNSSEC validation errors; they often contain more specific messages describing missing keys or bad signatures. On authoritative servers, ensure that key rollover procedures and signature refresh intervals are configured correctly so that signatures never expire unexpectedly.

Using dig +trace to Debug Delegation

Domain delegation problems occur when the parent zone points to the wrong or unreachable name servers, or when glue records are missing. dig +trace is particularly useful for discovering these issues.

When you run:

dig +trace www.example.com

dig starts at the root and shows each step. You can observe which name servers the root uses for the top level domain, which NS records the top level returns for example.com, and finally which authoritative servers reply with the answer (or an error).

If the trace stops with SERVFAIL or NXDOMAIN at the domain level, you know the problem is in the delegation or the authoritative zone. If the trace reaches your authoritative servers but they do not answer correctly, adjust your server configuration or fix the zone.

When dealing with glue records, watch for A and AAAA records in the additional section for the NS targets. If NS hosts are inside the same domain they serve, missing glue can make the domain unreachable, because resolvers do not know the IP addresses of the name servers. In such cases, you must add or correct the necessary A and AAAA records at the parent zone through your registrar or DNS provider.

Client Side DNS Troubleshooting

Not all DNS problems come from the server side. Clients can also misbehave because of incorrect configuration or local caching. On Linux, modern systems often use systemd-resolved, nscd, or other local resolvers.

To inspect client side status with systemd, use:

resolvectl status

This command shows the default DNS servers, per interface configuration, and whether DNSSEC or other features are enabled. If the client points to an obsolete or unreachable DNS server, change the configuration either manually or through the network manager.

If you suspect a local cache is keeping stale entries, on systemd-resolved you can restart the resolver service, or in other setups you can restart nscd or equivalent. You can also bypass the usual resolver logic by using dig with a specific @server, which ignores /etc/resolv.conf and similar settings.

Hosts file entries in /etc/hosts override DNS lookups for the names listed there. Verify that this file does not contain old or incorrect entries for the hostnames you are testing. Removing or correcting those entries can instantly fix what appears to be a DNS problem, but is in fact a local override.

On workstations, firewall or VPN software may intercept DNS queries or redirect them to different servers. When troubleshooting, check active VPN connections and their DNS settings, since they might have installed custom resolvers or search domains that confuse applications.

Using Logs and Monitoring for Persistent Issues

For recurring DNS issues, logs and monitoring provide essential clues. On BIND, you can control logging categories, for example queries, security, or zone transfers. The logs can show whether queries arrive at the server, how they are answered, and whether any zones fail to load.

On systemd based systems, DNS related services log through the journal. You can inspect logs with journalctl filtered by unit name. For instance, if your DNS server runs as named.service, you can use:

journalctl -u named.service

Look for messages about zone load failures, transfer errors, or repeated query patterns that might indicate abuse or misconfiguration.

Monitoring tools can regularly check your important hostnames from multiple vantage points and alert you if responses change unexpectedly. These tools often perform DNS lookups in addition to HTTP or TCP checks. Integrating DNS checks into your monitoring helps you detect issues such as expired zones, unauthorized changes, or misconfigured records quickly.

By combining direct query tools like dig, authoritative and recursive comparisons, cache awareness, and good logging practices, you can systematically narrow down almost any DNS issue to a clear cause and fix it with confidence.

Comments

Please login to add a comment.

Don't have an account? Register now!