Case Study: DNS data identifies network performance issues
The Domain Name System (DNS) is a powerful tool for enhancing visibility into all aspects of a network. At an individual query level, DNS records are a strong indicator of user intent – useful for tracking anomalous behavior back to the source device or IP address. In aggregate, DNS traffic paints a picture of how well a network is operating – useful for an overall assessment of network security and performance.
One of BlueCat’s large enterprise customers recently discovered how powerful the DNS protocol can be in identifying and mitigating large-scale network performance problems.
It all started with a noticeable lag in performance for users of the company’s virtual desktop infrastructure (VDI). Certain subnets with newer workstation images were facing particular connectivity problems. These subnets usually had between 1,000-3,000 active VDI clients.
DNS data provides critical clues
Looking into the DNS request data collected by BlueCat’s intelligent security system, a clear pattern started to emerge. The problematic subnets showed exceedingly high NXDOMAIN volumes, indicating that something wasn’t resolving correctly. At the same time, the subnets also showed a large amount of anomalous PTR (reverse lookup) activity.
The PTR activity all had the same timestamp, indicating a simultaneous barrage of reverse lookups from across the network.
With BlueCat, users can easily adjust the search command to pull up relevant logs. With one click, pivot into the DNS insights and analytics tab for a graphical view of those logs. Integrate easily with Splunk with our free Splunk app to drill deeper into query patterns.
These data points were then triangulated against network utilization information for various applications. It emerged that the local firewall was the largest consumer of DNS on the network. Looking at packets from workstations impacted by performance issues, a large number of Link-Local Multicast Name Resolution (LLMNR) queries were also identified.
Identifying root causes
High utilization of DNS by the local firewall in combination with a surge of LLMNR queries finally allowed the team to piece together the issue. Here’s what was happening:
A newer version of a workstation image contained a firewall setting that enabled reverse name look-ups on connections. That same image had LLMNR and NetBios enabled.
For an inbound connection, the firewall would attempt to perform a reverse name look-up through a PTR query. Those queries failed due to a Windows registration issue – the clients were not consistently registering reverse records. The DNS result was an NXDOMAIN.
When the lookup failed, the client would send out an LLMNR broadcast to all other clients on the subnet. Those clients would then perform PTR queries on the same record, producing the same NXDOMAIN result.
The firewall kept producing PTR and LLMNR queries across the network in an increasing cascade. There were so many lookups that network performance began to degrade – hence the connectivity issues faced by VDI clients in a growing number of subnets.
Solving the problem and monitoring results
Once they discovered the source of the issue, the team turned off the firewall’s reverse lookup function. Returning to the DNS Edge console, the team saw results in real time. That simple switch immediately improved network performance to the tune of around 5,000 queries per second – around half of all network queries! The change quickly restored VDI connectivity and dramatically reduced the strain on core network infrastructure. The spike in PTR and NXDOMAIN queries vanished.
In this case, the granular, client-level DNS data from DNS Edge provided a critical clue which allowed the network team to identify the source of performance issues. At a basic level, this shows the core value of collecting and analyzing DNS data. Without this information at hand, the team may have gone down the wrong path in attempting to mitigate VDI performance issues. They would have never expected DNS as a source of the problem. They probably would have pursued other (wrong) root causes instead.
This also highlights the value of a DNS security system which can be deployed at the source of queries. Externally-facing DNS firewalls would not have detected the PTR and NXDOMAIN queries, as they never made it to the network boundary. Only by looking deeper into the network was the team able to discover the critical information needed to rectify the core issue.
Learn more about the security gold just sitting on your DNS servers, and how it can be used for root cause investigations, in our video about reducing incident response time.