Tales from the Edge: DNS is so much more than a phone book
A conversation on Edge and enterprise use cases with BlueCat’s Chief Strategy Officer, Andrew Wertkin, and podcast hosts Stephen Spector and Rob Hirschfeld.
When the average person thinks of DNS, they picture a phone book that translates human-language domain names into computer-friendly IP addresses. But helping users get to Amazon’s toilet paper page is just the tip of the iceberg when it comes to what DNS can do. Especially in a large enterprise.
As you’re about to read (or hear, if you prefer listening to your podcasts), one area of untapped potential for DNS is found in the large organization’s court.
BlueCat Chief Strategy Officer Andrew Wertkin recently had a conversation on this note with the hosts of L8ist Sh9y. Andrew, along with Stephen Spector and Rob Hirschfeld, spent half an hour unpacking why DNS – in the enterprise and at the edge – is so interesting.
First, some context
Stephen: Andrew, give us a quick background on BlueCat and then we’ll jump right in.
Andrew: BlueCat is a 20 year old company that’s been in network, and network security, for quite some time. Focused most broadly on DNS. In the private side of DNS; DNS inside of an enterprise. We deal with external public DNS as well. But a huge chunk of our business and our customers are driving solutions for DNS inside the network.
Most of us as individuals, consumers, think about DNS as what’s necessary for us to connect to some third party service or application. Inside a corporation, inside of a data center. There’s a tremendous amount of pressure put on internal DNS. For connecting devices to internal applications for Active Directory, authentication, Kerberos… I mean there’s many, many use cases that require DNS internally.
Rob: DNS is an essential thing. People, I think, forget that you can’t really do TLS (which is secure sockets communication) without naming servers and having authenticated DNS. So DNS is an essential infrastructure on sites. Do you then basically become the enterprise DNS infrastructure? I know there’s a couple of companies that do that. There’s a couple that are really internet DNS and application focused DNS for huge websites and things like that. Where’s the BlueCat balance?
Andrew: We become the enterprise DNS. That doesn’t mean that some of our customers don’t use us for public DNS, for websites and those things as well. It depends on the company, its size, and what part of the public DNS they’re talking about. Perhaps they delegate a lot of zones to different services based on certain applications, whatever the case might be. But 90% of what we bring to our customers is that enterprise DNS on the inside of the network.
And so we provide the appropriate services for them to tag networks. Part of an enterprise DNS solution is also a layer of IP address management in general. That’s where we’re the single source of truth for all of the networks that are deployed inside of the enterprise. And so we can tag those and locate those to help with that.
So there’s that side as well, but there’s a variety of other use cases like Kerberos authentication, for instance. Simply finding my domain controller, and just on and on and on.
I mean, I was actually just looking earlier today across a variety of different customers. Depending on the customer, like a large university, maybe 80% of the queries are going to the internet and 20% are internal. If you swap that to a large financial customer of ours, it completely swaps along with it.
Where 80% of the queries are internal and 20% are going external, it goes to show the amount of applications, resources, that are used inside the network. (Versus outside the network.) And then, also whether or not there’s a web proxy or explicit web proxy in play that might be executing a lot of the DNS queries for the user on behalf of the user.
Rob: That’s fair. DNS is a core technology, but there’s so many things that impinge on it. Proxy being one of them, load balancer would be another. And it seems like performance of a DNS infrastructure can be super sensitive.
The joke on the internet is it’s always the DNS when you’re troubleshooting problems. So does BlueCat have those services as part of your offering or then do you integrate in, into other services that are out there? DHCP is another one.
Andrew: Yeah, DHCP is another core. This market was always called DDI (DNS, DHCP and IP address management), which is an acronym of acronyms that really nobody knows. And so we tend to just think about it all as DNS and enterprise DNS, which includes these other solutions.
The latency and availability of DNS are critical to availability of applications and services and performance. These are things that we’re providing tools and capabilities around. For customers to understand what the performance of DNS is on the inside. Also, we’re looking proactively for anything that might cause service slow-down or outages or anything like that.
And so we work with our customers to make sure that we’ve deployed enough capacity. And that historically has been capacity that’s usually located in large data centers or large branches.
More so now, as we push more into Edge computing. It also means deploying smaller points of service wherever there might be egress to the internet.
How DNS works at the edge
Rob: So would you, in an Edge location, basically put together a small DNS infrastructure for that site? Is that the idea?
Andrew: Yeah. We have a product profile that’s specifically engineered for that Edge site. Say that Edge site is a retail store, where our data center appliances can go to hundreds of thousands of queries per second.
You’re talking about it’d be lucky to have a hundred queries per second or less. Same thing on the DTP side where you just expect a much lower volume. However, there’s still all the complexity of internal DNS that gets dragged along with it.
And so we have the sort of smart servers there, which provide a lot of visibility. We don’t know if a customer connected from point A to point B, or any client I should say. But we know there was an intention to make a connection, because there was a DNS lookup. With that intention, it becomes a pretty good proxy for the intent of that device; those signals are extremely interesting to the security team, the network team. There’s a great desire for that.
But then also, we’re able to change the answer to a query based on context. And that’s what really becomes important in general, but especially with Edge based computing.
Rob: So when you say “change the answer”, there’s so much to unpack in what you just said. I think a lot of people don’t realize just how deeply integrated DNS is into infrastructure. And you’re exactly right. A script that runs and is trying to download – malicious or benign – if it’s trying to download materials, it’s literally hitting the DNS every time it’s pulling those down. Pulling them from repos going out the caches. I mean, that is the first indicator of any content activity, any network traversal activity. Is that fair?
Andrew: Yeah, it is. For sure. Therefore, if all of a sudden something’s constantly looking up the same address, and it’s a user driven device, and it’s looking up in a pattern that no way could possibly be a user, yeah. Somebody could have left some sort of file sharing application servers running on their device at home, forgot to check it/turn it off, brought it back inside.
By and large, it represents something that somebody should look at. And just the pattern of DNS is good enough. Even if the connection couldn’t be made. Let’s say you’re blocking lookup to that specific domain, something’s installed in that machine. And so just based on that pattern of queries, it becomes pretty obvious that something’s going wrong there.
Rob: Right. And once again, benign or malevolent. If something’s hitting an address, if it’s hitting a DNS over and over again, and not being able to resolve that address, you’ve now got a non-performing application.
Andrew: Could be. Right.
Rob: And it’s going to at least hit a timeout or something’s going to happen. I mean, that’s pretty normal. Then on Edge, it becomes super expensive because if you’re doing a DNS entry to someplace and running out and pulling data across, you might not even know what’s crossing your router out of that Edge site, but the DNS could give you a pretty good indication of what’s going on.
Andrew: Yeah. It becomes a pretty rich set of data to mine for what’s being used there, but then also to help steer that.
I’ll give you a good example of how one of our Edge customers are using it. They’ve implemented SD-WAN in God knows what percentage of enterprise at this point. And they only want to–and this isn’t unique to them either–there’s only certain services that they want to make available for direct internet access. And in the case of this company, it’s basically Office 365. (The amount of network re-architecture and spending that has occurred because of Office 365 is pretty insane, but it’s there.) And so they want to drive that out. And a lot of the SD-WAN vendors sell some pretty sophisticated capabilities to try to sense different applications.
What this company realized is that they can basically steer traffic based on whether or not you can resolve the DNS names. Now we have this policy-driven DNS, and Microsoft provides a web service with all of the fully qualified domain names that are necessary to run Office 365.
And basically those get whitelisted, everything else gets blocked. Their PAC files for the proxy are written such that if they can’t look up the answer in DNS, if DNS doesn’t have an answer, then send it through the proxy. In fact, they’ve taken ridiculously complex PAC files for their explicit proxy, and reduced them to like two lines.
‘If I could look it up and DNS go directly there, else go to proxy.’ That simple and elegant solution allows Office 365 traffic go directly out to the internet. Everything else still routes back through the proxy.
Somebody I was talking to recently said, “it’s sort of like you build all these great highways with all sorts of expectations for network utilization, but there’s no Waze or Google Maps. There’s no ability to direct that traffic.”
And at the end of the day DNS certainly doesn’t know the route that’s going to be taken. But it knows the address of the service on the other side. It’s what’s providing that. And so it a very powerful way to direct traffic in whatever direction necessary.
Manipulating DNS responses
Rob: No, it makes a ton of sense. I think the internet is supposed to be resilient so what you’re describing is a feature and a [inaudible] simultaneously from that perspective. But we do provide a lot of hits from that perspective. My company does a lot of DHCP work and there’s a ton of things written down in the spec that provide [inaudible] also that people have been using and playing with.
Are there things that people should be thinking about in application design that would make DNS more powerful?
Andrew: As we build cloud-native applications these days, then the service discovery capabilities of DNS become very important in that process. Not that they weren’t used before, but now there’s much more use of things like that. I can build way less brittle applications and start using service discovery to have the DNS basically tell me what the closest, healthiest service is that I should be connecting to for authentication or whatever that service might be doing with my application. So there’s that side of it, for sure. The thing we see a lot is the misuse of DNS in application development. For instance, those writing the application deciding that they’re going to basically look up DNS on their own, as opposed to going through the operating system.
We see that with, I won’t name them, but one of the security agents out there. What ends up happening is, during the middle of the day, there’s about 8,000 queries per second just for this security agent at one of our large customers. Because it doesn’t allow the operating system to cache. It’s not caching, it’s just pounding away at DNS. And therefore the further away those endpoints are from their DNS, there’ll be a hit on performance.
For the security agent, maybe that’s not a big deal. But when you see a lot of people having strategies of moving portions of applications to Edge or portions of applications to cloud, things that maybe didn’t cause performance issues before now do. There was an assumption that the application was close to DNS. You had a latency of five milliseconds, and that’s now a hundred milliseconds. You see a performance issue, and that performance issue could end up being pretty difficult to nail down. DNS issues are just historically very difficult to understand especially on the performance side.
Rob: What you’re describing to me is a real Edge challenge case for people. We’re dealing with Edge scenarios where a lot of Edge talking that I’ve done is like, “Oh, I have a gateway and then I have devices behind that gateway. And that everything’s sort of hard coded and yay. It’s great”. Everything has to run through the gateway, but now we’re moving to multiple systems in that Edge gateway and that gateway is becoming a small cluster to do more work.
You could have an application running on that gateway that’s coded to go to the internet to collect an address and then talk to a service based on that address. Yeah, it’s going to work great in the cloud where the DNS is really close and everything is assured, but that Edge site could actually take a significant amount of time just to return the address. And then it could be vulnerable that if the address doesn’t get returned because the link is down, you could actually have local communications that are failing because it couldn’t resolve an address.
Andrew: Yeah, exactly. That’s why we tend to push these days to the Edge, to solve for those sorts of issues. Then there’s a couple of other wonky issues that companies have when they start trying to break out. It’s all because of legacy, data center, web proxies…
Yeah. Just years and years and years of God knows what configuration in there. And people have sort of patched DNS issues in the web proxy.
A good example of that is you, or whatever company, will use the same domain internally and externally. They’re not managing it from like a split horizon perspective.
Their Active Directory domain is the same as their public-facing domain. Or there are some domains that are the same. And the web proxies have this kludge basically to say, “Okay, we don’t know if www.whatever or whatever this application is, is an internal site or an external site?” “So first we’ll check if it’s an internal site and then we’ll send you the external site if it’s not”. Now all of a sudden you start doing direct internet access and stuff isn’t going where it should.
And we’ve seen large enterprises completely halt DIA strategies because it completely broke internal DNS. And so we also started working on and released some capabilities to solve for that. So that you don’t need to completely go fix that and start managing appropriately. That’s actually a year or two of hard manual migration of DNS zones, as opposed to doing some magical stuff that we can do because we can control what the answer is.
And so we’re trying to make it easier to manage a fleet of these servers because, look, DNS is critical infrastructure and people don’t like to change it. They’re terrified of it going down because if it goes down, then everything breaks.
And so when we’re trying to convince a company that instead of having 40 DNS servers in their data centers, they should have hundreds or thousands and deploy wherever they’re going directly to the internet (depending on the type of company), we need to, as a vendor, take on the responsibility of managing the health of those services. Customers don’t want to manage the health of 500 DNS servers.
Rob: Yeah. This is like the East-West firewall problem. All of a sudden you’ve got a lot more firewalls. You’ve got cruft floating around and you don’t want to promote the entire DNS infrastructure into each one of those sub units. It actually needs to be managed specifically for it.
So what you’re describing to me, the Edge case is the same. You’re going to have thousands of DNS sites that have actually slightly bespoke configurations. And then it’s site-specific and then you’ve got alerting and monitoring on top of that. [crosstalk] That’s a distributed app, that’s a significant problem.
Andrew: Right. And then we look at it as an IOT application, to be honest with you and this is why we delivered these capabilities as a SaaS-based application where we can manage the stuff from the cloud. And we do the alerting, and we ensure there’s health, and we look for anomalies in how any of those servers are working. But we deploy it on premises, or in the cloud, wherever the customers want it.
They can deploy it, as a containerized service, or they can deploy it as a VM if they want to. Working with some of the router vendors to try to deploy on router, certainly on CPE depending on the hypervisor. Then we manage it because they can’t think of managing servers in that range without more people.
And obviously they don’t want more people for doing this. And then we also, in this service, we’ve sort of transitioned from relying on our customers understanding the somewhat esoteric nature of DNS configuration and really driving it at a higher level, because actually, yes, there’s lots of different variations depending on site. But the basic configuration is way less complex than a data center-based DNS where there’s many more parameters in place, especially on the authoritative side of DNS.
So we try to make it as easy as possible. And that’s where we’re doing everything from threat protection to traffic steering and anything where us being the first hop in DNS, actually knowing who the client is, because you lose that attribution once you hop to the next bounce, therefore we can apply policy and we can restrict things.
Say I’ve got a Point of Sale machine and the machine only looks up, I don’t know, 15 different DNS names in normal use. If the 16th is google.com, it’s compromised because it shouldn’t do that. And so we can apply specific policy to what certain devices can and can’t do, which is interesting for us as well.
Edge and the enterprise: similar use cases
Rob: One of the things that you said that really comes home with our own experience is that Edge and Enterprise actually have very similar requirements. What you just described as an Edge use case, but that would be just as true as a tablet walking onto a campus as a new device. And you’d actually be able to say, this is a new device it’s asking for certain things I don’t understand.
How well can the DNS infrastructure then interact with something else to send an alert or do a quarantine or…? Because, right, you can effectively quarantine something just by saying, “I’m not going to answer any DNS requests for you.”
Andrew: Yeah, which is interesting. If you know something’s compromised, yeah, you can alert somebody that it keeps pinging to some known command and control server via DNS, fantastic.
Better is, yeah. “So why reply correctly to any question it’s asking?” You know it’s compromised so just don’t reply to it anymore.
So certainly, we have that stuff and we have the appropriate integrations, but we also do this with some of our customers, like even on the DTP side. But “If I don’t know this device [and granted, you can spoof the MAC address] then why let it onto a VLAN that routes to other production machines, for instance?”
Now that usually means some coordination with a NOC-based solution so that we can give a specific address, for instance. But also it’s fairly certain these days that if something’s on the IP network, it’s going to issue a DNS query.
So it also just becomes a pretty interesting way to do discovery of what’s on the network, just based on where we’re getting these queries from. Like, “is that IP address supposed to be occupied right now?”
We think, the same thing goes with TGP, but we just think DNS can be used for way more inside the network.
A lot of the things we use and [inaudible] actually aren’t good for the internet because you, as a consumer, don’t want people munching your DNS query or response. And it’s why things like DNS over HTTPS and other technologies have come along, or protocols. There’s this fear of, “am I getting the actual answer?” DNSSEC validation helps with that, “but can I trust this answer?”
On the inside of the network it’s the enterprise, it’s the corporation’s network. The right answer is what they say is the right answer. So there’s more we can do that a chunk of which is there’s no RFC for, because what we’re trying to do is make sure the DNS is healthy. It’s low latency. It’s giving the right answer based on the context of what’s happening right now, as opposed to giving the authoritative answer of a third party company that I’m connecting to. Because in that latter case, I don’t want you to mess with my answer. In the former, I expect you to.
Rob: That’s interesting. So it becomes that you have more data to make a decision. You’re still saying, “Hey, go to this address.” But it’s a time sensitive situation sensitive. It’s not exactly like load balancing, which is sort of application. I mean, it feels a little like load balancing because you’re making a decision, like “I’m going to send traffic based on your application performance or other criteria.”
In this case, it’s much more about what the client is doing or what the network topology is. Less so about where you’re sending that traffic to.
Andrew: It’s a bit of both. And we’re going to make a decision based on security, based on health, on network performance, on latency.
There’s a whole variety of ways we can make that decision. But we can’t do that on our own. It means more integrations with different services so that we can provide the right answer, which is an exciting way for us.
I mean for the first 15 years of BlueCat, we’ve been focused on core DNS. Certainly we would sell at the Edge, but we would sell at the Edge where customer wanted survivability at the Edge. There wasn’t additional features or capabilities to make. There wasn’t an opportunity to do something innovative at the Edge. And now given what’s happening at the Edge it brings upon opportunity that we think is interesting.
Rob: Makes a lot of sense.
Andrew: And it’s fun.
The broader ecosystem
Rob: Fun is always good. I think Edge is super exciting and I think we’re just at the beginning. One thing you had said that I would go back to is the other thing that’s fun and changing a lot right now, which is like service mesh and building some service discovery infrastructure.
There’s an element from a DNS perspective there also, where we’re saying, “Hey, I have an application profile that’s shifting all over the place because I’ve put it in containers and I’m distributing those containers.” So service discovery, service mesh, is there a component in DNS that we should be thinking about for that too?
Andrew: Yeah. And most of those solutions are based on DNS. For instance, Kubernetes has a related project called CoreDNS, that is basically a plugin DNS engine that has a lot of other capabilities, or HashiCorp has a great piece of software, and I’m going to blank on the name….
Rob: Console.
Andrew: Yeah, Console. Which has a DNS interface as well as a RESTful interface. DNS has always had a play in that world. Certainly if you’re delivering or developing services on Amazon, then Route53 becomes an opportunity for service discovery. They change the answer based on health as well.
So there’s definitely a play there. Our play from a broad enterprise DNS perspective is, we don’t try to go into our customers and convince them that they should stop those that are writing new applications, launching new Kubernetes clusters.
They’re going to use what’s native to that system. So instead we just plug into that and we make sure that:
One: we can keep the appropriate compliance. Who’s querying what, when? We can have pride policies. Why is this thing queering anything outside of that pod? What services is it supposed to access back in the data center? So there’s a whole compliance side that we play in.
Two: there’s no reason our customers can’t use us in many of those scenarios. There’s no reason for us to go wage that battle. The tools that come with a lot of these systems are purpose-built and could provide more value than a more general tool.
Closing remarks
Rob: This makes sense to me. It feels like we’ve kept circling around sort of the same question that I’m trying to get straight in my mind… There’s DNS that is sort of ‘take me to my app’ DNS.
And then there’s DNS of what the client is doing and what those interactions are. And that’s a side of DNS I hadn’t thought much about. Clearly it’s the heart of what you’re positioning and the analytics that you’re adding.
Andrew: Yeah, you got it. Yeah, look, we in the DNS industry are supposed to be offended by the analogy that DNS is a phonebook. I get why that’s the analogy. Because a huge percentage of what it does is takes a name and gives you an address and so fantastic. It’s just it was built for way more and the reality in any enterprise scenario it does way more.
Stephen: So, Andrew, this is Stephen, as I said, I was going to break in right when it was exciting. And it was like, we were reaching the crescendo of all the key items, but I always try to keep things at a half hour. We found that works the best for the podcasts, but I want to thank you for joining us. Really good conversation today.
That’s all, folks! Thanks for reading.
*Transcript has been lightly edited for clarity. If you enjoyed this, subscribe to the L8ist Sh9y podcast on SoundCloud or Tweet them at @l8istsh9y