Last updated on April 29, 2021.
A network cannot run by machine alone; network admins are the brains behind them. But humans, as we all know, are prone to making errors.
Some mistakes are minor, fixed before anyone notices. Some might be a little bigger and require an apology, but you clean up and move on.
And then there are those gargantuan, once-in-a-career errors, the horrors of which are forever etched into the recesses of your mind. You remember and replay every step that led up to the fiasco. Systems stop working and you catch legitimate flack from angry users. And you breathe a sigh of relief when you don’t get fired.
We’ve all made those, too.
We recently asked members of BlueCat’s Network VIP community to share their worst manual error horror stories [link accessible with Network VIP membership]. This post will highlight five of those stories, with members detailing the errors they committed, the resulting fallout, and what important lessons they learned.
Just remember: Networking is inherently hard, and sometimes your missteps will seem worse than they are. In Network VIP on Slack, all are welcome to apply to join a community of networking professionals to support you through your challenges and accomplishments alike.
And here’s an extra piece of good news: On April 13, we’re holding the industry’s first-ever DDI Day to celebrate the unsung heroes of the network—you! Register to join us.
With a “/” in a home directory, 16 DNS production servers fail
First up, the chief IT infrastructure engineer for a health care organization recalls that while performing some cleanup on DNS servers, he removed a local account.
As per normal practice, he issued the command “userdel -r <account-name>”. The -r switch indicated the removal of the user’s home directory on the server as well.
Unfortunately, and unbeknownst to him, this account’s home directory was “/”, or the root of the filesystem. In removing the account’s home directory, he had inadvertently told the system to remove all files from the root on down.
“To make the mistake even more egregious (as if it wasn’t egregious enough already),” he writes, “I was using a console program that allowed me to access and issues commands on all of our servers simultaneously.”
Immediately after issuing the command, processes began failing on all 16 production servers across the U.S. It took his team eight hours to rebuild all DNS servers from scratch.
The lesson: Never take shortcuts to save time
Luckily, most DNS functions run from RAM and so DNS resolution continued to mostly function and there were few reports of impact.
He points the blame, in part, to poor documentation of the account that was being removed. And to the shoddy implementation of the product that utilized the account in the first place. “Why would an account ever have a home directory of “/” ?” he mused.
Before executing commands that remove any data, be sure you know exactly what you’re removing.
He also recognizes that his “foolish attempt” to save time by executing commands on all servers at the same time contributed to the fiasco.
“Before executing commands that remove any data, be sure you know exactly what you’re removing,” he writes. “And never, ever, take shortcuts to save time.”
Connecting a device to the network results in DHCP exhaustion
A systems engineer for a networking vendor recalls that his manual error horror occurred when he was conducting a lab refresh as a lab manager.
He loaned a wireless controller to a junior member from another team. He told him to do what he wished with it and that he would clean it up after he returned it. Upon return, it appeared that the junior member hadn’t done anything to it. So he decided it would be easier to configure it in-band for the new lab.
He connected the device back to where it was before in the data center, both to the production network and his lab network. (This was a normal setup, as the wireless controller acted as the demarcation point.)
He pinged it, checked that he could manage it remotely, confirmed that his credentials were still valid, and went home. He figured he could do the rest of his configuration from his desk the next day.
But things are not always as they appear.
It turns out that the address resolution protocol (ARP) proxy was enabled between the lab and production networks.
The device caused DHCP exhaustion overnight, impeding three floors of engineering staff from getting an IP address the next morning. It took the infrastructure team two to three hours to troubleshoot, block the port, reset the DHCP leases, and identify the offending controller setup.
The lesson: Factory reset devices before connecting them
The error left him with some important lessons.
“Always do a factory reset for a device that is connecting to the network. Even if it is a device that is returning,” he writes. “Don’t assume anything—verify. Don’t plug things in right before leaving. And minimize the devices that are the production network.”
With one file deleted, data for 20,000 users gone
In one of his first jobs as a Unix systems administrator, a community member recalls that the major university he worked for had a text file that contained billing information for about 20,000 user accounts.
The process for editing it was to manually create a lock file—so that no automated processes would conflict with your editing—edit the file, and then delete the lock file. Both the lock file and the real data file began with the same two letters.
Can you guess where this is going? He completed his edits, but then deleted the real data file, not the lock file.
Had we had an automated process, this mistake wouldn’t have even been a possibility.
“And just like that,” he wrote, “billing information for 20,000 accounts was gone.” The gaffe was entirely a result of user error, he says, “pure and simple.”
The lesson: Use automated processes when you can
Thanks to nightly backups, he was able to restore the file and carry on. But it imparted some important career lessons.
“Always double-check what you’re doing, especially if it’s destructive in any way,” he writes. “Also, an automated process that checks conditions on things is better than any human. Had we had an automated process, this mistake wouldn’t have even been a possibility.”
Reusing an IP address stops users from reaching an application
A senior infrastructure integration analyst who works for a non-federal government agency recalls reusing an IP address that was conducting network address translation (NATing) for a production service.
The NAT took over on the firewall and stopped all existing sessions. This stopped users from reaching the application. The outage lasted about two hours until the error was identified. He then reverted and re-did the change with a different IP address.
The lesson: Properly document IP usage
The reason behind this error was simple enough: poor documentation.
“It happened because we lacked documentation of IP usage,” he wrote. “I also wasn’t experienced enough with that firewall to know all the places to check to be sure my change wouldn’t cause a conflict. But really, the proper documentation would have saved us the headache.”
The proper documentation would have saved us the headache.
But even better than simple documentation (and the downfalls of IP address spreadsheets are many), consider an IP address management (IPAM) solution. You can get a single source of truth for relationships between devices, users, and IP addresses on your network.
A live production server shut down with the push of a button
Finally, this community member committed this manual error 15 years ago as a systems analyst at a university but says he still remembers it like it was yesterday.
While conducting maintenance on a test environment server in a data center, he was working at a KVM (keyboard, video, mouse) console on a 1U rack appliance. The appliance slid out on rails, with a fold-up screen and keyword and trackpad below (much like a laptop mounted in the server rack). He reached below the KVM console to hit the power button on what he thought was the test server.
Turns it out was the live production server.
This resulted in an outage and degraded service for several hours during the business day. As the server was shut down improperly, it didn’t turn back on cleanly. It required consistency checks and maintenance before full restoration.
Meanwhile, several thousand users could not access email or course content.
The lesson: Know what power button you’re pressing
The data center had identical servers for production and test, which has its benefits. But they were not easily distinguishable when rack-mounted one on top of the other. They looked identical, he recalled, save for a small label that had the server name on each. The servers were also below the KVM console, so, when reaching below without looking, it was easy to mix the two up.
Further adding to the potential for confusion, his admin team occasionally switched the roles of the two servers between test and production in order to perform maintenance.
“The obvious takeaways are to ensure you know what power button you are pressing,” he writes. “Have clearer labels. Even different label colors to help distinguish between production and test infrastructure.
“Look more carefully when pressing a button,” he continued. “I even went as far as putting a piece of tape over the power button for the production server so I’d have to remove it and think twice about pressing the button.”