The Art and Science of Upgrading Infrastructure Services

How can upgrading a service infrastructure be an art AND a science?  I just click a button, the stuff upgrades, and I’m good… right?

Stylized ninja beside a laptop showing a warning icon, symbolizing hidden risks in upgrading infrastructure services
Key takeawaysThis key takeaway was generated through LLMs crawling the page and coming up with an overview of the content.

The article explains that upgrading infrastructure services is both an art and a science, requiring careful lab testing, documented procedures, and operational planning to avoid business impact from downtime. It outlines practical steps—build a similar test environment, create a test matrix with acceptance criteria, document upgrade procedures, choose an upgrade strategy (slow-roll vs. forklift), and plan backups and rollback processes—to reduce risk during production upgrades. The piece also emphasizes aligning cross-functional resources, validating post-upgrade behavior, and using BlueCat’s experienced teams to support successful upgrades and minimize operational disruption.

Why is lab testing more important than the actual production upgrade for infrastructure services?

Lab testing is critical because it exposes hardware, software, customizations, and integrations to the upgrade process in a low-risk environment, letting you validate success criteria before touching production. The article stresses that even a non‑mirror environment with similar hardware/software can reveal issues, and a test matrix with simple acceptance criteria provides documented proof of what was tested. This reduces the chance of unexpected failures during production, supports change requests, and serves as a cover‑your‑behind record if management asks whether specific scenarios were validated.

How should an organization choose between a slow-roll upgrade and a fork-lift approach?

The article recommends preferring slow-roll and segmented upgrades when possible because they limit blast radius and make reversion simpler if problems arise. However, it acknowledges maintenance windows and operational constraints sometimes force a fork-lift upgrade; in those cases, exhaustive lab testing and meticulous preplanning are even more important. Decision factors include the criticality of services, availability of segmented targets, maintenance window duration, and the organization’s ability to align hands-on resources and cross-team support to address issues quickly.

What operational plans and preparations should be in place in case an upgrade fails?

You must plan for the worst: ensure backup personnel are available and briefed so exhausted staff don’t make mistakes, prepare and test rollback procedures with files and timelines, and determine whether boots‑on‑the‑ground are required and how long a rollback will take. The article also advises establishing go/no‑go checkpoints during windows to avoid partial work that requires full reversion, lining up network, firewall and other teams ahead of time, and validating post‑rollback steps. Engaging BlueCat’s Professional Services or Technical Account Management teams is recommended to leverage their upgrade experience and vetted methodologies.

How can upgrading a service infrastructure be an art AND a science?  I just click a button, the stuff upgrades, and I’m good… right?  We’re talking infrastructure services here, people.  If the infrastructure is unavailable, the business loses money.

If I asked the following question to 100 admins, “Do you enjoy testing software upgrades in the lab?” Exactly 100 admins would say, “NO”.  But guess what? Testing is more important than actually performing the upgrade.

What do I test?

  • The hardware and software you’re upgrading – You can’t test if you don’t have an environment. It doesn’t have to be a mirror image, but having similar hardware/software is needed, albeit in a reduced capacity.
  • Test matrix with success criteria – Having a matrix of what you’ve tested, and if it passed (simple acceptance criteria) is essential. It’s a big CYA (cover your behind) move, so if management asks, “Did you test XYZ?” You can say, “YES!” Using DNS, DHCP and IP Address Management for an example, your test matrix should include things like:

o    Upgrading of your DNS primary and DHCP server(s) from version A to version B

o    Testing a variety of services and devices, if available

o    Validating if any customizations still work, including API environments

  • Having an upgrade document – You’re testing the upgrade in your lab. You’re performing steps that will simply need to be repeated in production, why not document it? This ensures that nothing gets overlooked or forgotten during the production upgrade. And, you might be able to use this document to help support your change request.

What’s my upgrade strategy?
Alright, so you’ve tested the upgrade in the lab in your “spare” time and everything is good. Now do you upgrade with a slow roll, or do a fork-lift upgrade?

  • Slow-roll and segmentation of upgrades, if possible – Again, we’re talking about business-critical core services here. Doing a fork-lift upgrade and then having to revert a large portion of network infrastructure can be painstaking.  That said, maintenance windows for infrastructure services are hard to procure and it’s not always possible to slow-roll. If you have to do a fork-lift upgrade, it’s all the more important to test your upgrade meticulously beforehand.
  • Aligning resources – Most, if not all, enterprises have multiple data centers in multiple locations. It’s important that you’ve got hands and feet ready to hit the DC if problems arise. It’s also important to line up resources from your network teams, firewall/security teams, etc.
  • Go/no-go checkpoints – What happens if you’re a couple hours into your six-hour maintenance window and you know you simply won’t come close to completing your task? There’s no sense in completing more of the work when you’ll simply need to revert.

Planning for the worst
We know that nothing will go wrong with your upgrade, especially since you’re running BlueCat gear.  But, you still need to plan for the worst and ensure all your bases are covered – no one ever got in trouble for being prepared.

  • Backup resources – What if something unexpected happens during your 6-hour maintenance window and the network team forgot to inform you of a core router change, the network is down and you can’t validate your upgrade?  If it takes the network team hours to fix their problems, you’ll have been awake for 24+ hours.  People make mistakes when they’re tired.  Having a back-up resource available and up-to-speed is a good plan.
  • Roll-back – What about if you need to roll back? Do you need “boots on the ground” remotely?  How long will it take? Do you have the files at the ready if needed? Have you tested the roll-back? Does anything else need to happen after the roll-back has taken place?

Engage the upgrade ninjas
Upgrading infrastructure isn’t something that happens often – maybe once or twice a year. Engaging the BlueCat “upgrade ninjas” will help you navigate through a successful upgrade. Here at BlueCat, we’ve got a handful of teams – from Professional Services to our Technical Account Management teams – that have the field experience and solid, vetted methodologies to ensure a successful upgrade.

Alright, I’m upgraded.  Now what?
OK, your upgrade is done. No alerts have fired. Nothing seems to have blown-up. What’s next? VALIDATION! Everyone loves validation – you know, checking log files, running some tests, working with other operational teams, ensuring applications are up and running. When doing a slow-roll upgrade, having some burn-in time before your next major upgrade will be the ultimate validation that your upgrade has been successful.

I’ll leave you with a quote from an unnamed Samurai: “Cry in the dojo, laugh on the battlefield.”  It’s a mantra that the upgrade ninjas at BlueCat try to live by.


An avatar of the author

BlueCat provides core services and solutions that help our customers and their teams deliver change-ready networks. With BlueCat, organizations can build reliable, secure, and agile mission-critical networks that can support transformation initiatives such as cloud adoption and automation. BlueCat’s growing portfolio includes services and solutions for automated and unified DDI management, network security, multicloud management, and network observability and health.

Related content

BlueCat and Cisco graphic stating “Get DDI data from BlueCat in Cisco Cloud Control” for AI-driven network operations

BlueCat DDI data boosts Cisco Cloud Control AI-driven operations

BlueCat’s integration with Cisco Cloud Control provides AI agents with access to trusted DDI data for network investigation and remediation.

Read more
Flock of geese flying in formation across a blue sky, framed by a pink graphic border, symbolizing coordinated network migrat

Automate your DDI modernization path by migrating with Micetro

Automate cross-platform DNS and DHCP migration with Micetro to reduce risk, eliminate manual effort, and modernize infrastructure faster.

Read more
Close-up of interlocked metal chain links symbolizing connected network objects and relationships in IPAM

How to map your network with user-defined links in Integrity X

Map your network with user-defined links in Integrity X to define and manage custom relationships, such as dual-stack and NAT environments.

Read more
Three armored figures walking toward a futuristic Las Vegas skyline with pyramids, glowing orb, and "Welcome to Fabulous Las

Your journey to intelligent NetOps begins at Cisco Live

Visit BlueCat’s booth or book a meeting now to learn more about how our solutions can help you build a network that supports constant change.

Read more