As a Citrix Technical Account Manager (TAM), I get to see outages from a unique perspective. While I am not in the trenches anymore, my responsibilities include guiding my customers during their some of their most difficult times. I have learned how the work you do before a production outage happens can greatly reduce the time to resolution and improve the overall customer experience.

The possibility of an outage always exists. But in the midst of an event like the COVID-19 pandemic, which has brought new challenges for IT teams, being ready for an outage and being able to respond quickly and efficiently are critical.

During outages a well thought out plan that you’ve developed ahead of time is key to reducing chaos and returning to normal operations. Here are some of the most common, but often overlooked, parts of a plan that can help to ensure your team is prepared to navigate an outage.

Define Roles

It’s important to look beyond your engineering needs as you define roles for staff during outages. Some of the most common roles I’ve seen include:

  • Incident Lead: This person coordinates all things related to the outage.
  • Communication Lead: This person ensures updates go to the correct personnel in a timely manner. Typically, they are the liaison between internal tech teams and leadership.
  • Change Manager: To get the system back up, many changes may be implemented in a short timeframe, so often customers have a person responsible for both managing and documenting changes made as part of the troubleshooting efforts.
  • Critical Response Team: This is the technical team responsible to troubleshoot and eventually resolve the issue. These engineers work directly with vendors as needed while working with the rest of the roles to help ensure the system runs smoothly.

Training

An added benefit of working as a Citrix Technical Account Manager is that I get to see various technical support and escalation teams that all have their preferred toolsets to use while troubleshooting customer issues. While there are many tools available, the following are commonly used for most of the issues.

Plan the Communications

In the heat of the moment, providing proactive and timely updates to your stakeholders is critical. Communication is key to providing clarity and ensuring the correct stakeholders are up to speed. Decide in advance which tools you’ll use and how you’ll use them; most customers use email for leadership updates while the technical team uses a bridge and/or a chat tool like Slack or Skype. This will help reduce confusion during and outage and help you optimize your workflow process. Another approach that I’ve seen help things to run smoothly is having two phone bridges — one for the technical team and one for management updates. The communication lead typically liaises between each bridge.

Most customers try to update all teams involved at set intervals with any relevant information, from what troubleshooting is being performed to ideas about why the issue is occurring and ETAs for next steps. Remember, you have two or three different stakeholders and the update to each group should contain different verbiage. Typically, the communications lead handles the interactions between the technical team, leadership, and sometimes the public. When updating each group, remember that a concise message is always best, as long as it contains the relevant information. Developing communication templates can help with efficient communication during an outage.

Lean on Partnerships

It’s important to work efficiently with your internal teams. The customers who work best among themselves usually get through outages a little bit more unscathed. This is typically done by taking command of your areas of control and proactively communicating findings. Also, plan to pull in your vendors as soon as the hint of an issue arises. At Citrix, we recommend you open a case as soon as possible. Citrix Priority customers even have access to Critical Situation Managers. I can speak from experience on how useful these team members can be to help navigate through a critical outage. For severity one issues, our <10-minute initial target response time and <4-hour restoration targets help accelerate remediation for your most important case. See Priority Plus features.

Prepare Constantly and Adjust as Needed

Your workflow process around outages should reviewed and revised frequently to streamline it for your current environment. Request input from all parties involved, from end users to management, to identify potential improvements. Go with what works and remove processes that do not benefit your situation. Re-evaluate the outage process every 12 months or more often if significant changes in your organization occur.

Questions? Tips? Lessons Learned?

Reach out to your Citrix technical account manager if you have any questions or would like to chat about other ways to help with outage management. Don’t have a technical account manager? Contact us for more information. And share your tips and lessons learned in the comments below.