Giving High Quality Support When Things Go Wrong.

It isn’t Always Pretty.
It was a Friday morning, around 745am Eastern time. Fridays typically being our slower days, I had settled in with my breakfast and a glass of tea as I updated a few help articles. I began clicking through the system taking screenshots when I received a server error.
Around here we’re lucky — our team runs a tight ship, and things very rarely go down. I quickly alerted the technical team on our chat program, Slack (most of whom were home, as it was Friday night in India), and they jumped on the problem.
The thing is, this time it wasn’t a quick fix. It took them about 3.5 hours to find the problem, push a fix, and deploy it. In my tenure at Recruiterbox, this was the longest downtime we have had.
Obviously this post isn’t about the downtime directly, it’s about how we handled it on Support.
A 3.5 hour downtime beginning at 745am Eastern managed to span the beginning of the work day for all of our continental US customers. Each time a new timezone came on board, we received another wave of messages. Having a set of messages to send at certain times helped us quickly get back to the customers. Below are the steps we took and the lessons we learned.
Step 1 — Update your status page
Our first step in a known downtime is to update our status page. This allows us to alert customers of the downtime. We also added it to the banner of the top of the site. For this downtime accessing the system was hit or miss — depending on which server you pinged. Because of this, the status message helped for some but not for others.
Step 2 — Prepare for the wave of messages
The next thing we did was log off of live chat. This was a spur of the moment decision, but we didn’t want to loose customers on live chat in mid conversation if they hit the down server. We also were receiving a large volume of emails — and wanted to reply to them quickly. Concentrating on just emails allowed us to hit the largest range of users at once.
Once the downtime was identified — we created a standard message to allow us to write back to customers who noted the outage quickly. Fast replies is something our team is known for already — but we felt during this time, the assurance that our team was looking at the issue was critical.
Thank you for reaching out to us — our apologies for the roadblock in your hiring workflow this morning!
Our team is aware of this issue, and are working on it as we speak.
We’ll be back in touch as soon as we know more! Appreciate your patience!
In our ticketing system (Frehsdesk) we also created a new status tag for these tickets, to make it easy to find them for follow up. I also created a note on my computer, to keep each ticket URL to double check that we got back to the customers.
Step 3 — If it has been more than 45 minutes, follow up again, and again, and again, until it’s over
Once the downtime had passed the 45 minute mark, we started writing again to those people who had been waiting more than 45 minutes. We wanted them to know we hadn’t forgotten about them and that we were still working on everything. We continued this (with variations to the message) until we were back up.
We wanted to give you a quick update -
Our team is still investigating the cause of the current application slowness and error messages. We are working hard to resolved this issue.
Our apologies for the continued delay in your workflow! We will be in touch as soon as we know more!
Step 4 — Follow up (again) once it is over (but not too eagerly)
Once the downtime was over, we made sure to tell everyone. I checked the list of tagged tickets, and my list I’d been keeping in my notes, multiple times to make sure every customer had been written back to.
We also included an apology for the delay to their workflow, and a thank you — for their patience during the event.
Especially with a downtime this long, it becomes easy to get super excited to give the all clear. Doing so falsely is even worse than waiting a bit longer. Make sure you’re 100% sure you’re back up before you send the all clear message.
Also be sure to return to your status page, and update it.
Thank you for your super awesome patience as we worked through a few different server issues this morning!
We apologize for any delay this caused to your Friday!
The system is back up and running! Our team will continue to monitor for any issues over the next few hours.
We hope you are headed towards a lovely weekend! Please reach out right away if you have any questions!
We also tagged these tickets for follow up the next day as well.
Step 5 — Follow up again the next day
Most of the customers who we wrote to with the “all clear” wrote back fairly quickly confirming everything was up and running. Any one we hadn’t heard from by the next day we wrote to again, just confirming all was okay. This may seem like an extra step, but we did find a few lingering bugs with this second follow up — things which we had assumed were related to the downtime, but were actually a different problem altogether.
I wanted to drop you a friendly followup — is everything running smoothly for you now?
Please feel free to reach out to us at any time!
Step 6 — Complete your Retrospective
After any type of an event a retrospective should be completed by all of those who were involved. This is used to explore the root cause of the issue, as well as how it can be handled better in the future. You should do this as soon as possible after the event, while it is fresh on everyone’s mind. Share the notes with everyone — even if they weren’t involved in the event.
Additional Thoughts and Lessons learned:
By the end of this downtime, we’d had almost 50 customers reach out, to which we sent a minimum of 3 messages (acknowledging the event, a(t least one) 45 minute follow up and an all clear (and potentially a next day follow up). Some customers also wrote back asking questions, which we did our best to answer — without making any promises.
We found the pace to be difficult but manageable with two of us answering the emails, tagging the tickets, and monitoring updates from the technical team.
At about 45 minutes after each hour the tickets would begin to slow — but they would quickly pick up again as another US timezone joined the mix. Once we anticipated this, the last two time zones were much easier.
Sure, it would have been simple to slap up an automated reply on the support email address (shudder) and call it a day, but not all of the messages we were getting were related to the downtime.
A few users could still access the system, and they had general, unrelated questions. We made sure to read messages carefully, and notice these — to avoid sending them the standard downtime message.
We also receive a number of emails every day that are not related to the use of the system, such as potential new clients. We made sure to not send them a downtime notice as well, of course.
Focus. Stop doing everything else.
With only two of us, the only way to manage the pace was by making emailed tickets our only activity. Everything else was stopped. I even told a teammate who started chatting us up about weekend plans to stop. We had one goal — get back to our customers with reassurance as quickly as possible.
We have a variety of ways to access client information, and we were using all of them during this downtime. It’s amazing how blind you suddenly feel when you can’t access the tools you use every single day. Make sure you have redundancies in place for accessing vital data points, and that you’re up to date on how to use them.
Lastly, take a deep breath, and don’t take it personally
When things don’t work, people are upset. This is a general understanding when working support — but especially so when when things go wrong. Sometimes people simply need a person to vent to — and — at this moment — you are this person. Take a deep breath. Put on your cheesiest smile, and be ready for a cold beer that night :)
How does your team handle downtimes?
I’d love to learn more from you!
Originally published at inside.recruiterbox.com on April 5, 2016.