April 20th, 2021 Outage Postmortem

What happened?

On Tuesday April 20th, between 2:28pm and 4:05pm EST the Makeswift app and live pages were down. The downtime was caused by an outage of our authentication provider, Auth0. You can read more information about Auth0's outage here.
﻿
We apologize for the impact this might have caused to your business. In this page we outline what went wrong and what we're doing to prevent an issue like this from happening again.

Timeline

Apr 20, 2:28 pm

Our monitoring systems alert us about increased error rates in live pages.

Apr 20, 2:45 pm

We notice that live pages are down and start investigating the issue.

Apr 20, 3:00 pm

We implement a temporary fix to get live pages back up.

Apr 20, 3:30 pm

We push the code with the temporary fix.

Apr 20, 3:36 pm

Our integration tests fail due to Auth0 being down.

Apr 20, 3:41 pm

We decide to bypass our CI/CD systems and start building the artifacts with the fixes locally.

Apr 20, 4:05 pm

We successfully deploy the temporary fix and live pages come back online.

Apr 20, 4:07 pm

Auth0 starts to recover, resulting in recovery of the Makeswift app.

After deploying a temporary fix and ensuring our systems were stable, we immediately started to analyze the root cause of the problem and to identify ways in which we could prevent an incident like this from happening again. Below is a list of the underlying problems we identified and a solution to each one.

What went wrong?

Reliance on Auth0 to authenticate the live page renderer with the API resulted in the live pages going down when Auth0 went down.
Reliance on Auth0 for non-authentication queries in our API resulted in our app going down.
Noise in our monitoring and alert systems resulted in a delayed response.
Inability to bypass our CI/CD pipeline for emergency deployments resulted in a delayed resolution.

What are we doing to prevent something like this from happening again?

We will remove the reliance on Auth0 to authenticate the live page renderer with the API.
We will update our APIs architecture so that it only relies on Auth0 for authentication requests (i.e., signup, login, and logout).
We will separate our monitoring and alert systems by environment to reduce noise and implement a protocol to handle alerts so that there is no delay between an issue happening and us knowing about it.
We will improve our disaster-recovery protocols by writing scripts that can allow us to quickly deploy fixes and patches by bypassing our CI/CD systems.
﻿
Again, we apologize for the impact this downtime might have had to your business and want to assure you that the availability of our services, especially live pages, is our top priority.