This error caused a Microsoft Azure outage

This error caused a Microsoft Azure outage

Microsoft Azure DevOps, a suite of application lifecycle services, went down for roughly 10 hours on Wednesday in the South Brazil region owing to a simple coding error.

Eric Mattingly, lead software engineering manager, apologized for the downtime on Friday and explained the cause: a simple error that wiped seventeen production databases.

Mattingly noted that Azure DevOps employees take snapshots of production databases on a regular basis to investigate reported issues or test performance improvements. They also rely on a background mechanism that runs every day and deletes outdated photos after a predetermined time period.

During a recent sprint – a group project in Agile jargon – Azure DevOps employees replaced obsolete Microsoft code.Azure.Management.* packages with Azure support.NuGet packages for ResourceManager.

Microsoft Confirms Lapsus$ Breach After Hackers Publish Bing, Cortana Source Code
This error caused a Microsoft Azure outage

As a result, a big pull request of changes swapped API calls in older packages for those in current packages were created. The error happened in the pull request, which is a code modification that must be reviewed and integrated into the relevant project. As a result, the background snapshot deletion task erased the entire server.

“Hidden within this pull request was a typo bug in the snapshot deletion job, which swapped out a call to delete the Azure SQL Database for one that deletes the Azure SQL Server that hosts the database,” Mattingly explained.

Although Azure DevOps provides tests to catch similar issues, Mattingly claims that the incorrect code only executes under particular scenarios and hence isn’t well covered by current tests. Those criteria, presumably, necessitate the presence of an old enough database snapshot to be detected by the deletion procedure.

Due to the lack of snapshot databases, Mattingly stated that Sprint 222 was deployed internally (Ring 0) without problem. Several days later, the software improvements for the South Brazil scale unit (a cluster of servers for a given job) were deployed to the customer environment (Ring 1).

Because that environment included a snapshot database that was old enough to trigger the bug, the background job deleted the “entire Azure SQL Server and all seventeen production databases” for the scale unit.

All of the data was retrieved, but it took more than ten hours. Mattingly explained that there are various explanations behind this.

One is that because customers can’t restart Azure SQL Servers themselves, on-call Azure engineers had to manage it, which took many people roughly an hour.

Another cause was that the backup configurations for the databases differed: some were set up for Zone-redundant backup, while others were set up for the more current Geo-zone-redundant backup. Resolving this discrepancy added several hours to the recovery procedure.

“Finally,” Mattingly added, “even after databases were restored, the entire scale unit remained inaccessible, even to customers whose data was in those databases, due to a complex set of issues with our web servers.”

These difficulties emerged as a result of a server warmup task that used a test call to traverse the list of accessible databases. Databases that were being recovered encountered a mistake, causing the warm-up test to “perform an exponential backoff retry, resulting in warmup taking ninety minutes on average, versus sub-second in a normal situation.”

To make matters worse, the recovery process was staggered, and once one or two of the servers began accepting consumer traffic again, they would become overloaded and crash. To restore service, all traffic to the South Brazil scale unit had to be blocked until everything was ready to rejoin the load balancer and manage traffic.

Several repairs and reconfigurations have been implemented to prevent the problem from reoccurring.

“Once again, we apologize to all of the customers who have been impacted by this outage,” Mattingly stated.