--- Azure RCA 03/05/2025 ---
What Happened -
A code bug was detected in one of our environments was repaired and merged back into the current deployment release. This caused a new build to be generated which got erroneously included in the current deployment that impacted some scale units in West Europe and East Us regions.
How did we respond -
We deployed a version of the current build that did not have the fix for the bug.
What went wrong and why -
After the fix for the bug was introduced, it changed the version of the build and it was this new version that was incorrectly used to deploy to the affected scale units.
What will MSFT do to prevent recurrence of this problem -
We will put processes in place to prevent the inclusion of changes that occurred after the deployment has made progress into the production environment. We revise our deployment review process before approval of the deployment.
Summary
An unexpected gateway drop in our US-East Azure IoT Hub lasted beyond the usual 15-minute tolerance. Manual failover was initiated, restoring device connectivity within 45 minutes.
Timeline
Contributing Factors
Resolution
Next Steps