US Network Server Outage

Incident Report for myDevices, Inc

Postmortem

--- Azure RCA 03/05/2025 ---

What Happened -
A code bug was detected in one of our environments was repaired and merged back into the current deployment release. This caused a new build to be generated which got erroneously included in the current deployment that impacted some scale units in West Europe and East Us regions.

How did we respond -
We deployed a version of the current build that did not have the fix for the bug.

What went wrong and why -
After the fix for the bug was introduced, it changed the version of the build and it was this new version that was incorrectly used to deploy to the affected scale units.

What will MSFT do to prevent recurrence of this problem -
We will put processes in place to prevent the inclusion of changes that occurred after the deployment has made progress into the production environment. We revise our deployment review process before approval of the deployment.


Summary

An unexpected gateway drop in our US-East Azure IoT Hub lasted beyond the usual 15-minute tolerance. Manual failover was initiated, restoring device connectivity within 45 minutes.

Timeline

  • 05:15 UTC – Monitoring detected a significant drop of connected devices in US East.
  • 05:30 UTC – Investigation ongoing; no recovery observed.
  • 05:45 UTC – Initiated manual failover; devices began reconnecting after ~10–15 minutes.
  • 06:00 UTC – Majority of devices back online; incident resolved.

Contributing Factors

  • No scheduled maintenance notification for Azure IoT Hub.
  • Delay in disabling automatic gateway offline alerts triggered unnecessary notifications.

Resolution

  • Performed manual failover to a different region, reducing downtime.
  • Devices reconnected steadily after failover.
  • Working with Microsoft on a Root Cause Analysis (RCA).

Next Steps

  • Incorporate Azure IoT Hub status checks into our alerting.
  • Ensure gateway offline alerts are paused during failover.
  • Update incident report once Microsoft’s RCA is received.
Posted Mar 05, 2025 - 00:35 UTC

Resolved

This incident has been resolved.
Posted Feb 22, 2025 - 06:44 UTC

Monitoring

A failover has been initiated for the region, devices are now recovering.
Posted Feb 22, 2025 - 06:10 UTC

Identified

The issue is related to an unexpected drop of devices in our US East Azure IoT Hub service.
Posted Feb 22, 2025 - 06:01 UTC

Update

We are continuing to investigate this issue.
Posted Feb 22, 2025 - 05:45 UTC

Investigating

We are currently investigating a drop of gateways connection in our US Region.
Posted Feb 22, 2025 - 05:15 UTC
This incident affected: LoRaWAN Network (Azure - US East) and Streaming, Device Connectivity, Device History.