Authentication Service Degradation

Incident Report for myDevices, Inc

Postmortem


Impact

  • Users were unable to log in or authenticate during the incident window
  • API calls requiring token validation or refresh returned errors
  • No data loss occurred

Root Cause

Elevated CPU on the shared database server caused existing database connections to be dropped. On recovery, the authentication service was overwhelmed by a surge of queued client requests, which exhausted the database connection pool. This prevented the service's health checks from responding, causing Kubernetes to repeatedly restart the service before it could stabilize.

Remediation

  • Database connection pool capacity was increased to handle traffic surges
  • Health check thresholds were tuned to allow adequate recovery time
  • A recovery runbook has been created for faster response to similar incidents

Prevention

  • Connection pool sizing has been updated to handle peak load
  • Health check configuration has been hardened against transient overload
  • Database connection validation will be added to automatically replace stale connections
Posted Feb 21, 2026 - 08:53 UTC

Resolved

## Timeline

**07:00 UTC - Issue detected**
The authentication service began experiencing failures due to database connectivity issues caused by elevated CPU on the shared database server from unrelated workloads.

**07:04 UTC - Service degraded**
All authentication requests began failing, including user logins, token grants, and token refreshes.

**07:15 UTC - Investigation started**
Engineering identified the root cause as database connection pool exhaustion following the database instability. The Kubernetes health checks were causing the service to restart repeatedly before it could recover.

**08:00 UTC - Remediation applied**
Health check thresholds were adjusted and the database connection pool was resized to handle the backlog of queued requests.

**~09:30 UTC - Service restored**
The authentication service stabilized and resumed normal operation. No further restarts observed.
Posted Feb 21, 2026 - 07:00 UTC