Impact
- Users were unable to log in or authenticate during the incident window
- API calls requiring token validation or refresh returned errors
- No data loss occurred
Root Cause
Elevated CPU on the shared database server caused existing database connections to be dropped. On recovery, the authentication service was overwhelmed by a surge of queued client requests, which exhausted the database connection pool. This prevented the service's health checks from responding, causing Kubernetes to repeatedly restart the service before it could stabilize.
Remediation
- Database connection pool capacity was increased to handle traffic surges
- Health check thresholds were tuned to allow adequate recovery time
- A recovery runbook has been created for faster response to similar incidents
Prevention
- Connection pool sizing has been updated to handle peak load
- Health check configuration has been hardened against transient overload
- Database connection validation will be added to automatically replace stale connections