Post Incident Summary
Availability of Flexpa is critically important to our customers. As part of our normal incident response, we have conducted a post-incident summary as to the source of this incident.
On April 19, we deployed a new database proxy / bouncer (PG Bouncer) as a new piece of infrastructure to assist in a migration effort. The deployment was completed successfully and internal testing showed systems were operationally normal.
On April 20, Flexpa experienced two brief API outages cumulatively lasting 17 minutes. We continued to experience high latency throughout the day.
At 12:17 PM EST it was observed that an internal application was not available.
At 12:22 PM EST our active database client connections began to drop dramatically from normal levels. At the same time, an internal responder was not able to successfully connect to the new database proxy infrastructure to investigate.
At 12:43 PM EST an availability monitor on api.flexpa.com was triggered and automatically posted a public status alarm. The internal responder escalated an an incident response call began.
At 12:51 PM EST the responding team restarted the database proxy and client connections successfully resumed.
While external availability to api.flexpa.com was restored at this time, internal applications continued to be unavailable due to database connectivity. Investigation continued with a review of the proxy logs. The investigating team was able to determine that the application code used to connect to the database proxy had been misconfigured.
At 5:30 PM EST the incident response team attempted to deploy a configuration change. This configuration change was not successful and resulted in an additional 9 minutes of downtime before being reverted.
As a result, over the course of the evening of April 20, the incident response team executed work to remove the database proxy from our infrastructure, completing their work at 11:00 PM EST. We have been operationally normal since this time.
Moving forward from this incident we are:
- Committed to additional testing of any database proxy infrastructure we add to our system in the future
- Improving alarms for internal services, that can serve as a warning sign for external availability issues
- Re-affirming our commitment to operational excellence
This incident has been resolved.
We are currently investigating this incident.