netPark

Slowness / Outage

分析

Around 12:15pm netPark received an alert from our logging system that one of our production servers had entered a state where it was no longer able to handle any additional requests. Initially the logging system automatically restarted that server and we began to monitor the activity. A few minutes later we received another alert that a different server about being unable to serve requests.

Upon review we realized that our read-only database was locked up with a large amount of stuck queries, which had started overflowing to the main database. This caused CPU & Memory usage to spike and soon the databases were unable to handle any more requests. This was roughly around 12:30. We spent the next 20 minutes or so working to clear the queries which slowly returned everything back to regular performance.

Once everything was back in a good state we began to review all of the logging we have for the servers. We haven’t determined what initially caused the backups to begin, but as part of this process we have identified several queries that that had a measurable impact on performance. These queries are being tweaked to run faster (or are having additional indexes added for them).

We have also added additional auto-scaling policies to automatically add in additional database resources when we reach certain performance thresholds.

Until we can find the main cause these changes may not necessarily resolve the issues going forward, but they should help reduce the likelihood of this happening as well as further reduce any performance issues experienced when it does happen.

The query changes will be implemented later tonight and should not impact on-going activities.

解決済み

Everything is currently back into a good state. We’re still reviewing logs for an underlying cause, but we have identified some queries that made the overall situation database worse will be working through those here soon.

監視中

At this point the databases are back to normal load and issues are clearing up. We’ll continue to monitor and will update with additional details as we uncover them.

検証中

We’re currently experiencing database issues that are causing the netPark system to run slowly or not load at all. We’re currently reviewing.