netPark Outage - Oct. 7th 2024 | US-EAST-2 Region AWS Outage

Post mortem

10/9 @ 7:38 PST - Amazon has issued a statement addressing the situation and outlining the steps they’re taking to prevent this issue from happening again in the future.

Their comment is pasted below in it’s entirety:

We wanted to provide some additional information. Beginning at 12:27 PM, we experienced increased error rates for S3 GET/PUT requests made in the US-EAST-2 Region. This issue also impacted other AWS Services who use these APIs as part of service operations. The S3 issue was resolved at 12:52 PM, when S3 error rates returned to normal operations. Since then, we have been working to fully understand the root cause. This issue was caused by a latent software issue in a subsystem of S3 that is responsible for assessing metadata (such as versioning) during S3 PUT and GET operations. We have already implemented mitigations, which include removing the software issue in the S3 request path, and have identified new testing to detect these types of software issues in the future.

Rozwiązany
Oceniony

On October 7th, 2024, at approximately 3:33 PM EST, netPark’s automated server monitoring system detected performance degradation across several key service components. Our team immediately began investigating the root cause of the degraded status.

Within five minutes, this issue escalated into a full outage. During this time, netPark’s support team quickly responded to customer inquiries via email and voicemail, keeping clients informed of the situation and our efforts to resolve it.

At 3:46 PM EST, we received an alert regarding an API Error Outage affecting the US-EAST-2 Region of Amazon Web Services (AWS), the platform hosting all netPark services. By approximately 3:49 PM EST, our servers began to come back online, and by 3:55 PM EST, all three netPark servers had fully recovered, restoring system operations. In total, the outage impacted seven different netPark services for an estimated 17 minutes.

Our team continued to closely monitor the environment until 4:15 PM EST, when all servers and services were confirmed to be fully operational and performing within expected thresholds. We will maintain vigilance throughout the day and provide any additional updates should AWS release further details on the outage in the US-EAST-2 Region.

We sincerely apologize for any inconvenience this disruption may have caused and appreciate your understanding.

7 Usługi, których dotyczy problem: