Platform Outage Resulting in 500 Error

Incident Report for Uscreen

Postmortem

What Happened?

On 3/13/2025, we detected a service disruption affecting all users. During this time, some catalog pages and the Uscreen admin area experienced errors, preventing store functionality from working as expected. The issue lasted approximately 19 minutes, with a complete outage of 5-6 minutes before services began recovering.

What Was the Impact?

Total Duration: ~19 minutes (15:01 - 15:20 UTC)
Full Service Disruption: ~5-6 minutes
Degraded Performance: 15:08 - 15:20 UTC
Scope: All stores
Recovery: Initial recovery started at 15:08 UTC, with full restoration by 15:20 UTC

Is Everything Working Now?

Yes, the service has been fully restored and is operating normally.

What Caused the Issue?

After investigating, we identified that:

A feature in the community section was not optimized for stores with a high number of users.
The system was executing resource-intensive operations, causing requests to take longer than expected.
Under heavy load, the database became overwhelmed, leading to errors.
Our monitoring system did not provide useful diagnostics during the incident, delaying troubleshooting.

What Are We Doing to Prevent This in the Future?

We’ve taken immediate steps to mitigate the issue and are working on long-term improvements:

Immediate Fixes:

Applied a temporary adjustment to reduce the server load caused by the feature.
Upgraded our monitoring system for better incident visibility and faster response times.

Planned Improvements:

Optimizing the affected feature to handle high-traffic scenarios efficiently.
Implementing measures in our mobile apps to limit excessive requests.

‌

We sincerely apologize for any disruption this may have caused and appreciate your patience as we work to enhance system stability.

Posted Mar 18, 2025 - 13:44 UTC

Resolved

Our team has identified the issue's root cause and implemented a fix. Thank you for being so patient while the team worked to fully resolve the issue.

Posted Mar 13, 2025 - 17:27 UTC

Update

Our team has restored service to normal operations, however, they continue to investigate the root cause of the issue. We appreciate your continued patience.

Posted Mar 13, 2025 - 15:56 UTC

Update

Our engineering team is investigating the issue and is working on implementing a solution. We're starting to see improvements.

Posted Mar 13, 2025 - 15:19 UTC

Investigating

Our engineering team is actively investigating the issue and working on a resolution. This affects access to the Admin Area and Catalog pages.

Posted Mar 13, 2025 - 15:07 UTC

This incident affected: Admin Portal and Storefront.