My First Postmortem Report
Debugging project 500 server error
Sun Feb 23 2020 - 2 min read
WordPress website running on a LAMP stack incident report
Sunday, February 23, 2020
Earlier this week we experienced a website service outage that affected all users. This incident occurred on February 19, 2020.
Today we’re providing an incident report detailing the nature of the outage and our response to resolve it.
Issue summary
From 12:15 PM (GMT-5) to 2:45 PM (GMT-5), all visitors to our website receive a 500 error response message. The issue affected 100% of traffic to our website. The root cause was a spelling error in a route file in the WordPress configuration.
Timeline
- 12:00 PM - Edit various files.
- 12:05 PM - Push files to production.
- 12:15 PM - Discover 500 error response when accessing the website.
- 12:20 PM - Debugging phase.
- 1:00 PM - Review a possible solution for the 500 error.
- 1:30 PM - Find no log errors found.
- 2:00 PM - Use
strace
command for apache process to look at what happened when usingcurl
. - 2:15 PM - Found a typo in a route file.
- 2:30 PM - Successfully identify the configuration file with the error.
- 2:32 PM - Use a script to fix the file.
- 2:45 PM - Server restarts and receive a 200 response from the webserver.
Root cause and resolution
A configuration change was released to production without first being tested. The change included an invalid route for a PHP file necessary for the WordPress configuration. Typically, every new change needs to be released on a testing environment that replicates our production environment. However, this time it was not tested and was not carefully reviewed by one of our Senior engineers for approval.
Once the error was found using the strace
command, we identified the configuration file using the incorrect route and fixed it.
Corrective and preventive measures
It is crucial that every commit pushed to production is first tested in a testing environment. Once everything is confirmed to be correct, it can then be pushed to production.
Actions to prevent future issues:
- Test changes in a test environment first.
- Double-check for spelling errors.
- Add monitoring alerts for faster response.
We are committed to continually and quickly improving our technology and operational processes to prevent this kind of outage in the future.