For several hours, Network Creators and Members were unable access many
of their Ning Networks. Users were seeing 404 errors across a large
swath of all Ning Networks. I will explain what happened, how we are
fixing it, and how we are going to avoid the issue in the future.
What happenedEarly
this week, we were seeing issues when deploying new software to our
production environment. As you know, we have been furiously deploying
new features and enhancements to Ning since we launched Ning Mini, Plus,
and Pro. Specifically, we were noticing that some of our servers were
not functioning properly with new code. To resolve this issue, we made
some operational changes in how we maintain these servers. Part of that
change was deleting old software deployments. Unfortunately, one of the
side effects of this was an errant process that started deleting some of
your Ning Networks. Note however, that this script only impacted the
PHP code for some of your Ning Networks. It did not delete any content.
Our
incident team was immediately engaged and online. Once we found the
root cause, we stopped the errant process and begin restoring your Ning
Networks.
How we fixed itWe
keep backups of all your Ning Networks both hourly and daily. We ran a
script that restored the deleted code. This script was able to work very
fast and restored many Ning Networks in a few hours. As Ning Networks
were restored from the backup, they were immediately available online.
How we will prevent this issue from occurring in the futureWe are avoiding this in three ways:
- Strengthen our production review process. All production changes go through a review process at Ning. We make
dozens of changes in our production environment per week without
incident. We are further strengthening this review process with deeper
reviews of all one-time scripts and long-running scheduled jobs/scripts
in our production environment.
- Perform an audit of all our existing jobs. This is something we do regularly but will be doing immediately in light of the incident.
- Put safeguards in our system against this. Even if another errant script runs in our environment, we are putting
safeguards in our system to prevent this kind of deletion or alteration.