We recently experienced two unplanned outages on the Ning Platform: the first on Thanksgiving evening, November 22, and the second yesterday, November 27. We know that near-perfect up-time is what you’ve come to expect from Ning, and we wanted to let those who are interested know more details about this unplanned maintenance and how we generally respond when outages occur.
Platform uptime and performance are a critical part of the hosted community solution we offer you, and we know that any outage could have a significant effect on the health and activity in your communities. We delivered at least 99.9% uptime consistently from 2011 to October 2012. Obviously, in the past week we’ve failed to meet this standard. I want you to know that we take these outages (and any outage) very seriously. A team of engineers and advocates are always on-call 24/7 to quickly resolve significant incidents. If necessary, we call in additional engineers with relevant expertise. Any time we have an outage, we perform a detailed post-mortem to investigate root causes and specify changes to prevent the issue from recurring.
So, what happened?
The first outage resulted from back-end optimization work our engineering team has been doing over the past three months. The team has been making changes to our architecture to take advantage of newer equipment and powerful new cloud services that weren’t available when the back-end was originally designed. Our expectation is that the optimization process will result in stronger performance and improved disaster recovery in the long-term. However, one of the changes that was made caused a key set of databases to fail, leading to the outage on November 22nd. The engineering team has corrected the issue so it will not recur.
The outage yesterday was unfortunately the result of user error by an engineer on our operations team during normal back-end maintenance. We’re revising our systems to prevent this type of error from occurring again.
Again, I realize how important platform uptime and performance are for you and your members, and I apologize for the impact of the outages this last week.
UPDATE ON CSS/JS LOADING ISSUES:
Eric, I noticed something. I did try to file a ticket on this but I much rather went through because of the ticket maintenance window.
Ever since the first maintenance issue I've had missing images and at least one discussion I found a ticket about the missing images but I don't have the ticket number handy right now. At the time I was told that the images must be missing because the members left I knew that wasn't the case but that was explanation.
Recently while doing a site map crawl. I've gotten the new error which I've never gotten before error 504 Gateway Timeout.
When investigating this I open the URL and I got XML parse error in the XML parse error said element missing or something. This leads not accessible me to think that something in the maintenance made it so that these images still are there but are now inaccessible. These images are in one of our most popular important discussions so what really be awesome if you can fix that.
And I understand that there is a hope to upgrade to cloud server technology with no ETA but could you give us a rough guess whether it will be weeks or months?
I don't think I have any good answers for you on these questions. I think our more technical team members will need to look into it. I fear any opinion I have would be an uneducated one when it comes to this level of detail. We're always updating and sometimes upgrading our systems, but I can't really speak to the particulars. Sorry, but that is a detail that we probably wouldn't be able to share anyway.
I can't load pics into albums since yesterday ... have reinstalled Chrome, cleared cache etc., but still no luck. Not sure what 'forced re-load' is ... (pardon my ignorance, lol!) happy to try it if you can explain what I need to do. Thanks.
PS: Yes I have submitted a ticket :)
PPS: Googled the 'forced reload' thing, have now tried, but it didn't resolve problem.