|
|
|
04-25-2022, 07:19 AM
|
Postid: 188700
|
|
CTO FutureQuest, Inc.
Join Date: Jun 1998
Location: Z'ha'dum
Posts: 8,108
|
Re: Network Outage
Current status with email systems.
The storage cluster suffered from metadata corruption which prevented this SAN from properly starting. We have been working on fixing this non-stop while trying to recover as close to 100% of emails as possible.
Out of a current store of 38,435,157 we were unable to recover 2,452. Those 2,452 emails were held in memory cache before being finalized to storage cluster.
That part is now done. What remains is fixing some control files that are causing SMTP to hang and not accept inbound email. There are 8 of them that need repair, and we are clearing them and rebuilding.
I am hoping that email should be back in full production within the next couple hours. As best I can see, SMTP processing is the last thing that needs to be fixed.
__________________
The FutureQuest Team
|
|
|
04-25-2022, 07:23 AM
|
Postid: 188701
|
|
Site Owner
Join Date: Sep 2002
Location: Buenos Aires
Posts: 377
|
Re: Network Outage
Quote:
Originally Posted by Terra
I am hoping that email should be back in full production within the next couple hours. As best I can see, SMTP processing is the last thing that needs to be fixed.
|
My site is still down, is this something that you are aware of? I'm on Managed QuestServer
|
|
|
04-25-2022, 07:27 AM
|
Postid: 188702
|
|
Registered User
Join Date: May 2007
Location: Maryland
Posts: 6
|
Re: Network Outage
Good news! I just had some email trickle in from yesterday. (Fingers crossed that the rest is fixed soon.)
|
|
|
04-25-2022, 07:29 AM
|
Postid: 188703
|
|
Systems Administrator
Join Date: Aug 2001
Location: Orlando, FL
Posts: 2,986
|
Re: Network Outage
On the topic of MySQL...
4 MySQL servers were destroyed 3 of them probably beyond recovery. They were MYSQL07, MYSQL08, MYSQL10, and MYSQL16. Fortunately we were able to move MYSQL16's disks from a spare machine and get it going fairly quickly though we were prioritizing email hardware at the time.
One lucky thing did happen here... The other 3 were the oldest MySQL servers we were still using. The next time we needed a MySQL server I was planning to start it with the content from all 3 anyway.
So, if your MySQL data was on any of those 3 it was restored from the ~3:30am backup and is now on MySQL16. The new MySQL16 is faster than the old one and hopefully the new load won't bother it much. If it does we can move do some distribution of users later.
If you are still seeing MySQL errors other than network related the scan/repair scan is running and will finish soon. I will post when it is done. If you are getting MySQL networking errors it probably means you manage your own DNS and are serving out an A record to the raw IP instead of the mysql.domain.tld.fqdns.net name in your web app config. If you change your 3rd party DNS to the new IP let us know as we can flush the entry out of our cache so it takes effect as soon as we do.
__________________
Kevin
|
|
|
04-25-2022, 07:30 AM
|
Postid: 188704
|
|
Service Assistant
Join Date: Feb 2010
Posts: 235
|
Re: Network Outage
Quote:
Originally Posted by esllou
My site is still down, is this something that you are aware of? I'm on Managed QuestServer
|
I just wanted to note that we do also have your email/ticket and I have forwarded it to the techs to check your MQS.
|
|
|
04-25-2022, 07:31 AM
|
Postid: 188705
|
|
Site Owner
Join Date: Sep 2002
Location: Buenos Aires
Posts: 377
|
Re: Network Outage
thanks a lot Ryan 
|
|
|
04-25-2022, 07:45 AM
|
Postid: 188706
|
|
Site Owner
Join Date: Aug 2020
Posts: 21
|
Re: Network Outage
When can we expect email to return? It has now been almost 24 hours
|
|
|
04-25-2022, 08:30 AM
|
Postid: 188707
|
|
Site Owner
Forum Notability:
156 pts: Ambassador of Goodwill
[ Post Feedback]
Join Date: Mar 2005
Posts: 418
|
Re: Network Outage
I have email again and all my sites are back! Hope this is true for everyone else.
Thanks FQ for all your hard work.
|
|
|
04-25-2022, 08:41 AM
|
Postid: 188708
|
|
CTO FutureQuest, Inc.
Join Date: Jun 1998
Location: Z'ha'dum
Posts: 8,108
|
Re: Network Outage
SMTP has been fixed and all mail services are now operational.
We have run various tests and have been watching the log files closely.
If you run into any problems, please let us know so we can take a closer look. There may be some gremlins due to the missing metadata entries from the storage cluster. However the bulk of them should clear out over time. The ones that don't I'll have to manually remove. I mention this because I've had to handle them before (rare), and the behavior is when a process tries to access the file - the metadata is there, but the file itself isn't causing the process to go into a D state (uninterruptible sleep) waiting for SAN I/O that won't ever happen.
Once again, our sincerest apologies for the headaches this has caused. The email systems are incredibly sophisticated with hundreds of moving parts that have to work in unison. Most of the time, it doesn't take long to sort things out, but in the case of a cold bootstrap, things can get complicated really fast as our first and foremost priority is to not have any lost email. Validating 34M+ emails takes an enormous amount of I/O and time to do right as it must be offline to accomplish.
__________________
The FutureQuest Team
|
|
|
04-25-2022, 08:47 AM
|
Postid: 188709
|
|
Site Owner
Join Date: Apr 2001
Location: UK
Posts: 164
|
Re: Network Outage
"...we were unable to recover 2,452."
Are these emails lost?
|
|
|
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 visitors)
|
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
All times are GMT -4. The time now is 12:23 AM.
|
| |
|
|
|