FutureQuest, Inc. FutureQuest, Inc. FutureQuest, Inc.

FutureQuest, Inc.
Go Back   FutureQuest Community > FutureQuest Site Owners (All may read - Only Site Owners May Respond) > Notices & Alerts
User Name
Password  Lost PW

 
Thread Tools Search this Thread Display Modes
Old 12-09-2020, 12:30 PM   Postid: 188218
harryd42
Site Owner

Forum Notability:
0 pts: Even-handed
[Post Feedback]
 
Join Date: Apr 2017
Posts: 7
Re: [FQuest Notice] SAN work progress updates

Email is working now.
harryd42 is offline  
Old 12-09-2020, 12:32 PM   Postid: 188219
photoruss
Site Owner
 
photoruss's Avatar

Forum Notability:
10 pts: User-friendly
[Post Feedback]
 
Join Date: Aug 2003
Location: Boston, MA
Posts: 27
Re: [FQuest Notice] SAN work progress updates

yes. I am getting emails now too.

Thanks to the FQ team for continued hard work!!
photoruss is offline  
Old 12-09-2020, 12:48 PM   Postid: 188220
 Kevin
Systems Administrator
 
Kevin's Avatar
 
Join Date: Aug 2001
Location: Orlando, FL
Posts: 2,986
Re: [FQuest Notice] SAN work progress updates

In the spirit of better communications I have made some changes to the Server Status page. I have added SMTP checks to the individual PT servers (what MXLB01 load balances) as well as MXLB01 itself.

I have also added a Load Average column. We have resisted doing this for years because the numbers are deceptive. Since one of the values is the load averaged over 15 minutes a large spike can keep it red for a while after the load drops back down (even to less than 1). Hopefully it won't be too ugly.
__________________
Kevin
Kevin is offline  
Old 12-09-2020, 12:57 PM   Postid: 188221
Mohawk
Site Owner

Forum Notability:
0 pts: Even-handed
[Post Feedback]
 
Join Date: Oct 2013
Posts: 43
Re: [FQuest Notice] SAN work progress updates

I think that would help, at least we can see if there are current or recent events that may help explain something going on at that moment with our own systems.
Mohawk is offline  
Old 12-09-2020, 12:57 PM   Postid: 188222
 Kevin
Systems Administrator
 
Kevin's Avatar
 
Join Date: Aug 2001
Location: Orlando, FL
Posts: 2,986
Re: [FQuest Notice] SAN work progress updates

If anyone just looked at the Server Status page and wondered why there was a bunch of red in the SSH column, we were just hit by a DDoS attack that made SSH access intermittent. Within 2 minutes our automated defenses kicked in fixed it.
__________________
Kevin
Kevin is offline  
Old 12-09-2020, 01:00 PM   Postid: 188223
JimJ26
Site Owner

Forum Notability:
42 pts: User-friendly
[Post Feedback]
 
Join Date: Sep 2010
Posts: 28
Re: [FQuest Notice] SAN work progress updates

Quote:
Originally Posted by Kevin View Post
A note since several people have asked...

If you are getting certificate errors those are unrelated to this problem. The certificates were coincidentally updated last night. If you are getting these errors it means your email client is configured incorrectly.

If you want to use encrypted email (which of course we recommend) and you get these errors change your incoming and outgoing email server names to mail.questmail.net instead of the domain specific names. Also, use the full email address as the user name.

If you manually force your email client to accept the new certificate instead of making the change you will be right back to the same problem in a month with the certificates renew again.
So tired of this. Now I have to wipe my phone of all email addresses again. This is insanity. And on the email setup page there is no mention of mail.questmail.net.

I setup as noted above and got a fail. Why am I not surprised.
JimJ26 is offline  
Old 12-09-2020, 01:02 PM   Postid: 188224
chernove
Site Owner
 
chernove's Avatar

Forum Notability:
65 pts: Helpful Contributor
[Post Feedback]
 
Join Date: Jan 2002
Location: NYC
Posts: 162
Re: [FQuest Notice] SAN work progress updates

Quote:
Originally Posted by Kevin View Post
In the spirit of better communications I have made some changes to the Server Status page. I have added SMTP checks to the individual PT servers (what MXLB01 load balances) as well as MXLB01 itself.

I have also added a Load Average column. We have resisted doing this for years because the numbers are deceptive. Since one of the values is the load averaged over 15 minutes a large spike can keep it red for a while after the load drops back down (even to less than 1). Hopefully it won't be too ugly.
Thank you, Kevin. Much appreciated.

Not 100% incidentally, Qmail is still extremely laggy for me (1:00 PM EST). But at least it's (kind of) working.
chernove is offline  
Old 12-09-2020, 01:22 PM   Postid: 188225
 Terra
CTO FutureQuest, Inc.
 
Terra's Avatar
 
Join Date: Jun 1998
Location: Z'ha'dum
Posts: 8,108
Re: [FQuest Notice] SAN work progress updates

The SAN storage cluster has now fully converged itself and its I/O performance should now be back up to speed. There are still some chunk upgrades being done, but they are now in background (best-effort) priority.

The moment of the SAN full convergence is significant because that is how long I would have had to wait before resetting the mail cluster if I had not taken a chance and tried a modified manual bootstrap procedure to bring it back online ahead of full convergence.

Operating massive SANS can be a beast sometimes, just by the sheer volume of data one has to work with. When they are working properly, they are simply amazing (for so many reasons) - but when they go sideways - one had better hope they have a Ph.D from space camp to get it back on course.

What happened today was simply a bad combination of events. The storage node that hard locked, happened to have a large number of storage chunks, of which about 50% were still pending replication (due to further offloading migrations). That means many emails ceased to exist in an instant, however there are safeguards and fail safes that when a mount tries to access that missing file - it will hold and wait till it returns (data integrity). This is where the cascading problem is due to a locking bug in the mount daemons that I still have not pinned down yet - though I believe I'm getting closer. However these types of bugs are some of the hardest to hunt down and isolate due to the multitude of code pathways that can hit that mutex. When IMAP is continually pounding on the mounts (waiting for those files) it freezes up the mounts and goes into a full on 'D' (uninterruptible sleep) state. A 'kill -9' won't even break them free, so now I have thousands of unruly IMAP daemons wreaking havoc because they swamp each other out and won't ever notice the file is back online and available to access. This is why I had to reset the mail system (which is only way to fix this), primarily because of IMAP gone wild, being belligerent and downright stubborn.

The SAN work that I'm personally doing is to increase its resiliency in the face of a storage node failure, where it can quickly replicate those chunks to other storage nodes. The prior power outage shook the trees quite a bit and pointed out where resilient improvements could be made without sacrificing I/O performance to achieve it. This is one of those technical cases where things like this don't become apparent until its being driven by a trial-by-fire event.

Please be assured that I take every precaution I can think of to make this SAN overhaul as transparent as possible. However the combination of two bad events; queued chunk replications and kernel hard lock, is just one of those things that all of the safety precautions can only take you so far. Sometimes you have to do work that sets up a mutually exclusive scenario, where you can survive one or the other - but not both at the same time. There is no safety net for that, other than fixing the broken storage node and reintroducing it while waiting for the (scanning) metadata registration to complete. Depending on the size of the silo, it can take anywhere from 15 to 90 minutes to fully complete. This scanning time is part of normal operations, unless you are dealing with pending replications. Our SAN is 99.99% always fully replicated.

The scanning time does drive me nuts, and that is another area of code research that I'm digging into to see if I can increase its efficiency. When I need it now and not ~X number of minutes from now - well - I need it now in order to keep our clients happy. Failure is never an option, but sadly in the technical world we sometimes have to handle whatever the failure is as quickly, efficiently and comprehensively as possible while striving to not make any mistakes along the way.
__________________
The FutureQuest Team
Terra is offline  
Old 12-09-2020, 01:59 PM   Postid: 188227
cetacean
Site Owner
 
cetacean's Avatar

Forum Notability:
0 pts: Even-handed
[Post Feedback]
 
Join Date: Feb 2016
Location: Seattle, WA
Posts: 6
Re: [FQuest Notice] SAN work progress updates

I was receiving email until about 10:35am PST, but now nothing is being received via my POP connection. No errors, just no messages that I should be receiving. My account is on SIX.
__________________
Joe
Cetacean Research Technology
https://www.cetaceanresearch.com
cetacean is offline  
Old 12-09-2020, 02:02 PM   Postid: 188228
 Kevin
Systems Administrator
 
Kevin's Avatar
 
Join Date: Aug 2001
Location: Orlando, FL
Posts: 2,986
Re: [FQuest Notice] SAN work progress updates

We are not aware of any current problems. The email servers do have a backlog of emails queued up for processing.
__________________
Kevin
Kevin is offline  


Currently Active Users Viewing This Thread: 1 (0 members and 1 visitors)
 

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT -4. The time now is 12:23 AM.


Running on vBulletin®
Copyright © 2000 - 2019, Jelsoft Enterprises Ltd.
Hosted & Administrated by FutureQuest, Inc.
Images & content copyright © 1998-2019 FutureQuest, Inc.
FutureQuest, Inc.