|
Re: [FQuest Notice] SAN work progress updates
The SAN storage cluster has now fully converged itself and its I/O performance should now be back up to speed. There are still some chunk upgrades being done, but they are now in background (best-effort) priority.
The moment of the SAN full convergence is significant because that is how long I would have had to wait before resetting the mail cluster if I had not taken a chance and tried a modified manual bootstrap procedure to bring it back online ahead of full convergence.
Operating massive SANS can be a beast sometimes, just by the sheer volume of data one has to work with. When they are working properly, they are simply amazing (for so many reasons) - but when they go sideways - one had better hope they have a Ph.D from space camp to get it back on course.
What happened today was simply a bad combination of events. The storage node that hard locked, happened to have a large number of storage chunks, of which about 50% were still pending replication (due to further offloading migrations). That means many emails ceased to exist in an instant, however there are safeguards and fail safes that when a mount tries to access that missing file - it will hold and wait till it returns (data integrity). This is where the cascading problem is due to a locking bug in the mount daemons that I still have not pinned down yet - though I believe I'm getting closer. However these types of bugs are some of the hardest to hunt down and isolate due to the multitude of code pathways that can hit that mutex. When IMAP is continually pounding on the mounts (waiting for those files) it freezes up the mounts and goes into a full on 'D' (uninterruptible sleep) state. A 'kill -9' won't even break them free, so now I have thousands of unruly IMAP daemons wreaking havoc because they swamp each other out and won't ever notice the file is back online and available to access. This is why I had to reset the mail system (which is only way to fix this), primarily because of IMAP gone wild, being belligerent and downright stubborn.
The SAN work that I'm personally doing is to increase its resiliency in the face of a storage node failure, where it can quickly replicate those chunks to other storage nodes. The prior power outage shook the trees quite a bit and pointed out where resilient improvements could be made without sacrificing I/O performance to achieve it. This is one of those technical cases where things like this don't become apparent until its being driven by a trial-by-fire event.
Please be assured that I take every precaution I can think of to make this SAN overhaul as transparent as possible. However the combination of two bad events; queued chunk replications and kernel hard lock, is just one of those things that all of the safety precautions can only take you so far. Sometimes you have to do work that sets up a mutually exclusive scenario, where you can survive one or the other - but not both at the same time. There is no safety net for that, other than fixing the broken storage node and reintroducing it while waiting for the (scanning) metadata registration to complete. Depending on the size of the silo, it can take anywhere from 15 to 90 minutes to fully complete. This scanning time is part of normal operations, unless you are dealing with pending replications. Our SAN is 99.99% always fully replicated.
The scanning time does drive me nuts, and that is another area of code research that I'm digging into to see if I can increase its efficiency. When I need it now and not ~X number of minutes from now - well - I need it now in order to keep our clients happy. Failure is never an option, but sadly in the technical world we sometimes have to handle whatever the failure is as quickly, efficiently and comprehensively as possible while striving to not make any mistakes along the way.
__________________
The FutureQuest Team
|