FutureQuest, Inc. FutureQuest, Inc. FutureQuest, Inc.

FutureQuest, Inc.
Go Back   FutureQuest Community > FutureQuest Site Owners (All may read - Only Site Owners May Respond) > Notices & Alerts
User Name
Password  Lost PW

 
Thread Tools Search this Thread Display Modes
Old 12-15-2020, 05:25 PM   Postid: 188275
 Terra
CTO FutureQuest, Inc.
 
Terra's Avatar
 
Join Date: Jun 1998
Location: Z'ha'dum
Posts: 8,108
[FQuest Notice] Secondary core router power supply failure

The secondary core routing network was taken offline from a failed power supply and we have switched our network fully back over to the primary. Unfortunately this requires a fair amount of manual work to perform as the fail over routines are engineered for automatic Primary => Secondary failures. There are many technical reasons for this, but mostly to ensure that the secondaries aren't in a marginal state causing major MSTP storms to the inner network. This was one of the problems we saw a few years ago and developed procedures to remove any risk of that happening.

We are still investigating the cause of the secondary network hardware failure, but all indications are pointing to a failed power supply.

During the last power outage, while we are getting everything back online, the primary routing core was responding in a marginal manner where the decision was made to cut out the primary network and drop back to the secondary which appeared to have been solid. Once everything was back online we left the secondary backup network in operation as the primary conduits (manually pinned to be safe). A few days after the event, the problems with the primary network were sorted out and fixed but due to being so deep into the holiday season we elected to hold off on switching everything back around till after the New Year. Historically the backup network has never really had any major problems which is why we left it pinned up and isolated out the primaries to ensure there was no unwanted cross talk until primary => secondary could be fully meshed back together - which in of itself is a disruptive event that needs to have a scheduled maintenance window.

All in all, we really did try to do what was best for the stability of our network after coming back from the chaos caused by the last major power outage. Ergo, we didn't want to rock the boat with networking. Yet it is now apparent there was hidden damage to the power supply that didn't even show up in our monitoring system. We watch power supplies for fluctuations thorough onboard chipset monitoring systems, and this one was all green - until it instantly cut out.

As it stands now, everything is now back on the primary network - which is how it normally runs and we'll be replacing the blown secondary core router. The primary core router was fully checked out while it was offline and we don't believe there are any power supply issues with it.

The Secondary=>Primary meshing work does not need a maintenance window as it isn't a disruptive event. Even if it might think about being disruptive, we completely isolate the secondary network while doing the work.
__________________
The FutureQuest Team
Terra is offline  
Old 12-15-2020, 05:53 PM   Postid: 188277
chernove
Site Owner
 
chernove's Avatar

Forum Notability:
65 pts: Helpful Contributor
[Post Feedback]
 
Join Date: Jan 2002
Location: NYC
Posts: 162
Re: [FQuest Notice] Secondary core router power supply failure

Not to sounds like a broken record, but....I will.

WHERE WAS THE *#&*(#&$ COMMUNICATION DURING THIS LATEST ISSUE?
WHERE?
NO, SERIOUSLY!
NOT ON TWITTER. NOT ON FACEBOOK. NOT ON THE NON-FUNCTIONING FUTUREQUEST.
It's a simple question, and one we've all been asking for going on two months now.
When the crap hits the fan, WHERE do we get the information?

I literally JUST now got a notification that FQ just posted to Twitter. Great. Where were you 4 hours ago?!

SERIOUSLY! Unbelievable.
chernove is offline  
Old 12-15-2020, 06:30 PM   Postid: 188279
cetacean
Site Owner
 
cetacean's Avatar

Forum Notability:
0 pts: Even-handed
[Post Feedback]
 
Join Date: Feb 2016
Location: Seattle, WA
Posts: 6
Re: [FQuest Notice] Secondary core router power supply failure

Well, I'm glad you're back on primary power. I've continued to give FQ the benefit of the doubt regarding these outages, but the total silence on Twitter or Facebook or any other channel for the 3rd or 4th time in the past two months is the last straw. After being with you for the past 11 years, it's time to say goodbye. (My account and those of several of my colleagues had been managed by Artemis, who sadly passed away last year, which is why it looks like I've only been part of the community for a few years.) We'll all be leaving by the end of the year. I hope you work out your stability issues and, more importantly, you learn how to communicate with your customers.
__________________
Joe
Cetacean Research Technology
https://www.cetaceanresearch.com
cetacean is offline  
Old 12-15-2020, 06:45 PM   Postid: 188280
 Terra
CTO FutureQuest, Inc.
 
Terra's Avatar
 
Join Date: Jun 1998
Location: Z'ha'dum
Posts: 8,108
Re: [FQuest Notice] Secondary core router power supply failure

chernove, I was called in at the tail end of the event, as I had been working on the SAN all night and was out to get rest for tonight's SAN work. Once I got in and helped to assess the postmortem and got a clearer picture of what was going I was able to get up a post here and also on Twitter. Due to a multitude of hacking attempts against our Twitter account, we have it locked down and currently I'm the only one that can unlock it (tied to my phone and private external email server) until we find a better way to ensure security. This account lock down is only temporary and what was needed to be done at the time, even if sub-optimal. It is also quite high on the priority list to resolve.

In regards to Facebook, we are looking at shutting down our presence there due to our disagreements with privacy concerns.

The external communications are being worked on as I'm looking at getting an outside server on Google's network. Getting this all setup won't happen until 1st Qtr 2021 as our primary focus right now is the SAN overhaul and also preparing for a core routing fail over system overhaul. We are currently addressing the most critical infrastructure concerns that could negatively impact everyone, and is part of the power outage cleanup.

I promise you the communication issue is of high concern and we will be working on it after we get the internal cleanup done. I have personally already done the preliminary work of external machine location and still designing what kind of services it will run. There just hasn't been enough time yet to implement the external server solution yet. As it stands now, on the critical (and unprecedented) must fix list - it is number three.

We have heard everyone's concern regarding the communication problem and preliminary work is already underway. It is very important to us that we get a solution deployed as quickly as we possibly can.

Please know that we are working day and night to get the critical must-fix items done as rapidly as possible. I have been working pretty much non-stop on overhauling the SAN (for better resiliency) and I'm about 75% finished with it. Tonight will be a big push (transparent) to get that up to around 85%.

The next major concern is the planned overhaul of the core routers, that the new design was supposed to prevent what happened today. New technologies have come into existence that will allow us to break the Primary=>Secondary failover trigger system, where going Secondary=>Primary can now be an automated, seamless, and transparent process unlike the manual one it is now.

In summation, we are working through many critical must-fix issues that were caused by the power outages as quickly as we can. There are some major components that are currently in a fragile (transitioning) state, and one in particular can disrupt email operations as happened the other day.
__________________
The FutureQuest Team
Terra is offline  
Old 12-15-2020, 07:25 PM   Postid: 188281
Press
Site Owner
 
Press's Avatar

Forum Notability:
0 pts: Even-handed
[Post Feedback]
 
Join Date: Oct 2007
Posts: 30
Re: [FQuest Notice] Secondary core router power supply failure

getting cannot verify server identity pop-ups on my phone. i guess there's going to be another email outage?! how long will this one be?
__________________
dp
chicago, usa
Press is offline  
Old 12-15-2020, 07:34 PM   Postid: 188283
chernove
Site Owner
 
chernove's Avatar

Forum Notability:
65 pts: Helpful Contributor
[Post Feedback]
 
Join Date: Jan 2002
Location: NYC
Posts: 162
Re: [FQuest Notice] Secondary core router power supply failure

Thank you, Terra. I appreciate this all.

Part of the issue is that in a triage situation, one needs to go with what one can. Or, to put it in computing terms, sometimes you just need a kluge job. To wit: I understand that you might have privacy concerns in re Facebook, and long-term you definitely SHOULD think about not using it. But, in the meantime, when everything hit the fan, was there NO ONE at FutureQuest who thought "oh crap, everything's down again. We need to post something RIGHT NOW to tell our clients we know it's happened and that we're dealing with it"? To be honest, it feels to your clients that FQ has not heard anything about this. You say you have, but the evidence is not there. How many times in two months have we all said, "fixing the darn thing can wait; communication can't."

To be honest, this is the first time I didn't completely read your technical explanation ("The secondary core routing network was taken offline ...") And I probably won't. Why? Because I don't care. And many of your other clients don't care, either. A good number of them can't figure out what they say anyway, and also justifiably think "okay, but that's what I'm paying you for. I don't care what the problem is; FIX IT." (In the past, as you know, I've followed your updates very carefully.) And, to be clear, telling us hours after the fact is, in light of recent communication failures, more galling than helpful.

This is in no way meant to diminish YOUR work. On the contrary, as I mentioned previously, it shouldn't be your job at all to tell us what's going on. You're the tech person, not the PR person. But (and I ask this in all seriousness): who IS in charge of communication in these situations? I've been with FQ for 20 years and I honestly don't know the answer to this question. More upsetting--FAR more upsetting--is that I, and every other client you have at this point--have more than a sneaking suspicion that FutureQuest doesn't know the answer, either. Or, just as bad, they know the answer is "No one. No one has been charged with the sole obligation of keeping our clients informed during outages."

My site went down today 15 minutes before I was to direct over a dozen of my clients to go there for a time-sensitive document. It was absolute dumb luck that I had alternative means to get them this information today. If I didn't I'd have been well and truly screwed.
I've been really patient. I've been really faithful. I've LONG praised FQ to the skies and beyond. But the ongoing communication issues are unconscionable, disgraceful, and bordering on unethical.

Now, for the final time: WHAT IS THE GAMEPLAN: SHORT-TERM, MEDIUM-TERM, LONG-TERM? When your site (and everyone else's) unexpectedly goes out HOW DO WE GET AN UPDATE? Don't tell me the long-term plan first. I don't care. You can tell me that when it happens. The email went out completely, what, two weeks ago? While you're making pans, today happened! If our sites and email go out tomorrow, I need to know--RIGHT NOW--where can I go to get the simple message from FQ, "we know; we're on it." Not 4 hours after the fact. Not 2 hours. Not 20 minutes. I'd say a reasonable timeline: within 5 minutes of FQ being aware of a problem, I should know where to go to get that simple message: "we know; we're on it."

You know, it's almost like I (and about a gazillion others) have mentioned this once or twice (or a million times) since October.

So: WHAT IS THE GAMEPLAN?
chernove is offline  
Old 12-15-2020, 07:45 PM   Postid: 188284
Press
Site Owner
 
Press's Avatar

Forum Notability:
0 pts: Even-handed
[Post Feedback]
 
Join Date: Oct 2007
Posts: 30
Re: [FQuest Notice] Secondary core router power supply failure

^^^^well said....
__________________
dp
chicago, usa
Press is offline  
Old 12-15-2020, 08:02 PM   Postid: 188285
Hotdog
Site Owner

Forum Notability:
0 pts: Even-handed
[Post Feedback]
 
Join Date: Aug 2013
Posts: 24
Re: [FQuest Notice] Secondary core router power supply failure

When I have to tell people, "I don't know and have no way to find out", my reputation is shot. Your fault, my fault or nobody's fault I have to live with the consequencies. After your last snafu we trusted that you understood the importance of communication and that you would put a priority on that. Even a major technical issue should not come out looking like the end of the world. You do not need an eighteen-wheeler to deliver a wheel barrow of information. You did not learn your lesson. Unfortunately, we are learning ours.
Hotdog is offline  
Old 12-15-2020, 09:09 PM   Postid: 188286
daniel77
Site Owner

Forum Notability:
0 pts: Even-handed
[Post Feedback]
 
Join Date: Aug 2016
Posts: 48
Re: [FQuest Notice] Secondary core router power supply failure

It takes a long time to build a good reputation, and a short time to destroy one. It is absolutely as amazing how simple and obvious the changes that needed to made were, as it is that those changes weren't made. I'll be moving my site elsewhere.
daniel77 is offline  
Old 12-15-2020, 11:21 PM   Postid: 188288
Kibarrister
Site Owner

Forum Notability:
0 pts: Even-handed
[Post Feedback]
 
Join Date: Nov 2020
Location: Maryland
Posts: 4
Re: [FQuest Notice] Secondary core router power supply failure

Goodbye Futurequest. I should have my website rerouted to a competitor with some sense of competence by this time next week. I’d say I hope you can get your sh!t together, but with all the money and headaches you have cost me these last few months... anyone that stays with this company is nuts.
Kibarrister is offline  


Currently Active Users Viewing This Thread: 1 (0 members and 1 visitors)
 

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT -4. The time now is 12:23 AM.


Running on vBulletin®
Copyright © 2000 - 2019, Jelsoft Enterprises Ltd.
Hosted & Administrated by FutureQuest, Inc.
Images & content copyright © 1998-2019 FutureQuest, Inc.
FutureQuest, Inc.