It looks like you're new here. If you want to get involved, click one of these buttons!
Subscribe to our Patreon, and get image uploads with no ads on the site!
"Take these three items, some WD-40, a vise grip, and a roll of duct tape. Any man worth his salt can fix almost any problem with this stuff alone." - Walt Kowalski
"Only two things are infinite - the universe, and human stupidity. And I'm not sure about the universe." - Albert Einstein
Base theme by DesignModo & ported to Powered by Vanilla by Chris Ireland, modified by the "theFB" team.
Comments
It's often a case of the actual failure being minor, but getting everything back online and up to speed that takes time.
I work for part of a very large company who suffered from a datacentre being knocked out due to a router failure. Although the backup system should of kicked in, it also failed, which meant the entire datacentre went offline. Although it was only offline for under an hour while the faulty equipment was swapped, it took over 2 days get all systems back up, and our part of the company took over a week to get caught up again after only a day of having to run system failure procedures.
So either:
(A) They've done an unusually bad job of managing their IT - by either not having DR or allowing someone to start working on both Live and DR at the same time
(B) They have been hacked or the datacentres physically attacked
Add in the fact that their systems have recently been outsourced so all the people who designed and configured the DR solution are now working elsewhere also contributes to the issue. I bet the outsourcing company never did a DR test when they took over the account
If the cloud solution is so poor that it's all in a single data-centre and THAT had a power cut long enough for its back-up generators to die ... that could do it - but who operates a system like that?
If it's a private Software-as-a-service company that's not like Amazon or Google cloud sized then perhaps there are backups in their data centre but it's rooted through a set of switches that had a separate power supply without backup generators...
But realistically given the size of the BA Operation and modern plans to avoid a DDoS attack, in an appropriate out-sourced solution there would be multiple distributed servers, in probably multiple geographic locations, each with decent international fibre access... short of a power cut that somehow affects multiple geographic locations covering every single distributed server I can't see ANY proper out-sourced solution that BA should be using...
A cheap out-sourced solution ... maybe one server with a half hour backup power solution... once the server is down it might require hands-on intervention to fix...
Anyone with a budget of tens of thousands of moneys should be able to make a distributed system that's HARD to take down... with the likely millions that BA would throw at it... it should be possible to make it near impossible to kill completely for this amount of time
I am also on the hunch that this may be a cyber attack but they are trying to keep hush.
If they are indeed cloud based, it is worth remembering that cloud computing has grown too quickly at times. They is not a lot of big examples of cloud based disaster recovery (in comparison to 'legacy' style networks). There are still plenty of lesson to be learnt.
They're also going to have a major physical problem to deal with even when they get it running again, because all the planes and crews will now be in the wrong places. I assume that the rotation pattern is probably weekly, so I would expect it to be that long before everything is back to normal.
"Take these three items, some WD-40, a vise grip, and a roll of duct tape. Any man worth his salt can fix almost any problem with this stuff alone." - Walt Kowalski
"Only two things are infinite - the universe, and human stupidity. And I'm not sure about the universe." - Albert Einstein
Maybe Corbyn will rescue BA by nationalising it .... the last time I flew BA it was awful.
Remember, it's easier to criticise than create!
I try to avoid BA, if there's an alternative - I've never been impressed with their service really. I suspect quite a lot of other people may be doing so from now on.
"Take these three items, some WD-40, a vise grip, and a roll of duct tape. Any man worth his salt can fix almost any problem with this stuff alone." - Walt Kowalski
"Only two things are infinite - the universe, and human stupidity. And I'm not sure about the universe." - Albert Einstein
I am aware of a recent lightning strike which fried the mains power transformer at a huge datacenter in London Docklands on a Friday night, and the backup system did not kick in. It took the engineers an entire weekend to source and replace the parts, and get the client computer systems up and running.
So an unexpected, disastrous failure is not entirely unknown.
And apparently the back-up system was down due to some failure or other.
What I don't understand is why there is no manual back-up system for checking weights, and customer name and numbers onto each aircraft. At least they could still fly with that knowledge.
One of the customers did sound like a drama queen, and she did keep saying "no one told us anything" and "the staff just got up and left". I find both of those hard to believe, especially the staff leaving, unless, as the ex-BA lady said, they were going for a briefing.
Ringleader of the Cambridge cartel, pedal champ and king of the dirt boxes (down to 21)
One of the really nasty things that happens a lot with power supply failure is when you start a new apparatus room and then start populating it with kit which gets powered up as it gets installed. The site goes live and then even more kit gets installed by follow up projects. At some point the power goes down and when you repower it the breaker immediately blows.
The problem?
The supply going into the room can handle the sustained current to power it, but not in inrush current of everything going on at once. You now have to unplug several hundred boxes throw the breaker and replug them one by one.
Even if you do this you may find that some devices rely on others being available and will shut themselves down if they don't see them. If you don't have intimate knowledge of this ordering you are totally screwed and won't be able to get the services back up.
On top of this you have got every single service which hasn't been configured properly to auto start which will just sit there until someone logs into it and works out how to get it back live.
The company can have all the DR strategies it wants (and probably does), but they are all worthless unless someone has actually yanked the power and most people won't because they are absolutely certain (with good reason) that the whole thing will go to shit, stay down for days, data will be lost and they will get fired.
You might think perhaps that they can just switch over to a backup system in another location, but this is actually a far more complex scenario than you might initially think because for someone like BA they always need consistency in their data between sites. (For example they can't sell the same flight to two different people). Now imagine you are the backup site and you can't contact the live site any more, you don't know if the other site has gone down, or if it's just a network failure. If you get it wrong either you end up with main and backup running at the same time and royally corrupt all the data, or you don't take over and you end up with the whole enterprise shutting down even though you have a viable backup site in working condition. I've seen this go wrong in both directions before.
I could go on. There are lots of ways stuff like this goes horribly wrong.
I wouldn't be surprised if the outsourcing company asked to do a full shutdown DR test as part of handover and BA told them to bugger of.