British Airways computer failure

ICBM · May 2017

A question for the more computer-oriented people here…

How can a global corporation the size of BA have had such a complete IT meltdown due to a 'power supply failure'? Are they lying, or is it possible that their system was really based around one IT centre with one power supply, and as such a single point of failure?

I could be wrong but I assumed they would at least have multiple redundancy, or even a widely-distributed system so that no one failure could bring it down.

And if it is true, were the GMB right to say that it's the result of outsourcing? - I remember they warned at the time that it could lead to trouble. If they're right it took about six months to show up.

m_c · May 2017

It's often a case of the actual failure being minor, but getting everything back online and up to speed that takes time.

I work for part of a very large company who suffered from a datacentre being knocked out due to a router failure. Although the backup system should of kicked in, it also failed, which meant the entire datacentre went offline. Although it was only offline for under an hour while the faulty equipment was swapped, it took over 2 days get all systems back up, and our part of the company took over a week to get caught up again after only a day of having to run system failure procedures.

ToneControl · May 2017

They should have full DR (disaster recovery) for all systems, with duplicate hardware at at least one other physical location. No large business would not have DR.

So either:
(A) They've done an unusually bad job of managing their IT - by either not having DR or allowing someone to start working on both Live and DR at the same time
(B) They have been hacked or the datacentres physically attacked

Axe_meister · May 2017

ICBM said:

A question for the more computer-oriented people here…

How can a global corporation the size of BA have had such a complete IT meltdown due to a 'power supply failure'? Are they lying, or is it possible that their system was really based around one IT centre with one power supply, and as such a single point of failure?

I could be wrong but I assumed they would at least have multiple redundancy, or even a widely-distributed system so that no one failure could bring it down.

And if it is true, were the GMB right to say that it's the result of outsourcing? - I remember they warned at the time that it could lead to trouble. If they're right it took about six months to show up.

Testing. The best DR strategy is worth bugger all without regular testing, including a full DC non graceful shutdown.
Add in the fact that their systems have recently been outsourced so all the people who designed and configured the DR solution are now working elsewhere also contributes to the issue. I bet the outsourcing company never did a DR test when they took over the account

JezWynd · May 2017

This meltdown and M & S sandwiches is sorely testing my faith in BA.

Winny_Pooh · May 2017

I guy I recently worked with used to be an IT consultant with BA for years and the way he described their systems and slow pace it sounded just like the way some NHS trusts work internally.

Myranda · May 2017

If the whole operation is based on cloud Software-as-a-service based somewhere abroad then potentially due to some strange routing issues a power cut in an intervening hop country's incoming fibre station *might* cause problems... the internet is supposed to re-route stuff...

If the cloud solution is so poor that it's all in a single data-centre and THAT had a power cut long enough for its back-up generators to die ... that could do it - but who operates a system like that?

If it's a private Software-as-a-service company that's not like Amazon or Google cloud sized then perhaps there are backups in their data centre but it's rooted through a set of switches that had a separate power supply without backup generators...

But realistically given the size of the BA Operation and modern plans to avoid a DDoS attack, in an appropriate out-sourced solution there would be multiple distributed servers, in probably multiple geographic locations, each with decent international fibre access... short of a power cut that somehow affects multiple geographic locations covering every single distributed server I can't see ANY proper out-sourced solution that BA should be using...

A cheap out-sourced solution ... maybe one server with a half hour backup power solution... once the server is down it might require hands-on intervention to fix...

Anyone with a budget of tens of thousands of moneys should be able to make a distributed system that's HARD to take down... with the likely millions that BA would throw at it... it should be possible to make it near impossible to kill completely for this amount of time

joeyowen · May 2017

hindsight is wonderful, but they probably don't have a mitigation strategy that is anywhere near good enough.
I am also on the hunch that this may be a cyber attack but they are trying to keep hush.

If they are indeed cloud based, it is worth remembering that cloud computing has grown too quickly at times. They is not a lot of big examples of cloud based disaster recovery (in comparison to 'legacy' style networks). There are still plenty of lesson to be learnt.

ICBM · May 2017

Myranda said:

But realistically given the size of the BA Operation and modern plans to avoid a DDoS attack, in an appropriate out-sourced solution there would be multiple distributed servers, in probably multiple geographic locations, each with decent international fibre access... short of a power cut that somehow affects multiple geographic locations covering every single distributed server I can't see ANY proper out-sourced solution that BA should be using...

A cheap out-sourced solution ... maybe one server with a half hour backup power solution... once the server is down it might require hands-on intervention to fix...

Anyone with a budget of tens of thousands of moneys should be able to make a distributed system that's HARD to take down... with the likely millions that BA would throw at it... it should be possible to make it near impossible to kill completely for this amount of time

That's what I thought.

They're also going to have a major physical problem to deal with even when they get it running again, because all the planes and crews will now be in the wrong places. I assume that the rotation pattern is probably weekly, so I would expect it to be that long before everything is back to normal.

Fretwired · May 2017

Do they still use Amadeus? An ancient creaking system ..

Maybe Corbyn will rescue BA by nationalising it .... the last time I flew BA it was awful.

stickyfiddle · May 2017

BA is a budget-quality airline still clinging to the idea that they are world-class. I don't know about their IT infrastructure, but assuming they've applied the same arrogance there as they do to customer service I'm not surprised it's caused this big a fuck-up.

ICBM · May 2017

Fretwired said:

Do they still use Amadeus? An ancient creaking system ..

Maybe Corbyn will rescue BA by nationalising it .... the last time I flew BA it was awful.

I don't actually see airlines as a good target for nationalisation - they're neither a natural monopoly, a public service or a necessity, and effective competition can work between them. Airports on the other hand… would probably be better in public ownership because a proper national strategy could be devised - as opposed to allowing vested business interests to bully the government into the stupidity of expanding Heathrow. Which rather amusingly will - if it ever goes ahead - result in BA having pressed for the demolition of its own headquarters, which is in the way

.

I try to avoid BA, if there's an alternative - I've never been impressed with their service really. I suspect quite a lot of other people may be doing so from now on.

rocktron · May 2017

It depends on what the problem is, and perhaps BA are not misleading the travelling public. Most large datacenters have N+1 redundancy, but even they can be hit by an unexpected disaster.

I am aware of a recent lightning strike which fried the mains power transformer at a huge datacenter in London Docklands on a Friday night, and the backup system did not kick in. It took the engineers an entire weekend to source and replace the parts, and get the client computer systems up and running.

So an unexpected, disastrous failure is not entirely unknown.

mike_l · May 2017

Listening on R5 last night, there was a lady who was ex-BA, and she categorically stated that the current system kept on crashing - usually in one area (ie baggage handling or check-in). She did think that it wasn't fit for the task.

And apparently the back-up system was down due to some failure or other.

What I don't understand is why there is no manual back-up system for checking weights, and customer name and numbers onto each aircraft. At least they could still fly with that knowledge.
One of the customers did sound like a drama queen, and she did keep saying "no one told us anything" and "the staff just got up and left". I find both of those hard to believe, especially the staff leaving, unless, as the ex-BA lady said, they were going for a briefing.

Evilmags · May 2017

BA have been crap for ages to be honest. Willie Walsh is not the most customer oriented man in aviation and for short distance their is no real advantage in not flying a cheap carrier. Long distance Cathay Pacific, Virgin Atlantic and Emirates are miles ahead.

monquixote · May 2017

Distributed systems can fail in all sorts of horrific ways which are hard to recover from. Even if you test a few scenarios it's often when something unexpected happens that you get in real trouble.

One of the really nasty things that happens a lot with power supply failure is when you start a new apparatus room and then start populating it with kit which gets powered up as it gets installed. The site goes live and then even more kit gets installed by follow up projects. At some point the power goes down and when you repower it the breaker immediately blows.
The problem?
The supply going into the room can handle the sustained current to power it, but not in inrush current of everything going on at once. You now have to unplug several hundred boxes throw the breaker and replug them one by one.
Even if you do this you may find that some devices rely on others being available and will shut themselves down if they don't see them. If you don't have intimate knowledge of this ordering you are totally screwed and won't be able to get the services back up.
On top of this you have got every single service which hasn't been configured properly to auto start which will just sit there until someone logs into it and works out how to get it back live.

The company can have all the DR strategies it wants (and probably does), but they are all worthless unless someone has actually yanked the power and most people won't because they are absolutely certain (with good reason) that the whole thing will go to shit, stay down for days, data will be lost and they will get fired.

You might think perhaps that they can just switch over to a backup system in another location, but this is actually a far more complex scenario than you might initially think because for someone like BA they always need consistency in their data between sites. (For example they can't sell the same flight to two different people). Now imagine you are the backup site and you can't contact the live site any more, you don't know if the other site has gone down, or if it's just a network failure. If you get it wrong either you end up with main and backup running at the same time and royally corrupt all the data, or you don't take over and you end up with the whole enterprise shutting down even though you have a viable backup site in working condition. I've seen this go wrong in both directions before.

I could go on. There are lots of ways stuff like this goes horribly wrong.

CarpeDiem · May 2017

There should be Uninterruptable Power Supply in place, so was this not there for part of the system? There should also be alternative hardware in the event of a failure. Since the IT was recently outsourced, everything should have been fully documented and tested, including disaster recovery. Hopefully, BA will investigate thoroughly as to what went wrong and why. They should also consider why there was no clear communications strategy in place when it went wrong. I suspect this incident will feature in future academic studies as an example of poor management and communications.

monquixote · May 2017

CarpeDiem said:

There should be Uninterruptable Power Supply in place, so was this not there for part of the system? There should also be alternative hardware in the event of a failure. Since the IT was recently outsourced, everything should have been fully documented and tested, including disaster recovery. Hopefully, BA will investigate thoroughly as to what went wrong and why. They should also consider why there was no clear communications strategy in place when it went wrong. I suspect this incident will feature in future academic studies as an example of poor management and communications.

Most UPS systems don't last very long in a big data center. It certainly doesn't help you if someone puts a digger through your power lines into the building etc.

I wouldn't be surprised if the outsourcing company asked to do a full shutdown DR test as part of handover and BA told them to bugger of.

rocktron · May 2017

Many Service Delivery Managers are reluctant to permit scheduled downtime of their Production Systems to conduct a DR Test for fear that the Production Systems won't come back up on restart. This is often one major reason that DR Tests do not get done.

57Deluxe · May 2017

cos they were running on Domino servers and Lotus Notes!

JezWynd · May 2017

ICBM said:

Fretwired said:

Do they still use Amadeus? An ancient creaking system ..

Maybe Corbyn will rescue BA by nationalising it .... the last time I flew BA it was awful.

I try to avoid BA, if there's an alternative - I've never been impressed with their service really. I suspect quite a lot of other people may be doing so from now on.

Up until recently, I've been the opposite - always flew BA - partly out of habit, my father flew for them (started out as navigator on Stratocruisers with BOAC). Apart from family loyalty, I always had faith in their maintenance and crews, though that has eroded in recent years when both the planes and the cabin crew have seemed decidedly tatty around the edges.

Howdy, Stranger!

Categories

In this Discussion

Become a Subscriber!

British Airways computer failure

Comments