It looks like you're new here. If you want to get involved, click one of these buttons!
Subscribe to our Patreon, and get image uploads with no ads on the site!
Base theme by DesignModo & ported to Powered by Vanilla by Chris Ireland, modified by the "theFB" team.
Comments
But I have had CIO roles in which I've managed application dev, test and support teams. Not running anything as complex as Horizon, but still providing run-the-business applications to 000s of users.
No-one would ever tell me that an application was guaranteed 100% fault-free, error-free, robust, or however you'd define it. Code in applications like that is way too complex and there are too many uncontrollable variables to give that guarantee, even for an application that was entirely self-contained.
When you add in integrations with third-party applications (each with their own issues naturally!), "unexpected" (and hence generally untested) user actions, data issues (invalid, unavailable, corrupted, etc) - none of which can be controlled by the dev team - it's not so much a risk that the application will behave unexpectedly, and more a likelihood.
If anyone had told me "it's 100%", I'd not have believed them. It's not achievable.
What's critical is how those unexpected behaviours are trapped and contained, how potential impacts are assessed, ensuring that fixes are developed quickly but rigorously, ie don't rush it and make the problem worse despite all the shouting. Wrapped around all of the above activity is comms; ensuring that those affected and those who need to know are promptly and effectively informed, and therefore understand and react correctly, to those issues.
Further, it appears that Horizon wasn't even, originally, an ICL/Fujitsu developed application, but had been bought in (at POL's request) from another third party. Which is another layer of risk. I hadn't realised that before today's session.
So, whether or not Horizon was robust (that undefined, subjective measure) isn't really the point here (though I'd say if it correctly processed 99.99%* of transactions - that's pretty robust). The fundamental error seems to have been a refusal by POL Exec to accept that the system could be wrong or create errors (ignorance) compounded by failing to assess the impact of any errors (incompetence) and then ensuring that issues were communicated to those involved and then resolved, whilst protecting its users from adverse consequences.
*99.99966% - I had to look that up to remember the actual six sigma quality measure!
Also, just how likely is it that 5% of your branches are stealing? Surely that should have given some caused for concern.
Not correct maths - you need to work out the system error rate on transaction volumes, not per post office, and also need to consider the time period,
Using your numbers (I'm not sure they're right), and lets assume (very conservatively) an average of 100 transactions per day per branch. In larger branches, that number will be orders of magnitude larger.
Over a 15year period, that's 12,000 x 100 x 312 (working days / yr) x 15 (yrs) = 5,616,000,000 transactions in total.
Not every transaction in each of 600 branches resulted in a system error. Let's assume, again conservatively, that there were 100 error transactions in the affected branches, ie 600 * 100 = 60,000. Those errors didn't happen every day, and certainly not over the full 15yr period. As soon as the error was recognised (failed reconciliations), they were flagged and investigated.
That's a transaction error rate, over the period, of 0.00106838%, or a success rate of 99.998932%, which is virtually the same as the "gold standard" six sigma quality level I quoted above.
But the point remains that it's not the error rate that was the critical issue here.
It's how those errors were handled, both by the application itself, Fujitsu/ICL and - more importantly - POL management - ie your first point.
But I've seen many businesses pick a critical application on which to run their business without due consideration of the underlying technical architecture.
Generally "How much?", "Can we have it ready next Tuesday?" and "Can this screen be blue?" are the critical Exec considerations.
As to robustness, given the NT/VB basis and the complexity of the application, I'd say that those error rates calculated above suggest the system was amazingly "robust". But no software is ever 100%, and it's how the errors are handled that becomes most critical.