Mr Bates Vs The Post Office

GuyBoden · June 25

As a Software Engineer at the time, I know that originally the Horizon software was using Microsoft NT and Visual Basic 6. Both, could never be called "Robust". Most Software Engineer's I knew at that time laughed at anyone using them.

So, I think that Gareth Jenkins, who was the System Architect, isn't telling the truth about software using Microsoft NT and Visual Basic 6 being "Robust".

Remember the "Blue Screen of Death" and insufficient system error logs anyone?

crosstownvamp · June 25

TTony said:

That gets summarised as "Jenkins disagreed with the judge's finding that Horizon wasn't robust". So, let's hang Jenkins.
But I'd say that Jenkins' statement was the more accurate (and I'd maybe hang some of the reporters for inaccurate reporting and not having the knowledge to understand what's being said).

As a person who manages and codes web projects I'd agree with the the judge and say it wasn't robust. Any bug that can't be ironed out once it has come to the developer's attention is a BIG NASTY BUG!

TTony · June 25

crosstownvamp said:
As a person who manages and codes web projects I'd agree with the the judge and say it wasn't robust. Any bug that can't be ironed out once it has come to the developer's attention is a BIG NASTY BUG!

My experience / perspective is slightly different - and I've certainly not done any web coding, so no first-hand experience.

But I have had CIO roles in which I've managed application dev, test and support teams. Not running anything as complex as Horizon, but still providing run-the-business applications to 000s of users.

No-one would ever tell me that an application was guaranteed 100% fault-free, error-free, robust, or however you'd define it. Code in applications like that is way too complex and there are too many uncontrollable variables to give that guarantee, even for an application that was entirely self-contained.

When you add in integrations with third-party applications (each with their own issues naturally!), "unexpected" (and hence generally untested) user actions, data issues (invalid, unavailable, corrupted, etc) - none of which can be controlled by the dev team - it's not so much a risk that the application will behave unexpectedly, and more a likelihood.

If anyone had told me "it's 100%", I'd not have believed them. It's not achievable.

What's critical is how those unexpected behaviours are trapped and contained, how potential impacts are assessed, ensuring that fixes are developed quickly but rigorously, ie don't rush it and make the problem worse despite all the shouting. Wrapped around all of the above activity is comms; ensuring that those affected and those who need to know are promptly and effectively informed, and therefore understand and react correctly, to those issues.

Further, it appears that Horizon wasn't even, originally, an ICL/Fujitsu developed application, but had been bought in (at POL's request) from another third party. Which is another layer of risk. I hadn't realised that before today's session.

So, whether or not Horizon was robust (that undefined, subjective measure) isn't really the point here (though I'd say if it correctly processed 99.99%* of transactions - that's pretty robust). The fundamental error seems to have been a refusal by POL Exec to accept that the system could be wrong or create errors (ignorance) compounded by failing to assess the impact of any errors (incompetence) and then ensuring that issues were communicated to those involved and then resolved, whilst protecting its users from adverse consequences.

*99.99966% - I had to look that up to remember the actual six sigma quality measure!

GuyBoden · June 26

TTony said:

crosstownvamp said:
As a person who manages and codes web projects I'd agree with the the judge and say it wasn't robust. Any bug that can't be ironed out once it has come to the developer's attention is a BIG NASTY BUG!

My experience / perspective is slightly different - and I've certainly not done any web coding, so no first-hand experience.

But I have had CIO roles in which I've managed application dev, test and support teams. Not running anything as complex as Horizon, but still providing run-the-business applications to 000s of users.

No-one would ever tell me that an application was guaranteed 100% fault-free, error-free, robust, or however you'd define it. Code in applications like that is way too complex and there are too many uncontrollable variables to give that guarantee, even for an application that was entirely self-contained.

When you add in integrations with third-party applications (each with their own issues naturally!), "unexpected" (and hence generally untested) user actions, data issues (invalid, unavailable, corrupted, etc) - none of which can be controlled by the dev team - it's not so much a risk that the application will behave unexpectedly, and more a likelihood.

If anyone had told me "it's 100%", I'd not have believed them. It's not achievable.

What's critical is how those unexpected behaviours are trapped and contained, how potential impacts are assessed, ensuring that fixes are developed quickly but rigorously, ie don't rush it and make the problem worse despite all the shouting. Wrapped around all of the above activity is comms; ensuring that those affected and those who need to know are promptly and effectively informed, and therefore understand and react correctly, to those issues.

Further, it appears that Horizon wasn't even, originally, an ICL/Fujitsu developed application, but had been bought in (at POL's request) from another third party. Which is another layer of risk. I hadn't realised that before today's session.

So, whether or not Horizon was robust (that undefined, subjective measure) isn't really the point here (though I'd say if it correctly processed 99.99%* of transactions - that's pretty robust). The fundamental error seems to have been a refusal by POL Exec to accept that the system could be wrong or create errors (ignorance) compounded by failing to assess the impact of any errors (incompetence) and then ensuring that issues were communicated to those involved and then resolved, whilst protecting its users from adverse consequences.

*99.99966% - I had to look that up to remember the actual six sigma quality measure!

But, would you have used MS NT and Visual Basic for such a complex Software application as Horizon in 1989, and called it "Robust"?

crosstownvamp · June 26

TTony said:
*99.99966% - I had to look that up to remember the actual six sigma quality measure!

The Post Office is a bank amongst other things. Would a bank accept weird anomalies coming from its branches? Would there be a scandal if end customers discovered money vanishing with no reason?

The large banks have a larger number of customers and just as many transactions - but when your sub-branch is a franchisee it makes it very tempting to insist the fault is at the sub-branch's end.

Anyway the metric is based on an incorrect analysis.

The Post Office has around 12,000 branches, 600 were deemed worth taking to court. That's a 5% error rate!

Reverend · June 26

crosstownvamp said:

TTony said:
*99.99966% - I had to look that up to remember the actual six sigma quality measure!

The Post Office is a bank amongst other things. Would a bank accept weird anomalies coming from its branches? Would there be a scandal if end customers discovered money vanishing with no reason?
The large banks have a larger number of customers and just as many transactions - but when your sub-branch is a franchisee it makes it very tempting to insist the fault is at the sub-branch's end.

Anyway the metric is based on an incorrect analysis.
The Post Office has around 12,000 branches, 600 were deemed worth taking to court. That's a 5% error rate!

if 5% were taken to court, how many more made up the difference themselves and never got prosecuted?

Also, just how likely is it that 5% of your branches are stealing? Surely that should have given some caused for concern.

TTony · June 26

crosstownvamp said:

but when your sub-branch is a franchisee it makes it very tempting to insist the fault is at the sub-branch's end.

Exactly what POL Exec did wrong.

crosstownvamp said:
The Post Office has around 12,000 branches, 600 were deemed worth taking to court. That's a 5% error rate!

Not correct maths - you need to work out the system error rate on transaction volumes, not per post office, and also need to consider the time period,

Using your numbers (I'm not sure they're right), and lets assume (very conservatively) an average of 100 transactions per day per branch. In larger branches, that number will be orders of magnitude larger.
Over a 15year period, that's 12,000 x 100 x 312 (working days / yr) x 15 (yrs) = 5,616,000,000 transactions in total.

Not every transaction in each of 600 branches resulted in a system error. Let's assume, again conservatively, that there were 100 error transactions in the affected branches, ie 600 * 100 = 60,000. Those errors didn't happen every day, and certainly not over the full 15yr period. As soon as the error was recognised (failed reconciliations), they were flagged and investigated.

That's a transaction error rate, over the period, of 0.00106838%, or a success rate of 99.998932%, which is virtually the same as the "gold standard" six sigma quality level I quoted above.

But the point remains that it's not the error rate that was the critical issue here.

It's how those errors were handled, both by the application itself, Fujitsu/ICL and - more importantly - POL management - ie your first point.

TTony · June 26

GuyBoden said:

But, would you have used MS NT and Visual Basic for such a complex Software application as Horizon in 1989,

No.

But I've seen many businesses pick a critical application on which to run their business without due consideration of the underlying technical architecture.

Generally "How much?", "Can we have it ready next Tuesday?" and "Can this screen be blue?" are the critical Exec considerations.

As to robustness, given the NT/VB basis and the complexity of the application, I'd say that those error rates calculated above suggest the system was amazingly "robust". But no software is ever 100%, and it's how the errors are handled that becomes most critical.

JEM · June 26

GuyBoden said:

But, would you have used MS NT and Visual Basic for such a complex Software application as Horizon in 1989, and called it "Robust"?

In fairness NT 4.0 was a pretty stable OS (certainly after SP3) and there were plenty of very robust applications developed in VB.

The Horizon system only used NT/VB for the clients anyway, the backend was on a mainframe. I think the whole "bug ridden software" bit has been rather overstated by hysterical journalists that don't really know what they're talking about.

As others have said, all software has bugs, the issue was how those bugs were handled. In the case of the PO, appallingly.

Edit: Just to say I didn't mean to LOL your post, sausage fingers.

GuyBoden · June 26

JEM said:

GuyBoden said:

But, would you have used MS NT and Visual Basic for such a complex Software application as Horizon in 1989, and called it "Robust"?

In fairness NT 4.0 was a pretty stable OS (certainly after SP3) and there were plenty of very robust applications developed in VB.

The Horizon system only used NT/VB for the clients anyway, the backend was on a mainframe. I think the whole "bug ridden software" bit has been rather overstated by hysterical journalists that don't really know what they're talking about.

As others have said, all software has bugs, the issue was how those bugs were handled. In the case of the PO, appallingly.

Edit: Just to say I didn't mean to LOL your post, sausage fingers.

So, did you write failware using MS NT and Visual Basic back in 1989?

Howdy, Stranger!

Categories

In this Discussion

Become a Subscriber!

Mr Bates Vs The Post Office

Comments