Update 11.58 hours 5 January 2022 – Tower Hamlets website is now back online. It’s amazing what you can do with computers nowadays isn’t it? Two sources have said it seems the Council’s IT department may well have forgotten to renew the domain name. Which is so negligent it seems about right.
TL; DR The Tower Hamlets website has been offline for more than a day now. How is this possible? And what has Nasty Nick got to do with it?
At the time of writing the Tower H amlets Council web site usually available at www.towerhamlets.org.uk has been offline for at least 24 hours. This is unprecedented.
The severity of this incident is best explained by using the analogy of a passenger jet that stops flying.
No engines. No electronics. No hope. If you are unfortunate enough to be passenger or crew on the plane when it stops flying there is no chance of survival.
Plane falls out of the sky, everyone dies.
The Five Nines
In a previous career this author used to design and build high-availability web systems for a living. Or web sites that would (almost) never fail.
The holy grail of high-availability systems is known as the ‘five nines’ or 99.999% uptime.
For all sorts of geeky reasons it is not possible to build a web system (or any system) that will exhibit 100% uptime but 99.999% is as close as you can get.
The need for high-availability web sites has increased as our reliance on the sites such as www.towerhamlets.org.uk has increased at a scary rate.
I am old enough to remember when the first public facing web sites began to emerge and the wonder of being able to use a cranky modem to dial-up (literally) and connect to a site in another country was just this side of comprehensible.
Before we realised what was going on everyone needed their own site which they could, err, well, everyone wanted one.
Within a few years wanted became needed.
High-availability is a concept that your web system, Tower Hamlets Council, Ikea, Google, East End Enquirer, will continue to work normally after an unexpected failure or error with the interruption to normal service being either none or so short as to be not be noticed by most users.
This means that there can be no single point of failure in a high-availability web system. There is no ‘one’ of anything.
When there is a failure of a critical part of the system – and all technology can fail – the system is designed to automatically flip from the broken bit to a nice shiny working bit.
Simple idea which can be expensive because a no single point of failure means redundancy. While the web system chugs along doing what it should do there will be another exact copy of it also chugging along but hidden from ordinary users.
The site stops chugging because of a widget failing in System A and automatically flips over to System B and all is well.
While it sounds (and is) very expensive to have a duplicate system essentially doing nothing loitering around just in case the primary system fails, imagine the cost of not having that duplicate redundant system.
Some poor individual is doing those sorts of sums in Tower Hamlets town hall right now.
Elsewhere there will be multiple teams of tech people across the country trying to get the problem fixed.
Been there, done that, got the t-shirt.
The really incredible thing is that the council has no web presence whatsoever. Nothing.
When you type in www.towerhamlets.org.uk into your computer you are taken to wherever www.towerhamlets.org.uk is pointing to. It can point anywhere.
So why is www.towerhamlets.org.uk not pointing anywhere at all? This is a very good indicator of a failure of management at Tower Hamlets Council. Not the IT managers who will no doubt be held to account when things are back to normal but the business managers.
Will Tuckley, Tower Hamlets CEO, is at the top of the business manager pile so this is ultimately down to him. He does not need to be a technical expert, but he does need to have employed the right technical experts who work in the right way. He needs to understand the business need for high-availability but is not a load balancer expert.
9/11 changed our society permanently and the web was no exception. News sites such as that for the New York Times (https://www.nytimes.com/) suddenly found the load on their web sites (load = number of simultaneous users) was so great that they were unavailable to most users.
Which is not ideal when there is a massive world event happening outside your window which is generating a massive demand for real-time information from all over the planet.
New Yorkers being New Yorkers someone took decisive action and quickly. The New York Times became just one single simple web page. One.
Because the New York Times was only having to serve this single page to the world the load on its servers was hugely reduced so users could find out something as opposed to nothing.
Not ideal but it worked.
All competent web site managers at large organisations should have a similar contingency plan, just a single simple web page which sits all by itself, probably forever, waiting for something to happen.
When that something does happen it has its hour of glory.
There is no point in having this lonely page live on the same system as the main system, because then it too may be unavailable.
It has to live somewhere completely different that is not in any way connected to the main site.
If this had been done at Tower Hamlets users would be able to read a simple statement explaining that there was a technical problem, it would soon be fixed and please do not try calling the council on the phone.
One reason being that phone systems are often part of the main IT system. So the phones might not be working either.
And what makes this author an expert on high-availability systems?
22 years ago I was the person responsible for designing and managing the web site for the first Big Brother UK.
The only way I managed to get this done was to have a team of exceptional people working with me. It was a privilege to lead them.
In the planning phase we had been warned by Endemol that the UK site might be subject to a massive load.
A gross understatement.
When Nasty Nick was being kicked out of the Big Brother house the load on the UK internet was so great that some Internet Service Providers were begging me to turn off the main site before their systems collapsed.
The first Big Brother UK took place a good five years before something called YouTube and broadband web access was still in its infancy. And because not many people viewed video content online the structure of the UK internet was not designed to distribute such huge amounts of data, especially not from one site at the same time.
The web server system I and my team, including some exceptional people from Intel UK, had designed never failed. It never crashed. It was always available.
If you could get to it.
The real problem was that the bandwidth demands generated by the Big Brother site were so great that the entire UK internet was grinding to a halt because it was just not designed for it.
Nowadays that is not a problem.
In the planning phase I was sitting outside a cafe in the West End with members of the different technical teams (over 100 people across the planet would eventually become involved) after yet another meeting.
One of my colleagues was scribbling some numbers on a paper napkin. He paused then did his calculations again.
“I think there is a real chance that when the Big Brother site is running at full demand it could be responsible for 25% of all UK internet traffic,” he said.
There was a moment of silence while we absorbed this. Then we got on with the job.
There is no difference between designing a web system to reliably serve video content from a reality TV show and designing a web system to provide the information needs of a local authority.
Both have to be fit for purpose.
The Big Brother site was fit for purpose. The Tower Hamlets Council site was not.
Which is why all over the borough people have no means of using essential council services or just looking up basic information.
Not ideal in an election year.