On Wednesday we experienced an outage in the morning. It started at 8:57 am and our team stabilised the platform at 9:08 am, lasting 11 minutes.
We have a very specific pattern of usage at Hero, where all of our users will log on to take attendance around 9:00 am each day. This creates a rapid spike in usage on our servers. We plan for this to anticipate a ramp-up period. However, in this instance, these strategies did not fully catch the spikes and caused an issue.
As a result, a few processes managed to “run away” with more compute resources than we anticipated, and it negatively affected one of our servers, which in turn backed up operations on other servers, causing the outage.
During this time of high traffic, it exposed a computer issue in the authentication service. Due to the pressure on this service, it was unable to start new instances of itself, they were immediately terminated for high RAM usage (1.5GB spikes). This meant that the authentication platform was unable to self-heal at this time.
We are implementing further processes to curb services using too much compute. Furthermore, we are in the final stages of an exciting new upgrade of our authentication service, which will not only improve performance but includes a number of significant benefits for all our users.