Degraded performance
Incident Report for Hero
Postmortem

Firstly, we want to apologize for the inconvenience this outage may have caused.

On Saturday 2021-06-16 at 22:00 NZST we upgraded all our backend services. This included a change in how our services encoded and decoded JSON data. This led to an on flow effect that under high load the new de/encoder consumed 100-300% more CPU processing time and 300-500% more memory. We have rolled back the changes to our most affected services that caused the performance degradation on Tuesday 2021-06-16 at 11:15 NZST.

The following details the performance degradation we experienced before and after our engineers deployed a hotfix to the most affected service:

fig 1. cpu and memory load across people service - lower is better.

fig 2. cpu load for people service across our cluster - lower is better.

Over the next week, we will finalize a rollback to the old JSON de/encoder for all of our services and stage the change for next week. This will enable our API consumers to amend any required changes over the next week. We are also investigating better processes to avoid performance regressions in future.

Team Hero.

Posted Jun 16, 2021 - 15:57 NZST

Resolved
This incident has been resolved.
Posted Jun 16, 2021 - 13:34 NZST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jun 15, 2021 - 09:27 NZST
Identified
We're experiencing degraded performance on our servers and are actively investigating to isolate the problem.
Posted Jun 15, 2021 - 09:10 NZST
This incident affected: Hero (Hero APIs).