Service outage
Incident Report for iAdvize (HA)
Postmortem

On September 16th we experienced service disruptions between 10:55am and 12:55pm (UTC+2) on our platform, impacting all iAdvize services.

We quickly identified the problem was located on our software's orchestrator cluster. For an unknown reason it was causing orchestrated jobs some unwanted restarts or stops.
After further investigations we discovered this misbehaviour was caused by a bug on our software's orchestrator cluster. As a consequence of this bug all our instances were stuck in a restart loop.

After identifying the problem, the following corrective actions have been performed:

  • At 12:10pm (UTC+2) we managed to stabilize our orchestrator cluster by applying a bug fix
  • Once the orchestrator cluster was stabilized, we were able to gradually restart all the services

When the restart of all our services was complete, everything went back to normal at 12:55pm (UTC+2) and agents could receive incoming contacts.

The bugfix we applied will prevent this issue from happening in the future, no further actions have been identified after this incident.

Posted 6 months ago. Sep 17, 2018 - 11:02 CEST

Resolved
After monitoring the activity of our platform since our last intervention, we didn't notice any new perturbations.
Posted 6 months ago. Sep 16, 2018 - 14:03 CEST
Monitoring
The situation is now back to normal, all services are operating normally since 12:52pm (UTC+2). We appreciate your patience as we worked through this and apologize for the inconvenience.

Our technical team continues the monitoring of our infrastructure.
Posted 6 months ago. Sep 16, 2018 - 13:18 CEST
Update
Our technical team has performed an intervention. Some services are back, however the situation is not yet back to normal.

We will update as we have more information.
Posted 6 months ago. Sep 16, 2018 - 12:29 CEST
Identified
Our platform is currently not fully reachable, this incident is impacting all our services. Our technical team is currently looking into the issue, we will update you as we learn more.
Posted 6 months ago. Sep 16, 2018 - 11:07 CEST
This incident affected: Administrator Web App, Agent Web App, Website widget, Mobile app (for Agent & Ambassador), and API.