Service disruption
Incident Report for iAdvize (HA)
Postmortem

On the Tuesday April 30th, we experienced several service performance degradations on our platform between 9am and 11:45am (UTC+2).

During this incident, there was a decrease of contacts handled due to an issue on some of our targeting engine instances (around 20%) .

A change has been done on our firewall rules and was blocking some connections between the components of the application for any new targeting engine instances. Instances started before this change were behaving correctly.

After identifying the problem, the following corrective actions have been performed:

  • Faulty firewall rules were fixed on new instances.

  • Our infrastructure code was patched to reflect that change.

Everything went back to normal at 11:45am (UTC+2).

Other actions have been identified to prevent this issue from happening in the future:

  • Improve the firewall change process. Status : done.

  • Improve firewall problem detection with new alerts. Status : to do

Posted Apr 30, 2019 - 16:05 CEST

Resolved
After monitoring the activity of our platform since our last intervention, we didn't notice any new perturbations. The situation is back to normal.

Thanks a lot for your patience.
Posted Apr 30, 2019 - 14:11 CEST
Update
The situation is now back to normal, our technical team continues the monitoring of our infrastructure.
We will provide you a post mortem on this incident as soon as possible.
Posted Apr 30, 2019 - 11:50 CEST
Update
We are continuing to monitor for any further issues.
Posted Apr 30, 2019 - 11:49 CEST
Monitoring
The situation is now back to normal, our technical team continues the monitoring of our infrastructure.
We will provide you a post mortem on this incident as soon as possible.
Posted Apr 30, 2019 - 11:48 CEST
Update
We are still experiencing slowdowns on the platform.
We are restarting services one by one in order to identify the root cause of the bottleneck.
Bot and social services are stopped for the moment.
We are working to restore the service as soon as possible and will update you as we learn more.
Posted Apr 30, 2019 - 11:28 CEST
Update
Because of the problem we are having with the bots, we are stopping them all.
We are still investigating on the issues.
Posted Apr 30, 2019 - 11:08 CEST
Investigating
There is still disruption on the platform. You can have delay receiving conversations due to slowdowns on the routing service.
Bots are also experiencing problem.
We are working to restore the service and will update you as we learn more.
Posted Apr 30, 2019 - 10:54 CEST
Monitoring
A patch has been deployed, situation is improving, we are now monitoring the results.
Posted Apr 30, 2019 - 10:06 CEST
Update
We are continuing to investigate on this issue.
Thanks for your patience.
Posted Apr 30, 2019 - 09:39 CEST
Investigating
There is currently an incident impacting the targeting, and creating a decrease of contacts.
We are working to restore the service and will update you as we learn more.
Posted Apr 30, 2019 - 09:20 CEST