On the Tuesday April 30th, we experienced several service performance degradations on our platform between 9am and 11:45am (UTC+2).
During this incident, there was a decrease of contacts handled due to an issue on some of our targeting engine instances (around 20%) .
A change has been done on our firewall rules and was blocking some connections between the components of the application for any new targeting engine instances. Instances started before this change were behaving correctly.
After identifying the problem, the following corrective actions have been performed:
Faulty firewall rules were fixed on new instances.
Our infrastructure code was patched to reflect that change.
Everything went back to normal at 11:45am (UTC+2).
Other actions have been identified to prevent this issue from happening in the future:
Improve the firewall change process. Status : done.
Improve firewall problem detection with new alerts. Status : to do