Perturbation on Chat initialization
Incident Report for iAdvize (HA)
Postmortem

Between May 16 and May 17, our Call service then our Chat service performance were degraded. We discovered abnormal errors occurring randomly on our servers on calls with our external call provider and an internal API. The same errors occurred on our internal calls and impacted Chats on the following day. An error occurred randomly on requests using an external library. This error was occurring once in a while on web servers, corrupting the cache of that request for that server. On the same day, we were able to track errors using our call provider’s logs and internal logs. We applied a patch to prevent cache corruption on call & chat channels. In order to prevent this from happening again in the future, we have already worked on reinforcing our monitoring on the PHP internal cache system and we have added proactive alerting linked to this stronger monitoring system. After a comprehensive testing of that system for a week, we can confirm that these actions fixed the issues linked to the cache corruption.

Posted May 30, 2017 - 16:40 CEST

Resolved
Since our last changes yesterday morning, we didn't notice any problem for the past 24 hours.
We consider that the incident can be closed. Our technical team will monitor the activity during this weekend and the following days.
We will also publish a Postmortem in the next days.
Posted May 19, 2017 - 17:22 CEST
Update
We applied several changes this morning.
At this time the chat volume is back to normal. We keep monitoring the activity.
Posted May 18, 2017 - 15:29 CEST
Update
Since our intervention of yesterday evening, the number of chat conversations has significantly increased. The situation is not yet back to normal.
We continue to monitor the activity. In the meantime, our technical team will apply new fixes this morning to stabilize our platform.
We will continue to provide additional updates going forward.
Posted May 18, 2017 - 09:56 CEST
Monitoring
The problem is probably due to a cache corruption. An intervention is currently ongoing to mitigate the consequences. We will then monitor the situation to confirm that it is fixed.
Posted May 17, 2017 - 17:46 CEST
Update
We are still investigating the issue. We will continue to provide additional updates going forward.
Posted May 17, 2017 - 15:31 CEST
Identified
Our technical team has identified the origin of the issue and implemented some logs to get more information.
Logs allowed us to have a better visibility of what's going wrong, and help us to elaborate a fix.
Posted May 17, 2017 - 12:07 CEST
Investigating
We have identified some perturbations on the chat initialization process. When a visitor tries to start a conversation, he can receive an unavailability message in response. You may receive less chats than usually.
Our technical team is currently investigating on this issue.
Posted May 17, 2017 - 10:44 CEST