Service disruption - iAdvize statistics - conversation history is not reachable

Incident Report for iAdvize (HA)

Postmortem

First of all, we would like to apologize for the incident and its consequences on your use of our solution.
This post mortem intends to explain what happened during this incident.

Summary :

On Tuesday 4 June, 2019, every messages from ongoing conversations where not saved for a duration of 2 hours 5 minutes (2:38pm -> 4:43pm UTC+2), due to an endless crash loop of the service in charge of writing them to the database. Incident is due to an unexpectedly oversized conversation creating the crash loop and no proper alerting to detect it.

Impacts :

100% of HA customers impacted
On brands’ website, visitors lost access to previous messages from their ongoing conversation, after reloading or changing pages
On the console application, agents lost access to previous messages from their ongoing conversations, after a reload of the application
On the conversation report, users don’t have access to messages history of conversations that occurred during the timeframe of the incident

On the “Contact - responsiveness” report, users have access to incorrect data on the following metrics, when selecting a date range corresponding to the incident:First message response timeResponse timeContacts with no response.

Timeline :

The incident started at 2:38pm but we only identified it at 4pm due to a lack of monitoring on the worker allowing the archiving of messages.

After identifying that the message archiving service was the cause of the problem, we restarted it at 4:26pm by allocating more RAM but the problem continued.

Incident ended at 4:43pm (UTC+2) after killing the process and cleaning the corrupted events.

Why can’t we recover lost messages?

Today the Archive service does not write messages in chat history for closed conversations.

To make this possible in the future, we need to consider how we could re-inject the messages stored in Ejabberd into our different databases. This is a significant task that will take time but would subsequently help to better secure data backup in the event of an incident.

Other actions have been identified to prevent this issue from happening in the future:

Done - Add an alerting on the filling of the queue of the archive to detect a similar problem more quickly
To do - Analyze the possibility of retrieving messages stored within Ejabberd and reinjecting them back into our databases

Posted Jun 13, 2019 - 09:30 CEST

Resolved

As mentioned above, during the incident, messages exchanged during the conversation could not be saved between 2:48 pm and 4:43 pm (UTC+2) yesterday. The following statistical indicators are based on the time of arrival of the messages and are therefore also incorrect during this period:
- First message response time
- Response time
- Contacts with no response
Despite our efforts, we are not able to retrieve these missing messages and correct these statistical indicators. We sincerely apologize for this.

After 24 hours of monitoring, we close this incident. A post mortem will be published in the coming days with more details on the causes of the incident and actions that will take place in order to avoid this problem in the future.

Posted Jun 05, 2019 - 17:46 CEST

Monitoring

The issue has been identified and a fix has been implemented.
From now, new conversations will have their history accessible by admin and manager.
The history of the conversations that were closed during the incident may not be recovered.

Posted Jun 04, 2019 - 17:18 CEST

Update

We are continuing to investigate this issue.

Posted Jun 04, 2019 - 16:15 CEST

Investigating

An incident is currently ongoing on the conversation history : messages are not displayed in the report.
We are currently working on the restoration of the service.

This is the only service impacted with the incident. Everything else is working fine (livechat works properly).

Posted Jun 04, 2019 - 16:15 CEST