First of all, we would like to apologize for the incident and its consequences on your use of our solution.
This post mortem intends to explain what happened during this incident.
Summary :
On Tuesday 4 June, 2019, every messages from ongoing conversations where not saved for a duration of 2 hours 5 minutes (2:38pm -> 4:43pm UTC+2), due to an endless crash loop of the service in charge of writing them to the database. Incident is due to an unexpectedly oversized conversation creating the crash loop and no proper alerting to detect it.
Impacts :
On the “Contact - responsiveness” report, users have access to incorrect data on the following metrics, when selecting a date range corresponding to the incident:First message response timeResponse timeContacts with no response.
Timeline :
The incident started at 2:38pm but we only identified it at 4pm due to a lack of monitoring on the worker allowing the archiving of messages.
After identifying that the message archiving service was the cause of the problem, we restarted it at 4:26pm by allocating more RAM but the problem continued.
Incident ended at 4:43pm (UTC+2) after killing the process and cleaning the corrupted events.
Why can’t we recover lost messages?
Today the Archive service does not write messages in chat history for closed conversations.
To make this possible in the future, we need to consider how we could re-inject the messages stored in Ejabberd into our different databases. This is a significant task that will take time but would subsequently help to better secure data backup in the event of an incident.
Other actions have been identified to prevent this issue from happening in the future:
Done - Add an alerting on the filling of the queue of the archive to detect a similar problem more quickly
To do - Analyze the possibility of retrieving messages stored within Ejabberd and reinjecting them back into our databases