P0 - Service disruption - Impossible to start new conversations & errors on the conversation panel
Incident Report for iAdvize (HA)
Postmortem

Incident

Following a major maintenance on our Livechat database (DB) to upgrade to a new version, scheduled on Feb. 22th 6:30 > 8:00 CET, we encountered CPU scalability problems as traffic on the new instance began to increase with the start of the day in Europe.All active connections to this DB slowed down until they reached a timeout. At this moment, the DB was frozen and unreachable.As this DB is central to the Livechat app, we were faced with a generalized interruption in conversation processing across all channels (chat, call, video, whatsapp, facebook, …) supported by the iAdvize platform.

Downtime on conversations processing happened on Feb. 22th, between 9:25 > 10:50 CET.

Resolution

As soon as we became aware of this incident, we shut down the services displaying contact notifications. This is to prevent visitors from trying to start conversations that cannot be handled by the system.Afterwards, we had to shut down several services linked to this DB and manually kill backend processes in order to mitigate the problem and decrease CPU load. Once the CPU level was acceptable again, we ran a system checking script to verify DB integrity and optimize its operation. Finally we were able to restart all services one by one without risking a new CPU burst.

Actions for the future

  • (Done) Identification and setting up a throttle system for next DBs migrations. The aim here is to allow a gradual increase of incoming traffic on new instances, in order to keep control over CPU and memory load.
  • (In progress) Improving our internal processes and tools to optimize DB crash resolution time.
Posted Feb 26, 2024 - 10:05 CET

Resolved
During the monitoring performed since the last update, no issue has been identified.

As stated earlier, this incident is resolved since 10:50am CET. We are now closing this status incident publication.
Posted Feb 22, 2024 - 14:48 CET
Update
The incident is over since 10:50am CET. The conversation processing flow is now operational.

Thank you for your patience.
Posted Feb 22, 2024 - 11:35 CET
Monitoring
Following technical interventions we are noticing great improvements.
iAdvize services have been restarted gradually, while monitoring the results.

You should notice improvements on your side:
- On the visitor side: Notification are visible on the website and conversations can be started
- For agents: New incoming conversations can arrive on the conversation panel
- For admins & managers: Inconsistencies could still be seen on some reports, we are currently working on it.

We are monitoring the situation, we will communicate again when we notice the situation is fully back to normal.
Posted Feb 22, 2024 - 10:58 CET
Update
Our technical team is still actively working on this issue to mitigate this incident.
Several interventions have been performed, but the situation is still not back to normal.

We continue to notice impacts on several services:
- On the visitor side: It is impossible to start conversations, no notification is visible on the website
- For agents: Different kinds of impacts, like errors on the conversations panel when handling opened conversations, no new incoming conversations
- For admins & managers: Inconsistencies on some reports (production report does not show connected agents, conversation report does not show past conversations)
Posted Feb 22, 2024 - 10:23 CET
Update
We are still investigating and attempting to mitigate the issue. We will update again as we have more information.
Thank you for your patience.
Posted Feb 22, 2024 - 09:59 CET
Investigating
We are currently experiencing service disruptions, so you may notice issues on several services:
- It is impossible to start new conversations
- Errors on the conversations panel on opened conversations
- Inconsistencies on some reports (production report)

Our technical team is currently working on this issue.
Posted Feb 22, 2024 - 09:42 CET
This incident affected: Onsite Channels (Chat, Call, Video, Mobile SDK), Third Party Channels (X, Facebook, Facebook Messenger, WhatsApp, SMS), Visitor’s interface (Engagement Notification, iAdvize Messenger), Conversation Panel (Conversation views, Message exchange, Conversation closure, Conversation transfert, Mirroring / Cobrowsing, Canned answer), and Administration (Statistics).