Following a major maintenance on our Livechat database (DB) to upgrade to a new version, scheduled on Feb. 22th 6:30 > 8:00 CET, we encountered CPU scalability problems as traffic on the new instance began to increase with the start of the day in Europe.All active connections to this DB slowed down until they reached a timeout. At this moment, the DB was frozen and unreachable.As this DB is central to the Livechat app, we were faced with a generalized interruption in conversation processing across all channels (chat, call, video, whatsapp, facebook, …) supported by the iAdvize platform.
Downtime on conversations processing happened on Feb. 22th, between 9:25 > 10:50 CET.
As soon as we became aware of this incident, we shut down the services displaying contact notifications. This is to prevent visitors from trying to start conversations that cannot be handled by the system.Afterwards, we had to shut down several services linked to this DB and manually kill backend processes in order to mitigate the problem and decrease CPU load. Once the CPU level was acceptable again, we ran a system checking script to verify DB integrity and optimize its operation. Finally we were able to restart all services one by one without risking a new CPU burst.