Service disruption - Slowdowns

Incident Report for iAdvize (HA)

Postmortem

We have experienced several service disruptions between 11:08am and 3:05pm (UTC+1) on the 27th of June.

It was mainly impacting the correct handling of post-targeted visitors: A contact proposition (Chat / Call / ...) was frequently not displayed to the visitor. The engagement process could fail while trying to find the best agent for the given visitor.

After several tries and actions getting us closer to the root cause, we discovered that the misbehaviour was induced by a yet unknown hard limit (50,000 requests per second) of the database behind the service handling the presence of agents. Even though agents were online, the presence service could fail to return this availability, hence stopping the engagement of visitors before they could be engaged with a contact proposition.

As a corrective action we scaled up the presence database to allow it to handle more queries, it brought the situation back to normal at 3:05pm (UTC+1).

Other actions have been identified to prevent this issue from happening in the future, and to improve our diagnosis hability:

1) Done - Be able to analyse, diagnose bad behaviours and discover root causes in a more efficient manner by centralizing key metrics for every iAdvize services and databases

2) To do - Reduce the dependency of presence in the pre-routing process

3) To do - Reduce the number of queries on the presence database

Posted Jun 29, 2018 - 09:49 CEST

Resolved

The situation is stable since 3pm (UTC+1), this incident is now closed.

More technical details will be published on this page about the cause of this incident and the action we performed to solve it.

Posted Jun 27, 2018 - 18:04 CEST

Monitoring

The last technical intervention ended at 3pm, since that time the situation is stable and your agents can receive incoming conversations.
We keep on monitoring the platform closely, in order to verify that everything is fully stable and back to normal.

Posted Jun 27, 2018 - 16:43 CEST

Update

We have identified the possible cause of these slowdowns on one database, a technical intervention is currently ongoing in order to attempt solving this problem.
We will update you as we have more information.

Posted Jun 27, 2018 - 14:58 CEST

Investigating

A new technical intervention performed at 1:40pm (UTC+1). Since then we are noticing improvements but the situation is still not back to normal.
Our technical team is still working on solving this problem, thank you for your patience.

Posted Jun 27, 2018 - 13:57 CEST

Identified

Despite our intervention, we are again experiencing slowdowns on our platform. The consequence is a decrease of incoming contacts, our technical team is investigating.

Posted Jun 27, 2018 - 13:08 CEST

Monitoring

The cause of the issue has been identified on our server in charge of the message exchange. This service has been restarted and more ressources have been allocated.
Since 12:23pm (UTC+1) the situation is back to normal, our technical team keeps on monitoring our platform.

Posted Jun 27, 2018 - 12:38 CEST

Investigating

Our technical team is investigating on slowdowns on our platform. The consequence is a decrease of incoming contacts.

Posted Jun 27, 2018 - 11:36 CEST