We have experienced several service disruptions between 11:08am and 3:05pm (UTC+1) on the 27th of June.
It was mainly impacting the correct handling of post-targeted visitors: A contact proposition (Chat / Call / ...) was frequently not displayed to the visitor. The engagement process could fail while trying to find the best agent for the given visitor.
After several tries and actions getting us closer to the root cause, we discovered that the misbehaviour was induced by a yet unknown hard limit (50,000 requests per second) of the database behind the service handling the presence of agents. Even though agents were online, the presence service could fail to return this availability, hence stopping the engagement of visitors before they could be engaged with a contact proposition.
As a corrective action we scaled up the presence database to allow it to handle more queries, it brought the situation back to normal at 3:05pm (UTC+1).
Other actions have been identified to prevent this issue from happening in the future, and to improve our diagnosis hability:
1) Done - Be able to analyse, diagnose bad behaviours and discover root causes in a more efficient manner by centralizing key metrics for every iAdvize services and databases
2) To do - Reduce the dependency of presence in the pre-routing process
3) To do - Reduce the number of queries on the presence database