Incident
Between September 2 and 4, we experienced several periods of slowdowns on our technical infrastructure that severely impacted the user experience of the administration and conversation panel.
We are aware of the inconvenience this incident may have caused to your business.
We would like to apologize and thank you for your patience and understanding.
The first occurrence of the problem was on September 2 at about 9:04 am (UTC+2).
For several minutes, the following impacts could be seen randomly:
- Difficulty for managers to connect to the iAdvize administration
- Disruption of navigation in the administration (blank pages)
- Delay in updating statistical reports
- Random failure to take into account changes in the availability status of agents in their discussion desks, which may have caused desynchronization in the actual presence of agents
- Random failure to take into account the closing of conversations that remained on the agents' discussion desks, which may have caused some reactivity indicators to increase in length
These episodes of slowdown, lasting a few minutes each time, were repeated several times until the night of September 4.
Resolution
Our technical team carried out several successive actions to mitigate and correct this incident as soon as possible.
As of September 2nd at 9:25 am (UTC+2), we have increased the number of servers needed for the proper functioning of the iAdvize administration and conversation panel. We also made adjustments to our platform auto-scaling tool to increase our capacity to absorb the incoming load.
These actions have mitigated the problem even if some temporary slowdowns remained.
Nevertheless we had a new incident on September 3rd at 12:01 pm (UTC+2).
We noticed that we were losing several servers in a random and unplanned way, making the administration and conversation panel unstable again.
Despite manual additions to compensate for these lost servers, our host was not able to provide and maintain a sufficient number of servers to ensure optimal stability of the platform.
To definitively correct this incident we then planned a maintenance to migrate a part of our technical infrastructure on a new more robust data center.
This maintenance took place during the night of September 3rd to 4th and ended at 3:02 am (UTC+2).
Reasons
This incident is the consequence of a succession of two events:
- Our hosting provider had difficulties in guaranteeing a continuous supply of the type of server used by iAdvize. While we were expecting a number of servers in line with the actual load on the iAdvize platform, we were only receiving 60-80% of the expected servers.
- Our platform auto-scaling tool was configured to be as close as possible to the actual load. Thus, unplanned variations in server provisioning by our hosting provider had a direct impact on the stability of the platform. By leaving more room in our configuration, these variations were less felt by iAdvize end users.
Actions
- Ongoing discussions with our hosting provider to understand why it was unable to provide us with the number of servers expected by iAdvize
- Ongoing reflection on a better technical splitting of the iAdvize solution in order to limit the functional impacts when we experience a shortage of servers
- Ongoing reflection on possible improvements in our incident handling process by the iAdvize technical team in order to minimize the duration of incidents
- Implementation of probes to better detect abnormal variations in the supply of servers necessary for the proper functioning of iAdvize
- Implementation of an automatic system for switching our technical infrastructure to more efficient servers in response of an unexplained drop in the supply of servers by our hosting provider
- Implementation of a more flexible configuration on our platform auto-scaling tool
- Implementation of a system to detect and correct desynchronization in the agents' presence status