Instabilities on the platform
Incident Report for iAdvize (HA)
Postmortem

Incident

Between September 2 and 4, we experienced several periods of slowdowns on our technical infrastructure that severely impacted the user experience of the administration and conversation panel.

We are aware of the inconvenience this incident may have caused to your business.
We would like to apologize and thank you for your patience and understanding.

The first occurrence of the problem was on September 2 at about 9:04 am (UTC+2).
For several minutes, the following impacts could be seen randomly: 

  • Difficulty for managers to connect to the iAdvize administration
  • Disruption of navigation in the administration (blank pages)
  • Delay in updating statistical reports
  • Random failure to take into account changes in the availability status of agents in their discussion desks, which may have caused desynchronization in the actual presence of agents
  • Random failure to take into account the closing of conversations that remained on the agents' discussion desks, which may have caused some reactivity indicators to increase in length

These episodes of slowdown, lasting a few minutes each time, were repeated several times until the night of September 4.

Resolution

Our technical team carried out several successive actions to mitigate and correct this incident as soon as possible.
As of September 2nd at 9:25 am (UTC+2), we have increased the number of servers needed for the proper functioning of the iAdvize administration and conversation panel. We also made adjustments to our platform auto-scaling tool to increase our capacity to absorb the incoming load.
These actions have mitigated the problem even if some temporary slowdowns remained.

Nevertheless we had a new incident on September 3rd at 12:01 pm (UTC+2).
We noticed that we were losing several servers in a random and unplanned way, making the administration and conversation panel unstable again. 
Despite manual additions to compensate for these lost servers, our host was not able to provide and maintain a sufficient number of servers to ensure optimal stability of the platform. 
To definitively correct this incident we then planned a maintenance to migrate a part of our technical infrastructure on a new more robust data center.
This maintenance took place during the night of September 3rd to 4th and ended at 3:02 am (UTC+2). 

Reasons

This incident is the consequence of a succession of two events:

  • Our hosting provider had difficulties in guaranteeing a continuous supply of the type of server used by iAdvize. While we were expecting a number of servers in line with the actual load on the iAdvize platform, we were only receiving 60-80% of the expected servers.
  • Our platform auto-scaling tool was configured to be as close as possible to the actual load. Thus, unplanned variations in server provisioning by our hosting provider had a direct impact on the stability of the platform. By leaving more room in our configuration, these variations were less felt by iAdvize end users.

Actions

  • Ongoing discussions with our hosting provider to understand why it was unable to provide us with the number of servers expected by iAdvize
  • Ongoing reflection on a better technical splitting of the iAdvize solution in order to limit the functional impacts when we experience a shortage of servers
  • Ongoing reflection on possible improvements in our incident handling process by the iAdvize technical team in order to minimize the duration of incidents
  • Implementation of probes to better detect abnormal variations in the supply of servers necessary for the proper functioning of iAdvize
  • Implementation of an automatic system for switching our technical infrastructure to more efficient servers in response of an unexplained drop in the supply of servers by our hosting provider
  • Implementation of a more flexible configuration on our platform auto-scaling tool
  • Implementation of a system to detect and correct desynchronization in the agents' presence status
Posted Sep 08, 2021 - 10:37 CEST

Resolved
After a long period of monitoring, we close this incident.
A Post Mortem will be available immediately after this last update.
We thank you again for your patience.
Posted Sep 08, 2021 - 10:37 CEST
Update
We continue to monitor the platform following our intervention at 3am (UTC+2)
Posted Sep 04, 2021 - 09:48 CEST
Monitoring
The maintenance supposed to happen in 4 hours has been done in order to restore the service. The solution is working again as it should. We are of course monitoring that everything is working properly.
Posted Sep 04, 2021 - 03:17 CEST
Investigating
The admin and desk applications are currently unavailable. We are investigating the issue.
Posted Sep 04, 2021 - 02:54 CEST
Update
We are planning a maintenance tomorrow morning at 7am (UTC+2) which could generate a 5 to 10 minutes production interruption. This maintenance aims at solving the instability problems encountered these last days.

MAINTENANCE PERIOD
- UTC+2 (Paris, France): September 4, 7:00 AM - 7:30 AM
- US/Pacific: September 3, 10:00 PM - 10:30 PM
- US/Eastern: September 4, 01:00 AM - 01:30 AM
- Asia/China: September 4, 01:00 PM - 01:30 PM

We thank you in advance for your patience and understanding.
Posted Sep 03, 2021 - 16:56 CEST
Identified
Despite our actions of the last few days, we have noticed this morning new instabilities related to a loss of server instances on our infrastructure.
These losses of instances occur in a recurrent way and generate instabilities lasting a few minutes each time. We are currently in contact with our host to investigate this problem.

In parallel, we will intervene in the next few hours to modify our server configuration and ensure stability.

We will make an update once this action is completed.
Posted Sep 03, 2021 - 12:42 CEST
Update
Although we see a clear improvement in the stability of the platform, we continue to actively monitor the platform until tomorrow. We will release a new update tomorrow morning
Posted Sep 02, 2021 - 18:02 CEST
Monitoring
We don't see any instabilities since our last message at 11:30am (UTC+2)
Some users still may have an unsynchronized presence status. We are working to solve this problem of desynchronization.

We continue to monitor the activity of the platform.
Posted Sep 02, 2021 - 15:16 CEST
Update
We are seeing an improvement in the last few minutes and are continuing our actions to ensure that production is stabilized.
Posted Sep 02, 2021 - 11:30 CEST
Update
We again see a high level of errors.
We are continuing our actions to solve the problem.
Posted Sep 02, 2021 - 11:05 CEST
Identified
As a result of our actions, we no longer see any errors in production. Some users may have an unsynchronized presence status (for example appearing in the supervision when they are no longer connected). We are working to solve this problem of desynchronization.
Posted Sep 02, 2021 - 10:49 CEST
Update
We notice 404, 500 errors on the administration and the different services that allow the proper functioning of the platform (presence, conversation, statistics...)
We continue our actions to mitigate the instabilities in progress.
Posted Sep 02, 2021 - 10:17 CEST
Investigating
We have noticed instabilities on all the services of the iAdvize platform. These instabilities are related to a problem of resource management of our infrastructure generating congestion in some essential services.
Our technical team is mobilized to mitigate and correct these instabilities as soon as possible. We will update this incident regularly to keep you informed.
Posted Sep 02, 2021 - 09:54 CEST
This incident affected: Administration (Login, Engagement settings, Distribution settings, Channel settings, Users management, Statistics, Sending email), Conversation Panel (Login, Conversation views, Message exchange, Conversation closure, Conversation transfert, Mirroring / Cobrowsing, Canned answer), Onsite Channels (Chat, Call, Video, Mobile SDK), Third Party Channels (X, Facebook, Facebook Messenger, WhatsApp, SMS), Visitor’s interface (Engagement Notification, iAdvize Messenger), APIs (Rest API, GraphQL API, Webhook), Bot (Bot service (except IA features)), and Mobile App.