P1 - Disturbances on the Conversations Panel
Incident Report for iAdvize (HA)
Postmortem

Incident:

On November 20th (17:19 > 17:52 CET) and November 21st (9:30 > 9:43 CET), we experienced two incidents degrading the user experience on the Conversation panel and Administration.

During this period, conversation processing by agents was disrupted by white screens or error messages. In addition, the monitoring of stats reports by managers has also been impacted by error messages.

These disturbances are the result of changes made to the platform infrastructure as part of our regular and scheduled system maintenance.

Although initially qualified as non major risk and validated in a pre-prod environment, these planned actions had an unexpected impact on platform stability. Access to services critical to the proper operation of the platform have been temporarily cut.

Resolution

To solve this issue, our technical team had to manually change some settings on these critical services and then to restart them.

Getting the required underlying services back to their nominal state allowed the Conversation panel and the Administration application to return to their own nominal state.

Actions for the future

  • (Done) Review our internal processes to ensure that customer communication on our status page is more responsive
  • (Done) Review our maintenance process to better identify and scope potential negative impacts on the iAdvize platform and adapt our execution plans subsequently
  • (Done) Improve probes and alerting on failing services to improve reactivity

Focus on the Black Friday period

Looking ahead to the next critical period, we're confident that we'll handle incoming traffic on the iAdvize platform without disruption. 

This incident is the consequence of manual actions whose impact has not been adequately anticipated.

This is not a problem related to traffic management or platform scaling.

In the meantime, we have been proactive in getting the iAdvize platform ready and we reviewed teams' preparation for this high-traffic period.

Our modus operandi is based on three pillars which have already been identified and implemented:

  • freezing period : no new code added in production

  • stress test : test the platform’s scalability with heavy load pick traffic

  • team’s mobilization : assigning the right people to monitor the main components 24/7

Be assured that our team and platform are ready for the end of the year.

Posted Nov 25, 2024 - 14:44 CET

Resolved
After a period of monitoriy, we confirm that this incident is resolved.
A post-mortem will be published soon.

We are sorry for the inconvenience caused.
Posted Nov 21, 2024 - 14:09 CET
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Nov 21, 2024 - 10:01 CET
Investigating
We have some disturbances on the platform.
You may notice blank page on the conversation panel or on the administration.
Also difficulties to close conversations.

The technical team is investigating the issue.

We keep you informed.
Posted Nov 21, 2024 - 09:52 CET
This incident affected: Administration (Login, Engagement settings, Distribution settings, Channel settings, Users management, Statistics, Sending email), Conversation Panel (Login, Conversation views, Message exchange, Conversation closure, Conversation transfer, Mirroring / Cobrowsing, Canned answer), Onsite Channels (Chat, Call, Video, Mobile SDK), Visitor’s interface (Engagement Notification, iAdvize Messenger), APIs (Rest API, GraphQL API, Webhook), Bot (Copilot Shopper, Copilot Agent, Bot service (except IA features)), Third Party Channels (Facebook, Facebook Messenger, WhatsApp, SMS), and Mobile App (iOS/Android).