P1 - Service disruption on Bots
Incident Report for iAdvize (HA)
Postmortem

What happened ?

We had a severe load issue on our Bots backend services preventing Bots from being functional on your websites.Conversations handled exclusively by humans were still functional. However, if a bot intervened in the engagement flow used by visitors, the conversation could not take place.

This load issue was caused by an unscheduled self-cleaning script performed on the Bots database engine. This cleaning was performed on a table with a large amount of data. As a consequence, critical queries needed by Bots were not able to perform in reasonable time, making the whole Bots system degraded.

This issue happened on October 26th between 11:10 to 15:47 CEST.

Resolution

Once started, the self-cleaning script cannot be stopped and must be completed. So we looked for alternative solutions.

In order to mitigate and restore the Bots services, we performed following actions: 

  • Upgrade Bots database engine to a higher-performance instance type in order to let Bots critical queries to be fully executed. This action took several hours to complete. This partly explains the duration of the incident.
  • Deploy a new version with patches to reduce the bots' dependence on the overloaded database.

Actions for the future

  • (Done) The bots database engine upgrade significantly reduced the permanent load. We have more capacity to handle heavy loads.
  • (Done) We have identified and cleaned up the data table at the origin of the self-cleaning.
  • (In progress) We are setting up new probes to detect loaded databases and avoid self-cleaning script launches.
Posted Oct 30, 2023 - 11:02 CET

Resolved
The latest intervention we accounced occurred succesfully.

Since the bot service has been restored (3:47pm CEST), our montoring shows that everything is back to normal. The incident is closed.
Posted Oct 26, 2023 - 20:26 CEST
Update
In order to stabilize the bot service, our technical team will apply a patch around 7.30pm CEST. During this time, you may experience some disturbances.
We will do our utmost to minimize all impact on the production.
In advance, thank you for your understanding.
Posted Oct 26, 2023 - 17:42 CEST
Update
The incident is now resolved.

However, you may have some conversations stucked display on the ongoing report Production, during this period.

Our technical team is doing its best to close them.
Posted Oct 26, 2023 - 16:56 CEST
Monitoring
Our latest actions have restored the bot.
The situation is now back to normal, our technical team continues the monitoring of the service.
Thanks again for your patience during the resolution of the incident.
Posted Oct 26, 2023 - 16:04 CEST
Update
Personal canned answers have been restored.

Your agents can now use them normally.

Our technical team keeps working on new actions to restore the bot service.

Thanks.
Posted Oct 26, 2023 - 15:41 CEST
Update
Our technical team is still working on technical interventions to solve the issue.

One action taken involved temporarily suspending our service dedicated to personal canned answers.

As a result, your agents may notice that some personal canned answers normally available are currently missing.
Posted Oct 26, 2023 - 15:07 CEST
Update
Our technical team is still working on several actions to fix the bot service.
Thanks again for your understanding and patience.
Posted Oct 26, 2023 - 14:22 CEST
Identified
The cause of the incident has been identified, we are actively working on a fix to correct this issue as soon as possible.
Thanks again for your patience.
Posted Oct 26, 2023 - 12:43 CEST
Update
We are still investigating and attempting to mitigate the issue on bots. We will update again as we have more information.
Thank you for your patience.
Posted Oct 26, 2023 - 12:07 CEST
Investigating
We are investigating on a problem concerning bots.
Bots may not be able to reply conversations.
Thanks for your understanding.
Posted Oct 26, 2023 - 11:48 CEST
This incident affected: Bot (Bot service (except IA features)).