P1. Bots are no longer operational (they no longer respond)
Incident Report for iAdvize (HA)
Postmortem

Incident

We had a lag issue on our Bots backend services preventing Bots managed by iAdvize from being functional on your websites.Conversations handled exclusively by humans were still functional. However, if a bot intervened in the engagement flow used by visitors, conversations stopped at the first stage of the bot scenario.

This lag issue occurred following the release of a version containing the first building blocks of a feature that will soon be available. This release successfully passed all our validation protocols. However, the increase in load that followed the release generated a significant lag in the incoming conversation ingestion service. As a consequence, these incoming conversations exceeded the maximum execution time and were discarded from processing.

This issue happened twice on November 29th : - 16:45 to 17:22 CEST - 17:27 to 17:37 CEST 

Resolution

In order to mitigate and restore the Bots services, we performed following actions: 

  • Manually clean up the event overflow in the incoming conversation ingestion service 
  • Rollback of the release that introduced the lag

Actions for the future

  • (Done) Add parallelization processes on the events consumers in order to be reactive in cause of lags on bots
  • (Done) Put a limit on the events publisher in order to prevent possible lags on bots
Posted Dec 01, 2023 - 16:52 CET

Resolved
This incident has been resolved.
All conversations stucked during the incident have been manually closed.
Thank you for your patience.
Posted Nov 29, 2023 - 20:53 CET
Update
We note that some conversations that took place during the incident are currently still in progress, but for which the bot is no longer responding. We are going to close these conversations manually.
Posted Nov 29, 2023 - 17:57 CET
Monitoring
We have seen a return to normality in the last 5 minutes following our latest actions.

We are continuing to monitor the situation.
Posted Nov 29, 2023 - 17:43 CET
Investigating
We are seeing new disruptions appear. Bots are taking a long time to respond or are no longer responding in some cases. We are actively working to resolve the problem
Posted Nov 29, 2023 - 17:31 CET
Monitoring
The service has restarted and we are seeing a return to normal.
Posted Nov 29, 2023 - 17:25 CET
Investigating
We have noticed that the bots are no longer responding.
The technical team is working to resolve the problem.
We're going to restart the service to restore it.
Posted Nov 29, 2023 - 17:19 CET
This incident affected: Bot (Bot service (except IA features)).