P1 - Some bots not able to handle conversations
Incident Report for iAdvize (HA)
Postmortem

1 - What happened?

Over the past few days, we've encountered several instabilities in the processing of conversations by iAdvize bots.

Instead of unfolding the expected scenario, the bots failed to process visitors' messages. As a result, the bots were displaying error messages or no response at all, resulting in a severe degradation of the user experience on your websites.

These instabilities occurred :

  • August 11th : 9:57 > 11:37 CEST
  • August 15th : 4:45 > 7:54 CEST

2 - What caused the outage?

We are currently working on a major revamp of the technical core of iAdvize bots. The aim of this redesign is to improve the speed of execution of bots' scenarios, and make them easier to maintain especially during future technical updates.

This work includes the implementation of a new service dedicated to the reception of new conversations by bots. These conversations are then distributed to a second service, which executes the scenario defined in iAdvize administration.Instabilities occurred on this new service due bot messages parsing. Unknown format messages were pushed to the system.This resulted in a delay in the processing of new bot conversations.We have identified that on 2 occasions.

  • On August 11th, the delay was small, and only a third of the bots managed by iAdvize were affected.
  • On August 15th, despite the patches applied after previous instability, we experienced a new problem. The accumulated delay was enough to interrupt all bots conversation processing.

3 - What was the fix?

  • On August 11th, we mitigated the problem by manually identifying and restarting the instances of the bot conversation reception service that were causing problems.

  • On August 15th, our new probes detected a new delay. We also restarted faulty instances. However, these actions didn't work as expected. A new problem with bots message parsing was discovered. We had to urgently develop a hotfix to unblock a message type that was not recognized by the system.

4 - How will iAdvize prevent this issue in the future?

  • (In progress) (Tech) Technical audit of the new bot service in order to identify any new potential failure points
  • (Done) (Tech) Improve our probes to detect potential delays in the processing of new conversations by bots
  • (Done) (Tech) Improve bot service reliability by adding safeguards to prevent bots from blocking the processing of new conversations in the case of unsupported message types
Posted Aug 17, 2023 - 18:03 CEST

Resolved
After a period of monitoring, we confirm that the incident has been resolved.
Thank you for your patience and understanding.
Posted Aug 16, 2023 - 09:04 CEST
Monitoring
The situation is now back to normal, our technical team continues the monitoring of our infrastructure.
Posted Aug 15, 2023 - 08:31 CEST
Investigating
We are currently investigating an issue on our bot service.

Most of the bots are not able to handle conversations, our technical team is working on solving this problem.
Posted Aug 15, 2023 - 07:16 CEST
This incident affected: Bot (Bot service (except IA features)).