Incident:
On November 20th (17:19 > 17:52 CET) and November 21st (9:30 > 9:43 CET), we experienced two incidents degrading the user experience on the Conversation panel and Administration.
During this period, conversation processing by agents was disrupted by white screens or error messages. In addition, the monitoring of stats reports by managers has also been impacted by error messages.
These disturbances are the result of changes made to the platform infrastructure as part of our regular and scheduled system maintenance.
Although initially qualified as non major risk and validated in a pre-prod environment, these planned actions had an unexpected impact on platform stability. Access to services critical to the proper operation of the platform have been temporarily cut.
Resolution
To solve this issue, our technical team had to manually change some settings on these critical services and then to restart them.
Getting the required underlying services back to their nominal state allowed the Conversation panel and the Administration application to return to their own nominal state.
Actions for the future
Focus on the Black Friday period
Looking ahead to the next critical period, we're confident that we'll handle incoming traffic on the iAdvize platform without disruption.
This incident is the consequence of manual actions whose impact has not been adequately anticipated.
This is not a problem related to traffic management or platform scaling.
In the meantime, we have been proactive in getting the iAdvize platform ready and we reviewed teams' preparation for this high-traffic period.
Our modus operandi is based on three pillars which have already been identified and implemented:
freezing period : no new code added in production
stress test : test the platform’s scalability with heavy load pick traffic
team’s mobilization : assigning the right people to monitor the main components 24/7
Be assured that our team and platform are ready for the end of the year.