Slowdowns on our platform
Incident Report for iAdvize (HA)
Postmortem

Reason :

Due to performance misbehaviors on some of our application servers, new servers were instanciated and production services were moved to them.
New applications then all tried to start at the same time, causing disk congestion during application image extractions. Applications start up thus timed out and were rescheduled for restart by the application scheduler, which ended up disturbing the scheduler's global stability.

Action :

  • Migrate application servers to a more stable copy on write file system
  • Improve the scheduler retries handling
Posted 9 months ago. Aug 21, 2018 - 10:38 CEST

Resolved
After monitoring the activity of our platform since our last intervention, we didn't notice any new perturbations. The situation is back to normal.

Thanks a lot for your patience.
Posted 9 months ago. Aug 14, 2018 - 10:19 CEST
Update
The situation is now back to normal, our technical team continues the monitoring of our infrastructure.
Posted 9 months ago. Aug 14, 2018 - 01:02 CEST
Update
A patch has been deployed, we are now monitoring the results.
Posted 9 months ago. Aug 14, 2018 - 00:46 CEST
Update
We still have issue with the conversation panel.
A 404 message is displayed instead of the the proper display.
We're on it.
Posted 9 months ago. Aug 13, 2018 - 23:56 CEST
Monitoring
The situation is now back to normal since 10PM, our technical team continues the monitoring of our infrastructure.
Posted 9 months ago. Aug 13, 2018 - 23:41 CEST
Investigating
Our technical team is investigating on slowdowns on our platform between 8:04PM to 9:41PM. Access to the administration or discussion panel, the messages exchange and reports were impacted (data not updated, error 404,...).
Our technical team has proceed to an emergency reboot in order to restore the platform acces.
Posted 9 months ago. Aug 13, 2018 - 21:45 CEST