Postmortem of myTuleap shutdown - 26.05.2021 • Tuleap Blog

On May 26, 2021, at around 5:00 p.m. (Paris time) all myTuleap platforms were impacted by an outage of approximately 4 hours. The incident occurred while migrating myTuleap data to a SAN disk. Our security teams handled the incident, which resolved around 8:30 p.m. (Paris time) the same day. No data loss was observed, nor serious consequences. The platforms are fully functional again this day.

NB: myTuleap is our Cloud offer. In other terms, easy access to our Tuleap Enterprise Edition in SaaS mode. Learn more about myTuleap

Background

For several months now, we have been slowly approaching the maximum disk size limit on myTuleap. In April, we tested Scaleway‘s SAN disks which were found to have better performance than local disks. We tested it on the pre-production and decided to migrate the myTuleap data to a SAN disk. On May 25, we saw that the disc already reached 97% of its capacity; so we decided to speed up the migration.

Timeline – 26.05.2021

Morning :
- Drive mount test and run-of-river migration test
- Failed test (need to restart NFS, therefore forced to shut down the service anyway)
- Communication on a possible complete migration in the evening
3:50pm: Following migration agreement at 7 p.m., roll-back of the test “on the fly”
- This rollback generated an overload on the NFS which disrupted the service on some platforms
- Several services being cut, we decided to take advantage of it to overhaul and make the migration
4:15pm: We cut off all services and started data synchronization
6pm: End of synchronization (a lot longer than expected) and deactivation of NFS and DRBD
- ds-001 server reboot to verify that DRBD no longer restarts (this is no longer necessary with the SAN)
- The server no longer reboots, opening a ticket at Scaleway + tests
7:25pm: Confirmation that the server (1 more) is dead
- Remounting the SAN disk on the backup server ds-002 and disabling DRBD (without reboot test …)
- Reassembling the NFS on myTuleap nodes

8pm: Relaunch of myTuleap: OK
8:30pm: Supervision: OK

Next steps

The service is now 100% operational again after 4 hours of interruptions without data loss. This is the first unplanned service interruption since the launch of the new myTuleap offer in spring 2017.

Data security is enhanced by the use of the SAN High Availability offering replicated in two data centers. For our part, we are closely monitoring the health of the servers and have ordered servers in advance to deal with the fatal reboot epidemic.

In the end, paradoxically, the unreliability of the hardware in recent months has allowed us to be ready for a major failure.