Postmortem of myTuleap incident - 2024-02-04 • Tuleap Blog

On February 04, 2024, at 22h32 UTC all myTuleap platforms were impacted by an outage of approximately 18 hours. Tuleap Cloud Premium was not affected by the incident. No data loss was observed. We detail bellow the timeline of the events and how we are going to remediate on the long term.

Background

myTuleap is our shared SaaS platform hosted by OVHCloud. It’s built on Bare Metal servers for compute part with Docker Swarm for clustering and storage relies on High Availability NAS. The incident occurred on storage part.

Timeline of events

04 Feb 2024 22h32: Our monitoring detects the availability issue.
05 Feb 2024 08h05: Human operator confirms that some myTuleap instances appear to be offline.
05 Feb 2024 08h09: Human operator confirms all the myTuleap instances are not accessible, investigation begins.
05 Feb 2024 08h22: Communication is sent to customers.
05 Feb 2024 08h28: Initial log analysis seems to show an issue with our data store. Source of the issue is unclear, all data appears to be accessible on the filesystem. In order to protect the integrity of customer data and ease the troubleshooting process, all services are shutdown.
05 Feb 2024 08h39: The shutdown of all services is started.
05 Feb 2024 08h43: investigation shows that one of the nodes of the cluster (node 02) has been out of memory after the beginning of the event due to the heavy re-scheduling of the services in the cluster. Some processes have been killed by Linux OOM killer.
05 Feb 2024 08h53: Human operator confirms data seem to be coherent: a canary instance internal to Enalean was restarted in a different environment with data extracted from the data store.
05 Feb 2024 08h53: The node affected by the out-of-memory issue is removed from its responsibilities in the cluster and rebooted. This is done to ensure the server will be in a working state when the services are going to be restarted.
05 Feb 2024 09h09: the services of our internal canary instance are restarted on the cluster to see how the behave and be able to do some live troubleshooting.
05 Feb 2024 09h17: The data store volume is not mounted automatically as expected on the rebooted server and manual attempts to mount it fails with a mount.nfs: Connection timed out error.
05 Feb 2024 09h33: A unexpected behavior is detected on the DB service of the canary instance: read/write accesses work as expected but some syscalls like fcntl fail after a long delay preventing the DB to start.
05 Feb 2024 09h58: The same issue is observed multiple times and it is observed that it is not possible to mount the volume even on servers where it is currently mounted and accessible.
05 Feb 2024 10h07: The data store being hosted using the OVHcloud NAS HA product, a support ticket is created.
05 Feb 2024 12h20: Recovery plan to restart the whole infrastructure from scratch is being considered.
05 Feb 2024 13h10: Still no information from OVHcloud, recovery plan is approved.
05 Feb 2024 14h19: Compute and storage resources have been provisioned for the new infrastructure.
05 Feb 2024 14h40: Data are being restored from our backups. Setup of the new cluster is underway.
05 Feb 2024 14h57: Data restoration process is complete.
05 Feb 2024 15h11: Canary instance has been restarted and is confirmed to work as expected
05 Feb 2024 15h16: All myTuleap are restarted on the new infrastructure.
05 Feb 2024 15h28: Production traffic is moved to the new infrastructure.
05 Feb 2024 15h41: An issue preventing all the myTuleap instances to be accessible again is identified, kernel setting is tweaked.
05 Feb 2024 15h43: All myTuleap instances appears to be accessible for our monitoring probes.
05 Feb 2024 16h06: An issue affecting mail notifications for ~2/3 of the myTuleap instances is identified and fixed.
05 Feb 2024 16h10: Additional log analysis confirms that all instances are accessible again.
05 Feb 2024 16h40: Communication is sent to customers to inform them access has been restored.
06 Feb 2024: Cleanup process of the old infrastructure is started in order to reduce the exposure of customer data.
07 Feb 2024 13h07: OVHcloud acknowledges the ticket and request more information to troubleshoot the issue. More than 48 hours after creation of the ticket, too late.

We are deeply sorry for the inconvenience our customer faced and we are working toward a more robust approach (see after). On the positive side, the recovery mechanism worked as intended and no data were lost. We managed to rebuild infrastructure in less than 3 hours, data included.

Next step

Since the end of 2023, we are working on a major architecture change for myTuleap hosting. We will move to OVH Public Cloud offer for more independence between the services and, for what we can observe so far, better reliability. The move is planned for Q1’2024, we will reach out impacted customer as soon as we have a precise timeline.

Postmortem of myTuleap incident – 2024-02-04

Background

Timeline of events

Next step

Manuel Vacelet