Follow more live status updates on our Discord server.
Release 0.30.0 command failures
Version 0.30.0 of our authentication service introduced a breaking change that caused service-related commands against projects on older versions to fail. Version 0.30.1 was released as a fix.
Oct 24, 2023Resolved
Gateway and Deployer performance degradation
On the 26th of September at 3:31 PM UTC we discovered a performance degradation of our services. Project creation was slow, to the point of timing out. Additionally, deployments were taking longer than usual. We analyzed our metrics and verified that there was no over-use of compute resources, which led us to believe the performance degradation was I/O bound. It took some time before we considered that the issue was caused by excessive disk utilization, since we didn’t consider it critically high at 80%. Ultimately, this was indeed the cause of the degradation, and freeing up disk space resolved the degradation in project creation as well as deployments. Going forward, we will strive to improve our monitoring and our alerts, and more specifically setting an alert for when disk usage approaches 80%. A lot of the disk usage is also due to each project container having their own cargo build cache, we aim to change this with our new builder service, which will be on a separate machine and dedicated to building deployments.
Sep 26, 2023Resolved
Controller VM status check failures
We experienced a platform downtime on 24 September 2023 at 7:30 PM UTC because the Shuttle Controller VM was unreachable. The VM holds the majority of the Shuttle core services. Rebooting the VM fixed the failing status check and the services came back live around 8:50 PM UTC. We discovered we had missed our chance to debug the root cause of why the instance became unreachable because we hadn’t set up system logs collection for the VM (where we hoped we would see important information relevant to why the VM was unhealthy and not able to respond to requests), and rebooting the VM resulted in losing the system logs. We will improve the process of handling such causes of downtime by adding notifications if status check failures occur again, adding system logs collection, and prioritizing improvements of the platform architecture to become highly available.
Sep 24, 2023Resolved
Logger Release Incident
Following the release on September 18, 2023, at 7:53 AM GMT, the new Logger service failed to start up. This was due to a subnet overlap, which meant the new Logger service was not able to resolve its database instance via DNS. Consequently, project containers were not able to start. By 9:00 AM, the issue was resolved, and project creation was restored. This incident could have been prevented by our staging and production environments being identical, which we have been striving for, but in this case they were not. We are working on setting up a new environment for canary deployments of new infrastructure to prevent this issue in the future.
Sep 18, 2023Resolved
Shuttle Gateway Service Disruption
Following the release on July 31, 2023, at 5:08 PM GMT + 2, the Shuttle gateway encountered an issue preventing it from fulfilling user project requests. This issue arose due to a faulty database migration. To address this, we promptly reverted the release and restored the database to a recent backup, ensuring no data loss occurred. By 6:20 PM, all services were fully operational again. Moving forward, we are committed to enhancing our release rollback procedures, aiming to restore services more efficiently in case of future problematic releases. Additionally, we will focus on bolstering our test suite and refining our staging environment to prevent such issues from ever reaching the production stage.
Jul 31, 2023Resolved
MongoDB Data Loss Incident
On Saturday, July 29th, we received a report indicating that a user's MongoDB database had been cleared upon re-deployment. Subsequent investigation by our engineers revealed that this issue arose from the shared MongoDB database container lacking an associated persistent volume. To rectify this situation, a persistent volume was attached to the MongoDB database container, and it was restarted without incurring any additional loss of data. Moving forward, our goal is to expedite the resolution of similar incidents and establish a structured procedure for efficiently restoring the MongoDB database in the event of any future failures.
Jul 29, 2023Resolved
We experienced a provisioner downtime which made Shuttle platform experience difficulties in deploying new projects or serving requests for idled deployments. As next steps, we are working on improving the detection/resolution times and decoupling the provisioner dependency for idled deployments cold start.
Jun 27, 2023Resolved
An unsecured endpoint within the auth service enabled unauthorized access to users' API keys
In the development of the authentication service, we established a /login endpoint to facilitate a session-based authentication flow specifically designed for console usage. However, this endpoint was unintentionally left unprotected, creating a potential risk that a user's API key could be accessed by others. The issue was successfully identified and rectified in the 0.19.0 release, thereby mitigating the vulnerability.
Jun 15, 2023Resolved