Incident 2023-12-04: Data leak and loss in some free tier databases

#What happened?

0.07% of databases under management were incorrectly configured with an empty backup identifier, which caused a data leak. The conservative fix we applied to the leak, lead to the possible loss of the most recent data in those databases.

A change that made possible for the empty backup identifier to be used was made to the system on November 20th. On December 1st, internal procedures led to those backups being used to recreate the databases. This was noticed and reported on December 4th 8:10 AM UTC, and fixed on December 4th 9:17 AM UTC.

#What was the root cause?

Databases on Turso's free tier may scale to zero after one hour of inactivity. They are scaled back to one automatically upon receiving a network request. Usually this is completely invisible to the users, except for added latency on the initial request.

However, in rare situations, our cloud provider (fly.io) is unable to restore the process due to lack of resource availability in the host. In those cases, we destroy the machine and recreate it from the S3 backup. Each database has a separate backup identifier, through which we know which snapshot to restore.

From time to time, we migrate databases to newer version. A bug in that process caused some databases created to use an empty backup identifier. Effectively, the databases in that small set were now sharing a backup storage bucket. In effect, instead of pointing to s3://bucket/backup_id/, the affected databases were pointing to s3://bucket//.

When some databases failed to scale back to one, recreation happened from a shared location, the null ID. This caused both the data loss and data leak mentioned above.

We fixed the issue, by re-running the migration with the correct parameters and recreating the affected databases with their December 1st backup from the correct backup ID. Since we considered any data past December 1st to be shared between those databases, any writes done after that point were discarded.

#How do we know which databases were affected?

We were able to determine all databases pointing to an invalid and shared backup ID by querying metadata for databases with the faulty backup ID. We could also scan our object store buckets and verify that no other databases were affected.

#Lessons learned and Remediation

While we know the root cause quite well by now, there are still a few things we must do to ensure this never happens again. Some of these things are internal processes and others are improving our current mechanisms. In both of these cases, we take this very seriously and have diverted a majority of our engineering efforts to prevent any future data loss and data leaks. Preventing these is the #0 goal for Turso.

While we have not fully reviewed every piece of code related to the incident we have already identified a few big ticket items that we have already fixed and/or plan on fixing ASAP. As part of that we expect the following changes:

Additional internal checks from both our control plane and data plane to ensure backup's are for the correct database. This also means improving the data isolation in our backups similar to how we have data isolation for our running databases.
Improve our ability to check for faulting configuration and to self heal or notify a team member that there is an issue so we can fix it. Essentially have a narrower band of allowed configurations and to be more strict by default.
Improve our deployment methods to prevent migrations from breaking backup ID's.
Add better mechanisms for being notified of security incidents either via twitter or via email at (security@turso.tech). Please reach out if you notice anything!

#Summary

We are embarrassed for this incident and the pain that it has caused our customers and our team. We have and will be implementing improved processes to prevent this in the future, but we need to be more rigorous going forward about how we handle data and the practices we use to prevent these issues. This will have my full attention and priority going into the new year as we plan to provide better features for data isolation and multi-tenancy. Thanks to Schlez for notifying us and allowing us to get to a solution quickly. We hope we can regain some of the trust from the community going forward.