Modular Monolith: DB Schema Migration failure led to an outage.
Postmortems | Links:
LESSON LEARNED: Plan upfront any DB Schema Migration that includes Data Manipulation. Do not change existing Migration Scripts, since they are inherently static. Do not manipulate any data directly inside a schema migration script.
“Learning never exhausts the mind.”
Platform/Client
Context Situation
One of our main customer services is a modular monolith built with Ktor on the backend side, being backed by a PostgreSQL as a DB Engine, using FlywayDB for migrating and performin changes on the database schema.
One of these schema changes (due to constant evolution of our codebase) led to a breaking change and for instance and error, causing an outage on our Development Environment.
Issue Summary
The incident occurred due to a failure in the database schema migration process, where the schema change script included an attempt to create new foreign key constraints (FCK) on tables where the information was inconsistent. As a result, an ERROR was triggered due to the data in both tables being INCONSISTENT.
Resolution
- We run a backup of the database.
- We put our Backend in Maintenance mode.
- An SQL Script was run via console to fix the inconsistent data.
Impact
There was no impact on the final users or business, since the issue affected our Development environment (fortunately), but our system was down for 5 days preventing our teams to test new features and perform demos.
Lessons Learned
When an schema change is performed, these main points should be considered:
- Breaking changes that could affect clients (frontend).
- Complexity of the change that could affect real data.
In order to succeed, a plan should be placed upfront, which should include 2 parts:
- A Schema Migration Script.
- A Data Migration Script (when applies).
Last but not least, apply and keep in mind the BEST PRACTICES when it comes to schema migrations:
- Do not manipulate any data directly inside a schema migration script.
- Always roll forward and in case something fails, create a new migration to rollback.
- Do not change existing scripts, since they are inherently static.