Handling schema migrations on a live database, at scale
Lessons from Production Postgres Migrations: Handling schema migrations on a live database, at scale
Last updated
©2024 by Gooey.AI / Dara.network Inc / support@gooey.ai
Lessons from Production Postgres Migrations: Handling schema migrations on a live database, at scale
Last updated
While working on adding support for workspaces in Gooey.AI (our noun for teams), we had to perform several database updates that would affect very large tables (4.5M+ rows in the runs table, 3.6M+ tables in the transactions table, 750K+ rows in the users table). We use a single PostgreSQL database instance for production use and Django's ORM to help us with migrations.
This presents a few challenges when making schema changes on these tables:
ALTER TABLE
will acquire a table-level ACCESS EXCLUSIVE
lock. This means that no reads or writes to the table will be processed until the schema change is completed. If this is too slow, live users will see incredibly slow load times and timeouts.
migrating data values can be even slower. it would be a disaster to use a computed value as a column's default when adding that column in a migration. you would end up locking the table for all other operations with one very long-running operation - because you would need to use the CPU (a limited resource) for each one of the millions of rows in your table.
production is different from local. it's easy to miss a bottleneck if you don't have that critical scale, as is often the case on a local setup.
indexes are not trivial to use correctly.
OFFSET
in SQL queries is terrible for performance
To deal with these challenges, we had to conduct our work in a non-straightforward manner - which is exactly why I'm writing this blog - so that you don't need to do the same exercise the hard way once again.
What do I mean by simple migrations? Migrations that do one thing. For example, if you're adding a new column, don't also give it a computed value as the default in the same migration. Adding a nullable column, or a column with a static default is fast. Computing a value for each of the millions of rows in your table is not. Whether that computation needs to be done in Python or in PostgreSQL, the same limitation applies.
Another anti-example is adding constraints after adding a column in the same migration. This one is particularly easy to shoot yourself in the foot with. Even worse is when each of these migrations affects different tables. The reason is that all operations in a single migration run within a single transaction. Now, database locks needed for each of these operations are held for the entire duration of the transaction. If you're doing 3 slow operations that acquire an ACCESS EXCLUSIVE
lock on 3 different tables, you slow down more queries, and for longer.
To work around this, Django also allows setting atomic = False
in the migration class to override this default behaviour. In some cases, that is good enough. In others, you might be better off splitting your migrations.
python manage.py runscript ...
over migrations.RunPython
For context, migrations.RunPython
lets you run arbitrary Python code and is commonly used for data migrations once the schema migrations have been made. This is a feature that should be used with care, especially when dealing with large amounts of existing data.
An easy pitfall with using this feature is writing some code that takes too long and making that part of a transaction with a schema change that holds on to an ACCESS EXCLUSIVE
lock. Locks are held for the entire duration of the transaction. If you iterate over all rows of a table, and perform something like instance.save(...)
on each iteration, that's going to take very long and your queries to a live database will be blocked for this entire duration. Of all the pitfalls I can mention in this post, this is probably the worst one.
That suggestion in and of itself does not solve your problem. Very often, there are cases when a schema change needs an accompanying data change for it to make sense. You still end up in an odd situation where the business logic has two places to look for the relevant data.
To take the example that we ran into: we had to move all the relevant billing information from the AppUser
model to the Workspace
model. The business logic needs to be able to find data in the relevant fields in Workspace
table to make use of the new schema. Our business logic would also need to add the relevant data to the Workspace
table for each new AppUser
.
At our scale of 750K+ users, doing this migration in a loop by iterating over each AppUser
would've been extremely slow and we end-up in that weird in-between state for a longer time. So we ended up doing this in three steps:
We add the new column to the database without migrating the historical data. Here we check a relevant row in the new Workspace
table for the billing info. If we do not find that data there, we fallback to looking into a row in the AppUser
table.
We run the data migration as a script after the schema change is completed. The data migration updates rows in batches (we chose a batch size of 10,000 rows). Each batch executes a single SQL UPDATE
query that involves a join with the AppUser
table. We do not use transactions, so we release the row-level lock acquired by the UPDATE
query once it is done and we move on to the next batch.
When this data migration is complete, we update the business logic to remove the fallback and rely only on the Workspace
model.
This lets us do the data migration slowly, at our convenience, and the app keeps running normally throughout.
If you have a staging environment, it could be a good idea to populate the relevant tables there to a scale comparable to your production database. We do not have a staging environment, so we instead did an admin-only release of this feature. Only us, the team members, could access the relevant pages. That gives us time to test at production scale without risking a bad experience for our users.
We did discover bottlenecks with some of our queries there, added the indexes that we needed, and all was good.
Django also provides helpful debugging tools here:
QuerySet().query
string for the raw SQL query that you can print and examine
QuerySet().explain()
to run a SQL EXPLAIN ...
to see what approach the database would take for a query (e.g. whether an index is being used or not)
There are cases when an index might not be used. For example in a SELECT
query that filters with some condition on multiple rows, individual indexes on each of those rows are not helpful and won't be used by the database engine. Instead, you need a composite index on those columns.
Similarly, if you use a SQL function in the query (like LOWER()
or UPPER()
), an index won't be used unless you have specifically added a functional index for LOWER(colname)
or UPPER(colname)
.
It is also a good idea to print qs.explain()
for your query sets during development - to check that an index is being used.
OFFSET
for pagination is terrible for performanceWhen you perform a SQL query with the LIMIT
-OFFSET
style - which seems like an intuitive implementation for pagination - all of the rows before the offset still need to be computed. You're not really saving much on database performance even with pagination. This is mentioned in small print in the PostgreSQL documentation on Limits and Offsets. I like this more detailed blog for an understanding of how OFFSET
works.
The other, much better alternative to pagination, is something called Cursor Pagination. It deserves a post of its own and the linked resource by Disqus devs is a good place to start.
The fundamental idea is this: with knowledge of the results on one page, we can construct a WHERE
clause that will give us the next page. The last result on one page will be sorted higher than the first result on the next page. We use this last result and the fields in the ordering criteria to build a query filter that will give us the next results.
Not surprisingly, there is a certain scale for a web application after which you need nuance and your own reasoning. Lots of common wisdom suddenly becomes bad practice - e.g. LIMIT
-OFFSET
pagination. At a smaller scale, caring about little things would in fact be akin to bike shedding - and really, just bad software engineering. With scale though, details start to matter more. An important part of good engineering at that point is to learn from the wisdom of your predecessors, pay attention to detail, and anticipate what might break. Shipping this new feature that changed so many parts through the app was an exercise for our team in just that.
Written by Kaustubh M Patil, Software Engineer, Gooey.AI