Migrating from PostgreSQL to CockroachDB

14 Jun 2023

DevOps  Howto  Python 

I've recently moved over a project from using PostgreSQL to CockroachDB. This was a lot harder than it looks, and if this was a project with a reasonable budget then I'd have either gone with RDS or just paid Supabase rather than going "argh, it's going to exceed the free plan", but no, this is a low-budget thing, so let's go with the complicated but cheaper path (assuming my time is free, which is kinda true when I'm having fun on a personal project).

CockroachDB does have a half-decent migration guide, but unfortunately there's several issues with it:

Instead, I'm going to be doing things with CSV export. As part of this, I'll be showing some links to the scripts I used to make this work. Please note, these are presented without any guarantees that they'll work for you. There's a number of hacks in them that were fine for my use cases, but might not be fine for you. They're great starting points, but you'll probably need to make changes to make them work for you.

Disclaimer over, here's the guide. First up, I'm assuming you've got a sensible method to setup all your tables in CockroachDB. I solved this by installing django-cockroachdb and doing a standard Django python manage.py migrate with the CockroachDB server URL plugged in (after doing some migration squashing to fix some cases where I'd changed primary keys) which got me the tables. If you don't have your database migrations written in a DB-neutral way, I suggest you follow the existing schema migration guide.

Now, we need a script to dump stuff out of PostgreSQL. The whole thing is available but let's talk through a few notable parts of it. Remember to install psycopg2 if you're running this and the rest is just standard library stuff. It takes an arg of the URL to the PostgreSQL server e.g. postgres://username:mysecretpassword@localhost/mydb

cur.execute("SELECT tablename FROM pg_catalog.pg_tables WHERE schemaname = 'public';")
tables = [".".join(table) for table in cur.fetchall()]

That pg_catalog bit means we're able to get the set of tables in your DB out, rather than hardcoding. Note the schemaname = 'public' which got the correct set of "my actual app tables" for me, but YMMV.

cur.execute(
"SELECT conrelid::regclass AS src_table, confrelid::regclass dest_table FROM pg_constraint WHERE contype = 'f' AND connamespace = 'public'::regnamespace")
fkeys = cur.fetchall()
deps = dict([(table, []) for table in tables])|
for src, dest in fkeys:
    deps[dest].append(src)
json.dump(deps, dump_folder.joinpath("deps.json").open("w"), indent=2)

Similarly to pg_catalog, pg_constraint contains the set of "constraints" on your DB. We're filtering down here to just the foreign keys (contype = 'f') as that gets us the critical "what order should we import data in" info.

sql = f"COPY (SELECT * FROM {table}) TO STDOUT WITH CSV HEADER DELIMITER ';'"
with table_path.open("w") as file:
    cur.copy_expert(sql, file)

Actually copying the CSV data out would be done with the \copy command at psql, but for the Python interface, we need copy_expert courtesy of a helpful Stack Overflow answer.

You should now have a dump folder containing a series of CSV files (well, kinda SSV as they're semicolon delimited to reduce the amount of escape characters) and a deps.json, ready to import into CockroachDB. Here's the full script for doing that, and let's again have a look at a few critical parts

wiped = []

while len(wiped) < len(deps.keys()):
    for table, requires in deps.items():
        if table in wiped:
            continue
        for req in requires:
            if req not in wiped:
                break
        else:
            print(f"Wiping {table}")
            cur.execute(f"TRUNCATE {table} CASCADE;")
            wiped.append(table)

This function wipes the tables (as this is a replacement import), going through the list of tables and only wiping those where we've wiped everything that depends on them first. Note this is pretty much the slowest and crudest way to do this, and it's at least N^2. I'd care more about that, except it's very clear what it does, N (number of tables) is probably quite small and it's a one-shot operation, so this works well enough.

def run():
    httpd = HTTPServer(("", 9000), SimpleHTTPRequestHandler)
    httpd.serve_forever()

server_thread = threading.Thread(target=run)
server_thread.daemon = True
server_thread.start()

# truncation code from above...

for dump_file in dump_folder.glob("*.csv"):
    table_name = dump_file.with_suffix("").name
    with dump_file.open() as f:
        reader = csv.reader(f, delimiter=";")
        headers = reader.__next__()
    header = ",".join([f'"{h}"' for h in headers])
    sql = f"IMPORT INTO {table_name} ({header}) CSV DATA ('http://{fileserver}:9000/dump/{dump_file.name}') WITH delimiter = ';', skip='1';"
    cur.execute(sql)

We do two things here:

Tada! You now have a fully imported DB. Note that both steps can be done repeatedly (e.g. for testing out the import prior to go-live) as the CockroachDB step wipes any existing data.

Previously: "Lights Out" with a keyboard Next: apt repository for Discord