After 5 years and many $$$ spent on keeping our data in an AWS RDS Aurora PostgreSQL cluster, we now want to bring the 12 TB of data back to an on-premises PostgreSQL instance, specifically a Crunchy Postgres cluster running in an on-premises Kubernetes cluster. The RDS cluster is running Aurora PostgreSQL 14.7, the on-premises target will (likely) be running PostgreSQL 16.1. The database is an analytics data warehouse fed by a nightly ETL pipeline, that is otherwise quiet (only reads, no writes) for the 21 hours following the pipeline run.
Certainly it would be possible to use pg_dump and pg_restore to bring the data from source to target piece by piece. Given the size of the database, it would be too failure-prone to try to save everything in a single dump. Moreover, it would take too long to finish a single huge dump in the 21 hours between pipeline runs. The disadvantage of this approach is that it would likely take days to finish given the amount of data, and in the meantime the data would have changed in the source. Some tables are partitioned by month, so for those we could dump and restore past months, but many tables are unpartitioned. In sum, using dump and restore would be a tedious and likely an error-prone approach.
I suppose RDS snapshots can only be used within RDS, not in an on-premises PostgreSQL instance, right? That would eliminate that option then.
Having read about someone using logical replication to do something similar with a MySQL database, I have started reading up about using PostgreSQL logical replication with Aurora, which uses the PostgreSQL logical replication feature. It looks like this could be an option for us to transfer the data to our target database.
If need be, we could set up the target to have the same PostgreSQL version as the source (14.7), and upgrade it to 16.1 after the data transfer, in case the version mismatch would be a problem.
Is our idea of using PostgreSQL logical replication to migrate data out of AWS RDS to an on-premises PostgreSQL instance feasible? Is it the "best" solution for our given use case? Has anyone done something similar and can give valuable advice? And lastly, if this approach is sound, how do we set up the logical replication in AWS RDS and in the on-premises PostgreSQL instance?