Monday, June 1, 2020

Replicating Databases in RDS

One Sunday in 2018 I sat for a whole day in Art Caffe at the ground floor of Yaya Centre in Nairobi on the phone to Norman at AWS Support in Cape Town discussing DMS for MSSQL servers. After a whole day of screen sharing and being on call we decided what we were trying to do was no achievable, but AWS was working on it. The next day AWS sent me an NDA (since expired).

Data replication from on-prem Database instances or between cloud database instances is an issue that comes up all the time. I have hands on experience doing this a couple of times now. This post summarizes my 3 or so attempts at doing this with different sources and targets and lessons learnt.

MSSQL to MSSQL using AWS DMS

At the start-up I was working at we adopted a pre-built mini ERP, it covered logistics workflows and finance / billing aspects. It was built in the 2000's in .NET Classic and ran on IIS and MSSQL server. Quickly the MSSQL became the single point of lack of scalability in the system. Since AWS does not natively support read-replicas for MSSQL RDS instances I looked at DMS to create these replicas. DMS did not quite work as expected and led to the conversation alluded to above with Norman. I ended up performing replication using CloudBasic as the SaaS provider for managing the Change Tracking and handling schema changes in the source tables and propagating them to the target replicas. The replication was fine for single replicas, but quickly bogged the source database down as I added more replicas.

As aside the same database was also being replicated to a Redshift cluster for BI usage using tooling provided by Periscope Data.

As part of this excercise I came to appreciate the advantage to write-only / append-only schemas in improving no-lock replication performance (at the cost of storage), also the need for timestamp columns such as update_time to perform incremental data transfer. I spent a lot of time reading the Schemaless articles by Uber Eng around building Schemaless DB's on top of RDBMS's like MySQL. I don't 100% agree with their design choices but it adds interesting perspective. The bottomline CRUD at scale is HARD.

RDS PostgreSQL to PostgreSQL using DMS

Fast forward a year or so, I am now working at Geoscience Australia, with the Digital Earth Australia. Everything runs on Kubernetes and is highly scalable. Single point of lack of scalability is again the database. A pattern seems to be emerging here. We were performing cluster migration in Kubernetes and I offered to investigate DMS again.

In the MSSQL scenario there is a small prologue, I had previously migrated around 1million cells from a massive Google Sheet to the MSSQL database at the start of my tenure at the startup, by the time we hit scalability issues in the single instance MSSQL we were at 10million rows in the largest append-only table. The PostgreSQL migration of the datacube tables featured 8-9 million rows in the largest table. However the database also has lots of indexes and uses PostGIS for some applications, particularly Datacube Explorer. DMS fell down in support for the Geometry columns, however I learnt a lot in setting up DMS using Terraform IAC and fine tuning for JSON Blob columns, which in Datacube design in 1.8.x series can be upto 2MB in size. DMS migrates standard columns separately from LOB columns.

Ultimately DMS was not feasible for datacube DB migration to a new RDS instance. However I believe core datacube can be migrated next time I try with applications depending on Materialized views and PostGIS setup afresh on new DB. Also by the time I try again Amazon may have better PostGIS support. For the cluster migration we ended up using a snaphot of the last DB.

On-prem PostgreSQL to RDS PostgreSQL

There is a Datacube PostgreSQL DB instance at the NCI which needs to be regularly replicated to RDS. It powers the Explorer Datacube application. However DB migration from one server without direct disk access to RDS where we also don't have disk access using pg_dump / pg_restore for a DB with largest tables being around 22 million rows and the compressed dump being around 11GB is a long running task. Ideally we sort something out that is incremental using update_time generated using triggers. The options explored so far are :

  • Using an Airflow DAG with Kubernetes Executors wrapped around pg_dump/restore with some application specific details.
  • Using COPY from S3 support for Aurora PostgreSQL, CSV's restored using the COPY command are generated incrementally.
  • Using PostgreSQL publish / subscribe and Logical Replication. Networking over the internet to maintain connectivity securely to the on-prem instance via SSH Tunnels and to the RDS instance via EKS port-forwarding.