Wednesday, August 29, 2018

Using AWS Athena in Anger

The journey down this particular rabbit hole started with this challenge "Create Read-replicas for MSSQL 2017 database hosted in RDS". Hosting services on AWS typically reduces IT load and makes focus on product development easier, however there are certain edge cases where AWS falls down and requires special intervention.

The above challenge turned out to be one of them. RDS supports seamless read-replicas on a lot of other types of databases , however MSSQL is not one of them. The recommended approach for creating read-replicas for MSSQL is to enable CDC and do so using DMS (Database Migration Service). This is similar to what the RDS does internally for multi-AZ redundancy. We hit another edge-case here, DMS replication support with CDC does not extend to the 2017 version.

We did manage to run DMS successfully from MSSQL 2017 to S3 as sort of a disaster recovery measure. Then I was stuck with the conundrum of what to do with the CSV table dumps this got me, at the AWS Summit in Cape Town one of the solution specialists introduced me to Athena and the adventure began.

Athena is essentially a managed Apache Hive based map-reduce setup which can churn through webserver logs or in this case CSV's to give me back the database I was replicating. The replica is not perfect, it is missing stored procedures and views. These need to be redone in HiveQL.

All tables can be mapped using corresponding schema and DDL to Athena tables, thus replicating fully the database without foreign keys and other niceties. AWS now has cross region S3 replication , this coupled with Athena can give a very Rube-Goldberg cross-region database replication.

Athena is also particularly useful with scrapers where data is being collected from 3rd-party sites and consolidated into S3 tables. It can seamlessly blast through millions of rows of scraped data for goodies in seconds.

This post is a WIP and will be enriched with Athena goodies over time.