Saturday, April 4, 2020

Distributed Locking from Python while running on AWS

In the day and age of eventually consistent web-scale applications the concept of locking may seem very archaic. However in some instances attempting to obtain a lock and failing to do so within a limited window can prevent dogpile effects for expensive server side operations or prevent over-write of already executing long running tasks such as ETL processes.
I have used 3-basic approaches to create distributed locks on AWS with the help of built-in services and accessed them via Python which is what I build most of my sofware in.

File locks upgraded to EFS

File based locks in UNIX file-systems are very common. They are typically created using the flock command, avalaible in Python under os-specific flock API. Also checkout the platform independent filelock. This is well and good for a VM or single application instance. For distributed locking, we will need EFS as the filesystem on which these locks are held, Linux-Kernel and NFS will use byte-range locks to help simulate locally attached file system type locks. However if the client loses connectivity the NFS lock-state cannot be determined, better run that EFS with enough replicas to ensure connectivity.
File locking this way is very useful if we are using EFS for holding large file and processing data anyway.

Redis locks upgraded to ElastiCache

Another popular pattern for holding locks in Python is using Redis. This can be upgraded in the cloud-hosted scenario to Redis-Elasticache, This pairs well with the redis-lock library.
Using redis requires a bit of setup and is subject to similar network vagaries and EFS. It makes sense when using Redis already as an in-memory cache for accelration or as a broker/results mechanism for Celery. Having data encrypted at rest and transit may require running an Stunnel Proxy.

An AWS only Method - DynamoDB

A while ago AWS published an article for creating and holding locks on DynamoDB using a Java lock client. This client creates the lock and holds it live using heart-beats while the relevant code section executes. Since then it has been ported to Python and I am maintaining my own fork.
It works well and helps scale-out singleton processes run as Lambdas to multiple lambdas in a serverless fashion, with a given lambda quickly skipping over a task another lambda is holding a lock on. I have also used it on EC2 based stuff where I was already using DynamoDB for other purposes. This is possibly the easiest and cheapest method for achieving distributed locking. Locally testing this technique is also quite easy using local-dynamodb in a docker container.
Feel free to ping me other distributed locking solutions that work well on AWS and I will try them out.