By Sunil Penumala - August 24, 2018
Quite often I hear questions from customers like “What are the right design patterns for a server-less ETL workflow?”. AWS offers a wide range of server-less and managed services to complement various ETL requirements. In this post, I would like to put together some of my views on designing an event-driven or real-time ETL workflow using AWS Lambda.
Automation of ETL plays a critical part in defining success to any reporting workflow. AWS offers services like AWS Data Pipeline to automate the movement and transformation of data. With AWS Data Pipeline you can define, schedule and monitor your batch workflows with all load dependencies. However for event driven workloads we should use AWS Lambda.
AWS Lambda is a server-less compute service that lets you run code without provisioning or managing servers. Lambda allows you to build pipelines that respond quickly to new data, and automatically hosts and scales them for you. Lambda can be invoked directly from APIs or as a response to events from other AWS services. Below diagram showcase a sample event-driven workflow to transform and load data to the Amazon Redshift.
The event considered here is uploading raw data to S3 bucket. This action will invoke a series of automated events which will complete the ETL workflow and make data available in the warehouse for consumption. A detailed flow is described below:
Once the raw data is landed in the S3 bucket, an object created event is detected by Lambda and all defined data validation checks are performed.
Validated data is batched to another S3 bucket and based on pre-defined input batch size, it will invoke another Lambda function to start the transformation process.
Pre-built jobs in Glue can be triggered or a transient EMR cluster can be utilized to transform the validated data.
Transformed data is then made available in S3 to services like AWS Athena, Redshift Spectrum, etc. or it can be loaded to Redshift to utilize full-fledged warehouse capabilities for reporting needs.
We can use Amazon CloudWatch to monitor and set alarms for these event-driven workflows in case of any failures.
AWS Pricing info – https://aws.amazon.com/pricing/?nc2=h_ql_pr&awsm=ql-3
Please feel free to reach out to Sunil Penumala if you have any questions or need additional information.
By Sunil Penumala, Solutions Architect at DataFactZ
The article was originally posted on LinkedIn at https://www.linkedin.com/pulse/event-driven-etl-workflows-using-aws-lambda-sunil-penumala/