Big Data on AWS

By admin - August 29, 2017

AWS offers the broadest set of production-hardened services for almost any analytic use-case. They enable customers to easily run analytical workloads (Batch, Real-time, Machine Learning) in a scalable fashion minimizing maintenance and administrative overhead while assuring security and low costs.

AWS services cover all elements of a typical analytic workflow from data ingestion, storage, ETL, data warehousing, Hadoop/Spark processing, interactive querying, cataloging, machine learning, AI and deep learning. Let us see how well the current AWS stack complements the usual analytics platform requirements.

Data Lake

S3 is the data lake. With S3 as the backbone, you can evolve your platform architecture replacing obsolete modules with better ones when available.

S3 serves as the core component in the data lake. It decouples storage and making S3 independent of the cluster. The beauty of the data lake in S3 is that you can future proof your analytics platform as new tools and use cases emerge. Currently, S3 is integrated with almost all AWS services, various vendor-specific and open source frameworks. With S3 as the backbone, you can evolve your platform architecture replacing obsolete modules with better ones when available. Here are few key features of S3:

  • Massively parallel and scalable
  • Designed for 99.99% availability
  • Storage scales independent of compute
  • Seamless integration with other AWS services
  • Low storage costs (as low as $0.021 / GB)

Data Warehousing and Ad-Hoc SQL

Amazon Redshift is an enterprise-grade petabyte scale data warehousing platform designed to deliver the best performance while securely running DW queries at scale. The underlying Redshift hardware is designed for high-performance data processing using locally attached storage to maximize throughput between the CPUs and drives, up to 3x compression techniques, Zone maps and columnar storage for better reads. Some key features of Redshift:

  • Relational data warehouse
  • Massively parallel and fully managed
  • Petabyte scale
  • Low costs (as low as $1000/TB/yr, $0.25/hr)

Amazon Athena offers easy to use serverless SQL on S3 with no clusters to manage and without requiring any data movement whatsoever.

Amazon Athena offers easy to use serverless SQL on S3 with no clusters to manage and without requiring any movement of data whatsoever. Under the wraps, Athena uses Presto as the query engine. Each user session runs on its own underlying managed Presto cluster, allowing Athena to arbitrarily scale concurrency with no instance costs when queries are not running. Customers log in to a web console or use their BI tool of choice to run ANSI SQL queries against datasets in S3.

Amazon Redshift Spectrum is a new service which integrates Redshift and Athena and utilizes Redshift’s powerful query optimizer to query massive data sets on S3.

ETL and Compute

 

Glue can automatically generate PySpark code for ETL processes from source to sink. This PySpark code can be edited, executed and scheduled based on user needs.

AWS Glue is a serverless ETL offering that provides data cataloging, schema inference, ETL job generation in an automated and scalable fashion. AWS Glue simplifies and automates the difficult and time-consuming data discovery, conversion, mapping, and job scheduling tasks. AWS Glue is integrated with Amazon S3, Amazon RDS, and Amazon Redshift, and can connect to any JDBC-compliant data store. Once your data is cataloged and has schema generated, Glue can automatically generate PySpark code for ETL processes from source to sink. This PySpark code can be edited, executed and scheduled based on user needs. Here are few key features of Glue:

  • Serverless architecture. Pay by the job.
  • Hive compatible metadata repository for data sources.
  • Inbuilt crawlers to infer schema and partitions
  • Automatic code generation in PySpark for customized ETL.

Amazon Kinesis is a streaming platform on AWS for a variety of ingestion and streaming ETL use cases. There are three services in the Kinesis family:

  • Kinesis streams enable scalable real-time ingestion that is highly available and durable and has a configurable retention period of up to a week.
  • Kinesis Firehose enables you to use Lambda for real-time, inline data transformations and have the data delivered automatically to Amazon S3 and Amazon Redshift. Kinesis Firehose scales automatically based on the workload.
  • Kinesis Analytics allows customers to easily build stream processing applications using standard SQL. It makes it easy to have standing SQL queries on real-time data enabling anomaly detection, sliding windows, and tumbling windows.

With proven customer successes for large scale data processing, Spark on EMR would complement a variety of use cases ranging from standard ETL to Machine Learning workloads.

Amazon EMR is another managed offering of fully customizable, auto-scaling clusters with the latest open source stack. It offers seamless integration of leading data processing engines like Apache Hadoop, Spark with AWS services. EMR also innovates and contributes to the ecosystem, offering capabilities like HBase on S3, which allows decoupled compute and storage for HBase.With proven customer successes for large scale data processing, Spark on EMR would complement a variety of use cases ranging from standard ETL to Machine Learning workloads.

Visualization

Amazon QuickSight is an easy-to-use, high performance, serverless BI service. QuickSight enables the customer to load and refresh data in SPICE, a super-fast, in memory calculation engine that enables low latency, interactive exploration of data at scale.

For beyond SQL use cases, AWS offers analytic notebooks like Apache Zeppelin, and data exploration UIs like Apache Hue. There are many vendor-specific tools like Tableau, MicroStrategy, etc. on the marketplace which are integrated with many AWS services.

By Sunil Penumala

https://www.linkedin.com/in/spenumala/