Why Use AWS Redshift Spectrum with Data Lake
A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. AWS uses S3 to store data in any format, securely, and at a massive scale. However, this creates a “Dark Data” problem – most generated data is unavailable for analysis. To solve this Dark Data issue, AWS introduced Redshift Spectrum which is an extra layer between data warehouse Redshift clusters and the data lake in S3.
Key Features
Redshift Spectrum has the following key features:
- A group of managed nodes in the private VPC
- Independence of the Redshift cluster and availability to any Redshift clusters that are Spectrum-enabled
- A single query on open file formats in S3 and data in Redshift without any ETL
- Fully managed and priced on a per query basis to provide higher performance and lower cost
Life of Query
There are three key components to run a query with Redshift Spectrum:
- External data catalog contains the schema definitions for the data to access in S3. It’s a central metadata repository for the data assets. The options for external data catalog are AWS Athena (default), AWS Glue, or Apache Hive metastore (either from your own Hadoop ecosystem or from AWS EMR.)
- External Schema contains your tables.
- External tables allow you to run queries between S3 and Redshift local tables. The external tables are read-only.
Here is the life of the query on Redshift Spectrum:
- A query is optimized and compiled at the leader node. Determine what gets run locally and what goes to Amazon Redshift Spectrum.
- The query plan is sent to all compute nodes.
- Compute nodes obtain partition info from the Data Catalog; dynamically prune partitions.
- Each Compute node issues multiple requests to the Redshift Spectrum layer.
- Amazon Redshift Spectrum nodes scan your S3 data.
- Amazon Redshift Spectrum projects, filters, and aggregates.
- Final aggregations and joins with local Amazon Redshift tables done in the cluster.
- The result is sent back to the client.
Lab on Redshift Spectrum
The lab on Redshift Spectrum is covered in my online course.