AWS Data Warehouse – Build with Redshift and QuickSight
AWS Redshift Data Warehouse solution is based on PostgreSQL but beyond just PostgreSQL. It is specifically designed for online analytical processing (OLAP) and business intelligence (BI) applications. Redshift is a fast, fully managed petabyte-scale data warehousing solution to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. It allows you to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution. It’s wrapped with other AWS services such as S3 for importing/exporting data, KMS(key management service) for encrypting database, IAM(identity and access management) for role access and policy setup, VPC(Virtual private cloud) for isolating in secured virtual network, SWF(simple workflow service) for durable, distributed-state management, and CloudWatch for monitoring performance. AWS also provides QuickSight, cloud-powered business analytics service, to build visualizations.
Why build a data warehouse?
A data warehouse is a central repository of information coming from one or more data sources. Data warehousing is really old technology concept. We have a lot more choices in modern database technologies such as relational database, NoSQL, streaming. Why build a data warehouse at all—why not just run analytics queries directly against the database? Let’s take a look at the difference between data warehouse and databases.
The No.1 difference is OLAP vs. OLTP. The data warehouse is designed for online analytical processing and database is designed for online transaction process. For the data warehouse, Data typically flows from transactional systems and other relational databases and typically includes structured, semi-structured, and unstructured data. For databases, Data captured as-is from a single transaction source. Data warehouses are optimized for batched write operations and reading high volumes of data, but databases are optimized for continuous write operations and high volumes of small read operations. Data warehouses employ denormalized schemas like the Star schema and Snowflake schema because of high data throughput requirements, whereas OLTP databases employ highly normalized schemas, which are more suited for high transaction throughput requirements. On data storage standpoint, data warehouse uses columnar storage to optimize for simplicity of access and high-speed query performance. Database use row-oriented physical block to optimize for high throughout write operations.
Redshift and QuickSight
The typical end-to-end analytics processing has the following stages: collect data, store the data, process the data then analyze and visualize the data. Let’s take a look at how AWS services accomplish this end-to-end analytics processing. In the data collection stage, you probably have transactional data, streaming data, log data and Internet of Things data. AWS use RDS and DynamoDB as an OLTP store for you. S3 hold transaction files and/or log data. Streaming data can be easily collected/processed by Kinesis services. Using AWS IoT connected devices interact easily and securely with AWS cloud. AWS offers direct connect, Snowball and Database migration services to support data migration.
There are two types of data processing workflows: batch and real time. OLAP (online analytical processing) is batched-based. OLTP(Online transaction processing) is real-time processing. AWS Lambda can process data directly from AWS IoT or Kinesis Streams. AWS Kinesis Client Library (KCL) is another way to process data from Kinesis Streams. Kinesis Firehose can load streaming data easily into AWS Redshift.
You can store those data in either a data warehouse or data mart. AWS redshift can build data marts in addition to data warehouses. AWS redshift works with ETL tools (e.g. Matillion ETL) Then use with existing BI tools (e.g. Tableau). AWS also offers QuickSight for visualization, Machine Learning for prediction. Apache Zeppelin as an open source BI solution runs on AWS EMR.
Course
I made AWS Data Warehouse – Build with Redshift and QuickSight course on Udemy. The course covers all of the main concepts you need to know about Data Warehouse and Redshift. This course assumes you have no experience in Redshift but are eager to learn AWS solution on Data Warehouse. This course has seven hands-on labs from launching Redshift cluster, loading data, managing cluster, monitoring performance to visualizing data on QuickSight. Plus the advanced experimental sections AWS Machine Learning on Redshift Data and Redshift Spectrum with Data Lake. You will learn Redshift essentials, QuickSight visualization, and Machine Learning prediction. You will also get the basic knowledge of other associated AWS services (e.g. S3, IAM, VPC, CloudWatch, and CloudTrial) during this step-by-step deploying and analytical processing.
Once you have completed this course, you should be able to deploy your data warehouse on Redshift, operate and maintain data, analyze and visualize data on Quicksight, and set up security for Redshift.
Course Agenda
- AWS Data Warehouse Overview
- AWS Redshift System Overview
- Cluster Management
- Monitoring and Logs
- Data Operations
- Analytics and Visualization on QuickSight
- Advanced Experimental Sections
- Security
- What’s Next…
Course Promotions
Please check the following links for the enrollment: