3 Popular Cloud Data Warehouse Solutions – AWS Redshift vs. Snowflake vs. GCP BigQuery
I compared cloud data warehouse solutions AWS Redshift and GCP BigQuery last time. Snowflake has recently become more and more popular. Let’s compare 3 popular cloud data warehouse solutions – AWS Redshift vs. Snowflake vs. GCP BigQuery today.
Market Occupation
Three popular cloud data warehouse solutions can run analytic queries against petabytes to exabytes of data with highly-scalability, cost-effectiveness, and security. GCP BigQuery was announced in May 2010 and made generally available in November 2011. It leverages Google’s Borg the large-scale cluster management system and Dremel the execution engine. AWS Redshift is based on an older version of PostgreSQL 8.0.2 with the changes to that version. An initial preview beta was released in November 2012 and a full release was made available on February 15, 2013. Snowflake has run on AWS S3 since 2014, on Microsoft Azure since 2018, and on the Google Cloud Platform since 2019. Both AWS and GCP are Platform as a Service (PaaS) that supports querying using ANSI SQL. Snowflake is a Database as a Service(SaaS) solution. At the time of this post, AWS Redshift is still the leading data warehouse cloud solution in the market but Snowflake has grown significantly faster than the other two in the past 12 months.
Architecture
In this post, I will just summarize three cloud data warehouse solutions’ architectures. You can review my course of Redshift, also Redshift architecture overview, and BigQuery architecture overview videos from my channel for the details. You can also review Snowflake’s architecture and key features from Snowflake’s website.
AWS Redshift
AWS Redshift’s fully-managed Data Warehouse solution is based on PostgreSQL but beyond just PostgreSQL. The core infrastructure component of AWS Redshift data warehouse is a cluster. A cluster is composed of one or more compute nodes. The client applications interact directly only with the leader node. Inside each compute node, it is partitioned into slices. Redshift reduces I/O through columnar storage, data compression, and massively parallel processing (MPP). AWS Redshift also introduced Redshift Spectrum that directly performs SQL queries on data stored in the AWS S3 bucket. This can save time and money without moving data from a storage service to the data warehouse.
GCP BigQuery
BigQuery architecture is different from traditional node-based cloud data warehouse solutions or massively parallel processing (MPP) systems. BigQuery’s serverless architecture decouples storage and compute and allows them to scale independently on demand. From the users’ standpoint, with the serverless service users do not have visibility or control over individual servers or clusters of servers. BigQuery has two services storage service and query service that are connected by Google’s high-speed Jupiter networking infrastructure. BigQuery Storage Service automatically shards and shuffles data in Google’s Colossus underlying file system. The tables are stored as highly compressed columns to provide durability and availability. The storage service supports bulk data ingest and streaming ingest. The query service runs interactive or batch queries. It integrates with other GCP data processing services through connectors. It can also run query jobs on data contained in Cloud Storage. Under the hood, BigQuery leverages Borg the large-scale cluster management system, and Dremel the execution engine for the services.
Snowflake
Snowflake is a SaaS solution with zero management from end-users. Three layers of architecture separate data storage, data processing, and data consumption. Similar to shared-disk architectures, Snowflake uses a central data repository for persisted data that is accessible from all compute nodes in the platform. It processes queries using MPP (massively parallel processing) compute clusters where each node in the cluster stores a portion of the entire data set locally. And supports multiple ways of connecting to the services e.g Web UI, Command-Line SnowSQL, ODBC/JDBC drivers, native and third-party connectors.
Architecture Comparison
PaaS vs. SaaS: Fully managed AWS Redshift, Serverless GCP BigQuery, and SaaS Snowflake on AWS/Azure/GCP
AWS Redshift: Based on PostgreSQL with MPP (massively parallel processing) and columnar storage; Provisioned on clusters and nodes; Redshift Spectrum between Redshift and Data Lake
GCP BigQuery: Decouples storage and compute; Leverages Borg, Dremel and Jupiter; Columnar storage with compression algorithm to store data in Colossus; Foundation for ML/AI
Snowflake: a hybrid of traditional shared-disk and shared-nothing database architecture; Columnar storage and MPP; Separates data storage, data processing, and data consumption
Operation and Maintenance
Let’s compare the features in the following areas:
Features | AWS Redshift | Snowflake | GCP BigQuery |
Infrastructure Management | Fully managed. Automated provisioning and automated backup | Zero management from end users. Separate data storage, data processing and data consumption | Completely serverless. Separate storage and computing |
Programmatic interaction | All languages supporting JDBC/ODBC | Native clients (connectors, drivers, etc.) provided by Snowflake e.g. JDBC/ODBC, Go, .Net, Node.js, PHP, and Python | REST API. Client libraries in Java, Python, Node.js, C#, Go, Ruby, and PHP |
Data Ingestion | Load static data from AWS S3, EMR, DynamoDB table, and remote hosts. Load streaming data using Kinesis | Load data from or external table in AWS S3/GCP Cloud Storage/Azure; Snowpipe load small volumes of data; Load streaming data from Apache Kafka’s topics | Load data from Cloud Storage, Datastore backups, Dataflow, and streaming data sources. Use familiar data integration tools e.g. Informatica, Talend |
Foundation for ML/AI | Predictive Analytics in AWS Redshift with SageMaker is in Preview | Support ML/AI applications | Besides bringing ML to your data with BigQuery ML BigQuery ML, integrations with AI Platform and TensorFlow |
Result caching | Use result caching to deliver sub-second response times for repeat queries | All query results are stored for 24 hours/90 days for enterprise customers or until the underlying data changes | Writes all query results to a table either a temporary cached results table without charge or a permanent table with a storage charge |
Scalability | Local storage configuration so the users can’t scale resource independently; Resizing or changing machine instance type requires cluster reconfiguration with read-only mode during this reconfiguration | Storage and compute can be scaled up/down independently and immediately; Metadata service also scales up/down as necessary; Trillions of rows can be sliced up with ease by multiple concurrent users | Fully elastic data warehouse; automatic and rapid provision with Borg the large-scale cluster management system and Dremel the execution engine |
Maintenance | Automatically runs the VACUUM DELETE operation to reclaim disk space | Low maintenance and zero management from end users | Use the expiration settings to remove unneeded tables and partitions |
Performance
The performance is tricky on all data warehouse solutions. It depends on the size of the data table, schema complexity, and the number of concurrent queries, etc. The different benchmarks show different results. I suggest you test with your benchmarks in three systems to compare the performance for your use case. Take the advantage of AWS Redshift’s two-month free trial for the customer that has never created the Redshift cluster, GCP $300 credits with its free tier account, and Snowflake’s trial account with free credits at the time of this post. I did include the two different performance results links in my previous AWS Redshift vs. GCP BigQuery post to show the different stories. The links did provide 2020 test results among cloud data warehouse solutions.
Security
Both AWS Redshift and GCP BigQuery solutions provide very similar security features:
- Access controls: Both platforms leverage their IAM to set up roles and permissions
- Encryption at reset: Both platforms leverage their key management system to do database encryption and sever/client-side encryption on the load data files. BigQuery supports encryption by default. Redshift also supports HSM.
- Data in transit: Both platforms include virtual private cloud (VPC) and SSL connections
- Data loss prevention (DLP): AWS DLP service Macie supports on AWS S3. Google Cloud DLP service supports BigQuery.
Snowflake runs on public cloud AWS/GCP/Azure so it secures as per cloud provider’s features. But it’s fair for the industry-standard security for the cloud solution.
Cost
AWS Redshift’s pricing model covers both storage and computing cost. You can choose from RA3 (built on the AWS Nitro system with managed storage), Dense Compute, or Dense Storage nodes types. The cheapest node dc2.large with 160GB will cost you $0.25 per/hour. Please go to the AWS Redshift Pricing Calculator to calculate the cost. For example, with 1 dc2.large node on demand, 1TB additional backup storage, and 10TB Redshift Spectrum, the total monthly cost is $256.05 and you can pay upfront for the discount.
GCP BigQuery’s pricing model is complicated (e.g. active vs. long-term, flat-rate vs. on-demand, streaming inserts vs. queries vs. storage API). It separates the storage cost and query cost. Storage cost is $0.020 per GB per month and the query cost is $5 per TB. The storage is cheaper than AWS Redshift but the query costs can add up quickly. Please go to the GCP Pricing Calculator then select BigQuery to estimate the cost (e.g. on-demand with 10 TB storage, 10GB streaming insert, and 1TB query the total monthly cost is $205.11. Flat-rate with 100 slots, 10TB storage, 10GB streaming insert. The total monthly cost is $2410.22)
Snowflake’s pricing model is not straight. Storage cost is $23TB/Month if paid up front or $40TB/Month if on-demand. Plus the formula for calculating Computing cost = (minutes consumed * cost per node * nodes per cluster) * cluster. So from Snowflake pricing to estimate on-demand with 10 TB storage ($400), a 10-node cluster of L size warehouse computing (32 credits/per hour x $4 enterprise=$128 per hour). You can save cost by minimizing computing and configuring with AUTO_SUSPEND = x to automatically shut down after x seconds of inactivity.
Conclusion
Three popular cloud data warehouse solutions AWS Redshift, Snowflake, and GCP BigQuery are highly scalable enterprise data warehouse solutions to make data analytics more productive with unmatched price-performance.
- All take care of infrastructure management and database administration responsibilities. So you can focus on business needs using familiar SQL and coding development.
- From the cost standpoint, unpredictable and complex BigQuery’s pricing model vs. roughly estimated Snowflake’s pricing model vs. Simple and predictable Redshift’s Pricing model. So you may want to estimate your cost and do a performance benchmark before selected the solution.
- I like Snowflake’s zero management from end-users to quickly start the virtual warehouse.
- I like AWS Redshift’s Spectrum concept. Redshift also gives the level of control over your data warehousing setup and performance tuning.
- I’m impressed with BigQuery ML. BigQuery is completely serverless with an in-memory BI Engine and machine learning built-in.
So the correct data warehouse solution decision should depend on your use cases, cost estimate, and performance benchmark.