Which is Right Hadoop Solution for You?
As I mentioned in the Hadoop ecosystem cheat sheet, the Hadoop ecosystem is open-source with plenty of add-on packages; additionally, you can build your own Hadoop system with these free resources. However, it will be challenging and time-consuming to install and set up the system, so choosing the right Hadoop solution is exceedingly important to your enterprise. There are two categories of Hadoop solutions: Hadoop distributions and Hadoop cloud services. In the first section, you will take a look at the most popular Hadoop distributions in the world — Cloudera, Hortonworks, and MapR. Whereas in the second section, you will take a look at the Hadoop on three clouds providers — Amazon AWS, Microsoft Azure, and Google GCP.
Hadoop Distributions
Cloudera
To begin with, Cloudera was the first company to release commercial Hadoop distribution and continues to be a leader in the industry. In addition, Cloudera offers software, services, and support in five bundles available both on-premise and across multiple cloud providers:
- Enterprise Data Hub: Cloudera’s comprehensive data management platform includes Data Science & Engineering, Operational DB, Data Warehouse, and Cloudera Essentials. The annual fee is $10,000.
- Data Warehouse: High-performance data warehouse for BI and SQL analytics built on the core Cloudera Essentials platform.
- Operational DB: Real-time data at scale for relational or NoSQL and structured or unstructured built on the core Cloudera Essentials platform. The annual fee is $8,000.
- Data Science and Engineering: Accelerate exploratory big data processing (ETL), data science, and machine learning on top of the Core Essentials platform. The annual fee is $6,000.
- Cloudera Essentials: Cloudera’s enterprise-ready management capabilities (Cloudera Manager) and open source platform distribution (CDH). CDH is the most popular Hadoop distribution with 100% open source components. The annual fee is $2,000.
If you so desire, please check Cloudera’s website for the latest features and pricing of each product. Should you want to get hands-on with CDH, you can download Cloudera QuickStart VM. Cloudera offers a free training course on Cloudera Essentials.
Cloudera also offers a managed-service offering in the cloud:
- Altus Data Engineering: Provides a cloud-native offering of Cloudera Data Engineering. You can deploy Cloudera on all major cloud providers. For example, the hourly charge is $0.08 on AWS m4.xlarge. Please check the hourly rate list of different cloud providers and instance types.
Hortonworks
The Hortonworks Data Platform (HDP) is open source software and is the first Hadoop Distribution that supports Windows. Specifically, HDP enables the creation of a secure enterprise data lake and delivers the analytics you need to innovate faster and power real-time business insights. Here is the HDP Hybrid Architecture from its website:
Besides, Hortonworks expanded its partnership in June 2018 with the major cloud providers. Hortonworks release 3.0 has three products HDP, HDF (Hortonwork DataFlow) and DPS (Hortonworks DataPlane Service). These products are now available on Azure as well as Amazon Web Services (AWS) and the Google Cloud Platform (GCP). Furthermore, a brand new service IBM Hosted Analytics with Hortonworks (IHAH) combines HDP, IBM’s Db2 Big SQL and the IBM Data Science Experience, an AI-oriented offering. For a hands-on experience, please proceed to the Hortonworks tutorial on HDP.
MapR
MapR has world-record performance. MapR-DB is 4-7x faster than HBase on other distributions. In addition, the DirectShuffle technology leverages the performance advantages of MapR-FS to deliver strong cluster performance, and Direct Access NFS simplifies data ingestion and access. Here is the MapR Converged Data Platform (CDP) architecture from its website:
Similarly, you can deploy MapR on Amazon AWS, Microsoft Azure, and Google Compute Engine. Note that you can find the total cost of ownership through its TCO calculator. MapR offers CDP’s Converged Community Edition for free, so you can try it before you buy it.
Hadoop Cloud Providers
The following cloud providers offer fully-managed Hadoop cloud service that makes it easy, fast, and cost-effective to process massive amounts of data:
AWS Elastic MapReduce (EMR)
Amazon EMR provides a managed Hadoop framework as a web service. The data cross dynamically with scalable Amazon EC2 instances. You can choose from Amazon S3 (EMRFS), the Hadoop Distributed File System (HDFS), and Amazon DynamoDB as the data stores. EMR can run popular frameworks such as Apache Spark, HBase, Presto, HUE, Flink and more. EMR supports more Hadoop ecosystem frameworks than Azure and GCP. Hourly prices range from $0.011/hour to $0.27/hour ($94/year to $2367/year) plus cost on EC2, EBS, and S3. Choosing reserved or spot EC2 instance can save you money. To find out the total cost of EMR, please go to AWS Calculator. EMR is not free under the AWS Free Tier.
Azure HDInsight
Azure HDInsight enables a broad range of scenarios such as ETL, Data Warehousing, Machine Learning, IoT and more. Use popular open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, R & more. You can choose Azure Blob storage instead of HDFS. Azure utilizes its own Azure Data Lake platform and its own cloud security framework. Hourly prices range from $0.074/hour to $1.496/hour. Microsoft offers a one-month free trial on Azure, so you may want to test a Hadoop cluster from HDInsight.
GCP Dataproc
Cloud Dataproc is a cloud-based managed Spark and Hadoop service offered on the Google Cloud Platform. Dataproc utilizes its Google Compute Engine(GCE) to process data. Dataproc integrates with its Cloud Storage, BigQuery, Bigtable, Stackdriver Logging, and Stackdriver Monitoring. But Dataproc self only has Apache Hadoop, Spark, Hive, and Pig. Hourly prices range from $0.001/hour to $0.640/hour. Google also offers Free Tier (12-month, $300 credit free trial) that allows you to use any GCP product, so you should take this offer to try Dataproc and BigQuery.
Conclusion
In conclusion, you should consider the components, deployment model, performance, security and cost holistically to choose the right Hadoop solution. Cloudera has the largest user base with the largest number of clients; moreover, Cloudera Enterprise Data Hub has a comprehensive data management platform with everything you need. But do you genuinely need every component? If your organization doesn’t have multitudinous Hadoop experts, then choosing a fully-managed Hadoop cloud service will let you focus on the development. If you need fast performance, you may want to take a look at MapR; if you want a low cost, Hortonworks is an excellent choice. Each Hadoop solution provides different approaches to authentication, security policy management, and data encryption, so you should base on your auditing policy and protection requirements to review how each solution addresses those needs. Also, designing a hybrid Hadoop solution keep some jobs on-premise, noncritical jobs on the cloud (e.g. AWS spot instance) to save cost. Moreover, if your Big Data below 1.6PB, you may want to take a look at the Redshift data warehouse option.