Apache Hadoop Ecosystem Cheat Sheet
Apache Hadoop 3.1.1 was released on the eighth of August with major changes to YARN such as GPU and FPGA scheduling/isolation on YARN, docker container on YARN, and more expressive placement constraints in YARN. Apache Hadoop has been in development for nearly 15 years. The term “Hadoop” refers to the Hadoop ecosystem or collection of additional software packages that can be installed on top of or alongside Hadoop. Seeing as to how there are so many add-on libraries on top of Apache Hadoop, the Apache Hadoop ecosystem can be a little bit overwhelming for a person new to the term. You will be a zookeeper, surrounded and overwhelmed by such exotic animals (Pig, Hive, Phoneix, Impala) and funny names such as the Oozie, Tez, and Sqoop. Therefore, I have made this cheat sheet for you to understand the technologies in the Apache Hadoop ecosystem. Moreover, I will write some articles comparing different packages so you can easily select packages for your Apache Hadoop ecosystem.
What is Apache Hadoop?
Apache Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides the software framework for massive storage, enormous MapReduce processing power, and the ability to handle virtually limitless concurrent tasks or jobs.
The modules of Apache Hadoop include:
- Hadoop Common: The common utilities that support the other Hadoop modules.
- Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
- Hadoop YARN: A framework for job scheduling and cluster resource management.
- Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
NoSQL Database
- Apache HBase: A random, real-time read/write NoSQL database (wide column store) to access data in Hadoop. Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
Scripting
- Apache Hive: Facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. A SQL-like query (Hive SQL) generates MapReduce code.
- Apache Pig: A high-level language similar to Python or Bash for expressing data analysis programs. An ETL library generates MapReduce jobs just like Hive does.
- Apache Impala: An open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Unlike Hive, Impala does not translate the queries into MapReduce jobs but executes them natively, meaning that it is faster than Hive.
- Apache Drill: An open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage.
- Apache Phoneix: An open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store.
- Presto: An open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
Data Serialization
- Apache Avro: A remote procedure call and data serialization framework developed within Apache’s Hadoop project. It uses JSON for defining data types and serializes data in a compact binary format.
Workflows
- Apache Oozie: A workflow scheduler system that manages Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability.
- Apache Tez: An application framework which allows for a complex directed-acyclic-graph(DAG) of tasks for processing data. It is currently built atop Apache Hadoop YARN. In some cases, it is used as an alternative to Hadoop MapReduce.
- Apache Kafka: Provide a unified, high-throughput, low-latency platform for handling real-time scalable pub/sub message queue data feeds.
Connectors
- Apache Flume: A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
- Apache Sqoop: A tool designed for efficiently transferring bulk data (importing/exporting) between Apache Hadoop and structured datastores such as relational databases.
Data Processing
- Apache Flink: A framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale. So Flink supports event time semantics for out-of-order events, exactly-once semantics, backpressure control, and APIs optimized for writing both streaming and batch applications.
- Apache Spark: A powerful open-source unified analytics engine built around speed, ease of use, and streaming analytics. Spark runs on Hadoop, Mesos, Kubernetes, standalone, or in the cloud. It has the following components:
- Spark Core: Dispatching, scheduling, and basic I/O functionalities
- Spark SQL: DSL (domain-specific language) to manipulate DataFrames. Because of its in-memory computing, the performance is even faster than Apache Impala.
- Spark Streaming: Micro-batching to perform fast streaming
- MLib: Scalable and easy machine learning library
- GraphX: Distributes graph processing framework
- Apache Storm: A real-time computation system designed to handle large streams of data within Hadoop. It can do micro-batching and stateful stream processing in batches using a trident.
Machine Learning
- Apache Mahout: A distributed linear algebra framework and mathematically expressive Scala DSL (domain-specific language) designed to perform predictive analytics on Hadoop data.
- Apache MXNet: An acceleration library designed for building neural networks and other deep learning applications. MXNet automates common workflows and optimizes numerical computations.
- Spark MLib, Storm SAMOA, and Flink ML
Coordination
- Apache Zookeeper: A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
Management and Monitoring
- HCatalog: A table storage management tool. It allows you to access Hive metastore tables within Pig, Spark SQL, and/or custom MapReduce applications.
- Ganglia: A scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.
- Apache Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters. Allows configuration and management of a Hadoop cluster from one central web UI.
- HUE(Hadoop User Experience): An open source Analytics Workbench for browsing, querying and visualizing data on a Web UI. It is developed by the Cloudera.
Interactive Notebooks
- Apache Zeppelin: A web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
- Jupyter Notebook: An open-source web application that you can use to create and share documents that contain live code, equations, visualizations, and narrative text.
Security
- Apache Ranger: A framework to enable, monitor and manage comprehensive data security across the Hadoop platform.
- KNOX Gateway: A system that provides a single point of authentication and access for the Hadoop services in a cluster.
Conclusion
In conclusion, the open-source Apache Hadoop ecosystem provides many add-on libraries to support your projects. However, it can also be challenging and time-consuming to set up the system. We will take a look at the commercial Hadoop solutions and the Hadoop on cloud options.