Apache Hadoop

An open-source framework for reliable, scalable distributed computing and storage for big data.

Author: Apache Software Foundation

Added Date: 2025-11-03

Open Source Since: 2014-08-28

Detailed Introduction

Apache Hadoop is an open-source framework for large-scale data processing that provides reliable distributed storage (HDFS) and resource management (YARN), along with a parallel computing model typified by MapReduce. Designed to scale across thousands of commodity nodes, Hadoop emphasizes fault tolerance and data locality, and is widely used as the underlying infrastructure for data lakes and batch processing.

Main Features

Core Hadoop modules include Hadoop Common, HDFS, YARN, and MapReduce. Key features include a distributed file system with replication for reliability; YARN-based resource scheduling for flexible workload management; a modular ecosystem (Hive, HBase, Ozone, etc.) for building data platforms; and robust support for large-scale batch processing and storage.

Use Cases

Hadoop fits scenarios requiring massive data storage and offline processing, such as data warehouse backends, large ETL pipelines, offline feature engineering, log/archive analysis, and serving as part of a data lake consumed by analytics and ML tools. Organizations often pair Hadoop with Spark, Hive and other tools for a complete platform.

Technical Features

Hadoop focuses on scalability and fault tolerance: HDFS replication and NameNode/journal mechanisms ensure data durability; YARN separates resource management from execution, enabling multiple processing frameworks to run concurrently; the project’s modular architecture and mature community support long-term maintenance and ecosystem growth.

Apache Hadoop

Detailed Introduction

Main Features

Use Cases

Technical Features

Resource Info

Related Resources

Apache Kafka

Apache Spark

Apache Airflow