Apache Hudi

A storage and incremental processing framework for big data that supports upserts, deletes and incremental consumption for real-time and batch analytics.

Author: Apache

Since: 2016-12-14

Visit Website GitHub

Apache Hudi is a storage and incremental processing framework for big data that enables upserts, deletes, and incremental consumption to support both real-time and batch analytics. It offers indexing, write-path optimizations, and version management to reduce processing latency and improve storage efficiency.

Key features

Incremental processing and upserts: Merge writes for handling change data capture (CDC) and mutable datasets.
Versioning and indexing: Built-in index and metadata management for efficient reads and time-travel queries.
Multi-engine compatibility: Works with Spark, Presto, Trino and other analytics engines.
Production-ready ecosystem: Documentation and deployment patterns for production environments.

Use cases

Real-time ETL and CDC: Serve as a data layer that enables near-real-time synchronization and analytics.
Data governance and compliance: Track data history and metadata for auditability.

Technical highlights

Focused on write-optimized strategies and indexing to reduce write amplification and improve read performance.
Tooling to support migration from traditional data lakes or warehouses to Hudi.

Apache Hudi

Key features

Use cases

Technical highlights

Resource Info

Related Resources

Apache Doris

Paimon

Gravitino