Apache Hudi is a storage and incremental processing framework for big data that enables upserts, deletes, and incremental consumption to support both real-time and batch analytics. It offers indexing, write-path optimizations, and version management to reduce processing latency and improve storage efficiency.
Key features
- Incremental processing and upserts: Merge writes for handling change data capture (CDC) and mutable datasets.
- Versioning and indexing: Built-in index and metadata management for efficient reads and time-travel queries.
- Multi-engine compatibility: Works with Spark, Presto, Trino and other analytics engines.
- Production-ready ecosystem: Documentation and deployment patterns for production environments.
Use cases
- Real-time ETL and CDC: Serve as a data layer that enables near-real-time synchronization and analytics.
- Data governance and compliance: Track data history and metadata for auditability.
Technical highlights
- Focused on write-optimized strategies and indexing to reduce write amplification and improve read performance.
- Tooling to support migration from traditional data lakes or warehouses to Hudi.