Standing on Giants' Shoulders: The Traditional …

“If I have seen further, it is by standing on the shoulders of giants.” — Isaac Newton

In the excitement surrounding LLMs, vector databases, and AI agents, it’s easy to forget that modern AI didn’t emerge from a vacuum. Today’s AI revolution stands upon decades of infrastructure work—distributed systems, data pipelines, search engines, and orchestration platforms that were built long before “AI Native” became a buzzword.

This post is a tribute to those traditional open source projects that became the invisible foundation of AI infrastructure. They’re not “AI projects” per se, but without them, the AI revolution as we know it wouldn’t exist.

The Evolution: From Big Data to AI

Era	Focus	Core Technologies	AI Connection
2000s	Web Search & Indexing	Lucene, Elasticsearch	Semantic search foundations
2010s	Big Data & Distributed Computing	Hadoop, Spark, Kafka	Data processing at scale
2010s	Cloud Native	Docker, Kubernetes	Model deployment platforms
2010s	Stream Processing	Flink, Storm, Pulsar	Real-time ML inference
2020s	AI Native	Transformers, Vector DBs	Built on everything above

Table 1: Evolution of Data Infrastructure

Big Data Frameworks: The Data Engines

Before we could train models on petabytes of data, we needed ways to store, process, and move that data.

Apache Hadoop (2006)

GitHub: https://github.com/apache/hadoop

Hadoop democratized big data by making distributed computing accessible. Its HDFS filesystem and MapReduce paradigm proved that commodity hardware could process web-scale datasets.

Why it matters for AI:

Modern ML training datasets live in HDFS-compatible storage
Data lakes built on Hadoop became training data reservoirs
Proved that distributed computing could scale horizontally

Apache Kafka (2011)

GitHub: https://github.com/apache/kafka

Kafka redefined data streaming with its log-based architecture. It became the nervous system for real-time data flows in enterprises worldwide.

Why it matters for AI:

Real-time feature pipelines for ML models
Event-driven architectures for AI agent systems
Streaming inference pipelines
Model telemetry and monitoring backbones

Apache Spark (2014)

GitHub: https://github.com/apache/spark

Spark brought in-memory computing to big data, making iterative algorithms (like ML training) practical at scale.

Why it matters for AI:

MLlib made ML accessible to data engineers
Distributed data processing for model training
Spark ML became the de facto standard for big data ML
Proved that in-memory computing could accelerate ML workloads

Search Engines: The Retrieval Foundation

Before RAG (Retrieval-Augmented Generation) became a buzzword, search engines were solving retrieval at scale.

Elasticsearch (2010)

GitHub: https://github.com/elastic/elasticsearch

Elasticsearch made full-text search accessible and scalable. Its distributed architecture and RESTful API became the standard for search.

Why it matters for AI:

pioneered distributed inverted index structures
Proved that horizontal scaling was possible for search workloads
Many “AI search” systems actually use Elasticsearch under the hood
Query DSL influenced modern vector database query languages

OpenSearch (2021)

GitHub: https://github.com/opensearch-project/opensearch

When AWS forked Elasticsearch, it ensured search infrastructure remained truly open. OpenSearch continues the mission of accessible, scalable search.

Why it matters for AI:

Maintains open source innovation in search
Vector search capabilities added in 2023
Demonstrates community fork resilience

Databases: From SQL to Vectors

The evolution from relational databases to vector databases represents a paradigm shift—but both have AI relevance.

Traditional Databases That Paved the Way

Dgraph (2015) - Graph database proving that specialized data structures enable new use cases
TDengine (2019) - Time-series database for IoT ML workloads
OceanBase (2021) - Distributed database showing ACID transactions could scale

Why they matter for AI:

Proved that specialized database engines could outperform general-purpose ones
Database internals (indexing, sharding, replication) are now applied to vector databases
Multi-model databases (graph + vector + relational) are becoming the norm for AI apps

Cloud Native: The Runtime Foundation

When Docker and Kubernetes emerged, they weren’t built for AI—but AI couldn’t scale without them.

Docker (2013) & Kubernetes (2014)

GitHub: https://github.com/kubernetes/kubernetes

Kubernetes became the operating system for cloud-native applications. Its declarative API and controller pattern made it perfect for AI workloads.

Why it matters for AI:

Model deployment platforms (KServe, Seldon Core) run on K8s
GPU orchestration (NVIDIA GPU Operator, Volcano, HAMi) extends K8s
Kubeflow made K8s the standard for ML pipelines
Microservice patterns enable modular AI agent architectures

Service Mesh & Serverless

Istio (2016), Knative (2018) - Service mesh and serverless platforms that proved:

Network-level observability applies to AI model calls
Scale-to-zero is essential for cost-effective inference
Traffic splitting enables A/B testing of ML models

Why they matter for AI:

AI Gateway patterns evolved from API gateways + service mesh
Serverless inference platforms use Knative-style autoscaling
Observability patterns (tracing, metrics) are now standard for ML systems

API Gateways: From REST to LLM

API gateways weren’t designed for AI, but they became the foundation of AI Gateway patterns.

Kong, APISIX, KGateway

These API gateways solved rate limiting, auth, and routing at scale. When LLMs emerged, the same patterns applied:

AI Gateway Evolution:

Traditional API Gateway (2010s)
  ↓
Rate Limiting → Token Bucket Rate Limiting
Auth → API Key + Organization Management
Routing → Model Routing (GPT-4 → Claude → Local Models)
Observability → LLM-specific Telemetry (token usage, cost)
  ↓
AI Gateway (2024)

Why they matter for AI:

Proved that centralized API management scales
Plugin architectures enable LLM-specific features
Traffic management patterns apply to prompt routing
Security patterns (mTLS, JWT) now protect AI endpoints

Workflow Orchestration: The Pipeline Backbone

Data engineering needs pipelines. ML engineering needs pipelines. AI agents need workflows.

Apache Airflow (2015)

GitHub: https://github.com/apache/airflow

Airflow made pipeline orchestration accessible with its DAG-based approach. It became the standard for ETL and data engineering.

Why it matters for AI:

ML pipeline orchestration (feature engineering, training, evaluation)
Proved that DAG-based workflow definition works at scale
Prompt engineering pipelines use Airflow-style orchestration
Scheduler patterns are now applied to AI agent workflows

n8n, Prefect, Flyte

Modern workflow platforms that evolved from Airflow’s foundations:

n8n (2019) - Visual workflow automation with AI capabilities
Prefect (2018) - Python-native workflow orchestration for ML
Flyte (2019) - Kubernetes-native workflow orchestration for ML/data

Why they matter for AI:

Multi-modal agents need workflow orchestration
RAG pipelines are essentially ETL pipelines for embeddings
Prompt chaining is DAG-based orchestration

Data Formats: The Lakehouse Foundation

Before we could train on massive datasets, we needed formats that supported ACID transactions and schema evolution.

Delta Lake, Apache Iceberg, Apache Hudi

These table formats brought reliability to data lakes:

Why they matter for AI:

Training datasets need versioning and reproducibility
Feature stores use Delta/Iceberg as storage formats
Proved that “big data” could have transactional semantics
Schema evolution handles ML feature drift

The Invisible Thread: Why These Projects Matter

What do all these projects have in common?

They solved scaling first - AI training/inference needs horizontal scaling
They proved distributed systems work - Modern AI is fundamentally distributed
They created ecosystem patterns - Plugin systems, extension points, APIs
They established best practices - Observability, security, CI/CD
They built developer habits - YAML configs, declarative APIs, CLI tools

The AI Native Continuum

Modern “AI Native” infrastructure didn’t replace these projects—it builds on them:

Traditional Project	AI Native Evolution	Example
Hadoop HDFS	Distributed model storage	HDFS for datasets, S3 for checkpoints
Kafka	Real-time feature pipelines	Kafka → Feature Store → Model Serving
Spark ML	Distributed ML training	MLlib → PyTorch Distributed
Elasticsearch	Vector search	ES → Weaviate/Qdrant/Milvus
Kubernetes	ML orchestration	K8s → Kubeflow/KServe
Istio	AI Gateway service mesh	Istio → LLM Gateway with mTLS
Airflow	ML pipeline orchestration	Airflow → Prefect/Flyte for ML

Table 2: From Traditional to AI Native

Why We’re Removing Them from AI Resources List

This post honors these projects, but we’re also removing them from our AI Resources list. Here’s why:

They’re not “AI Projects”—they’re foundational infrastructure.

Hadoop, Kafka, Spark are data engineering tools, not ML frameworks
Elasticsearch is search, not semantic search
Kubernetes is general-purpose orchestration
API gateways serve REST/GraphQL, not just LLMs

But their absence doesn’t diminish their importance.

By removing them, we acknowledge that:

AI has its own ecosystem - Transformers, vector DBs, LLM ops
Traditional infra has its own domain - Data engineering, cloud native
The intersection is where innovation happens - AI-native data platforms, LLM ops on K8s

The Giants We Stand On

The next time you:

Deploy a model on Kubernetes
Stream features through Kafka
Search embeddings with a vector database
Orchestrate a RAG pipeline with Prefect

Remember: You’re standing on the shoulders of Hadoop, Kafka, Elasticsearch, Kubernetes, and countless others. They built the roads we now drive on.

The Future: Building New Giants

Just as Hadoop and Kafka enabled modern AI, today’s AI infrastructure will become tomorrow’s foundation:

Vector databases may become the new standard for all search
LLM observability may evolve into general distributed tracing
AI agent orchestration may reinvent workflow automation
GPU scheduling may influence general-purpose resource management

The cycle continues. The giants of today will be the foundations of tomorrow.

Conclusion: Gratitude and Continuity

As we clean up our AI Resources list to focus on AI-native projects, we don’t forget where we came from. Traditional big data and cloud native infrastructure made the AI revolution possible.

To the Hadoop committers, Kafka maintainers, Kubernetes contributors, and all who built the foundation: Thank you.

Your work enabled ChatGPT, enabled Transformers, enabled everything we now call “AI.”

Standing on your shoulders, we see further.

Acknowledgments: This post was inspired by the need to refactor our AI Resources list. The 27 projects mentioned here are being removed—not because they’re unimportant, but because they deserve their own category: The Foundation.

Standing on Giants' Shoulders: The Traditional Infrastructure Powering Modern AI

The Evolution: From Big Data to AI

Big Data Frameworks: The Data Engines

Apache Hadoop (2006)

Apache Kafka (2011)

Apache Spark (2014)

Search Engines: The Retrieval Foundation

Elasticsearch (2010)

OpenSearch (2021)

Databases: From SQL to Vectors

Traditional Databases That Paved the Way

Cloud Native: The Runtime Foundation

Docker (2013) & Kubernetes (2014)

Service Mesh & Serverless

API Gateways: From REST to LLM

Kong, APISIX, KGateway

Workflow Orchestration: The Pipeline Backbone

Apache Airflow (2015)

n8n, Prefect, Flyte

Data Formats: The Lakehouse Foundation

Delta Lake, Apache Iceberg, Apache Hudi

The Invisible Thread: Why These Projects Matter

The AI Native Continuum

Why We’re Removing Them from AI Resources List

The Giants We Stand On

The Future: Building New Giants

Conclusion: Gratitude and Continuity

Jimmy Song

Core Content

Core Content

Technology

Technology

More

More

AI Infrastructure

AI Infrastructure

Explore

Explore

Connect

Connect

Quick Links

Quick Links

LinkedIn

LinkedIn

Follow on X

Follow on X

Standing on Giants' Shoulders: The Traditional Infrastructure Powering Modern AI

The Evolution: From Big Data to AI

Big Data Frameworks: The Data Engines

Apache Hadoop (2006)

Apache Kafka (2011)

Apache Spark (2014)

Search Engines: The Retrieval Foundation

Elasticsearch (2010)

OpenSearch (2021)

Databases: From SQL to Vectors

Traditional Databases That Paved the Way

Cloud Native: The Runtime Foundation

Docker (2013) & Kubernetes (2014)

Service Mesh & Serverless

API Gateways: From REST to LLM

Kong, APISIX, KGateway

Workflow Orchestration: The Pipeline Backbone

Apache Airflow (2015)

n8n, Prefect, Flyte

Data Formats: The Lakehouse Foundation

Delta Lake, Apache Iceberg, Apache Hudi

The Invisible Thread: Why These Projects Matter

The AI Native Continuum

Why We’re Removing Them from AI Resources List

The Giants We Stand On

The Future: Building New Giants

Conclusion: Gratitude and Continuity

Jimmy Song

Share via WeChat

From Spatial Data to AI Open Source: Technical Standards, Data Sovereignty, and the Global Divide

From Cloud Native to AI Native: Why Kubernetes Is the Foundation for Next-Gen AI Agents

Agentic Runtime Realism