Standing on Giants' Shoulders: The Traditional Infrastructure Powering Modern AI

Before ChatGPT and TensorFlow, there was Hadoop, Kafka, and Kubernetes. This post honors the traditional open source infrastructure that became the foundation of today’s AI revolution.

“If I have seen further, it is by standing on the shoulders of giants.” — Isaac Newton

Figure 1: Standing on Giants’ Shoulders: The Traditional Infrastructure Powering Modern AI
Figure 1: Standing on Giants’ Shoulders: The Traditional Infrastructure Powering Modern AI

In the excitement surrounding LLMs, vector databases, and AI agents, it’s easy to forget that modern AI didn’t emerge from a vacuum. Today’s AI revolution stands upon decades of infrastructure work—distributed systems, data pipelines, search engines, and orchestration platforms that were built long before “AI Native” became a buzzword.

This post is a tribute to those traditional open source projects that became the invisible foundation of AI infrastructure. They’re not “AI projects” per se, but without them, the AI revolution as we know it wouldn’t exist.

The Evolution: From Big Data to AI

EraFocusCore TechnologiesAI Connection
2000sWeb Search & IndexingLucene, ElasticsearchSemantic search foundations
2010sBig Data & Distributed ComputingHadoop, Spark, KafkaData processing at scale
2010sCloud NativeDocker, KubernetesModel deployment platforms
2010sStream ProcessingFlink, Storm, PulsarReal-time ML inference
2020sAI NativeTransformers, Vector DBsBuilt on everything above
Table 1: Evolution of Data Infrastructure

Big Data Frameworks: The Data Engines

Before we could train models on petabytes of data, we needed ways to store, process, and move that data.

Apache Hadoop (2006)

GitHub: https://github.com/apache/hadoop

Hadoop democratized big data by making distributed computing accessible. Its HDFS filesystem and MapReduce paradigm proved that commodity hardware could process web-scale datasets.

Why it matters for AI:

  • Modern ML training datasets live in HDFS-compatible storage
  • Data lakes built on Hadoop became training data reservoirs
  • Proved that distributed computing could scale horizontally

Apache Kafka (2011)

GitHub: https://github.com/apache/kafka

Kafka redefined data streaming with its log-based architecture. It became the nervous system for real-time data flows in enterprises worldwide.

Why it matters for AI:

  • Real-time feature pipelines for ML models
  • Event-driven architectures for AI agent systems
  • Streaming inference pipelines
  • Model telemetry and monitoring backbones

Apache Spark (2014)

GitHub: https://github.com/apache/spark

Spark brought in-memory computing to big data, making iterative algorithms (like ML training) practical at scale.

Why it matters for AI:

  • MLlib made ML accessible to data engineers
  • Distributed data processing for model training
  • Spark ML became the de facto standard for big data ML
  • Proved that in-memory computing could accelerate ML workloads

Search Engines: The Retrieval Foundation

Before RAG (Retrieval-Augmented Generation) became a buzzword, search engines were solving retrieval at scale.

Elasticsearch (2010)

GitHub: https://github.com/elastic/elasticsearch

Elasticsearch made full-text search accessible and scalable. Its distributed architecture and RESTful API became the standard for search.

Why it matters for AI:

  • pioneered distributed inverted index structures
  • Proved that horizontal scaling was possible for search workloads
  • Many “AI search” systems actually use Elasticsearch under the hood
  • Query DSL influenced modern vector database query languages

OpenSearch (2021)

GitHub: https://github.com/opensearch-project/opensearch

When AWS forked Elasticsearch, it ensured search infrastructure remained truly open. OpenSearch continues the mission of accessible, scalable search.

Why it matters for AI:

  • Maintains open source innovation in search
  • Vector search capabilities added in 2023
  • Demonstrates community fork resilience

Databases: From SQL to Vectors

The evolution from relational databases to vector databases represents a paradigm shift—but both have AI relevance.

Traditional Databases That Paved the Way

  • Dgraph (2015) - Graph database proving that specialized data structures enable new use cases
  • TDengine (2019) - Time-series database for IoT ML workloads
  • OceanBase (2021) - Distributed database showing ACID transactions could scale

Why they matter for AI:

  • Proved that specialized database engines could outperform general-purpose ones
  • Database internals (indexing, sharding, replication) are now applied to vector databases
  • Multi-model databases (graph + vector + relational) are becoming the norm for AI apps

Cloud Native: The Runtime Foundation

When Docker and Kubernetes emerged, they weren’t built for AI—but AI couldn’t scale without them.

Docker (2013) & Kubernetes (2014)

GitHub: https://github.com/kubernetes/kubernetes

Kubernetes became the operating system for cloud-native applications. Its declarative API and controller pattern made it perfect for AI workloads.

Why it matters for AI:

  • Model deployment platforms (KServe, Seldon Core) run on K8s
  • GPU orchestration (NVIDIA GPU Operator, Volcano, HAMi) extends K8s
  • Kubeflow made K8s the standard for ML pipelines
  • Microservice patterns enable modular AI agent architectures

Service Mesh & Serverless

Istio (2016), Knative (2018) - Service mesh and serverless platforms that proved:

  • Network-level observability applies to AI model calls
  • Scale-to-zero is essential for cost-effective inference
  • Traffic splitting enables A/B testing of ML models

Why they matter for AI:

  • AI Gateway patterns evolved from API gateways + service mesh
  • Serverless inference platforms use Knative-style autoscaling
  • Observability patterns (tracing, metrics) are now standard for ML systems

API Gateways: From REST to LLM

API gateways weren’t designed for AI, but they became the foundation of AI Gateway patterns.

Kong, APISIX, KGateway

These API gateways solved rate limiting, auth, and routing at scale. When LLMs emerged, the same patterns applied:

AI Gateway Evolution:

Traditional API Gateway (2010s)
Rate Limiting → Token Bucket Rate Limiting
Auth → API Key + Organization Management
Routing → Model Routing (GPT-4 → Claude → Local Models)
Observability → LLM-specific Telemetry (token usage, cost)
AI Gateway (2024)

Why they matter for AI:

  • Proved that centralized API management scales
  • Plugin architectures enable LLM-specific features
  • Traffic management patterns apply to prompt routing
  • Security patterns (mTLS, JWT) now protect AI endpoints

Workflow Orchestration: The Pipeline Backbone

Data engineering needs pipelines. ML engineering needs pipelines. AI agents need workflows.

Apache Airflow (2015)

GitHub: https://github.com/apache/airflow

Airflow made pipeline orchestration accessible with its DAG-based approach. It became the standard for ETL and data engineering.

Why it matters for AI:

  • ML pipeline orchestration (feature engineering, training, evaluation)
  • Proved that DAG-based workflow definition works at scale
  • Prompt engineering pipelines use Airflow-style orchestration
  • Scheduler patterns are now applied to AI agent workflows

n8n, Prefect, Flyte

Modern workflow platforms that evolved from Airflow’s foundations:

  • n8n (2019) - Visual workflow automation with AI capabilities
  • Prefect (2018) - Python-native workflow orchestration for ML
  • Flyte (2019) - Kubernetes-native workflow orchestration for ML/data

Why they matter for AI:

  • Multi-modal agents need workflow orchestration
  • RAG pipelines are essentially ETL pipelines for embeddings
  • Prompt chaining is DAG-based orchestration

Data Formats: The Lakehouse Foundation

Before we could train on massive datasets, we needed formats that supported ACID transactions and schema evolution.

Delta Lake, Apache Iceberg, Apache Hudi

These table formats brought reliability to data lakes:

Why they matter for AI:

  • Training datasets need versioning and reproducibility
  • Feature stores use Delta/Iceberg as storage formats
  • Proved that “big data” could have transactional semantics
  • Schema evolution handles ML feature drift

The Invisible Thread: Why These Projects Matter

What do all these projects have in common?

  1. They solved scaling first - AI training/inference needs horizontal scaling
  2. They proved distributed systems work - Modern AI is fundamentally distributed
  3. They created ecosystem patterns - Plugin systems, extension points, APIs
  4. They established best practices - Observability, security, CI/CD
  5. They built developer habits - YAML configs, declarative APIs, CLI tools

The AI Native Continuum

Modern “AI Native” infrastructure didn’t replace these projects—it builds on them:

Traditional ProjectAI Native EvolutionExample
Hadoop HDFSDistributed model storageHDFS for datasets, S3 for checkpoints
KafkaReal-time feature pipelinesKafka → Feature Store → Model Serving
Spark MLDistributed ML trainingMLlib → PyTorch Distributed
ElasticsearchVector searchES → Weaviate/Qdrant/Milvus
KubernetesML orchestrationK8s → Kubeflow/KServe
IstioAI Gateway service meshIstio → LLM Gateway with mTLS
AirflowML pipeline orchestrationAirflow → Prefect/Flyte for ML
Table 2: From Traditional to AI Native

Why We’re Removing Them from AI Resources List

This post honors these projects, but we’re also removing them from our AI Resources list. Here’s why:

They’re not “AI Projects”—they’re foundational infrastructure.

  • Hadoop, Kafka, Spark are data engineering tools, not ML frameworks
  • Elasticsearch is search, not semantic search
  • Kubernetes is general-purpose orchestration
  • API gateways serve REST/GraphQL, not just LLMs

But their absence doesn’t diminish their importance.

By removing them, we acknowledge that:

  1. AI has its own ecosystem - Transformers, vector DBs, LLM ops
  2. Traditional infra has its own domain - Data engineering, cloud native
  3. The intersection is where innovation happens - AI-native data platforms, LLM ops on K8s

The Giants We Stand On

The next time you:

  • Deploy a model on Kubernetes
  • Stream features through Kafka
  • Search embeddings with a vector database
  • Orchestrate a RAG pipeline with Prefect

Remember: You’re standing on the shoulders of Hadoop, Kafka, Elasticsearch, Kubernetes, and countless others. They built the roads we now drive on.

The Future: Building New Giants

Just as Hadoop and Kafka enabled modern AI, today’s AI infrastructure will become tomorrow’s foundation:

  • Vector databases may become the new standard for all search
  • LLM observability may evolve into general distributed tracing
  • AI agent orchestration may reinvent workflow automation
  • GPU scheduling may influence general-purpose resource management

The cycle continues. The giants of today will be the foundations of tomorrow.

Conclusion: Gratitude and Continuity

As we clean up our AI Resources list to focus on AI-native projects, we don’t forget where we came from. Traditional big data and cloud native infrastructure made the AI revolution possible.

To the Hadoop committers, Kafka maintainers, Kubernetes contributors, and all who built the foundation: Thank you.

Your work enabled ChatGPT, enabled Transformers, enabled everything we now call “AI.”

Standing on your shoulders, we see further.


Acknowledgments: This post was inspired by the need to refactor our AI Resources list. The 27 projects mentioned here are being removed—not because they’re unimportant, but because they deserve their own category: The Foundation.

Jimmy Song

Jimmy Song

Focusing on research and open source practices in AI-Native Infrastructure and cloud native application architecture.

Post Navigation