“If I have seen further, it is by standing on the shoulders of giants.” — Isaac Newton

In the excitement surrounding LLMs, vector databases, and AI agents, it’s easy to forget that modern AI didn’t emerge from a vacuum. Today’s AI revolution stands upon decades of infrastructure work—distributed systems, data pipelines, search engines, and orchestration platforms that were built long before “AI Native” became a buzzword.
This post is a tribute to those traditional open source projects that became the invisible foundation of AI infrastructure. They’re not “AI projects” per se, but without them, the AI revolution as we know it wouldn’t exist.
The Evolution: From Big Data to AI
| Era | Focus | Core Technologies | AI Connection |
|---|---|---|---|
| 2000s | Web Search & Indexing | Lucene, Elasticsearch | Semantic search foundations |
| 2010s | Big Data & Distributed Computing | Hadoop, Spark, Kafka | Data processing at scale |
| 2010s | Cloud Native | Docker, Kubernetes | Model deployment platforms |
| 2010s | Stream Processing | Flink, Storm, Pulsar | Real-time ML inference |
| 2020s | AI Native | Transformers, Vector DBs | Built on everything above |
Big Data Frameworks: The Data Engines
Before we could train models on petabytes of data, we needed ways to store, process, and move that data.
Apache Hadoop (2006)
GitHub: https://github.com/apache/hadoop
Hadoop democratized big data by making distributed computing accessible. Its HDFS filesystem and MapReduce paradigm proved that commodity hardware could process web-scale datasets.
Why it matters for AI:
- Modern ML training datasets live in HDFS-compatible storage
- Data lakes built on Hadoop became training data reservoirs
- Proved that distributed computing could scale horizontally
Apache Kafka (2011)
GitHub: https://github.com/apache/kafka
Kafka redefined data streaming with its log-based architecture. It became the nervous system for real-time data flows in enterprises worldwide.
Why it matters for AI:
- Real-time feature pipelines for ML models
- Event-driven architectures for AI agent systems
- Streaming inference pipelines
- Model telemetry and monitoring backbones
Apache Spark (2014)
GitHub: https://github.com/apache/spark
Spark brought in-memory computing to big data, making iterative algorithms (like ML training) practical at scale.
Why it matters for AI:
- MLlib made ML accessible to data engineers
- Distributed data processing for model training
- Spark ML became the de facto standard for big data ML
- Proved that in-memory computing could accelerate ML workloads
Search Engines: The Retrieval Foundation
Before RAG (Retrieval-Augmented Generation) became a buzzword, search engines were solving retrieval at scale.
Elasticsearch (2010)
GitHub: https://github.com/elastic/elasticsearch
Elasticsearch made full-text search accessible and scalable. Its distributed architecture and RESTful API became the standard for search.
Why it matters for AI:
- pioneered distributed inverted index structures
- Proved that horizontal scaling was possible for search workloads
- Many “AI search” systems actually use Elasticsearch under the hood
- Query DSL influenced modern vector database query languages
OpenSearch (2021)
GitHub: https://github.com/opensearch-project/opensearch
When AWS forked Elasticsearch, it ensured search infrastructure remained truly open. OpenSearch continues the mission of accessible, scalable search.
Why it matters for AI:
- Maintains open source innovation in search
- Vector search capabilities added in 2023
- Demonstrates community fork resilience
Databases: From SQL to Vectors
The evolution from relational databases to vector databases represents a paradigm shift—but both have AI relevance.
Traditional Databases That Paved the Way
- Dgraph (2015) - Graph database proving that specialized data structures enable new use cases
- TDengine (2019) - Time-series database for IoT ML workloads
- OceanBase (2021) - Distributed database showing ACID transactions could scale
Why they matter for AI:
- Proved that specialized database engines could outperform general-purpose ones
- Database internals (indexing, sharding, replication) are now applied to vector databases
- Multi-model databases (graph + vector + relational) are becoming the norm for AI apps
Cloud Native: The Runtime Foundation
When Docker and Kubernetes emerged, they weren’t built for AI—but AI couldn’t scale without them.
Docker (2013) & Kubernetes (2014)
GitHub: https://github.com/kubernetes/kubernetes
Kubernetes became the operating system for cloud-native applications. Its declarative API and controller pattern made it perfect for AI workloads.
Why it matters for AI:
- Model deployment platforms (KServe, Seldon Core) run on K8s
- GPU orchestration (NVIDIA GPU Operator, Volcano, HAMi) extends K8s
- Kubeflow made K8s the standard for ML pipelines
- Microservice patterns enable modular AI agent architectures
Service Mesh & Serverless
Istio (2016), Knative (2018) - Service mesh and serverless platforms that proved:
- Network-level observability applies to AI model calls
- Scale-to-zero is essential for cost-effective inference
- Traffic splitting enables A/B testing of ML models
Why they matter for AI:
- AI Gateway patterns evolved from API gateways + service mesh
- Serverless inference platforms use Knative-style autoscaling
- Observability patterns (tracing, metrics) are now standard for ML systems
API Gateways: From REST to LLM
API gateways weren’t designed for AI, but they became the foundation of AI Gateway patterns.
Kong, APISIX, KGateway
These API gateways solved rate limiting, auth, and routing at scale. When LLMs emerged, the same patterns applied:
AI Gateway Evolution:
Traditional API Gateway (2010s)
↓
Rate Limiting → Token Bucket Rate Limiting
Auth → API Key + Organization Management
Routing → Model Routing (GPT-4 → Claude → Local Models)
Observability → LLM-specific Telemetry (token usage, cost)
↓
AI Gateway (2024)
Why they matter for AI:
- Proved that centralized API management scales
- Plugin architectures enable LLM-specific features
- Traffic management patterns apply to prompt routing
- Security patterns (mTLS, JWT) now protect AI endpoints
Workflow Orchestration: The Pipeline Backbone
Data engineering needs pipelines. ML engineering needs pipelines. AI agents need workflows.
Apache Airflow (2015)
GitHub: https://github.com/apache/airflow
Airflow made pipeline orchestration accessible with its DAG-based approach. It became the standard for ETL and data engineering.
Why it matters for AI:
- ML pipeline orchestration (feature engineering, training, evaluation)
- Proved that DAG-based workflow definition works at scale
- Prompt engineering pipelines use Airflow-style orchestration
- Scheduler patterns are now applied to AI agent workflows
n8n, Prefect, Flyte
Modern workflow platforms that evolved from Airflow’s foundations:
- n8n (2019) - Visual workflow automation with AI capabilities
- Prefect (2018) - Python-native workflow orchestration for ML
- Flyte (2019) - Kubernetes-native workflow orchestration for ML/data
Why they matter for AI:
- Multi-modal agents need workflow orchestration
- RAG pipelines are essentially ETL pipelines for embeddings
- Prompt chaining is DAG-based orchestration
Data Formats: The Lakehouse Foundation
Before we could train on massive datasets, we needed formats that supported ACID transactions and schema evolution.
Delta Lake, Apache Iceberg, Apache Hudi
These table formats brought reliability to data lakes:
Why they matter for AI:
- Training datasets need versioning and reproducibility
- Feature stores use Delta/Iceberg as storage formats
- Proved that “big data” could have transactional semantics
- Schema evolution handles ML feature drift
The Invisible Thread: Why These Projects Matter
What do all these projects have in common?
- They solved scaling first - AI training/inference needs horizontal scaling
- They proved distributed systems work - Modern AI is fundamentally distributed
- They created ecosystem patterns - Plugin systems, extension points, APIs
- They established best practices - Observability, security, CI/CD
- They built developer habits - YAML configs, declarative APIs, CLI tools
The AI Native Continuum
Modern “AI Native” infrastructure didn’t replace these projects—it builds on them:
| Traditional Project | AI Native Evolution | Example |
|---|---|---|
| Hadoop HDFS | Distributed model storage | HDFS for datasets, S3 for checkpoints |
| Kafka | Real-time feature pipelines | Kafka → Feature Store → Model Serving |
| Spark ML | Distributed ML training | MLlib → PyTorch Distributed |
| Elasticsearch | Vector search | ES → Weaviate/Qdrant/Milvus |
| Kubernetes | ML orchestration | K8s → Kubeflow/KServe |
| Istio | AI Gateway service mesh | Istio → LLM Gateway with mTLS |
| Airflow | ML pipeline orchestration | Airflow → Prefect/Flyte for ML |
Why We’re Removing Them from AI Resources List
This post honors these projects, but we’re also removing them from our AI Resources list. Here’s why:
They’re not “AI Projects”—they’re foundational infrastructure.
- Hadoop, Kafka, Spark are data engineering tools, not ML frameworks
- Elasticsearch is search, not semantic search
- Kubernetes is general-purpose orchestration
- API gateways serve REST/GraphQL, not just LLMs
But their absence doesn’t diminish their importance.
By removing them, we acknowledge that:
- AI has its own ecosystem - Transformers, vector DBs, LLM ops
- Traditional infra has its own domain - Data engineering, cloud native
- The intersection is where innovation happens - AI-native data platforms, LLM ops on K8s
The Giants We Stand On
The next time you:
- Deploy a model on Kubernetes
- Stream features through Kafka
- Search embeddings with a vector database
- Orchestrate a RAG pipeline with Prefect
Remember: You’re standing on the shoulders of Hadoop, Kafka, Elasticsearch, Kubernetes, and countless others. They built the roads we now drive on.
The Future: Building New Giants
Just as Hadoop and Kafka enabled modern AI, today’s AI infrastructure will become tomorrow’s foundation:
- Vector databases may become the new standard for all search
- LLM observability may evolve into general distributed tracing
- AI agent orchestration may reinvent workflow automation
- GPU scheduling may influence general-purpose resource management
The cycle continues. The giants of today will be the foundations of tomorrow.
Conclusion: Gratitude and Continuity
As we clean up our AI Resources list to focus on AI-native projects, we don’t forget where we came from. Traditional big data and cloud native infrastructure made the AI revolution possible.
To the Hadoop committers, Kafka maintainers, Kubernetes contributors, and all who built the foundation: Thank you.
Your work enabled ChatGPT, enabled Transformers, enabled everything we now call “AI.”
Standing on your shoulders, we see further.
Acknowledgments: This post was inspired by the need to refactor our AI Resources list. The 27 projects mentioned here are being removed—not because they’re unimportant, but because they deserve their own category: The Foundation.
