From GPU to Token: The 8-Layer Observability Stack for AI …

GPU utilization is not the destination. Token cost is the true North Star metric for AI infrastructure.

For the past few years, one of the hottest topics in AI infrastructure has been GPU scheduling. Whether it’s Kubernetes, Volcano, Kueue, or HAMi, they all fundamentally solve the same problem — how to make expensive and scarce GPUs more efficiently utilized.

But as more enterprises begin running production-grade Large Language Model (LLM) services, a new phenomenon has emerged: GPU utilization is high, but users still complain about slow response times; GPU clusters are near full capacity, but business throughput hasn’t grown proportionally; VRAM and compute still have headroom, yet TTFT (Time To First Token) continues to degrade. This reveals a fundamental truth — GPU utilization alone is no longer sufficient to describe the real operational state of modern AI systems.

For traditional cloud-native applications, we focus on CPU, memory, network, and disk. For AI systems, however, we also need to care about: whether GPUs are truly performing useful computation, whether NCCL (NVIDIA Collective Communications Library) communication has become a bottleneck, whether Kubernetes has correctly allocated resources, whether KV Cache is exhausted, whether Token latency meets user experience requirements, and whether the cost per Token is reasonable. In other words, the observability target for AI infrastructure has expanded from the GPU to the entire inference chain.

Recently, while participating in the development of an industry standard — “Technical Capability Requirements for Computing Power Efficiency Enhancement: Heterogeneous Computing Services” organized by the China Academy of Information and Communications Technology (CAICT) — a core question came up repeatedly in discussions with experts: How do GPU and Token relate? How do we measure GPU output through Tokens? The standard introduces the concept of “Token as a Service,” shifting the unit of measurement for computing services from traditional GPU hours to Tokens, yet a mature practical framework for building a complete observability chain from hardware to Token was still missing.

Around the same time, I came across a technical article on GPU and LLM observability (GPU & LLM Inference Observability — Layer-by-Layer Coverage) that proposed a constructive approach: decomposing the AI system into eight observability layers from GPU hardware to business cost, which neatly fills the observability gap between GPU and Token. This article reorganizes and interprets the original content, removes product-specific implementation details and vendor promotion, and extends the model by incorporating Kubernetes, HAMi, and modern LLM inference architectures.

This article aims to answer one question:

After we’ve solved GPU scheduling, what should we observe next?

8-Layer Observability Architecture Overview

The diagram below shows the eight observability layers from GPU hardware to business cost, each corresponding to different observability targets and areas of responsibility:

Figure 1: 8-Layer AI Infrastructure Observability Architecture

Original image source GitHub.

Each layer collects data through the OTel Collector and routes it to metrics, tracing, and logging backends, forming a complete observability loop. Let’s examine each layer in detail.

L1 GPU Hardware Layer

This layer focuses on the GPU itself. The core question is: Is the GPU running healthy?

The table below lists the key observability metrics for the GPU hardware layer:

Category	Key Metrics
Compute	GPU Utilization, SM Occupancy, Tensor Core Activity
Memory	VRAM Usage, HBM Bandwidth
Interconnect	NVLink Throughput, PCIe Throughput
Reliability	ECC Error, XID Error
Thermal	Temperature, Power Draw, Throttle Reason

Table 1: L1 GPU Hardware Layer Key Metrics

Many teams only track GPU Utilization, but there’s a critical distinction — GPU Utilization ≠ GPU Efficiency. Two GPUs might both show 90% Utilization: one executing Tensor Core operations, the other waiting for memory access. The actual performance can be vastly different. Therefore, SM (Streaming Multiprocessor) Occupancy and Tensor Core Activity are often more valuable than Utilization alone.

L2 CUDA Runtime and Communication Layer

For distributed training, GPU computation is usually not the bottleneck — communication is. The core question at this layer is: Is the GPU computing, or waiting for other GPUs?

The table below lists the key metrics for the CUDA runtime and communication layer:

Category	Metrics
NCCL	AllReduce, AllGather, ReduceScatter
Communication	Duration, Bandwidth
Kernel	Execution Time, P99 Duration
Straggler	Rank Skew

Table 2: L2 CUDA Runtime and Communication Layer Key Metrics

In large-scale training scenarios, a single slow node can drag down the entire training job. By monitoring NCCL communication Duration and Bandwidth, as well as Skew between Ranks, you can quickly pinpoint communication bottlenecks.

L3 Host / OS Layer

Many GPU problems are ultimately not GPU problems. When GPU utilization is abnormal, the root cause may lie in the host’s CPU, memory, disk, or network.

The table below lists the key metrics at the host level:

Category	Metrics
CPU	Utilization, IO Wait
Memory	Usage, Swap
Disk	Throughput, Latency
Network	Retransmit, Bandwidth
Process	RSS, Thread Count

Table 3: L3 Host / OS Layer Key Metrics

A common misconception: when GPU utilization is low, the first reaction is that the GPU isn’t fast enough. In reality, the GPU is often waiting for data — a slow DataLoader, insufficient CPU, network congestion, or inadequate storage performance can all leave the GPU idle.

L4 Kubernetes and Scheduling Layer

This layer is the core of cloud-native AI infrastructure. The question to answer is: Who owns the GPU? You need to know which Pod is using the GPU, which Namespace consumes the most resources, which team has the highest cost, and which model occupies the most VRAM.

The table below lists the key observability dimensions at the scheduling layer:

Category	Key Attributes
Ownership	Pod, Container
Workload	Deployment, Job
Organization	Namespace, Team
Scheduling	GPU Sharing, GPU Partition
Topology	Node, AZ, Region

Table 4: L4 Kubernetes and Scheduling Layer Key Dimensions

Projects like HAMi, Volcano, and Kueue primarily operate at this layer. HAMi solves core problems like GPU Sharing, GPU Partitioning, and heterogeneous GPU scheduling, while observability answers the question: Has scheduling actually improved resource utilization? The observability data at this layer is the foundation for resource auditing and cost allocation.

L5 Training Runtime Layer

The training phase requires monitoring the model’s runtime state. The table below lists the key metrics for the training runtime:

Category	Metrics
Efficiency	MFU, TFLOPS, Step Time
Gradient	Norm, NaN, Inf
Loss	Training Loss
Data	DataLoader Wait
Checkpoint	Save Time, Restore Time

Table 5: L5 Training Runtime Key Metrics

The most important metric here is MFU (Model FLOPs Utilization). Many training jobs show high GPU utilization but only 30% MFU, meaning a significant amount of GPU time is not being converted into actual training progress. MFU is the golden metric for measuring training efficiency — it directly reflects the ratio of hardware compute converted into effective training computation.

L6 Inference Engine Layer

This is the most critical layer in production environments and the one seeing the fastest growth in industry attention. The table below lists the core metrics for inference engines:

Metric	Description
TTFT	Time To First Token
ITL	Inter Token Latency
Queue Wait	Request queuing time
Throughput	Token throughput
Batch Size	Batch processing efficiency
KV Cache Usage	KV Cache utilization

Table 6: L6 Inference Engine Key Metrics

Many teams still focus solely on GPU Utilization, but the true capacity metric for inference systems is often KV Cache (Key-Value Cache) Utilization.

Why KV Cache Matters More Than GPU Utilization

In modern LLM inference systems, GPU compute usually has headroom, but KV Cache often runs out first. A typical scenario: GPU utilization is only 60%, KV Cache is at 95%, new requests start queuing, and TTFT spikes rapidly.

For inference systems, KV Cache is more akin to a database’s Buffer Pool — it often determines the capacity ceiling of the entire system. When KV Cache approaches saturation, the system is forced to perform Eviction, leading to context loss, request retries, and ultimately latency spikes and throughput drops.

L7 GenAI API Layer

With the development of OpenTelemetry GenAI Semantic Conventions, the industry is converging on unified AI observability standards. The table below lists the key metrics at the API layer:

Metric	Description
Input Tokens	Input token count
Output Tokens	Output token count
Request Duration	Request latency
TTFT	First Token latency
Token Throughput	Token throughput

Table 7: L7 GenAI API Layer Key Metrics

This layer marks the shift from infrastructure-centric to application-centric observability. API-layer metrics directly face end users and business systems, serving as the core data source for SLA/SLO quality assessment.

L8 Business and Cost Layer

Ultimately, enterprises don’t care about GPU utilization — they care about: Is the GPU creating value? The table below lists the core metrics for the business and cost layer:

Metric	Description
Cost per GPU Hour	GPU hourly cost
Cost per Token	Token cost
Cost per Request	Request cost
Idle GPU Cost	Idle resource cost
Tokens per Watt	Inference efficiency

Table 8: L8 Business and Cost Layer Key Metrics

The core competitive metric for future AI infrastructure may not be GPU Utilization, but rather Cost per Useful Token — the comprehensive cost of producing one useful Token. This metric unifies hardware cost, energy consumption, inference efficiency, and business value into a single measurement framework.

Cross-Layer Troubleshooting

Real production issues often span multiple layers. The table below lists common cross-layer failure symptoms and their potential causes:

Symptom	Possible Causes
TTFT Increase	CPU IO Wait, Queue Depth, KV Cache Pressure
Throughput Drop	NCCL, Batch Size, Network
Latency Spike	KV Cache Eviction
OOM	Insufficient VRAM, Oversized Batch
Training Stall	NCCL Straggler

Table 9: Cross-Layer Troubleshooting Reference

Modern AI operations are evolving from single-point monitoring to cross-layer observability. The root cause of a single-layer metric anomaly often lies in another layer. Only by building cross-layer correlation capabilities can teams achieve rapid root-cause identification and precise troubleshooting.

From GPU Control Plane to AI Observability Plane

Over the past few years, the industry has focused primarily on projects like Kubernetes Scheduler, Volcano, Kueue, and HAMi, solving the problem of how to allocate GPUs. In the coming years, the industry will start asking whether GPUs are truly generating value, leading to the formation of a new AI infrastructure technology stack:

GPU Hardware
↓
Kubernetes Control Plane
↓
Inference Runtime
↓
Observability Plane
↓
Optimization Plane

GPU scheduling determines how resources are allocated. Observability determines whether resources are being used correctly. Together, they form the next-generation AI Infrastructure Stack. The evolution from GPU Control Plane to AI Observability Plane marks a new era where AI infrastructure transitions from “resource management” to “value management.”

Summary

AI infrastructure observability is undergoing a fundamental paradigm shift. We used to look only at GPU utilization; now we need to build a complete observability system across eight layers — from GPU hardware, CUDA runtime, host OS, Kubernetes scheduling, training runtime, inference engine, GenAI API, to business cost.

Key takeaways:

L1-L3 focus on hardware and systems: GPU health, CUDA communication efficiency, and host resource adequacy
L4 focuses on resource allocation: Kubernetes scheduling and GPU sharing, with tools like HAMi solving allocation problems
L5-L6 focus on model runtime: Training efficiency (MFU) and inference capacity (KV Cache) are the core metrics
L7-L8 focus on business value: From Token throughput to cost per Token, ultimately measuring whether GPUs are creating value

GPU scheduling is the starting point for AI infrastructure, but not the end goal. The 8-layer observability stack from GPU to Token is the critical closed loop that ensures AI infrastructure truly delivers business value.

References

GPU & LLM Inference Observability — Layer-by-Layer Coverage - GitHub

From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure

8-Layer Observability Architecture Overview

L1 GPU Hardware Layer

L2 CUDA Runtime and Communication Layer

L3 Host / OS Layer

L4 Kubernetes and Scheduling Layer

L5 Training Runtime Layer

L6 Inference Engine Layer

Why KV Cache Matters More Than GPU Utilization

L7 GenAI API Layer

L8 Business and Cost Layer

Cross-Layer Troubleshooting

From GPU Control Plane to AI Observability Plane

Summary

References

Jimmy Song

Technology

Technology

More

More

AI Infrastructure

AI Infrastructure

Explore

Explore

Connect

Connect

Quick Links

Quick Links

LinkedIn

LinkedIn

Follow on X

Follow on X

From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure

8-Layer Observability Architecture Overview

L1 GPU Hardware Layer

L2 CUDA Runtime and Communication Layer

L3 Host / OS Layer

L4 Kubernetes and Scheduling Layer

L5 Training Runtime Layer

L6 Inference Engine Layer

Why KV Cache Matters More Than GPU Utilization

L7 GenAI API Layer

L8 Business and Cost Layer

Cross-Layer Troubleshooting

From GPU Control Plane to AI Observability Plane

Summary

References

Jimmy Song

Share via WeChat

GPU Utilization Is Breaking

Token Is More Than a Billing Unit, It's Becoming the Resource Unit of the AI Era

Olares and HAMi: Desktop AI Workstation Inflection