System Diagnosis Principles: Criteria for Health Status
To maintain the long-term healthy evolution of AI infrastructure, post-mortem summaries are far from sufficient. We need a set of system diagnosis principles to detect hidden risks early and correct deviations.
Based on the Yin-Yang Five Elements Yun model, diagnosis can be conducted from the following five dimensions:
Five-Dimensional Diagnosis Framework
Five Elements Balance Check
Assess the current status of five aspects: Data (Water), Models (Wood), Compute (Fire), Platform (Earth), and Hardware (Metal).
Diagnosis Method
Checklist:
- Can data pipelines keep up with demands? (Water)
- Are model capabilities fully utilized? (Wood)
- Are compute resources effectively used? (Fire)
- Can the platform support current load? (Earth)
- Is hardware becoming a bottleneck? (Metal)
Identify Problems
| Problem Type | Manifestation | Solution |
|---|---|---|
| Short Board | One element significantly weaker than others | Prioritize strengthening that element |
| Overload | One element consumes excessive resources or frequently becomes a bottleneck | Introduce limits or expand other elements to share pressure |
Typical Symptoms
- Water Level Too Low: Data pipelines always lag behind training needs → Replenish data processing capacity
- Metal Overload: Hardware often runs at full capacity or even triggers limit alarms → Expand capacity or impose constraints on upper layers
Most failures do not stem from missing components, but from long-term role imbalance
Qi Flow Smoothness Check
Analyze whether Qi flows smoothly through the system via full-link monitoring.
Diagnosis Method
Key Metrics:
- Latency distribution of key processes
- Queue backlogs
- Resource utilization curves
Qi Smooth vs. Qi Not Smooth
| State | Characteristics |
|---|---|
| Qi Smooth | Processing rates across stages basically match, without long-term backlogs or idle resources |
| Qi Not Smooth | One stage remains a bottleneck for long periods, or large amounts of resources sit idle |
Diagnosis Points
Distinguish temporary fluctuations from persistent trends: brief peaks don’t necessarily indicate Qi blockage, but persistent deviations must be addressed
Tool Support:
- Dashboards and automated alerts
- Timely capture of “stagnant Qi” locations
- Further investigation of causes (which Five Elements imbalance corresponds)
Yin-Yang Dynamics Check
Assess whether current strategy and state are Yang Excess Yin Deficiency or Yin Excess Yang Deficiency.
Diagnosis Method
Qualitative Analysis:
- Look at whether recent architecture decisions overly favor one extreme
- Have you been continuously expanding and adding new features while ignoring stability?
- Or conversely, multiple layers of approval and strict constraints but lack innovation momentum?
Quantitative Metrics:
| Metric | Yang Excess | Yin Excess |
|---|---|---|
| Change Frequency | Extremely high | Extremely low |
| Incident Rate | Frequent | Extremely low but no change |
| Release Rhythm | Continuous | Long-term stagnation |
Balance Strategy
| State | Symptoms | Solution |
|---|---|---|
| Yang Excess Yin Deficiency | Frequent changes with frequent incidents | Pause releases, focus on addressing hazards (replenish Yin) |
| Yin Excess Yang Deficiency | Long-term no change and stagnation | Introduce challenges and innovation (add Yang) |
Yun Alignment Check
Determine whether the organization’s actions match the system’s current stage, preventing counter-Yun operation.
Diagnosis Method
Combine Business Development and Technical Maturity:
| Error Pattern | Manifestation | Consequences |
|---|---|---|
| Premature Standardization | Spending 大量精力 on process management and cost optimization for emerging projects | These are typically scale stage concerns, but the project is still in exploration stage |
| Counter-Yun Exploration | Frequently changing underlying architecture for widely used platforms without rigorous testing | Inconsistent with scaling stage |
Stage-Strategy Reference Table
| Stage | Should Focus On | Should Not Do |
|---|---|---|
| Exploration Stage | Diversity, flexibility, rapid trial and error | Premature pursuit of efficiency |
| Platform Stage | Standardization, process norms | Frequent arbitrary changes |
| Scale Stage | Optimization, stability, efficiency | Still growing wildly |
| Rebalancing Stage | Transformation, breakthrough, innovation | Clinging to the past |
Checklist:
- Which stage are we currently in?
- Do our actions match the stage?
- Do we need to adjust strategy?
When discovering actions don’t match the stage, immediately adjust strategy to avoid working at cross-purposes
Yang Runaway Warning
Pay special attention to whether there are signs of Yang state runaway in the system.
What is Yang Runaway?
Exponential explosion or collapse risk caused by unconstrained positive feedback.
Typical Scenarios
| Scenario | Mechanism | Risk |
|---|---|---|
| Service Call Volume Surge | Bug or abuse → Resource strain → Queuing and retry storms → Further increase in calls | Resource exhaustion |
| Training Task Self-Replication | Tasks unlimitedly self-replicate to accelerate → Cluster resource exhaustion | System collapse |
Diagnosis Signals
- A metric shows exponential explosive growth
- Lack of slowing mechanisms
- Formation of vicious cycles
Response Strategy
| Strategy | Means | Effect |
|---|---|---|
| Establish Hard Limits | Metal’s constraints | Immediate shutdown |
| Introduce Negative Feedback | Earth’s governance (rate limiting, quotas) | Braking and deceleration |
| Break Positive Feedback Chain | Activate emergency plan | Pull back to steady state |
When discovering a metric showing exponential explosive growth without slowing mechanisms, intervene immediately
Diagnosis Implementation Process
Regular Diagnosis Mechanism
Recommend establishing a periodic diagnosis process:
Diagnosis Meeting Agenda
Fixed Session of Weekly Operations Review Meeting:
- Check Five Elements scores for each module
- Browse global Qi flow diagram
- Analyze Yin-Yang dynamics
- Discuss current Yun
This systematic examination makes hidden risks 无处遁形,thus achieving prevention before problems occur
Diagnosis Action Matrix
| Diagnosis Result | Action Recommendation |
|---|---|
| Five Elements: One Element Too Weak | Concentrate resources to strengthen the weakness |
| Five Elements: One Element Overloaded | Expand capacity or introduce constraints |
| Qi Stagnation at One Stage | Clear bottlenecks, optimize processes |
| Yang Excess Yin Deficiency | Strengthen governance and stability mechanisms |
| Yin Excess Yang Deficiency | Activate innovation and boost vitality |
| Counter-Yun Operation | Adjust strategy and go with the flow |
| Yang Runaway Warning | Immediate intervention, break positive feedback |
Summary
Through the above diagnosis principles, architects and operations teams can periodically take the pulse of infrastructure like TCM pulse diagnosis.
When diagnosis indicates imbalance in some aspect, immediately prescribe remedy based on the theory: replenish what needs replenishing, purge what needs purging.
Long-term adherence will keep the system on a healthy evolutionary trajectory.