Responding to Data Center Failures Through CT-Guided Insights

June 18, 2026

Authors

Keith Beers, Ph.D., P.E., CFEI, CQE, CVFI

Practice Director and Principal Engineer

Materials Science and Electrochemistry

Menlo Park

Daniel Vasquez, Ph.D., CFEI

Director of Asia Offices and Principal

Electrical Engineering and Computer Science

Menlo Park

Yash Bhargava, Ph.D., P.E.

Principal Engineer

Metallurgical and Corrosion Engineering

Menlo Park

Alex Hudgins, Ph.D., P.E.

Principal Engineer

Metallurgical and Corrosion Engineering

Menlo Park

Garrett Grocke, Ph.D.

Senior Associate

Materials Science and Electrochemistry

Menlo Park

Executive Summary

As data center infrastructure scales to meet the demands of AI workloads and cloud computing, the hardware systems powering that growth are becoming more complex, more interconnected, and increasingly vulnerable to failures that propagate rapidly across cooling, electronics, and power components. Standard approaches like destructive analysis and surface-level inspections are often insufficient to accurately investigate and characterize the internal conditions that drive performance degradation and failure.

Computed tomography (CT) provides non-destructive, volumetric visibility into complex assemblies, preserving critical components and evidence while guiding future mitigations. But realizing its value hinges on expert analysis to rigorously interpret complex data, determine root causes, and deliver precise, actionable guidance.

How can non-destructive analysis support root cause investigations and future improvements?

Global data center capacity is projected to , propelled by AI workloads, cloud computing, and the accelerating expansion of hyperscale and edge infrastructure. That growth is not simply a matter of adding more hardware — it means integrating increasingly complex systems across compressed timelines and global supply chains and vendors while preserving the reliability and performance that operators and customers depend on.

As infrastructure expands at this breakneck pace, the sheer complexity and scale of data center operations introduce new, hidden failure modes. Internal degradation, manufacturing defects, and operational damage can develop within fully assembled systems undetected and cascade across installations. Computed tomography (CT) enables teams to non-destructively evaluate specific components and their construction while also guiding subsequent destructive analysis to maximize insight from each investigation.

Computed tomography (CT), especially cutting-edge systems with high accelerating voltage and submicron detectability capabilities, provides a way to see inside complex hardware without destroying or impacting critical evidence. However, realizing its value across rapidly scaling data center infrastructure requires the ability to interpret complex data, determine root causes, and provide precise, actionable guidance. When failures occur inside sealed, fully assembled systems, CT provides the non-destructive insight needed to determine what happened and why, turning isolated failures into actionable intelligence that hardens future systems as they rapidly scale.

What does it take to scale data centers that can be trusted?

Cooling systems

Modern cooling architectures — particularly liquid cooling — are indispensable for supporting higher compute densities, but they introduce risks that are difficult to assess once systems are assembled and deployed. Leaks, internal and external corrosion, erosion, and manufacturing variability in evolving liquid cooling designs may remain hidden until a field failure occurs. In hyperscale environments, where cooling systems operate continuously under high demand, a single undetected defect has the potential to escalate into widespread thermal events across interconnected infrastructure.

CT enables non-destructive visualization of internal flow paths, fittings, and interfaces, allowing engineers to localize defect pathways or signs of internal degradation without disassembly. This insight helps teams rapidly narrow the scope of a potential investigation while preserving what's needed for follow-up evaluation.

High-resolution imaging, however, is only as effective as the analysis behind it. Apparent voids or material changes may not be causative, requiring the expertise to distinguish meaningful defects from incidental manufacturing features. By integrating CT findings with materials analysis, insights from thermal and materials sciences, and a deep system- and component-level understanding of data center infrastructure, teams can resolve active failures while paying those lessons forward — improving design and operational robustness, evaluating vendors, and hardening cooling systems ahead of future deployment cycles.

Electronics and printed circuit boards

Data center electronics often rely on large, densely populated printed circuit boards with tightly coupled failure modes. Internal cracks, solder joint defects, dendrites, or thermally induced damage can result in electrical faults including open circuits, short circuits, or high-leakage failures. Such failures may be invisible through external inspection or conventional 2D imaging, particularly in multi-layer boards or complex system-in-package designs that high-performance compute platforms demand. System-in-package designs may contain passive and active components that are not visible through external inspection.

Advanced CT imaging provides three-dimensional insight into these assemblies, helping identify internal features and guide electrical fault isolation or targeted destructive analysis. This is particularly valuable for large boards where blind sectioning can risk missing the true failure site — a costly mistake when replacement lead times are measured in weeks. In post-incident scenarios, CT provides the non-destructive insight needed to triage failed hardware, narrowing the problem space before more invasive techniques are applied.

Specialized expertise is critical. Many features exist at the edge of detectability, and not every anomaly is a root cause. Experienced interpretation helps teams avoid chasing secondary damage or benign manufacturing artifacts. Expert, CT-guided analysis accelerates diagnosis while maintaining the technical rigor and defensibility that high-stakes failures — and the disputes that may follow — require.

Batteries and backup power systems

Backup power systems present distinct safety and reliability challenges, especially as battery energy storage systems grow in size and energy density to support increasingly power-hungry data center loads. Internal defects or degradation mechanisms may progress unnoticed until a field incident occurs, with consequences ranging from unexpected downtime to thermal runaway and fires.

CT enables non-destructive assessment of internal battery structures, supporting evaluation of assembly quality. As with electronics, CT serves as a critical non-destructive entry point for failure analysis, particularly in post-incident investigations where maintaining the integrity of the cell or module is essential to understanding root cause. However, navigating the dense materials of large-scale cells and packs requires the highest possible imaging detectability and, crucially, the specialized knowledge to accurately interpret the results. Combining CT findings with electrical and materials analysis provides the comprehensive data required to confirm the failure mode and mitigate future recurrence.

Recurring defect signatures, vendor-specific failure patterns, and construction anomalies identified through failure analysis can inform procurement decisions and flag systemic risks before they propagate across installations

Infrastructure, cabling, and connectors

At data center scale, mechanical and structural integrity is foundational to operational reliability. Weld defects in racks, connector misalignment under sustained load, and deformation caused by heavy cabling — increasingly required by high-density power distribution — may be difficult or impossible to evaluate without compromising hardware. During rapid capacity expansion, small mechanical issues have the potential to quickly multiply across installations.

When a mechanical failure occurs, CT provides the non-destructive entry point needed to localize the problem, preserve critical evidence, and guide subsequent analysis — whether targeted destructive sectioning or materials characterization. This is particularly valuable for connector and weld failures where the causal feature may be buried deep within an assembly, and where blind sectioning risks destroying the very evidence needed to establish root cause.

Recurring defect signatures, vendor-specific failure patterns, and construction anomalies identified through failure analysis can inform procurement decisions and flag systemic risks before they propagate across installations. A weld defect that caused a single rack failure may reflect a manufacturing process issue present across hundreds of units in the field — understanding it fully is what makes that distinction actionable.

A close-up view of numerous blue network cables arranged neatly in a data center rack, illustrating modern technology and efficient organization for data management.

Can non-destructive failure analysis drive operational resilience?

Hyperscale and edge facilities are often in remote or constrained environments where rapid intervention is difficult and redundancy is limited. Failures that might be manageable in isolation frequently carry consequences far beyond the data centers themselves.

As AI workloads drive data center infrastructure to scales and densities that would have been unthinkable a decade ago, the systems underpinning them — and the industries, services, and critical functions those systems support — leave increasingly little margin for unresolved failures. In that environment, the failure patterns, defect signatures, and root causes surfaced through rigorous CT-guided analysis don't stay contained to a single incident: they become the intelligence that reduces risk across every installation that follows.

What Clients are Talking About

Multi-vendor environments are among the most complex failure scenarios to navigate. When failures occur at interfaces between systems — between a cooling supplier's hardware and a rack manufacturer's enclosure, for example — each party may attribute root cause to the other.
Rigorous, independent failure analysis that preserves evidence from the earliest stage, characterizes each component's condition non-destructively, and reconstructs the sequence of failure is essential for establishing what actually happened. Objective, well-documented findings grounded in physical evidence, including CT data, are the most effective basis for resolving disputes and preventing recurrence.
Higher rack densities — AI infrastructure often exceeds 100 kW per rack, with some configurations surpassing 200 kW — introduce failure mechanisms that were marginal or absent at lower power levels. Thermal gradients across boards and connectors become more severe, accelerating fatigue and degradation at solder joints and interfaces.
Power distribution components face higher current densities, increasing susceptibility to resistive heating and electrochemical degradation. Structural elements experience greater mechanical loads from heavier cabling and cooling hardware. Each of these mechanisms often progress silently before manifesting as a field failure.
Power quality — including voltage transients, harmonic distortion, and frequency variation — is an underappreciated driver of hardware degradation, particularly for sensitive electronics. At hyperscale campuses with mature grid connections and robust conditioning infrastructure, these effects are manageable.
At edge sites, where grid infrastructure may be less stable, generator-dependent, or subject to load fluctuations from co-located facilities, the exposure is meaningfully higher. Over time, recurring power anomalies may accelerate degradation of capacitors, transformers, and PCB components in ways that are difficult to attribute without detailed materials analysis and historical context.

White storage and processing unit, large AI processing server room, data center storage and network processing room, SSD and hard disk drive storage, and processor cooling system,3D rendering

Case study

Improving Data Center Performance Through Failure Analysis

Breakthrough Insights — Delivered

Subscribe now

Capabilities

What Can We Help You Solve?

Ģ��tv helps automated vehicle developers structure scalable, integrated safety case frameworks, ConOps, and ODDs to better facilitate effective safety case oversight of evolving system capabilities and operations. Our teams support safety case updates across development milestones, helping assess coverage and maintain clear, traceable safety arguments as systems expand.

Get in touch