Over recent decades, nearly unimaginable improvements in computing performance and efficiency have unfolded, driven by Moore’s Law and supported by scale-out commodity hardware and loosely connected software. This setup has provided online services to billions worldwide and granted easy access to almost all human knowledge.
Yet, the approaching computing revolution demands more. To realize AI’s potential, a leap in capabilities beyond internet-era advancements is essential. Achieving this necessitates revisiting the foundational elements that fueled past changes and collaboratively innovating to rethink the entire technology stack. Let’s delve into the forces driving this transformation and outline what this new architecture must entail.
For years, the main trend in computing has been democratization through scale-out architectures using nearly identical, commodity servers. This standardization enabled flexible workload placement and efficient resource use. However, the predictable math operations on massive datasets required by gen AI are reversing this tendency.
Now, there’s a significant shift towards specialized hardware, like ASICs, GPUs, and TPUs, offering substantial performance improvements per dollar and watt compared to general-purpose CPUs. This expansion of domain-specific compute units, optimized for narrower tasks, is essential for ongoing AI advancements.
These specialized systems often necessitate “all-to-all” communication, with terabit-per-second bandwidth and nanosecond latencies nearing local memory speeds. Today’s networks, mostly based on commodity Ethernet switches and TCP/IP protocols, cannot meet these extreme demands.
Consequently, scaling gen AI workloads across large clusters of specialized accelerators has led to emerging specialized interconnects such as ICI for TPUs and NVLink for GPUs. These purpose-built networks favor direct memory-to-memory transfers and employ dedicated hardware to expedite information sharing among processors, bypassing traditional networking stack overheads.
This shift towards tightly integrated, compute-centric networking is crucial for solving communication bottlenecks and efficiently scaling future AI generations.
For decades, computational performance gains have outstripped memory bandwidth growth. Techniques like caching and stacked SRAM have mitigated this somewhat, but the data-intensive nature of AI exacerbates the issue.
The demand for powerful compute units has led to high bandwidth memory (HBM), which stacks DRAM on the processor package to enhance bandwidth and cut latency. Even HBM faces limits: the chip perimeter constrains data flow, and moving large datasets at terabit speeds raises energy challenges.
These constraints highlight the critical need for higher-bandwidth connectivity and underscore the urgency of breakthroughs in processing and memory architecture. Without such innovations, our powerful compute resources will remain idle, awaiting data, drastically limiting efficiency and scalability.
Modern machine learning (ML) models depend on meticulously coordinated calculations across vast arrays of identical compute elements, consuming vast power. This tight coordination and fine-grained synchronization at the microsecond level impose new demands. Unlike systems that embrace heterogeneity, ML computations require homogeneous elements; mixed generations would bottleneck faster units. Communication pathways must be prearranged and highly efficient because delays in a single element can stall an entire process.
These extreme coordination and power demands necessitate unprecedented compute density. Reducing the physical space between processors is critical for minimizing latency and power consumption, leading to a new class of ultra-dense AI systems.
This quest for extreme density and coordinated computation fundamentally changes infrastructure design, necessitating a radical rethinking of physical layouts and dynamic power management to maximize efficiency.
Traditional fault tolerance uses redundancy among loosely connected systems for high uptime. ML computing requires a different method.
First, the massive computation scale makes over-provisioning expensive. Second, model training is tightly synchronized, where a single failure can affect thousands of processors. Lastly, advanced ML hardware often pushes technological boundaries, increasing potential failure rates.
The emerging strategy involves frequent checkpointing—saving computation state—along with real-time monitoring, rapid resource allocation, and quick restarts. The underlying hardware and network design must enable swift failure detection and seamless component replacement to maintain performance.
Currently and in the future, power access is a critical constraint for scaling AI compute. While traditional system design focuses on maximum performance per chip, a shift to end-to-end design focused on performance per watt at scale is needed. This approach is crucial as it considers the interaction of all system components—compute, network, memory, power delivery, cooling, and fault tolerance—to sustain performance. Isolating component optimization significantly limits overall system efficiency.
As we aim for higher performance, individual chips need more power, often surpassing traditional air-cooled data centers’ cooling capacity. This necessitates a shift towards more energy-intensive yet ultimately more efficient liquid cooling solutions and a fundamental data center cooling infrastructure redesign.
Beyond cooling, conventional redundant power sources, like dual utility feeds and diesel generators, impose substantial costs and slow capacity delivery. Instead, combining diverse power sources and storage at a multi-gigawatt scale, managed by real-time microgrid controllers, is essential. Utilizing AI workload flexibility and geographic distribution allows more capability without expensive backup systems needed only occasionally.
This evolving power model enables real-time power availability responses—from shutting down computations during shortages to advanced techniques like frequency scaling