Navigation

© Zeal News Africa

AI Titans Meta & Oracle Tap NVIDIA for Next-Gen Data Centers

Published 1 week ago6 minute read
Uche Emeka
Uche Emeka
AI Titans Meta & Oracle Tap NVIDIA for Next-Gen Data Centers

Leading technology giants Meta and Oracle are significantly upgrading their AI data centers by integrating NVIDIA’s Spectrum-X Ethernet networking switches. This advanced technology is specifically engineered to address the escalating demands of large-scale artificial intelligence systems, facilitating improved AI training efficiency and accelerating deployment across extensive compute clusters. Jensen Huang, NVIDIA’s founder and CEO, highlighted the transformative impact of trillion-parameter models, which are converting traditional data centers into “giga-scale AI factories.” He likened Spectrum-X to the “nervous system” essential for connecting millions of GPUs to train the most complex AI models ever conceived.

Oracle is set to leverage Spectrum-X Ethernet in conjunction with its Vera Rubin architecture to construct these large-scale AI factories. Mahesh Thiagarajan, Oracle Cloud Infrastructure’s executive vice president, emphasized that this new setup will enable the company to connect millions of GPUs more efficiently, thereby empowering customers to train and deploy novel AI models at an accelerated pace. Concurrently, Meta is enhancing its AI infrastructure by incorporating Spectrum-X Ethernet switches into its proprietary Facebook Open Switching System (FBOSS), a platform designed for managing network switches at scale. According to Gaya Nagarajan, Meta’s vice president of networking engineering, the company’s next-generation network architecture must embody openness and efficiency to adequately support increasingly larger AI models and deliver services to billions of global users.

The increasing complexity of data centers necessitates flexibility, a principle underscored by Joe DeLaere, who spearheads NVIDIA’s Accelerated Computing Solution Portfolio for Data Centre. DeLaere elucidated that NVIDIA’s MGX system offers a modular, building-block design, granting partners the adaptability to combine diverse CPUs, GPUs, storage, and networking components as required. This system also fosters interoperability, allowing organizations to maintain a consistent design across multiple hardware generations, ensuring “flexibility, faster time to market, and future readiness.”

As AI models grow in size, power efficiency has emerged as a critical challenge for data centers. NVIDIA is tackling this issue through a comprehensive “from chip to grid” strategy aimed at enhancing energy utilization and scalability. This involves close collaboration with power and cooling vendors to maximize performance per watt. Notable advancements include the transition to 800-volt DC power delivery, which significantly reduces heat loss and boosts efficiency. Furthermore, the company is introducing power-smoothing technology to mitigate spikes on the electrical grid, an innovation capable of reducing maximum power needs by up to 30 percent, consequently allowing for greater compute capacity within the same physical footprint.

NVIDIA’s MGX system is also instrumental in how data centers are scaled. Gilad Shainer, NVIDIA’s senior vice president of networking, explained that MGX racks integrate both compute and switching components, supporting NVLink for scale-up connectivity and Spectrum-X Ethernet for scale-out growth. He further noted that MGX can unify multiple AI data centers into a single, cohesive system—a capability vital for companies like Meta that operate massive distributed AI training operations. Depending on geographical distance, these sites can be linked via dark fiber or additional MGX-based switches, ensuring high-speed connections across various regions.

Meta’s adoption of Spectrum-X exemplifies the burgeoning importance of open networking. Shainer stated that while Meta will utilize FBOSS as its network operating system, Spectrum-X is compatible with several other leading operating systems, including Cumulus, SONiC, and Cisco’s NOS, through strategic partnerships. This inherent flexibility allows hyperscalers and enterprises to standardize their infrastructure using the systems that are optimally suited to their specific environments. NVIDIA envisions Spectrum-X as a catalyst for making AI infrastructure more efficient and widely accessible across diverse scales. The Ethernet platform was purpose-built for AI workloads such as training and inference, achieving up to 95 percent effective bandwidth, thereby significantly outperforming traditional Ethernet. NVIDIA’s collaborations with key industry players like Cisco, xAI, Meta, and Oracle Cloud Infrastructure are pivotal in expanding Spectrum-X’s reach to a broader spectrum of environments, from hyperscalers to individual enterprises.

Looking ahead, NVIDIA’s forthcoming Vera Rubin architecture is anticipated to be commercially available in the second half of 2026, with the Rubin CPX product expected by year’s end. Both will synergize with Spectrum-X networking and MGX systems to underpin the next generation of AI factories. DeLaere clarified that Spectrum-X and XGS share the same core hardware but employ distinct algorithms for varying distances: Spectrum-X is optimized for connectivity within data centers, while XGS is designed for inter-data center communication. This approach minimizes latency and enables multiple geographically dispersed sites to function collectively as a single, powerful AI supercomputer.

To facilitate the transition to 800-volt DC, NVIDIA is engaging in comprehensive collaborations across the entire power chain, from chip design to grid integration. The company is partnering with Onsemi and Infineon for power components, Delta, Flex, and Lite-On at the rack level, and Schneider Electric and Siemens for overall data center designs. A technical white paper detailing this holistic approach will be presented at the OCP Summit. DeLaere characterized this as a “holistic design from silicon to power delivery,” meticulously ensuring seamless integration and operation of all systems within the high-density AI environments characteristic of Meta and Oracle’s operations.

Spectrum-X Ethernet, specifically engineered for distributed computing and AI workloads, offers significant performance advantages for hyperscalers. Shainer elaborated that it features adaptive routing and telemetry-based congestion control, which effectively eliminate network hotspots and guarantee stable performance. These capabilities enable higher training and inference speeds while allowing multiple workloads to execute concurrently without interference. He affirmed that Spectrum-X stands as the sole Ethernet technology validated to scale at extreme levels, thereby assisting organizations in maximizing performance and return on their GPU investments. For hyperscalers like Meta, this unparalleled scalability is crucial for managing growing AI training demands and maintaining highly efficient infrastructure.

While NVIDIA is renowned for its hardware innovations, DeLaere stressed the equally critical role of software optimization. The company consistently enhances performance through co-design, a process that aligns hardware and software development to achieve maximum efficiency for AI systems. NVIDIA is actively investing in FP4 kernels, advanced frameworks such as Dynamo and TensorRT-LLM, and sophisticated algorithms like speculative decoding to improve throughput and overall AI model performance. These continuous updates, he noted, ensure that systems like Blackwell consistently deliver superior results over time for hyperscalers such as Meta, who depend on unwavering AI performance. The Spectrum-X platform, comprising Ethernet switches and SuperNICs, represents NVIDIA’s inaugural Ethernet system purpose-built for AI workloads. It is meticulously designed to link millions of GPUs efficiently while upholding predictable performance across AI data centers. With its congestion-control technology achieving up to 95 percent data throughput, Spectrum-X signifies a monumental leap over standard Ethernet, which typically only attains about 60 percent due to flow collisions. Its XGS technology further extends its capabilities, supporting long-distance AI data center links and connecting facilities across various regions into unified “AI super factories.” By seamlessly integrating NVIDIA’s complete stack—including GPUs, CPUs, NVLink, and software—Spectrum-X provides the consistent performance indispensable for supporting trillion-parameter models and powering the next wave of generative AI workloads.

Loading...
Loading...
Loading...

You may also like...