Once Again, Meta Buys Rather Than Builds A Supercomputer






For a company that has been so enthusiastic about designing and building its own infrastructure and datacenters, Meta Platforms, the parent company to Facebook as well as WhatsApp and Instagram and one of the champions of the metaverse virtual reality a lot of us first read about in Burning Chrome, sure has not been building its own AI supercomputers lately. And it is perplexing.

Back in January, Meta Platforms announced that it was acquiring a complete machine from Nvidia, called the Research Super Computer, or RSC for short, that would be comprised of 2,000 DGX A100 nodes with a pair of  AMD “Rome” 64-core Epyc 7742 processors each (for a total of 4,000 CPUs) and with an octo of Nvidia “Ampere” A100 GPU accelerators each for a total of 16,000 GPUs). The initial 760 nodes went in already, and the rest are expected to be installed by October – just in time to run the High Performance Linpack benchmark for the fall Top 500 supercomputer rankings. Each DGX A100 has eight 200 Gb/sec Quantum InfiniBand network interfaces, and the nodes are interconnected in a two-tier Clos fabric topology.

With the 768 nodes in phase one, the peak theoretical performance of this chunk of the RSC machine would be rated at 59.6 petaflops with the FP64 units and 119.8 petaflops with the 64-bit processing on the Tensor Core units across the 6,144 GPUs in this phase. If A100s are used in every node on the machine – neither Meta Platforms nor Nvidia have said what the GPUs in phase two would be – the RSC system would be rated at around 155.2 petaflops using the FP64 units and 312 petaflops using the TensorCore units on the GPUs (which have 2X the 64-bit throughput). This is a respectable machine, even in the dawning of the exascale era. At FP16 or BF16 precision, that is just under 5 exaflops of “AI performance” as Nvidia puts it, and that maps to what Meta Platforms said the machine would have when it is finished.

So we know the RSC machine as announced in January has no Hopper H100 GPU accelerators in it. But if we were Meta Platforms, with the Hopper GPUs announced, we would go back and ask for a modification.

Here’s why. If the remaining 9,920 GPUs in the phase two of the RSC buildout are based on the “Hopper” H100 GPU accelerators, then the RSC machine will be considerably more powerful by October. The additional 1,232 nodes of phase two equipped with H100s would be rated at 295.7 petaflops on its FP64 units and 591.4 petaflops on the Tensor Core units using 64-bit data. If this could come to pass, then RSC would weigh in at 355.3 petaflops at FP64 and 711.2 petaflops using the Tensor Cores. If HPL was running on the Tensor Cores, RSC would be among the fastest supercomputers in the world on the November 2022 list – even ahead of the current top ranking of 537.2 petaflops peak (442 petaflops sustained) from the “Fugaku” supercomputer at RIKEN Lab in Japan.

Where RSC machine actually ranks depends on how many exascale machines are installed between now and November, and it will be a much lower number than it could be if it had Hopper instead of Ampere GPUs. It’s only May. It is a long time until October. This can change.

As we said back in January when RSC was announced by Meta Platforms, the acquisition of the RSC machine, rather than having Facebook design, procure, and build it, was done out of necessity. Nvidia does not support Facebook’s Open Accelerator Module (OAM) form factor for Ampere or Hopper accelerators and the two vendors that do – AMD with its “Aldebaran” Instinct MI250 and Intel with its “Ponte Vecchio” Xe HPC – aren’t shipping in volume, and whatever volumes they have are going into their respective “Frontier” system at Oak Ridge National Laboratory and “Aurora” system at Argonne National Laboratory.

Looking around for even more GPUs to run its AI workloads, Meta Platforms cast a gaze upon the one hyperscaler and cloud builder that doesn’t directly compete with it in the advertising market – that would be Microsoft – and has partnered with the company’s Azure cloud division to use a dedicated Azure cluster that has 5,400 A100 GPUs delivered using the NDm A100 v4-series instances on the Microsoft cloud.

These NDm A100 v4-series just went into preview yesterday, have a pair of 48-core AMD “Milan” Epyc 7V13 processors and 1.85 TB of accessible main memory for the virtual machine and eight A100 GPU accelerators with 80 GB of HBM2e memory that are all hooked together using NVLink 3.0 interconnects. The node has a 200 GB/sec HDR InfiniBand adapter from Nvidia for each GPU, delivering 1.6 TB/sec of aggregate bandwidth into the interconnect. Microsoft says that it can scale “up to thousands of GPUs” within a region, and that is precisely what Meta Platforms is doing with its supercomputer for permanent rent that is being announced this week.

At 51.3 petaflops at FP64 across the 675 nodes in the system – which is almost certainly an HGX system with components sourced from Nvidia and built by one of the big ODMs and not actual DGX A100 systems from Nvidia itself – and 106.4 petaflops using Tensor Cores to drive FP64 math, this machine in the cloud has just a little less oomph than the first phase of the RSC machine outlined above.

The word on the street is that Microsoft will probably not move to 400 Gb/sec NDR Quantum 2 InfiniBand until next year, and we suspect that it will deploy this interconnect on HPC-style clusters in Azure that have the Hopper GPUs.

It would be funny – and illustrative – if in the future Meta Platforms will be able to rent higher performance Nvidia GPUs and interconnects from Microsoft than it can get into its own datacenters. . . .

It will be even funnier still if Meta Platforms keeps coming under fire on so many fronts, sees user growth continue to stagnate, feels IT costs under pressure, and Microsoft decides to acquire it or merge with it.

It is hard to say what Meta Platforms would cost, with a market capitalization of $490.6 billion as we go to press, while Microsoft has a market capitalization of $1.94 trillion. Microsoft has $130.6 billion in cash and investments, and while an acquisition by Microsoft would require huge amounts of cash beyond this, a merger would not. It might require a lot of lawyers to argue with antitrust authorities. But it is not beyond possibility, although such a deal would dwarf the inflation-adjusted $297.7 billion that Vodaphone paid for Mannesmann in 1999, the $286.4 billion that AOL paid for Time Warner is 2000, and the $151.2 billion that Verizon paid for Vodaphone in 2013.

Strange thought, isn’t it, having the two key Open Compute Project contributors under the same corporate umbrella?

Anyway, Meta Platforms has been renting capacity on the Azure cloud to train AI models since last year, and Microsoft is touting the fact that the interconnects between its Azure servers is 4X that of its peers on the clouds who sell Nvidia GPU capacity and that this allows for faster training of larger models, such as Meta Platform’s OPT-175B natural language model.

Under the expanded partnership, Microsoft is going to continue to provide enterprise-grade support for the PyTorch machine learning framework for Python that Facebook has helped create and the two companies are going to collaborate on scaling PyTorch on hyperscale infrastructure and improving the workflow of the creation and testing of AI models on that framework.




Leave a Comment