Meta’s massive new AI supercomputer will be ‘world’s fastest’

Fresh off his rebrand last October, Meta (new Facebook) is powering its vision of a metaversal future with a massive new AI supercomputer called AI Research SuperCluster (RSC). Meta says RSC will be used to help build new AI models, develop augmented reality tools, seamlessly analyze multimedia data, and more. The first phase of the supercomputer is already operational and is expected to be fully developed by the middle of the year. HPCwire estimates that the final system will weigh more than 220 Linpack petaflops.

RSC as currently built. Image courtesy of Meta.

About the system

The first phase of RSC, already built and operational, consists of 760 Nvidia DGX A100 compute nodes, totaling some 6,080 Nvidia A100 GPUs, all connected to Nvidia’s Quantum 200Gb/s InfiniBand. For storage, the system is equipped with 175 PB Pure Storage FlashArray, 10 PB Pure Storage FlashBlade, and 46 PB cache storage housed in Penguin Computing Altus servers. Meta Says They “Believe” With Just This First Stage [RSC] is one of the fastest AI supercomputers currently running[.]”

With the completion of the second phase around July, Meta says, RSC will feature a total of 16,000 GPUs (presumably via an additional 1,240 DGX A100 nodes, which Nvidia believes will be the largest customer installation of DGX A100 systems) and a full exabyte storage capacity with a capacity of 16 TB/s of training data. Meta indicated that 16,000 GPUs will be the maximum configuration of the system. “This is due to the network configuration to reduce the hop count, to ensure we provide a 1:1 oversubscription,” a Meta spokesperson told us.

Meta says this second phase will boost RSC’s AI training performance by more than 2.5× (following the 2.63× increase in GPUs), making it the fastest AI supercomputer in the world.

Unlike previous systems, RSC is intended for use with not only open-source/public datasets, but with real, internal production data from Meta. To that end, Meta says, they designed the system to be isolated from the Internet, with all connections going through Meta’s own data centers. User-generated data – checked for anonymization – is encrypted from the storage systems to the GPUs and is only decrypted in memory immediately prior to use in model training.

Meta also developed a storage service (called AI Research Store or AIRStore) to handle RSC’s growing bandwidth and capacity requirements. AIRStore processes training data for AI models and is designed to optimize transfer rates.

In RSC’s announcement, Meta also quietly described the first generation of its AI research supercomputing hardware, launched in 2017. The unnamed cluster, Meta says, has 22,000 Nvidia V100 GPUs and runs 35,000 training tasks per day. Meta says that compared to this previous system, RSC’s early benchmarks show a 20x improvement in computer vision workflows and a 3x improvement in large-scale NLP model training (which, Meta says, translates into weeks of time saved).

So far, Meta has worked with a consistent list of partners for these systems: Penguin Computing for architecture and managed services; Nvidia for systems, GPUs, networking and software stack components; and Pure Storage for most storage functionality.

Image courtesy of Meta.

The fastest AI supercomputer(s)

In terms of flops, Meta estimates that RSC will deliver nearly five exaflops of mixed-precision AI computing power. Using Nvidia’s Selene supercomputer (also consisting of eight GPU Nvidia DGX A100 nodes) as a benchmark, HPCwire estimates that (if Meta ran the HPL benchmark) the full iteration of RSC could yield about 227 Linpack petaflops of computing power (versus perhaps 86 petaflops at this point), although further optimizations made by Nvidia in the meantime may underestimate these numbers.

That’s a powerful system for sure — RSC’s first stage would likely be fourth in November’s Top500 list, and full form would probably come in second — but the race for the “fastest AI supercomputer” is on. Busy. While RSC will almost certainly beat current peers like Selene (63.4 Linpack petaflops) and the similar A100-based Perlmutter system at NERSC (70.9 Linpack petaflops), the near future presents much stronger challengers.

Perhaps the closest comparison is EuroHPC’s future Leonardo system, a pre-exascale Atos-built supercomputer that will also be powered by Nvidia A100s (about 14,000 of them, compared to RSC’s planned 16,000). CINECA, which will launch Leonardo’s GPU-powered booster module this month, expects that module alone to deliver 240.5 Linpack petaflops, and Nvidia has billed the forthcoming system as — you guessed it — the “world’s fastest AI supercomputer.” (with an estimated ten exaflops of FP16 AI performance).

Also, Tesla is publicly building a huge AI supercomputer called Dojo, which will target that system for model training for the development of autonomous vehicles. Currently, it has an A100-based precursor system that: HPCwire previously estimated at about 82 Linpack petaflops, but Dojo itself will be powered by Tesla’s own “D1” chip. Due to the non-traditional hardware and other uncertainties, it is more difficult to estimate Dojo’s future Linpack performance, but Tesla says that when Dojo launches (unspecified) it will be “the fastest AI training computer.”

Two notes: first, HPCwire also estimates that RSC’s V100-based precursor system is likely to deliver around 135 Linpack petaflops and would likely finish third in the current Top500, well above competition from AI systems like Selene and Perlmutter. This would make it – at least in terms of the Top500 – the fastest AI supercomputer in the world. Second, Meta (under the name Facebook) submitted a 3.3-Linpack petaflops system to the Top500 in early 2017 (it currently ranks 139th). While that system uses Penguin servers, the specs list Nvidia Tesla P100s and Quadro GP100s rather than V100s, so it may not be part of the predecessor system.

Only time (and benchmarks) will tell who comes out on top.

Image courtesy of Meta.

In the metavers

The first phase of RSC is already being used for applications such as large-scale training for natural language processing (NLP) and computer vision. But the long-term goal is the metaverse, the vaguely defined virtual world that Meta (named after the metaverse) clearly believes will be another digital revolution.

Meta has an ambitious vision for RSC for the metaverse, highlighting, for example, how RSC could train models for real-time speech translation among large groups of people, enabling individuals speaking different languages ​​to collaborate on work or gameplay without a language barrier.

“The experiences we are building for the metaverse require massive computational power (trillions of operations/second!) and RSC will enable new AI models that can learn from trillions of examples, understand hundreds of languages ​​and more,” said Mark Zuckerberg, CEO of Meta.

Building RSC during a pandemic

Meta links the ideas behind RSC all the way back to the creation of the Facebook AI Research lab in 2013, but says the project’s real start was in early 2020, when they decided a new system was needed to take advantage of the advances. in GPU and network fabric technologies. Its main goal: a system that can train models with more than a trillion parameters on data sets as large as an exabyte.

Rack delivery for RSC. Image courtesy of Meta.

Covid naturally hindered the development of such a system. Meta says RSC started out as a completely remote project, and the supply chain challenges that emerged later in the pandemic added further roadblocks to the path. Meta explained that supply chain disruptions made it difficult to source components from chips to GPUs.

“You don’t just buy and switch on a supercomputer,” said George Niznik, sourcing manager at Meta. “RSC was designed and executed under extremely compressed timelines without the benefit of a traditional product release cycle. In addition, the pandemic and a severe shortage of chips in the industry hit at just the wrong time. We had to make full use of all our collective skills and experiences to overcome these difficult limitations.”

Nevertheless, a year and a half later the team had delivered a functioning cluster. Meta told HPCwire that the team has been able to mitigate supply chain issues for phase one and that phased construction is proceeding as planned.

“I think I’m most proud to be doing this completely remotely with the team,” said Shubho Sengupta, an AI researcher at Meta. “I mean, it’s insane that you can do this without ever meeting anyone.”

An image of RSC’s otherwise undisclosed location, courtesy of Meta’s video announcing the system. Anyone dare to try?

Leave a Reply

Your email address will not be published. Required fields are marked *