Meta (AKA Facebook) is ready to enter the fray with the “AI Research SuperCluster,” or RSC, a global competition to build the world’s largest, most powerful computers. It may probably rank among the top ten fastest supercomputers in the world once operational, and it will be used for the huge number-crunching required for language and computer vision modeling. Large AI models, like OpenAI’s GPT-3, are not built on laptops and PCs; they are the result of weeks and months of calculations by supercomputers that exceed even the most cutting-edge gaming equipment.
In addition, the quicker you can complete a model’s training, the faster you can test it and create a new, better one. That counts a lot when training times are measured in months. RSC is now operational, and the company’s researchers are putting it to use… It must be mentioned that user-generated data is used, however, Meta was cautious to point out that it is encrypted until training time and that the entire facility is disconnected from the internet.
Supercomputers are surprisingly physical constructs, with basic factors like heat, cabling, and connectivity affecting performance and design. The team behind RSC is understandably proud of having pulled this off almost entirely remotely. Exabytes of storage appear to be sufficient in terms of digital storage, but they must also exist physically, on-site and accessible at a microsecond’s notice. (Pure Storage is also pleased with the setup they devised.) RSC presently has 760 Nvidia DGX A100 workstations with 6,080 GPUs, putting it roughly in competition with Perlmutter at Lawrence Berkeley National Lab, according to Meta.
According to the long-running rating site Top 500, this is the fifth most powerful supercomputer in use right now. (In case you are wondering, Fugaku is by far the best in Japan.) As the company continues to develop the technology, this could alter. They hope to make it three times more powerful in the end, which would put it in contention for third place.
There is, without a doubt, a caution in there. Systems like the second-place Summit at Lawrence Livermore National Lab are used in research when accuracy is critical. When simulating molecules in a region of the Earth’s atmosphere to unprecedented detail levels, every calculation must do to a large number of decimal places. As a result, the calculations are more time-consuming to do. AI applications, according to Meta, don’t require the same level of precision because the results don’t depend on a thousandth of a percent — inference operations produce things like “90 percent certainty this is a cat,” and the difference between 89 percent and 91 percent wouldn’t make much of a difference.
It is more difficult to achieve 90% certainty for a million objects or phrases than it is for a hundred. It is an oversimplification, but RSC can get more FLOP/s (floating-point operations per second) per core when using TensorFloat-32 math mode than other, more precision-oriented systems. It is up to 1,895,000 teraFLOP/s (1.9 exaFLOP/s) in this scenario, which is more than four Fugaku’s. Does it make a difference? In addition, if that is the case, to whom? If anyone, the Top 500 people could be interested, I have asked if they have any thoughts on it. However, that does not change the fact that RSC will be one of the world’s fastest computers, possibly the fastest ever built by a private firm for its own use.