This week, Microsoft and Nvidia announced that they have formed what they claim to be one of the largest and most successful AI language models to date: Megatron-Turing Natural Language Generation (MT-NLP ). MT-NLP contains 530 billion parameters – the parts of the model learned from historical data – and achieves peak accuracy in a wide range of tasks, including reading comprehension and natural language inferences.
But its construction was not cheap. The training took place on 560 Nvidia DGX A100 servers, each containing 8 Nvidia A100 80 GB GPUs. Experts estimate the cost in millions of dollars.
Like other large AI systems, MT-NLP raises questions about the accessibility of cutting-edge machine learning research approaches. AI training costs increased 100-fold between 2017 and 2019, but the totals still exceed the compute budgets of most startups, governments, nonprofits, and colleges. Inequity favors global corporations and superpowers with extraordinary access to resources at the expense of smaller players, cementing incumbent advantages.
For example, in early October, Alibaba researchers detailed M6-10T, a language model containing 10,000 billion parameters (roughly 57 times the size of GPT-3) from OpenAI driven on 512 Nvidia V100 GPUs for 10 days. . The cheapest V100 plan available through Google Cloud Platform costs $ 2.28 per hour, which would equate to over $ 300,000 ($ 2.28 per hour times 24 hours over 10 days) – more than most teams research cannot expand.
It is estimated that Google’s subsidiary DeepMind spent $ 35 million to train a system to learn the Chinese board game Go. And when the company’s researchers designed a model to play StarCraft II, they did not voluntarily not tried multiple ways to design a key component because the cost of training would have been too high. Likewise, OpenAI did not correct an error when implementing GPT-3, as the cost of training made retraining the model impractical.
Paths to follow
It is important to keep in mind that training costs can be inflated by factors other than the technical aspects of an algorithm. As Yoav Shoham, professor emeritus at Stanford University and co-founder of startup AI21 Labs, recently told Synced, personal and organizational considerations often contribute to the final price of a model.
“[A] the researcher may be anxious to wait three weeks to do a thorough analysis and their organization may not be able or unwilling to pay for it, ”he said. “So for the same task you could spend $ 100,000 or $ 1 million.”
Yet the growing cost of training – and storing – algorithms like Huawei’s PanGu-Alpha, Naver’s HyperCLOVA, and Beijing Artificial Intelligence Academy’s Wu Dao 2.0 is spawning a cottage industry of startups aimed at “Optimize” models without degrading precision. This week, former Intel senior executive Naveen Rao launched a new company, Mosaic ML, to offer tools, services and training methods that improve the accuracy of the AI system while reducing costs. and saving time. Mosaic ML – which has raised $ 37 million in venture capital – competes with Codeplay Software, OctoML, Neural Magic, Deci, CoCoPie and NeuReality in a market that is expected to grow exponentially in the years to come.
Among the good news, the cost of basic machine learning operations has declined over the past few years. A 2020 OpenAI survey found that since 2012, the amount of computation required to train a model to the same image classification performance in a popular benchmark – ImageNet – has been halved every 16 months.
Approaches such as pruning the network before training could lead to additional gains. Research has shown that parameters pruned after training, a process that decreases model size, could have been pruned before training without any effect on the learning ability of the network. Called the “lottery ticket hypothesis,” the idea is that the initial values that a model’s parameters receive are crucial in determining whether they are important. Parameters kept after pruning are given initial “lucky” values; the network can train successfully with only these settings present.
However, pruning networks is far from a solved science. New pruning methods that work before or at the start of training will need to be developed, as most current methods only apply retroactively. And when the parameters are pruned, the resulting structures are not always suitable for the training hardware (e.g. GPUs), which means that pruning 90% of the parameters will not necessarily reduce the cost of training. ‘a 90% model.
Whether through pruning, new AI accelerator hardware, or techniques such as meta-learning and neural architecture research, the need for alternatives to incredibly large models is rapidly growing. obvious. A study from the University of Massachusetts at Amherst showed that using 2019-era approaches, training an image recognition model with a 5% error rate would cost $ 100 billion and would produce as many carbon emissions as New York City in a month. As the IEEE Spectrum editorial team wrote in a recent article, “We either need to adapt the way we do deep learning or face a future where progress is much slower.”
For AI coverage, send topical advice to Kyle Wiggers – and be sure to subscribe to the AI Weekly newsletter and bookmark our AI channel, The Machine.
Thanks for reading,
IA personal writer
VentureBeat’s mission is to be a digital public place for technical decision-makers to learn about transformative technology and conduct transactions. Our site provides essential information on data technologies and strategies to guide you in managing your organizations. We invite you to become a member of our community, to access:
- up-to-date information on the topics that interest you
- our newsletters
- Closed thought leader content and discounted access to our popular events, such as Transform 2021: Learn more
- networking features, and more
Become a member