Supercomputers for language models: “Go large”

Today Machine Learning can exceed human-level accuracy on a variety of everyday tasks, and these models are getting much bigger and more powerful at a rate that is hard to comprehend.

In the past, language systems were designed and trained for a specific task. Today, the internet giants are focussed on creating general-purpose language models which are able to perform multiple tasks with a single system.

In their race to the top, tech companies are creating bigger systems to beat the benchmark. In May 2020, OpenAI, the competitor of Google’s Deepmind, released a language model which was 10 times bigger than any other system in the past.

It cost between 4 and 12 million dollars to train the system and it cannot be run on an ordinary computer. Microsoft had to build a bespoke supercomputer for OpenAI, which is ranked in the top 5 supercomputers of the world.

General purpose language models: “One Model to Rule them All”

Currently, there is a big push towards creating general-purpose language models by companies like Microsoft, Google, and OpenAI. These systems can perform tasks such as question answering, summarisation, translation and reading comprehension. An interesting development, as it brings us closer to the concept of general intelligence.

These larger models for NLP, known as text-2-text transformers (T5) are extremely powerful. They are able to learn new tasks with only a few examples – a concept called ‘few shot learning’. This removes training and configuration which results in rapid deployment, huge cost savings, reduced power consumption and a lower CO2 footprint.

Super-sized Models

The architecture of language models has hardly changed over the past year. Still, the performance has been increasing steadily: What is the reason?

Neural networks for language tasks consist of Lego-like blocks. Researchers discovered that by simply adding more of these blocks, one can get a higher performance. Alongside more data, this has resulted in exponentially bigger models. Language models now grow by a factor of 10 each year.

Figure 1: Model size over the past two years

Moore’s law states that the number of transistors on a chip doubles every 2 years. This means that the computational requirements outpace the rate at which hardware improves.

Larger models therefore need more hardware. For example, Microsoft has developed a supercomputer for OpenAI’s GPT-3 with 285,000 CPUs and 10,000 GPUs. Open AI is estimated to have spent between 4 and 12 million dollars on cloud compute to train GPT-3. By the time they found some mistakes with GPT-3, they had already spent too much money and did not have the budget to rerun without the bugs.

The flipside

The models are growing so fast that the biggest systems cannot be efficiently run on a PC. This is advantageous to the small number of companies which have released these systems. They can simply make more money with their cloud services. The “Machine Learning as a Service” market has been growing over 40% a year, and this trend is not expected to stop soon.

In spite of the impressive flexibility of the general-purpose language models, they are not yet capable of outperforming their fine-tuned counterparts.

The main question is the effect that these super-sized models will have on the research community. Researchers in academia have driven many of the innovations within language processing. However, we have now reached a point at which they cannot join the battle for general purpose language models. They simply lack computational resources to compete with and challenge this small set of companies.


Within the current AI community, there is a huge push towards bigger language models which obtain the best state-of-the-art performance. This trend is enabled by a combination of hardware and software developments, and the use of supercomputers.

The future will show whether these general purpose language models can outperform the state-of-the-art and be widely adapted by the industry.