NLP Breakthrough: Pre-Training

From July of 2023 I will start a new job as a software engineer in the natural language processing (NLP) field and could not be more excited. As part of entering this space I’ve been reviewing as much of the history and literature on the field as I can and the recent developments are absolutely fascinating. In this post I’ll discuss one of the key breakthroughs that has lead to the rapid improvements in NLP models being witnessed now; “pre-training” a.k.a. the ability for AI to read.

(Image made with DALL E given the following prompt: “A robot reading a book in a dimly lit coffee shop in an impressionist style.”)

In NLP “pre-training” is where an AI model is trained on a generic dataset and task in order to build the general capability of the model. These tasks are usually very simple, one example is language modeling where the objective is to predict the next word given a string of text 1. While this task sounds simple, in order to predict the next word that comes it turns out you need to learn a lot of things about language. In fact if you want to get really good at predicting the next word it would be helpful if you understood the meaning of the preceding text. This is the basis for training of the now popular GPT series of AI models. G stands for generative (creating, or predicting text) and PT stands for, wait for it… Pre-training.

One crucial detail about this simple language modeling task is that once an AI is shown how to do this, it no longer needs a human to help it learn. If we give the AI a book, all it has to do is grab chunk of text, block out the last word (it’s pretty disciplined about not peeking) and try to guess what that last word is based on the previous text. In AI this is referred to as a form of self-supervised learning and it is an amazing thing. Now that models can effectively train themselves the natural thing to do was 1) give them more data to learn from (books to read) and 2) make the brain bigger.

(Image made with DALL E given the following prompt: “A brain messily eating a library with books spilling everywhere in an impressionist style.” I’ll admit it didn’t turn out as expected.)

Today’s language models are enormous, going from ~117 million parameters in the original GPT in 2018 to now up to 540 billion parameters in PaLM in 2022, a 4,600 times increase for those that haven’t already grabbed their calculators. So brains are getting bigger. What about the amount that models are reading?

The original GPT was no slouch and an avid reader: training data included “over 7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance.” (2) or roughly 1 billion words. Today’s language models are on another level altogether. For the LAMBDA language model the training corpus consisted of 1,560 billion words, or a 1,560 times increase (no calculator needed). What is interesting about the data set for today’s models is they are consuming any and all text that is available to train on and to a larger degree that is becoming social media (yikes). For example with LAMBDA 50% of the total dataset comes from social media.

To sum up before this gets too long a large part of the success of today’s AI language models is their ability to read and consume information in a self-supervised way. It will be interesting to see how this scales in the future. Will models just continue to get bigger in size? Will datasets continue to get bigger? (For example transcribing social media videos and feeding that into models). Either way, with recent developments like ChatGPT I’m excited for whats to come!

Comments

One response to “NLP Breakthrough: Pre-Training”

Leave a Reply Cancel reply