Transformers Who?
This last week has seen what used to be a subdued research battle between top tech companies like Google and Microsoft go public in a big way. By now everyone and their cousin have heard of ChatGPT and how it will change everything. We’ve seen launch presentations from Microsoft and Google about how they are building and using the best of AI for search. The outcomes of these new technologies are incredibly important for these companies, highlighted by the 100 billion dollar drop in Alphabets value from Google’s Bard chatbot giving a wrong answer in their demo… With this backdrop, I found one portion of Google’s presentation interesting. They splashed a 2017 research paper on the screen, claiming basically that ‘we were the ones that revolutionized this whole AI thing’. What was that paper and what is the ‘Transformers’ they talked about?
To find out we are going to need to go on a tour of the greatest hits of neural network architectures. I know, can’t wait right? I promise it’s not as bad as it sounds… We have four we need to get through and I’ll go quick:
- Basic Neural Networks
- Convolutional Neural Networks (CNNs)
- Recurrent Neural Networks (RNNs)
- Transformers!
Let’s start with a basic neural net. This is likely the picture you’ve seen if you have found yourself late night searching YouTube for “what is a neural network anyway?”. It’s defined by basic input nodes all connected to output nodes with some layers in between for fun. Not altogether too much different from a subway sandwich.
Ok one down. How about CNN’s? Well those are similar to our basic footlong sandwich neural network, but the input is handled a little differently. Instead of naively connecting all the inputs in equally to the middle sandwich layers, these networks instead more intelligently look at groups of inputs. Computer vision is a common area for these. Instead of just reading the value of each pixel independently, this network considers a patch of pixels all at once, like how you focus on one particular area of a picture. So CNN’s are great where groups of input has meaning, like an area of a picture.
Over halfway now. RNN’s are all about their first word “Recurrence”. An RNN is basically one layer that continually calls itself, feeding its outputs back to its inputs. Why do this? Well, it turns out this is a great strategy for any data in a time series. For example stock prices, the previous value goes into the network layer, and the output is the next predicted stock price. Just rinse and repeat for the series. Or with text, the last word is the input, and the next word is the output. The great thing about these networks is they are easily expandable like a slinky. Short text? 6 inch NN sandwich. Long text? Foot long NN sandwich. You get the idea, or maybe this is not making any sense but I commend you for reading this far anyway.
RNN Reference: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
And finally… Transformers! As Google was keen to point out, these were developed by Google in 2017 and they’ve taken over as the neural network architecture of choice for cutting edge AI in many different areas. So what are they? Well Transformers rely on a basic mechanism called ‘Attention’. Details can be found in the aptly named ‘Attention is All You Need’ research article. (Quick aside, InceptionNet gets my gold star award for the best named neural network, from the paper Going Deeper with Convolutions also by Google) So what attention does is instead of looking at portions of the input, or reading the input recurrently word by word, it just looks at the whole thing, it reads in the whole input all at once. I know what you’re thinking, isn’t that where we started with the plain subway sandwich network?
Yes, you are absolutely correct. Except one thing the attention mechanism does is learn where to focus its ‘attention’ to within that input. Some of its internal sandwich layers are dedicated to learning which parts of the input are important for collecting certain types of information. This is somewhat similar to how a CNN will look at certain areas of in image at a time. Except transformers looks at the whole thing and dynamically learn what to focus on within that image or string of text. Basically as a transformer network trains it gets better and better at recognizing what types of input are coming in and how much emphasis to place on a given input. In practice you get many of these transformers working together, each trying to answer a specific question about the input. For example one might be looking at what the topic of a sentence is, while another is looking at what is the event that is taking place. All of this results in a very cool sounding Multi-Headed Transformer Neural Network.
That’s it. Sorry I lied about the whole quick thing but we had a lot of ground to cover. Long story short the basic architectures of neural networks continue to evolve. Transformers take the current title of the latest and greatest, until the next sandwich comes along.
Leave a Reply