Intro to Transformers for a Beginner
Ideal for the Non-Technical Background
Hi my name is Fazal and I am a freshman at UC Berkeley. The following document is a collection of my learnings/understanding of transformer models. Follow along as I explain the novel architecture and build my own from scratch to really understand the inner workings. This document is meant for people of all levels of expertise, and should only require a curious, thoughtful mind.
Introduction
With the rise of AI like ChatGPT and Google’s Bard, the race to building the world’s first fully sentient AI is underway. We have to recognize that when trying to train a ChatGPT level chatbot, we have to teach a system how to go from zero knowledge of a language to a system that is “smart” enough to convey ideas thoughtfully. Several major companies are trying their luck at constructing and training their own models with unique datasets, parameters, frameworks, etc. While each of these models are constructed differently, they actually all have the same underlying architecture that allows them to work so well. This novel architecture is known as the transformer.
Transformers are a kind of machine learning model that are used for sequence to sequence modeling. The most simple explanation of a transformer is a model that can take an input sequence, and generate an output sequence. Some examples include language translation or summarization. In both cases, the model expects a collection of words (input sequence) and outputs a new collection of words (output sequence).
With so much innovation happening at this front, I found myself wondering how these transformers actually worked. How can I ask these systems a question about any random topic and get an instantaneous response that is not only accurate, but also given to me in a conversational manner as if I was talking to another person? Let’s dive in.
In the remainder of this document, I will be referring to a transformer used for text to text generation.
What Makes Transformers Better Than Other Models?
In our daily lives, whether we are talking to a friend, solving a math problem, or even microwaving something, we often use our past experiences or memories to help us make informed decisions in the present. We do this without even thinking for the most part; even in linear algebra, we still remember how to factor because we learned how to years ago. When we are deciding what to eat out for dinner, we take into consideration that just a week ago we already had Chinese food. While this idea of extracting relevant thoughts from the past and using them to make judgements is second nature for us, it is not so easy for an AI. How can we tell our model which “memories” it should be using as well as how relevant they even are? Transformers ultimately aim to solve this problem and it is why they are so good at tasks like question answering or summarization. It is the idea that they are able to actually understand and learn from context.
This idea I just described is what is known as attention. In essence, attention allows the model to focus on specific parts of its input sequence when generating the output sequence, which significantly enhances its ability to understand context and generate accurate results.
To understand, let’s use an example of a transformer whose job is to predict the next word in a sentence. Say the input sentence is “I love the colors red and ” and we would like the output sequence to be just “blue.” Attention basically allows the model to look at the sentence it was given and find links between different words. For example, our transformer may learn that “red” is what is known as a “color” and that “and” means the output should include another “color” to complete the thought. While the actual computations are a bit more complex, this simple explanation accurately represents how a transformer learns and logically deduces what the next word may be.
With this example, you may realize how the idea of attention can be powerful especially when given a much longer input sequence. If your transformer receives a paragraph as input, it is able to create more links between words, effectively increasing its understanding of the overall input. This concept is specifically known as self-attention as it involves the transformer looking solely at its input sequence.
Transformer Architecture
This image which shows the inner workings of a transformer is by far the most straightforward way to understand how the model works — that is, once you actually know what all the components actually do. At first glance we can see that this diagram contains 2 separate blocks. The one on the left is an encoder and the one on the right is the decoder.
In this document, I will focus more on the decoder and touch only slightly on the encoder. The reason for this is because transformers don’t always even need an encoder; in fact, decoder only transformers excel in generating coherent sequences based on contextual information. Many state of the art models such as OpenAI’s GPT4 happen to be decoder only transformers.
Let’s unpack this diagram and first understand each individual piece one at a time. Remember that all my examples and explanations are in the context of a transformer that is using sentences as input/output sequences. Oftentimes in other tutorials or transformer explanations, you may see them use the word “token.” Keep in mind that “tokens” or “words” are interchangeable and just the parts that make up the sequences that go in and out of the model.
Parts of the Transformer
Embeddings and Positional Encoding
Before we get started though there is a large problem that we haven’t addressed. A computer is really good at understanding numbers; however, it has no way of understanding human language. We need a way to represent letters and words as numbers so that a machine can actually read them. We also need to make sure that these numbers maintain the words’ meanings and can be changed back into words eventually. This is where embeddings come in.
An embedding is a way of representing words as vectors of real numbers. The goal is to create these vectors in such a way that words with similar meanings have vectors that are close together in this vector space. This way, we’re not only representing the words in a way that the model can understand, but we’re also preserving the semantic relationships between words. For example, the words ‘king’ and ‘queen’ might be represented by vectors that are close together in this space, reflecting their related meanings.
While the meanings of the words themselves are important, there’s another piece of information that’s crucial for understanding the meaning of a sentence: the order of the words.
Consider the sentences: “The cat chased the mouse” and “The mouse chased the cat.” The only difference between these sentences is the order of the words, but that difference changes the entire meaning of the sentence.
To ensure our model takes into account the order of the words, we use something called positional encoding. This is an additional vector that we add to the embedding for each word, and it represents the position of the word in the sentence. This way, the model has access to information about both the meaning of the word (from the embedding) and its position in the sentence (from the positional encoding).
With both the embeddings and positional encodings, we’re now ready to pass our sentence into the transformer, where it can begin the complex task of understanding and generating language.
Multi-Head Attention
We talked earlier about self-attention and why it is so powerful. If we implement self-attention with just one “head,” we just compare the final word of the sequence to all the preceding words. This means for a sentence like “I like to eat apples,” we are comparing “apples” to ”I”, “like”, “to”, and “eat.” The issue with this approach is there is a lot of lost context. For example we aren’t comparing the words “like” and “I” at all. Out of the many relationships of words, we would only be focusing on a few.
This is the problem that multi-head attention solves. If we used multi-head attention with 4 heads in the example above, we would compare each word of the sentence with “apples”, “eat”, “to”, and “like.” You can see that by increasing the number of comparisons we take into account, we increase the amount of context our model has.
Masked Multi-Head Attention
In the diagram above we see two kinds of attention: masked and regular.
To explain why this is helpful we need to delve further into how an input sentence actually gets processed. For the sentence “I like to eat apples” we actually get a total of 4 training samples for the model. Remember that the overall goal of a transformer is to predict the next word in a sentence. Therefore we can train the model to predict the next word 4 different times for a sentence with 5 words:
“I” → “like”
“I like” → “to”
“I like to” → “eat”
“I like to eat ” → “apples”
But the issue is that if we use traditional self attention, then we can only extract 1 training example instead of the 4 we see above. In general in machine learning, the more data the merrier. Getting rid of these examples significantly hinders our model’s ability to learn. In order to maximize the training examples we can get out of a single sentence, we have to implement a masked self attention.
When we use masked self-attention in the decoder, we prevent the model from seeing future words in the sequence. This is essential during training because we want the model to predict the next word based on the previous words only, not future ones. If we give it the entire sentence, it will obviously be able to guess the next word because it has already seen it. This means that for masked self-attention for the second example above: “I like” → “to,” the only words that we compare to each other are “I” and “like.” Intuitively, if we also compared “to” (next word in sequence) to “I” and “like,” then the model would already know that “to” is the next word in the sequence and no learning would actually occur. It’s as if we give a student a homework problem but they just copy the answer down and submit it; no learning occurs.
Feed Forward
After we pass through a multi-head attention layer, we add a feed forward layer which is essentially just a simple neural network with a few layers. But why do we even need this?
We can think of the attention layer as a way of gathering all the clues we need to solve the mystery; however, even if we have all the clues, we still need one final layer of logical reasoning to put them all together. In our case, this final layer of reasoning is the feed forward layer. The more technical explanation is that the attention step is a linear layer. In machine learning, we often use non-linear functions to add a level of depth to the system so that it can extrapolate patterns that may not be so obvious.
If we were to take out this layer, it may end up with a very basic understanding of the English language but it would certainly not be on par with a human’s understanding. This feed forward layer is what allows the transformer to learn when certain words should be emphasized, when to rearrange a sentence to make it more concise, or how to use pronouns/antecedents correctly.
Layer Normalization
We aren’t going to cover layer normalization in depth because it is difficult to understand why we actually need it. To get a tiny grasp of why it may be useful, just remember that we are trying to give our model the most optimal learning environment. Let’s use an example to understand.
Imagine you’re baking a cake. When you mix the ingredients together, you want each part of the batter to be consistent. You don’t want some parts to be too wet and others too dry. In a way, neural networks are like recipes for solving problems. Layer normalization is a technique we use to make sure each “ingredient” (or neuron) in the network is working well together.
Now, in the world of neural networks, we have these layers of neurons that process information. Sometimes, these layers can get a bit too excited or not excited enough. It’s like having some ingredients in your cake batter too hot and others too cold. This can make the network’s job harder because the information it’s working with is not consistent.
Layer normalization steps in like a chef checking the temperature of all ingredients. It makes sure that the values coming out of each neuron are just right, not too hot or too cold. This helps the neural network learn more efficiently and makes the whole process smoother, just like baking a perfect cake.
In simpler terms, layer normalization is like making sure all the cooking ingredients in your recipe are at the right temperature, so your neural network (or recipe) can work better and produce a more accurate result.
How the Blocks Fit Together
There are 2 ways we can build a transformer. The first includes the encoder and the second leaves it out. Let’s start with the encoder and decoder strategy.
Encoder and Decoder
The simplest way to think of the encoder and decoder framework is through the example of text translation. Let’s say we want to make a transformer that can translate text from French to English. In this case, it would be the encoder’s job to extrapolate meaning from the French sentence and then pass that in to the decoder. The decoder would then be tasked with actually translating the sentence word by word while maintaining a similar meaning to the original, French input.
We start by passing in a positionally encoded embedding (input sequence) into the encoder block. We can see above that the encoder block (left) contains 2 main pieces: the multi-head attention layer and a feed forward layer. Once the inputs are passed through this block a few times (denoted by N in the diagram), the output is a new sequence that is a much more meaningful representation of the input sequence. This representation has information about the relationships between different words and is the final output of the encoder that is eventually passed into the decoder.
Since the job of the encoder is done, let’s talk about the decoder. In the context of our example, the decoder starts by positionally encoding the output sequence (English sentence) and passes that into a masked self-attention layer. After going through this layer, the model essentially learns how each word in the English sentence makes sense with the preceding words. Remember that in masked attention, we want to maximize learning by using one input sequence to create multiple training samples for our model. So when we pass in our input sequence to this model, our model learns how different English words associate with each other similar to the encoder.
After the first masked multi-head self attention layer, is another multi-head attention layer where the encoder and decoder meet. The job of this layer is for the model to draw connections between French and English words. It compares words of the English sentence to words of the French sentence and learns how they relate to each other in an actual sentence. The final part of the decoder is a feed forward layer that allows the model to use what it has figured out about French/English words to draw deeper connections between the two languages. Remember that the feed forward layer is what gives the model the extra depth it needs to really figure out the intricacies of both languages and how they compare.
So, the transformer’s job is to act as a bridge between languages, taking a sentence in one language, encapsulating its meaning in a universal ‘machine language’, and then rendering that meaning in another language. It’s like the decoder is reading the decoder’s mind and translating its thoughts into English.
Now that we understand the encoder/decoder architecture, we can see how it can be useful for tasks such as text translation or text to code. Essentially any problem that involves converting a sequence into something new can take advantage of the encoder/decoder architecture.
Decoder Only
Now let’s talk about how the decoder only architecture works. In a decoder-only transformer architecture, the focus is solely on “transforming” an input sequence into an output sequence. This type of architecture is quite powerful and forms the basis for models like GPT, which generate human-like text.
Let’s understand this using an example. Suppose we’re training a transformer to generate a piece of news text based on a given headline. The input would be the headline, and the output would be the generated news text. The process starts a positionally encoded input sequence (the headline).
The encoded input then passes through a masked multi-head attention layer. Remember that the ‘attention’ mechanism allows the transformer to focus on different parts of the input sequence at different times, understanding the context and relationships between the words. Also recall that the ‘masking’ ensures that the model only considers the current and previous words, not the future ones, when predicting the next word. This way, the model learns to generate text that logically follows from the given input.
Next, the model goes through a feed-forward layer. This is a simple neural network that adds depth to the model’s understanding, allowing it to capture more complex patterns and relationships in the data.
The output of the feed-forward network is then used to predict the next word in the output sequence. The process repeats, with each new word being added to the input for the next prediction, until a full news text is generated from the given headline.
So, in summary, a decoder-only transformer architecture takes an input sequence, pays ‘attention’ to its context, learns from it using a feed-forward network, and generates a relevant output sequence based on what it’s learned. This makes it ideal for tasks like text generation, where the goal is to produce coherent and contextually relevant text based on a given input.
Coding my Own Transformer
After I had learned how the internals of the transformer worked, I sought out to build my own. I followed along Andrej Karpathy’s tutorials and annotated each line of code to make sure I understood the point behind each line, layer, and matrix operation. This implementation uses torch which is an excellent machine learning library that was very beginner friendly (this was my first time using it). The point of the transformer I trained be able to write in a style similar to Shakespeare. I had a file that contained all of Shakespeare’s works and randomly sampled input sequences from it for training data.
You can find the code here:
(Currently only the decoder implementation works; encoder still in the works)
While the explanation above is very beginner friendly, the code goes a little more in depth and explains the internals of the transformer. Specifically it implements the key, query, value method of self attention that we didn’t necessarily go over. It also implements layer normalization which we didn’t cover in depth.
I recommend going over the code and trying to understand how the structure of the code translates to the diagram we saw above. After coding the transformer, I have definitely gained a deeper understanding of the architecture and how intuitive operations like self attention really are.
Conclusion
And that’s it! You now have a solid understanding of how a transformer works and a good starting point for exploring this architecture further.