TRANSKRYPCJA VIDEO
Dla tego filmu nie wygenerowano opisu.
Looking back, what is the most beautiful or surprising idea in deep learning or AI in general that you've come across? You've seen this field explode and grow in interesting ways. What cool ideas? Like, we made you sit back and go, hmm, big or small? Well, the one that I've been thinking about recently, the most probably is the transformer architecture. So basically, neural networks have a lot of architectures that were trendy have come and gone for different sensory modalities like for vision, audio, text. You would process them with different looking neural nets. And recently, we've seen this convergence towards one architecture, the transformer.
And you can feed it video or you can feed it images or speech or text, and it just gobbles it up. And it's kind of like a bit of a general purpose computer that is also trainable and very efficient to run on our hardware. And so this paper came out in 2016, I want to say. Attention is all you need. Attention is all you need. You criticized the paper title in retrospect that it didn't foresee the bigness of the impact that it was going to have. Yeah. I'm not sure if the authors were aware of the impact that that paper would go on to have. Probably they weren't.
But I think they were aware of some of the motivations and design decisions behind the paper, and they chose not to, I think, expand on it in that way in the paper. And so I think they had an idea that there was more than just the surface of just like, oh, we're just doing translation and here's a better architecture. You're not just doing translation. This is like a really cool, differentiable, optimizable, efficient computer that you've proposed. And maybe they didn't have all of that foresight, but I think it's really interesting. Isn't it funny, sorry to interrupt, that that title is memeable, that they went for such a profound idea.
They went with a, I don't think anyone used that kind of title before, right? Attention is all you need. Yeah, it's like a meme or something. Yeah. Isn't that funny? That one, like maybe if it was a more serious title, it wouldn't have the impact. Honestly, yeah, there is an element of me that honestly agrees with you and prefers it this way. Yes. If it was too grand, it would over promise and then under deliver potentially. So you want to just meme your way to greatness. That should be a t-shirt. So you tweeted, the transformer is a magnificent neural network architecture because it is a general purpose, differentiable computer.
It is simultaneously expressive in the forward pass, optimizable via back propagation, gradient descent, and efficient high parallelism compute graph. Can you discuss some of those details, expressive, optimizable, efficient from memory or in general, whatever comes to your heart? You want to have a general purpose computer that you can train on arbitrary problems, like say the task of next work prediction or detecting if there's a cat in an image or something like that. You want to train this computer so you want to set its weights. I think there's a number of design criteria that overlap in the transformer simultaneously that made it very successful. I think the authors were deliberately trying to make this really powerful architecture.
Basically it's very powerful in the forward pass because it's able to express very general computation as something that looks like message passing. You have nodes and they all store vectors. These nodes get to basically look at each other and it's each other's vectors and they get to communicate. Basically nodes get to broadcast, hey, I'm looking for certain things. Then other nodes get to broadcast, hey, these are the things I have. Those are the keys and the values. So it's not just the tension. Yeah, exactly. Transformers are much more than just the attention component. It's got many pieces architectural that went into it. The residual connection, the way it's arranged.
There's a multi-layer perceptron in there, the way it's stacked and so on. But basically there's a message passing scheme where nodes get to look at each other, decide what's interesting and then update each other. So I think when you get to the details of it, I think it's a very expressive function so it can express lots of different types of algorithms in the forward pass. Not only that, but the way it's designed with the residual connections, layer normalizations, the softmax attention and everything, it's also optimizable.
This is a really big deal because there's lots of computers that are powerful that you can't optimize or they're not easy to optimize using the techniques that we have, which is backpropagation and gradient descent. These are first order methods, very simple optimizers really. And so you also need it to be optimizable. And then lastly, you want it to run efficiently in our hardware. Our hardware is a massive throughput machine by GPUs. They prefer lots of parallelism. So you don't want to do lots of sequential operations. You want to do a lot of operations serially. And the transformer is designed with that in mind as well.
And so it's designed for our hardware and it's designed to both be very expressive in the forward pass, but also very optimizable in the backward pass. And you said that the residual connections support a kind of ability to learn short algorithms fast and first and then gradually extend them longer during training. What's the idea of learning short algorithms? Right. Think of it as a. . . So basically a transformer is a series of blocks, right? And these blocks have attention and a little multilayered perception. And so you go off into a block and you come back to this residual pathway. And then you go off and you come back.
And then you have a number of layers arranged sequentially. And so the way to look at it, I think, is because of the residual pathway in the backward pass, the gradients sort of flow along it uninterrupted because addition distributes the gradient equally to all of its branches. So the gradient from the supervision at the top just floats directly to the first layer. And all the residual connections are arranged so that in the beginning during initialization, they contribute nothing to the residual pathway. So what it kind of looks like is imagine the transformer is kind of like a Python function, like a death. And you get to do various kinds of lines of code.
Say you have a hundred layers deep transformer. Typically, they would be much shorter, say 20. So you have 20 lines of code and you can do something in them. And so think of during the optimization, basically what it looks like is first you optimize the first line of code and then the second line of code can kick in and the third line of code can kick in. And I kind of feel like because of the residual pathway and the dynamics of the optimization, you can sort of learn a very short algorithm that gets the approximate answer, but then the other layers can sort of kick in and start to create a contribution.
And at the end of it, you're optimizing over an algorithm that is 20 lines of code, except these lines of code are very complex because it's an entire block of a transformer. You can do a lot in there. What's really interesting is that this transformer architecture actually has been remarkably resilient. Basically, the transformer that came out in 2016 is the transformer you would use today, except you reshuffle some of the layer norms. The layer normalizations have been reshuffled to a pre-norm formulation. And so it's been remarkably stable, but there's a lot of bells and whistles that people have attached on it and tried to improve it.
I do think that basically it's a big step in simultaneously optimizing for lots of properties of a desirable neural network architecture. And I think people have been trying to change it, but it's proven remarkably resilient. But I do think that there should be even better architectures potentially. But you admire the resilience here. There's something profound about this architecture that's at least resilient. Maybe everything can be turned into a problem that transformers can solve. Currently, it definitely looks like the transformer is taking over AI and you can feed basically arbitrary problems into it. And it's a general differentiable computer, and it's extremely powerful. And this convergence in AI has been really interesting to watch for me personally.
What else do you think could be discovered here about transformers? What's surprising thing? Or is it a stable. . . I want a stable place. Is there something interesting we might discover about transformers? Like aha moments maybe has to do with memory, maybe knowledge representation, that kind of stuff? Definitely the zeitgeist today is just pushing. . . Basically right now the zeitgeist is do not touch the transformer, touch everything else. So people are scaling up the data sets, making them much, much bigger. They're working on the evaluation, making the evaluation much, much bigger. And they're basically keeping the architecture unchanged. And that's the last five years of progress in AI, kind of. .