“Attention” and “Transformer” Architectures
Transcript of “Attention” and “Transformer” Architectures
![Page 1: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/1.jpg)
“Attention” and “Transformer” Architectures
James Hays
![Page 2: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/2.jpg)
Recap – Semantic Segmentation
![Page 3: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/3.jpg)
Outline
• Context and Receptive Field
• Going Beyond Convolutions in…• Text
• Point Clouds
• Images
![Page 4: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/4.jpg)
![Page 5: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/5.jpg)
![Page 6: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/6.jpg)
![Page 7: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/7.jpg)
![Page 8: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/8.jpg)
![Page 9: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/9.jpg)
Language understanding
… serve …
![Page 10: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/10.jpg)
Language understanding
… great serve from Djokovic …
![Page 11: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/11.jpg)
Language understanding
… be right back after I serve these salads …
![Page 12: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/12.jpg)
![Page 13: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/13.jpg)
![Page 14: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/14.jpg)
So how do we fix these problems?
![Page 15: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/15.jpg)
Slide Credit: Frank Dellaert https://dellaert.github.io/19F-4476/resources/receptiveField.pdf
![Page 16: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/16.jpg)
Dilated Convolution
Figure source: https://github.com/vdumoulin/conv_arithmetic
![Page 17: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/17.jpg)
Sequence 2 Sequence models in language
Source: https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
![Page 18: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/18.jpg)
![Page 19: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/19.jpg)
Outline
• Context and Receptive Field
• Going Beyond Convolutions in…• Text
• Point Clouds
• Images
![Page 20: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/20.jpg)
![Page 21: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/21.jpg)
![Page 22: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/22.jpg)
From https://medium.com/lsc-psd/introduction-of-self-attention-layer-in-transformer-fc7bff63f3bc
![Page 23: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/23.jpg)
From https://medium.com/lsc-psd/introduction-of-self-attention-layer-in-transformer-fc7bff63f3bc
![Page 24: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/24.jpg)
![Page 25: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/25.jpg)
![Page 26: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/26.jpg)
![Page 27: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/27.jpg)
Transformer Architecture
![Page 28: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/28.jpg)
Outline
• Context and Receptive Field
• Going Beyond Convolutions in…• Text
• Point Clouds
• Images
![Page 29: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/29.jpg)
Point Transformer. Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip Torr, Vladlen Koltun
![Page 30: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/30.jpg)
![Page 31: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/31.jpg)
![Page 32: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/32.jpg)
Outline
• Context and Receptive Field
• Going Beyond Convolutions in…• Text
• Point Clouds
• Images
![Page 33: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/33.jpg)
![Page 34: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/34.jpg)
![Page 35: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/35.jpg)
![Page 36: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/36.jpg)
![Page 37: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/37.jpg)
![Page 38: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/38.jpg)
When trained on mid-sized datasets such as ImageNet, such models yield modest accuracies of a few percentage points below ResNets of comparable size. This seemingly discouraging outcome maybe expected: Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data.
However, the picture changes if the models are trained on larger datasets (14M-300M images). We find that large scale training trumps inductive bias.
Dosovitskiy et al.
![Page 39: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/39.jpg)
![Page 40: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/40.jpg)
![Page 41: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/41.jpg)
Summary
• “Attention” models outperform recurrent models and convolutional models for sequence processing and point processing. They allow long range interactions.
• Surprisingly, they seem to outperform convolutional networks for image processing tasks. Again, long range interactions might be more important than we realized.
• Naïve attention mechanisms have quadratic complexity with the number of input tokens, but there are often workarounds for this.
![Page 42: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/42.jpg)
Reminder
• This is the final lecture. We won’t use the reading period or the final exam slot.
• Project 5 is out and due Friday
• Project 6 is optional. It is due May 5th.
• The problem set will go out this week.
![Page 43: “Attention” and “Transformer” Architectures](https://reader030.fdocuments.in/reader030/viewer/2022012708/61a8804d7de7c3430d39ee7b/html5/thumbnails/43.jpg)
Thank you for making this semester work!