1. Transformers Theory
1.1 Earlier
- Use existing libraries
- Just a few function calls
- But no understanding of how transformers work
1.2 Section Outline
- It’s all theoretical / conceptual
- Next section will look at implementations
- This section already has everything you need to implement
- Code agnostic: use Tensorflow, PyTorch etc.
1.3 Preparation
In one sentence : understand CNNs and RNNs
- Convolution / finding features / dot products
- Stacking convolutional filters to make a convolution layer
Advanced features: ResNet, skip connections, batch norm layer, pooling layer
- RNNs for NLP: text classification, toekn classification, seq2seq
Understanding shapes: inputs, embeddings, hidden states
- If you can do this, the understanding transformers is simple
1.4 Outline
- Basic self-attention
- Scaled dot-product attention
- Attention mask
- Mult-head attention
- More layers –> transformer block
- Encoders (e.g. BERT)
- Decoders (e.g. GPT)
- Seq2seq Encoder-Decoders
- Specific models: BERT, GPT, GPT-2, GPT-3
2. Basic Self-Attention
- First review how attention works in seq2seq RNNs (very briefly)
- See beginner’s corner for more detail
- The main thing we’re trying to compute with the attention in RNN is the context vector
- Context vector is the input to the Decoder RNN
- Context vector is weighted sum of the hidden states computed on the input sequence
- c(t) = weighted sum attention weights x hidden state
- score(t,t’) = ANN(s(t-1), h(t’))
- sequence of feature vectors (x1, x2, x3, x4, x5) –> context vector “weighted sum of these feature vectors” (A1, A2, A3, A4, A5)
- x_i = embedding vector (match each word to a dense embedding vector)
- A_i = context aware embeddings vector
- Attention Weights
- dot product is exactly like cosine similarity except we ignore the magnitude of the vectors
- pearson correlation
- dot product in convolution -
- bright spot in image where there is amtch between the filter and the image itself
- feature finders
3. Self-Attention & Scaled Dot-Product Attention
- queries, keys and values
- Attention(Q,K,V) = softmax(QK^T / root(d_k)) V
3.2 Database Inspiration
example - redis, memcached, DynamoDB
my_dict = {'key1' : 'value1', 'key2' : 'value2'}
Converting x into queries, keys, and values
- query vector .dot(key vectors) = attention scores –> softmax –> attention weights
- attention vector is weighted sum of the value vectors
3.3 Understanding match between query vector and key vector in the vector space
- vecror space where all key vectors live - key vector for all the words in the input sentence
- same word can be both the query and the key
- match is the key vector which yields the largest dot product with query vector
- key_vector.dot(query_vector) = attention weight –> softmax
- attention_weight * value_vector (of the match)
## 3.4 Why self-attention?
- Different from seq2seq RNNs - for each output token we wanted to know which input token to pay attenton to
- Example why it’s not weird: word sense disambiguation
- Ambiguous: “check” by itself
- Disambiguate by looking at context (“bank”, “cashed”)
Leave a Comment