1. Transformers Theory

1.1 Earlier

  • Use existing libraries
  • Just a few function calls
  • But no understanding of how transformers work

1.2 Section Outline

  • It’s all theoretical / conceptual
  • Next section will look at implementations
  • This section already has everything you need to implement
  • Code agnostic: use Tensorflow, PyTorch etc.

1.3 Preparation

  • In one sentence : understand CNNs and RNNs

  • Convolution / finding features / dot products
  • Stacking convolutional filters to make a convolution layer
  • Advanced features: ResNet, skip connections, batch norm layer, pooling layer

  • RNNs for NLP: text classification, toekn classification, seq2seq
  • Understanding shapes: inputs, embeddings, hidden states

  • If you can do this, the understanding transformers is simple

1.4 Outline

  • Basic self-attention
  • Scaled dot-product attention
  • Attention mask
  • Mult-head attention
  • More layers –> transformer block
  • Encoders (e.g. BERT)
  • Decoders (e.g. GPT)
  • Seq2seq Encoder-Decoders
  • Specific models: BERT, GPT, GPT-2, GPT-3

2. Basic Self-Attention

  • First review how attention works in seq2seq RNNs (very briefly)
  • See beginner’s corner for more detail

2.1

  • The main thing we’re trying to compute with the attention in RNN is the context vector
  • Context vector is the input to the Decoder RNN
  • Context vector is weighted sum of the hidden states computed on the input sequence
  • c(t) = weighted sum attention weights x hidden state
  • score(t,t’) = ANN(s(t-1), h(t’))

2.2

  • sequence of feature vectors (x1, x2, x3, x4, x5) –> context vector “weighted sum of these feature vectors” (A1, A2, A3, A4, A5)
  • x_i = embedding vector (match each word to a dense embedding vector)
  • A_i = context aware embeddings vector
  • Attention Weights
  • dot product is exactly like cosine similarity except we ignore the magnitude of the vectors
  • pearson correlation
  • dot product in convolution -
    • bright spot in image where there is amtch between the filter and the image itself
    • feature finders

3. Self-Attention & Scaled Dot-Product Attention

3.1

  • queries, keys and values
  • Attention(Q,K,V) = softmax(QK^T / root(d_k)) V

3.2 Database Inspiration

example - redis, memcached, DynamoDB

my_dict = {'key1' : 'value1', 'key2' : 'value2'}

Converting x into queries, keys, and values

  • query vector .dot(key vectors) = attention scores –> softmax –> attention weights
  • attention vector is weighted sum of the value vectors

3.3 Understanding match between query vector and key vector in the vector space

  • vecror space where all key vectors live - key vector for all the words in the input sentence
  • same word can be both the query and the key
  • match is the key vector which yields the largest dot product with query vector
  • key_vector.dot(query_vector) = attention weight –> softmax
  • attention_weight * value_vector (of the match)

## 3.4 Why self-attention?

  • Different from seq2seq RNNs - for each output token we wanted to know which input token to pay attenton to
  • Example why it’s not weird: word sense disambiguation
  • Ambiguous: “check” by itself
  • Disambiguate by looking at context (“bank”, “cashed”)

Updated:

Leave a Comment