1. Transformers Theory

1.1 Earlier

The main thing we’re trying to compute with the attention in RNN is the context vector
Context vector is the input to the Decoder RNN
Context vector is weighted sum of the hidden states computed on the input sequence
c(t) = weighted sum attention weights x hidden state
score(t,t’) = ANN(s(t-1), h(t’))

sequence of feature vectors (x1, x2, x3, x4, x5) –> context vector “weighted sum of these feature vectors” (A1, A2, A3, A4, A5)
x_i = embedding vector (match each word to a dense embedding vector)
A_i = context aware embeddings vector
Attention Weights
dot product is exactly like cosine similarity except we ignore the magnitude of the vectors
pearson correlation
dot product in convolution -
- bright spot in image where there is amtch between the filter and the image itself
- feature finders

example - redis, memcached, DynamoDB

my_dict = {'key1' : 'value1', 'key2' : 'value2'}

Converting x into queries, keys, and values

query vector .dot(key vectors) = attention scores –> softmax –> attention weights
attention vector is weighted sum of the value vectors

vecror space where all key vectors live - key vector for all the words in the input sentence
same word can be both the query and the key
match is the key vector which yields the largest dot product with query vector
key_vector.dot(query_vector) = attention weight –> softmax
attention_weight * value_vector (of the match)

## 3.4 Why self-attention?

Different from seq2seq RNNs - for each output token we wanted to know which input token to pay attenton to
Example why it’s not weird: word sense disambiguation
Ambiguous: “check” by itself
Disambiguate by looking at context (“bank”, “cashed”)