1. Welcome

“if you can’t implement it, then you don’t understand it” “what I cannot crate, I do not understand”

1.1 Introduction

What is this course about?
Transformers : State-of-the-art NLP model
But Transformers have also made significant contribution to computer vision and computational biology
They’re the best models for:
- Translation
- Questiona-answering
- Generating human-level text
NLP - ChatGPT
Computer Vision - DALL-E2
Molecural Biology - DeepMind’s AlphaFold 2

1.2 Who should take this course?

Beginners - Apply API
Intermediate - Fine Tune Parameters
Advanced - Build your own transformer

Note:

Transformers aren’t one thing
Multiple kinds of Transformers
Like BERT & GPT
Which to choose for your task?

1.3 Outline

Beginners - Apply API
- State-of-the-art model in just 1 or 2 lines of code
- Practical tasks:
  - Generate text
  - Sentiment analysis
  - NLP : Named entity recognition, text summarization, neural machine translation
  - masked language model (article spinner) - black head technique
  - question asnwering
  - zero shot classification - no training / text / set of possible labels
Intermediate - Fine Tune Parameters
- transfer learning
- massive amount of data, millions of $$$ of training time
- text classification - spam detection, sentiment analysis
- classifying each word - named entity recognition (NER), parts-of-speech tagging
- complex applications - machine translation, question answering
Advanced - Build your own transformer
- how transformers actually work
- how self-attention mechanism works
- multi-head attention
- encoder-decoder
- BERT & GPT (excell at different thing)
- built from scratch using Tensor Flow and PyTorch

2. Getting Setup

3. Beginner’s Corner

3.1 Beginner’s Corner Section Introduction

Hugging Face library - powerful models through a simple & universal interface / API
NLP (tokenizations, convert tokens into integers, convert integers into embedding vectors)

3.1.1 Section Outline

How we get from RNNs to Transformers
Each application will have 2 parts: 1) what we are doing 2) Python demo
No training required! Pretrained models work on any text
Why? “Language is language”, it’s universal!
Taks:
- Sentiment Analysis - NOT interesting
- Embeddings and nearest neighbor search - pre-train a neural network and take out the embedding vector from the second last layer
- Text Generation (autoregressive, similar to Markov models)
- Masked Language Modeling (i.e. “article spinning”, bidirectional)
- Named Entity Reognition (many-to-many)
- Text Summarization (sequence-to-sequence)
- Neural Machine Translation (also useful for building intution for attention)
- Question Answering
  - input = (question, context), answer = selection from context
  - impressively, transformers can parse this input
- Zero-Shot Classification
  - classify text given an arbitrary set of labels

3.2 From RNNs to Attention and Transformers - Intuition

3.2.1 A Brief History of Attention and Transformers

How we got from RNNs (recurrent neural networks) to transformers
Optional - not needed for code
Useful for those who understand deep learning already, and plan to learn more about transformers in-depth later

3.2.2 Main Points

1) The attention mechanism allows neural networks to learn very long-range dependencies in sequences a) Longer range than LSTM, a type of RNN b) Attention was created for RNNs, but transformers use attention only, while doing away with the recurrent part 2) Transformers are big and slow a) But computations can be done in parallel (unlike RNNs!)

3.2.3 Many Taks

1) many-to-one
- spam detection
2) many-to-many
- parts of speech tagging
3) language translation
- Problem 1) Input sequence length != Target sequence length
- Problem 2) Each output (y_hat(t)) depends only on h(t)
- Seq2Seq (sequence-to-sequence) : Encoder Decoder RNN Model
- Embedding Vector –> Encoder RNN unit –> Thought Vector –> Decoder RNN unit
- Encoder RNN Unit : Convert sequence of words into a thought vector
- Decoder RNN Unit : Convert thought vector into an output
Attention in Seq2Seq
- Attention : for each OUTPUT token, we want to know which INPUT token(s) to pay attention to

3.2.4 Attention Is All You Need

Keeps attention, get tid of RNN
RNNs are slow, since every output must be computed sequentially
Cannot be parallelized
Vanishing gradients - LSTMs and GRUs are mean to mitigate this problem but only work up to a point
With attention, even for very long sequences, every input is connected to every output (and can be computed in parallel)
Downside: for input / output length N, we have O(N2) weights
Transformer: stack of attention layers (many details omitted)

3.2 Section Outline

How we get from RNNs to Transformers (intuition)
Each application will have 2 parts: 1) what we are doing 2) Python demo
No training required! Pretrained models work on any text
Why? “Language is language”, it’s universal!
Tasks:
- Sentiment analysis
- Embeddings and neares neighbor search
- Named entity recognition (man-to-many)
- Text generation (autoregressive, similar to Markov models)
- Masked language model (i.e. “article spinning”, bidirectional)
- Text summarization (seuqnce-to-sequence)
- Language translaion (also, useful for building intuition for attention)
  - one/many to one/many mapping
- Question-answering
  - Input = (question, context), Answer - selection from context
  - Impressively, transformers can parse this input
Zero-shot classification
- Classify text given an arbitrary set of labels

3.5.1 RNNs, LSTMs, seq-to-seq (encoder-decoder RNNs), Attention in Seq2Seq

3.3 Sentiment Analysis

Part 1) Review of what sentiment analysis is
Part 2) How to perform sentiment analysis with Hugging Face Transformers (just a few lines of code)
Sentiment Analysis :
- Positive, Negative
- Positive, Negative, Neutral
- Positive, Very Positive, Negative, Very Negative, Neutral
Usefulness of Sentiment Analysis
- How is sentiment analysis used to make money?
- Reputation management
- Report sentiment statistics on Twitter of competitors and yourself
- Customer support sentiment
- Stock price prediction
Sequential models (CNNs, RNNs) can help
Recursive neural networks (trees) can help
Transformers / attention can help

from transformers import pipeline

# Create your pipeline (includes tokenization, etc.)
classifier = pipeline("sentiment-analysis")

# No need to convert input into PyTorch Tensor, Numpy array
# tensorflow Tensor, etc.

3.4 Text Generation

More intutive to start with time series - we want to predict future
- Stock price
- Demand / Sales
- COVID cases (hospital admissions)
Autoregressive Time Series Models
ARIMA is a linear version of this
This “structure” can be applied to any model
- e.g. - random forest, RNNs (LSTM), transformers
Autoregressive Language Models
- Language is a “time series” (i.e. a sequence) of categorical objects
- An autoregressive language model is one where we find the conditional distribution of the next word given past words
- p(x(t+1) x(t), x(t-1), x(t-2), …)
- Markov chain models
- Markov assumtion: x(t+1) depends only on x(t)
- Convenient, but very strong assumption
Uses of Autoregressive Language Models
- We’ve used them to generate poetry
- But difficult to generate coherent text (even today)
- To this day, it remain one of the most important tasks in NLP
- Used to train the largest / most popular model today : GPT-3
- Even simpler language models have already been used in industry:
  - Predictive text / text completion
- Use cases: help writing emails / creative writing
- Use cases: Github Copilot (can generate working code from text prompt)
- Build full website designs (actual code), compose music, medical queries
Transformers (attention-mechanism) have been a key technology (long-range dependencies)
Interesting Thought:
- Recall: OpenAI was hesitant to share their pretrained GPT models
- One interesting “use-case”; unethical website owners / marketers may use these models to fill their websites with incoherent, machine-generated text
- But language models are trained from text on the Internet
- This makes a loop (will we e=require a new training objective in order to “improve”)

from transformers import pipeline

gen = pipeline("text-generation") # uses GPT-2

prompt = "Neural networks with attention have been used with great success"

gen(prompt)

# Generate multiple possible continuations
gen(prompt, num_retuen_sequences=3)

# Control length of continuation
gen(prompt, max_length=30)

Recap

1) Import the pipeline function 2) Load up a pretrained model 3) use the pretrained model (simply pass in a string / list of strings)

3.5 Masked Language Modeling (Article Spinner)

depends on past as well as future
p(x(t) x(t-1), x(t-2),…., x(t+1), x(t+2),…)

3.5.1 Transformers Are Trained As Language Models

Different transformers use different types of pretarining
Autoregressive language modeling is used by the GPT family
- E.g. - GPT-2 is default for text generation pipeline
Masked language modeling is used by BERT
- BERT = “bidirectional encoder representations from transformers”
Beginners normally have a tough time grasping unsupervised learning

3.5.2 Why is Unsupervised Learning Difficult to Understand?

Mostly due to lack of patience, “quick fix” mentality
Supervised learning delivers a quick fix
For advanced applications, it requires open-mindedness and creativity
Be the type of data scientist that build things no one has ever seen before!

3.5.3 Application : Article Spinning

Widely used by black hat marketers and SEOs
Helps if you have experience building a website / blog / online business
SEO = techniques to improve search engine rankings
Example: create content with keyboards that match users’ queries
An unethical person might copy (a.k.a. steal) article written by others
Article spinning idea: change enough words (while keeping the article coherent) such that it doesn’t match the original

3.5.4 Code Preparation

from transformers import pipeline
mlm = pipeline("fill-mask")
mlm("The cat <mask> over the box")

3.5.5 Autoencoding Language Model

Helpful (but not necessary) if you’ve seen unsupervised deep learning
Autoencoders are neural nets that try to produce their input
Applications : recommender systems, pretarining,
one variation: denoising autoencoder
input is corrupted image, and output is a restoration (trying to make it close to the original)
image is corrupted by “noise” (can be Gaussian, setting pixels to zero, …)
The “mask” in our language model is a corruption too!

3.6 Named Entity Recognition (NER)

Named entity recognition (NER) allows us to identify (i.e. tag) all the people, places, and companies in a document
Example - Steve Jobs was the CEO of Apple, headquartered in the state of California
- Steve Jobs - Person
- Apple - Organization
- California - Location
How does it work?
- Exactly the same as parts-of-speech tagging! (many-to-many)

Steve : B-PER Jobs : I-PER Apple : B-ORG Silicon : B-LOC Valley : I-LOC

NER in Python

from transformers import pipeline
ner = pipeline("ner", aggregation_strategy='simple', device=0)

ner(“Steve Jobs was the CEO of Apple, headquartered in California.”)

OUTPUT:

entity : ‘PER’, word : ‘Steve Jobs’
entity : ‘ORG’, word : ‘Apple’
entity : ‘LOC’, word : ‘Californias’

3.7 Text Summarization

Why should we summarize text?
We already do this all the time!
- Scientific paper abstracts
- Executive summaries in professional documents
It can be useful in our own lives

3.7.1. Why Text Summarization Helps AI

Paraphrase / Summarize
Summarization is a way for learning systems to demonstrate understanding of a concept

3.7.2 Two Types of Summarization

Extractive vs. Abstractive
Extractive summaries consist of text taken from the original document
Abstractive summaries can contain novel sequences of text not necessarily taken from the input

summarizer = pipeline("summarization")
summarizer(my_long_text)

3.8 Neural Machine Translation

Convert phrases from one language to another
Why is it useful? Communication, Internet, books, TV shows, online courses
Sequence-to-sequence task
- Text to summary
- Neural translation

# english to spanish
translator = pipeline('translation', model='Helsinki-NLP/opus-mt-en-es')

translator("I like eggs and ham")

3.8.1 Translation Evaluation

Many valid translations
BLEU score is the most popular metric
- The best correlated with human judgement
- Prediction is compared with multiple reference texts
- Is a value between 0 and 1

3.9 Question Answering

The ultimate goal would be to have an AI than can answer any question
Such a system could replace doctors, teachers, etc. in some instances
SQuAD [Stanford Question Answering Dataset]
From Stanford’s famous NLP department (Chris Manning, Dan Jurafsky)
It is an extractive question answering dataset
The answer is contained in the input, and the model simply “extracts” the portion which makes up the answer
[CLS] question tokens [SEP] context tokens

from transformers import pipeline
qa = pipeline("question answering")
ctx = "Today, I made a peanut butter sandwich"
q = "What did I put in my sandwich?"
qa(context=ctx, question=q)

3.10 Zero-Shot Classification

Classification without labels
NLP Examples:
- Text: Wikipedia page on Albert Einstein
  - Classify the document : Classes[scientist, painter]
- Text: Wikipedia page on mitochondria
  - Classify the documen : Classes[biology, math, psychology]
Now consider: how would you build a model that can do this?
Architecture is not like a typical softmax neural network
In that case, we have one final dense layer with # outputs = # classes
Zero-shot model cannot work this way because it must use whatever classes you give it

from transformers import pipeline
clf = pipeline("zero-shot-classification", device=0)

clf("This is a great movie", candidate_labels=["positive", "negative"])

3.11 Beginner’s Corner Section SUmmary

Sentiment Analysis - NOT interesting
Text Generation - pre-train a neural network and take out the embedding vector from the second last layer
Masked Language Modeling
Named Entity Reognition
Text Summarization
Neural Machine Translation
Question Answering
Zero-Shot Classification

3.12 Suggestion Box

Share on

Twitter Facebook Google+ LinkedIn

Transformers Essentials - Part 1!

Sahil Gupta