1. Welcome
“if you can’t implement it, then you don’t understand it” “what I cannot crate, I do not understand”
1.1 Introduction
- What is this course about?
- Transformers : State-of-the-art NLP model
- But Transformers have also made significant contribution to computer vision and computational biology
- They’re the best models for:
- Translation
- Questiona-answering
- Generating human-level text
- NLP - ChatGPT
- Computer Vision - DALL-E2
- Molecural Biology - DeepMind’s AlphaFold 2
1.2 Who should take this course?
- Beginners - Apply API
- Intermediate - Fine Tune Parameters
- Advanced - Build your own transformer
Note:
- Transformers aren’t one thing
- Multiple kinds of Transformers
- Like BERT & GPT
- Which to choose for your task?
1.3 Outline
- Beginners - Apply API
- State-of-the-art model in just 1 or 2 lines of code
- Practical tasks:
- Generate text
- Sentiment analysis
- NLP : Named entity recognition, text summarization, neural machine translation
- masked language model (article spinner) - black head technique
- question asnwering
- zero shot classification - no training / text / set of possible labels
- Intermediate - Fine Tune Parameters
- transfer learning
- massive amount of data, millions of $$$ of training time
- text classification - spam detection, sentiment analysis
- classifying each word - named entity recognition (NER), parts-of-speech tagging
- complex applications - machine translation, question answering
- Advanced - Build your own transformer
- how transformers actually work
- how self-attention mechanism works
- multi-head attention
- encoder-decoder
- BERT & GPT (excell at different thing)
- built from scratch using Tensor Flow and PyTorch
2. Getting Setup
3. Beginner’s Corner
3.1 Beginner’s Corner Section Introduction
- Hugging Face library - powerful models through a simple & universal interface / API
- NLP (tokenizations, convert tokens into integers, convert integers into embedding vectors)
3.1.1 Section Outline
- How we get from RNNs to Transformers
- Each application will have 2 parts: 1) what we are doing 2) Python demo
- No training required! Pretrained models work on any text
- Why? “Language is language”, it’s universal!
- Taks:
- Sentiment Analysis - NOT interesting
- Embeddings and nearest neighbor search - pre-train a neural network and take out the embedding vector from the second last layer
- Text Generation (autoregressive, similar to Markov models)
- Masked Language Modeling (i.e. “article spinning”, bidirectional)
- Named Entity Reognition (many-to-many)
- Text Summarization (sequence-to-sequence)
- Neural Machine Translation (also useful for building intution for attention)
- Question Answering
- input = (question, context), answer = selection from context
- impressively, transformers can parse this input
- Zero-Shot Classification
- classify text given an arbitrary set of labels
3.2 From RNNs to Attention and Transformers - Intuition
3.2.1 A Brief History of Attention and Transformers
- How we got from RNNs (recurrent neural networks) to transformers
- Optional - not needed for code
- Useful for those who understand deep learning already, and plan to learn more about transformers in-depth later
3.2.2 Main Points
1) The attention mechanism allows neural networks to learn very long-range dependencies in sequences a) Longer range than LSTM, a type of RNN b) Attention was created for RNNs, but transformers use attention only, while doing away with the recurrent part 2) Transformers are big and slow a) But computations can be done in parallel (unlike RNNs!)
3.2.3 Many Taks
- 1) many-to-one
- spam detection
- 2) many-to-many
- parts of speech tagging
- 3) language translation
- Problem 1) Input sequence length != Target sequence length
- Problem 2) Each output (y_hat(t)) depends only on h(t)
- Seq2Seq (sequence-to-sequence) : Encoder Decoder RNN Model
- Embedding Vector –> Encoder RNN unit –> Thought Vector –> Decoder RNN unit
- Encoder RNN Unit : Convert sequence of words into a thought vector
- Decoder RNN Unit : Convert thought vector into an output
- Attention in Seq2Seq
- Attention : for each OUTPUT token, we want to know which INPUT token(s) to pay attention to
3.2.4 Attention Is All You Need
- Keeps attention, get tid of RNN
- RNNs are slow, since every output must be computed sequentially
- Cannot be parallelized
- Vanishing gradients - LSTMs and GRUs are mean to mitigate this problem but only work up to a point
- With attention, even for very long sequences, every input is connected to every output (and can be computed in parallel)
- Downside: for input / output length N, we have O(N2) weights
- Transformer: stack of attention layers (many details omitted)
3.2 Section Outline
- How we get from RNNs to Transformers (intuition)
- Each application will have 2 parts: 1) what we are doing 2) Python demo
- No training required! Pretrained models work on any text
- Why? “Language is language”, it’s universal!
- Tasks:
- Sentiment analysis
- Embeddings and neares neighbor search
- Named entity recognition (man-to-many)
- Text generation (autoregressive, similar to Markov models)
- Masked language model (i.e. “article spinning”, bidirectional)
- Text summarization (seuqnce-to-sequence)
- Language translaion (also, useful for building intuition for attention)
- one/many to one/many mapping
- Question-answering
- Input = (question, context), Answer - selection from context
- Impressively, transformers can parse this input
- Zero-shot classification
- Classify text given an arbitrary set of labels
3.5.1 RNNs, LSTMs, seq-to-seq (encoder-decoder RNNs), Attention in Seq2Seq
3.3 Sentiment Analysis
- Part 1) Review of what sentiment analysis is
- Part 2) How to perform sentiment analysis with Hugging Face Transformers (just a few lines of code)
- Sentiment Analysis :
- Positive, Negative
- Positive, Negative, Neutral
- Positive, Very Positive, Negative, Very Negative, Neutral
- Usefulness of Sentiment Analysis
- How is sentiment analysis used to make money?
- Reputation management
- Report sentiment statistics on Twitter of competitors and yourself
- Customer support sentiment
- Stock price prediction
- Sequential models (CNNs, RNNs) can help
- Recursive neural networks (trees) can help
- Transformers / attention can help
from transformers import pipeline
# Create your pipeline (includes tokenization, etc.)
classifier = pipeline("sentiment-analysis")
# No need to convert input into PyTorch Tensor, Numpy array
# tensorflow Tensor, etc.
3.4 Text Generation
- More intutive to start with time series - we want to predict future
- Stock price
- Demand / Sales
- COVID cases (hospital admissions)
- Autoregressive Time Series Models
- ARIMA is a linear version of this
- This “structure” can be applied to any model
- e.g. - random forest, RNNs (LSTM), transformers
- Autoregressive Language Models
- Language is a “time series” (i.e. a sequence) of categorical objects
- An autoregressive language model is one where we find the conditional distribution of the next word given past words
-
p(x(t+1) x(t), x(t-1), x(t-2), …) - Markov chain models
- Markov assumtion: x(t+1) depends only on x(t)
- Convenient, but very strong assumption
- Uses of Autoregressive Language Models
- We’ve used them to generate poetry
- But difficult to generate coherent text (even today)
- To this day, it remain one of the most important tasks in NLP
- Used to train the largest / most popular model today : GPT-3
- Even simpler language models have already been used in industry:
- Predictive text / text completion
- Use cases: help writing emails / creative writing
- Use cases: Github Copilot (can generate working code from text prompt)
- Build full website designs (actual code), compose music, medical queries
- Transformers (attention-mechanism) have been a key technology (long-range dependencies)
- Interesting Thought:
- Recall: OpenAI was hesitant to share their pretrained GPT models
- One interesting “use-case”; unethical website owners / marketers may use these models to fill their websites with incoherent, machine-generated text
- But language models are trained from text on the Internet
- This makes a loop (will we e=require a new training objective in order to “improve”)
from transformers import pipeline
gen = pipeline("text-generation") # uses GPT-2
prompt = "Neural networks with attention have been used with great success"
gen(prompt)
# Generate multiple possible continuations
gen(prompt, num_retuen_sequences=3)
# Control length of continuation
gen(prompt, max_length=30)
Recap
1) Import the pipeline function 2) Load up a pretrained model 3) use the pretrained model (simply pass in a string / list of strings)
3.5 Masked Language Modeling (Article Spinner)
- depends on past as well as future
-
p(x(t) x(t-1), x(t-2),…., x(t+1), x(t+2),…)
3.5.1 Transformers Are Trained As Language Models
- Different transformers use different types of pretarining
- Autoregressive language modeling is used by the GPT family
- E.g. - GPT-2 is default for text generation pipeline
- Masked language modeling is used by BERT
- BERT = “bidirectional encoder representations from transformers”
- Beginners normally have a tough time grasping unsupervised learning
3.5.2 Why is Unsupervised Learning Difficult to Understand?
- Mostly due to lack of patience, “quick fix” mentality
- Supervised learning delivers a quick fix
- For advanced applications, it requires open-mindedness and creativity
- Be the type of data scientist that build things no one has ever seen before!
3.5.3 Application : Article Spinning
- Widely used by black hat marketers and SEOs
- Helps if you have experience building a website / blog / online business
- SEO = techniques to improve search engine rankings
- Example: create content with keyboards that match users’ queries
- An unethical person might copy (a.k.a. steal) article written by others
- Article spinning idea: change enough words (while keeping the article coherent) such that it doesn’t match the original
3.5.4 Code Preparation
from transformers import pipeline
mlm = pipeline("fill-mask")
mlm("The cat <mask> over the box")
3.5.5 Autoencoding Language Model
- Helpful (but not necessary) if you’ve seen unsupervised deep learning
- Autoencoders are neural nets that try to produce their input
- Applications : recommender systems, pretarining,
- one variation: denoising autoencoder
- input is corrupted image, and output is a restoration (trying to make it close to the original)
- image is corrupted by “noise” (can be Gaussian, setting pixels to zero, …)
- The “mask” in our language model is a corruption too!
3.6 Named Entity Recognition (NER)
- Named entity recognition (NER) allows us to identify (i.e. tag) all the people, places, and companies in a document
- Example - Steve Jobs was the CEO of Apple, headquartered in the state of California
- Steve Jobs - Person
- Apple - Organization
- California - Location
- How does it work?
- Exactly the same as parts-of-speech tagging! (many-to-many)
Steve : B-PER Jobs : I-PER Apple : B-ORG Silicon : B-LOC Valley : I-LOC
NER in Python
from transformers import pipeline
ner = pipeline("ner", aggregation_strategy='simple', device=0)
ner(“Steve Jobs was the CEO of Apple, headquartered in California.”)
OUTPUT:
- entity : ‘PER’, word : ‘Steve Jobs’
- entity : ‘ORG’, word : ‘Apple’
- entity : ‘LOC’, word : ‘Californias’
3.7 Text Summarization
- Why should we summarize text?
- We already do this all the time!
- Scientific paper abstracts
- Executive summaries in professional documents
- It can be useful in our own lives
3.7.1. Why Text Summarization Helps AI
- Paraphrase / Summarize
- Summarization is a way for learning systems to demonstrate understanding of a concept
3.7.2 Two Types of Summarization
- Extractive vs. Abstractive
- Extractive summaries consist of text taken from the original document
- Abstractive summaries can contain novel sequences of text not necessarily taken from the input
summarizer = pipeline("summarization")
summarizer(my_long_text)
3.8 Neural Machine Translation
- Convert phrases from one language to another
- Why is it useful? Communication, Internet, books, TV shows, online courses
- Sequence-to-sequence task
- Text to summary
- Neural translation
# english to spanish
translator = pipeline('translation', model='Helsinki-NLP/opus-mt-en-es')
translator("I like eggs and ham")
3.8.1 Translation Evaluation
- Many valid translations
- BLEU score is the most popular metric
- The best correlated with human judgement
- Prediction is compared with multiple reference texts
- Is a value between 0 and 1
3.9 Question Answering
- The ultimate goal would be to have an AI than can answer any question
- Such a system could replace doctors, teachers, etc. in some instances
- SQuAD [Stanford Question Answering Dataset]
- From Stanford’s famous NLP department (Chris Manning, Dan Jurafsky)
- It is an extractive question answering dataset
- The answer is contained in the input, and the model simply “extracts” the portion which makes up the answer
- [CLS] question tokens [SEP] context tokens
from transformers import pipeline
qa = pipeline("question answering")
ctx = "Today, I made a peanut butter sandwich"
q = "What did I put in my sandwich?"
qa(context=ctx, question=q)
3.10 Zero-Shot Classification
- Classification without labels
- NLP Examples:
- Text: Wikipedia page on Albert Einstein
- Classify the document : Classes[scientist, painter]
- Text: Wikipedia page on mitochondria
- Classify the documen : Classes[biology, math, psychology]
- Text: Wikipedia page on Albert Einstein
- Now consider: how would you build a model that can do this?
- Architecture is not like a typical softmax neural network
- In that case, we have one final dense layer with # outputs = # classes
- Zero-shot model cannot work this way because it must use whatever classes you give it
from transformers import pipeline
clf = pipeline("zero-shot-classification", device=0)
clf("This is a great movie", candidate_labels=["positive", "negative"])
3.11 Beginner’s Corner Section SUmmary
- Sentiment Analysis - NOT interesting
- Text Generation - pre-train a neural network and take out the embedding vector from the second last layer
- Masked Language Modeling
- Named Entity Reognition
- Text Summarization
- Neural Machine Translation
- Question Answering
- Zero-Shot Classification
Leave a Comment