1. Fine-Tuning (Intermediate)
1.1 Fine-Tuning Section Introduction
- Previous section was easy: only use models others have trained
- i.e. We only used the model for inference / prediction
- Very simple thanks to pipeline - same interface for all tasks
1.1.1 Section Outline
- Review text preprocessing (convert text into numbers)
- Tokenization, token to integer mapping, padding
- Pipeline (two parts : tokenizer + model)
- Text —> Tokenizer —> Tokenized Data —> Model —> Predictions
Examples: * Example 1 : sentiment analysis on built-in dataset * Example 2 : text classification on custom dataset * Example 3 : text classification with multiple input sentences
1.2 Text Preprocessing Review
The steps:
- Tokenize (string split, split words + punctuation, characters, subwords…)
- Map tokens to integers (e.g. {“i”,”like”,”cats”} –> {420,650,103})
- Padding / truncation (to process batches of different length sequences)
1.2.1 Tokenization
- word-level tokenization
- char-level tokenization
- string split
- punctuation
- stemming / lemmatization
- subword tokenization
- (tokenize - run-ING, run-S, run-ER)
- wouldn’t is a contraction of would not
- Different models use different tokenization schemes
- Each model has its’s won tokenizer
1.2.2 Mapping Tokens to Integers
- assign unique id to each unique token
- vocabulary = {token1 : id1, tok2n2 : id2}
- idx2token = {v : k for k, v in token2idx.items()}
Padding
- In deep learning, we want to process batches of inputs at a time
- But what if the documents in a batch have different lengths?
- Generally Pad token is ‘0’ but could be something else
- Remove outliers (like most docs are 1K while 1 doc is 100K tokens)
- RNN: h(t) = f(W_x.X + W_h.H(t-1) + b)
Truncation
- Padding makes some documents longer
- Truncation makes some documents shorter
- Transformers have a maximum sequence length
- e.g. BERT limit = 512 tokens, GPT-2 limit = 1024 tokens
- quadratic complexity
- For text classification (single prediction), doesn’t matter so much
- For NER or machine translation, it does
1.3 What Does a Pipeline Actually Do?
- Step 1
- Text processing : convert input text into numbers
- Tokenizer + standard text-preprocessing
- Universal interface to BertTokenizer, DistilBertTokenizer, GPT2Tokenizer
- Hugging Face Model is a wrapper around TF / PyTorch model with a few added bells and whistles (universal interface to BertModel, GPT2Model…)
- Step 2
- Model : convert the input into predictions (e.g. negative/positive)
- Step 3
- Post processing : make prediction human-readable (e.g. translation)
- Hard example : NER (tokens are subwords, but labels correspond to words!)
from transformers import AutoTokenizer
checkpoint = 'bert-base-cased'
tokenzier = AutoTokenizer.from_pretrained(checkpoint)
tokenizer("hello world")
OUTPUT :
{
'input_ids': [101, 7592, 2088, 102],
'token_type_ids': [0, 0, 0, 0], # will show up for BERT, but not for DistillBERT
'attention_mask': [1, 1, 1, 1]
}
[CLS] - classification [SEP] - sentence separator
1.3.1 Why do all theses methods exists?
- This is a new library, things are always changing!
- A few related methods (not shown) are deprecated
- Don’t be surprised if things changed overnight
- Corollary : don’t be alarmed if/when it does
- The goal of thos course isn’t to memorize one set of syntax, it’s to become skilled enough to handle any syntax!
tokenizer("hello world", return_tensors='pt')
‘pt’ = pytorch ‘tf’ = tensor flow ‘np’ = numpy arrays
OUTPUT :
{
'input_ids': tensor([[101, 7592, 2088, 102]]),
'token_type_ids': tensor([[0, 0, 0, 0]]), # will show up for BERT, but not for DistillBERT
'attention_mask': tensor([[1, 1, 1, 1]])
}
1.3.2
data = [
"I like cats.",
"Do you like cats too?",
]
model_inputs = tokenizer(data) # OK, but model won't accept
model_inputs = tokenizer(data, return_tensors='pt') # ERROR
you’ll need to give two additional args
model_inputs = tokenizer(data, padding=True, truncation=True, return_tensors='pt')
attention_mask = where to pay attention to (example, don’t pay attention to padded tokens which are only to make equal size tensor)
1.3.3
Using the Model
from transformers import AutoModelForSequenceClassification
# should be the same checkpoint as tokenizer
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
OUTPUT: Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference
1.3.4
Making Predictions
model_inputs = tpkenizer(
data, padding=True, truncation=True, return_tensors='pt')
outputs = model(**model_inputs)
Double Asterix
- Recall : converts a dictionary into named arguments
def my_function(name, email, password):
# do some stuff
# normal function call
my_function("Alice", "alice@emal.com", "12345")
# function call with explicit arguments
my_function(name="Alice", email="alice@email.com", password="12345")
# function call with dictionary - equivalent to above
d = {'name': 'Alice', 'email': 'alice@email.com', 'password'='12345'}
my_function(**d)
Reminder: PyTorch Model
class MyModel(torch.nn.Module):
def __init__(self):
super().__init__()
# ...
def forward(self, input_ids, attention_mask, ...):
# do the computation
return output
# usage
model = MyModel()
model(input_ids=some_data, attention_mask=other_data, ...)
Model Outputs
- Document Classification - K classes
- If you pass in N documents, you will get back an N x K output
- If you pass in a single document, you will get back a K-sized output
- The outputs are logits (value before applying softmax)
- To get class prediction, just take the argmax
Leave a Comment