1. Fine-Tuning (Intermediate)

1.1 Fine-Tuning Section Introduction

Previous section was easy: only use models others have trained
i.e. We only used the model for inference / prediction
Very simple thanks to pipeline - same interface for all tasks

1.1.1 Section Outline

Review text preprocessing (convert text into numbers)
Tokenization, token to integer mapping, padding
Pipeline (two parts : tokenizer + model)
- Text —> Tokenizer —> Tokenized Data —> Model —> Predictions

Examples: * Example 1 : sentiment analysis on built-in dataset * Example 2 : text classification on custom dataset * Example 3 : text classification with multiple input sentences

1.2 Text Preprocessing Review

The steps:

Tokenize (string split, split words + punctuation, characters, subwords…)
Map tokens to integers (e.g. {“i”,”like”,”cats”} –> {420,650,103})
Padding / truncation (to process batches of different length sequences)

1.2.1 Tokenization

word-level tokenization
char-level tokenization
string split
punctuation
stemming / lemmatization
subword tokenization
- (tokenize - run-ING, run-S, run-ER)
- wouldn’t is a contraction of would not
Different models use different tokenization schemes
Each model has its’s won tokenizer

1.2.2 Mapping Tokens to Integers

assign unique id to each unique token
vocabulary = {token1 : id1, tok2n2 : id2}
idx2token = {v : k for k, v in token2idx.items()}

Padding

In deep learning, we want to process batches of inputs at a time
But what if the documents in a batch have different lengths?
Generally Pad token is ‘0’ but could be something else
Remove outliers (like most docs are 1K while 1 doc is 100K tokens)
RNN: h(t) = f(W_x.X + W_h.H(t-1) + b)

Truncation

Padding makes some documents longer
Truncation makes some documents shorter
Transformers have a maximum sequence length
- e.g. BERT limit = 512 tokens, GPT-2 limit = 1024 tokens
- quadratic complexity
For text classification (single prediction), doesn’t matter so much
For NER or machine translation, it does

1.3 What Does a Pipeline Actually Do?

Step 1
- Text processing : convert input text into numbers
- Tokenizer + standard text-preprocessing
- Universal interface to BertTokenizer, DistilBertTokenizer, GPT2Tokenizer
- Hugging Face Model is a wrapper around TF / PyTorch model with a few added bells and whistles (universal interface to BertModel, GPT2Model…)
Step 2
- Model : convert the input into predictions (e.g. negative/positive)
Step 3
- Post processing : make prediction human-readable (e.g. translation)
Hard example : NER (tokens are subwords, but labels correspond to words!)

from transformers import AutoTokenizer

checkpoint = 'bert-base-cased'

tokenzier = AutoTokenizer.from_pretrained(checkpoint)

tokenizer("hello world")

OUTPUT :
{
    'input_ids': [101, 7592, 2088, 102],
    'token_type_ids': [0, 0, 0, 0], # will show up for BERT, but not for DistillBERT
    'attention_mask': [1, 1, 1, 1]
}

[CLS] - classification [SEP] - sentence separator

1.3.1 Why do all theses methods exists?

This is a new library, things are always changing!
A few related methods (not shown) are deprecated
Don’t be surprised if things changed overnight
Corollary : don’t be alarmed if/when it does
The goal of thos course isn’t to memorize one set of syntax, it’s to become skilled enough to handle any syntax!

tokenizer("hello world", return_tensors='pt')

‘pt’ = pytorch ‘tf’ = tensor flow ‘np’ = numpy arrays

OUTPUT :
{
    'input_ids': tensor([[101, 7592, 2088, 102]]),
    'token_type_ids': tensor([[0, 0, 0, 0]]), # will show up for BERT, but not for DistillBERT
    'attention_mask': tensor([[1, 1, 1, 1]])
}

1.3.2

data = [
 "I like cats.",
 "Do you like cats too?",
]

model_inputs = tokenizer(data) # OK, but model won't accept
model_inputs = tokenizer(data, return_tensors='pt') # ERROR

you’ll need to give two additional args

model_inputs = tokenizer(data, padding=True, truncation=True, return_tensors='pt')

attention_mask = where to pay attention to (example, don’t pay attention to padded tokens which are only to make equal size tensor)

1.3.3

Using the Model

from transformers import AutoModelForSequenceClassification

# should be the same checkpoint as tokenizer
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

OUTPUT: Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']

You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference

1.3.4

Making Predictions

model_inputs = tpkenizer(
    data, padding=True, truncation=True, return_tensors='pt')

outputs = model(**model_inputs)

Double Asterix

Recall : converts a dictionary into named arguments

def my_function(name, email, password):
    # do some stuff

# normal function call
my_function("Alice", "alice@emal.com", "12345")

# function call with explicit arguments
my_function(name="Alice", email="alice@email.com", password="12345")

# function call with dictionary - equivalent to above
d = {'name': 'Alice', 'email': 'alice@email.com', 'password'='12345'}
my_function(**d)

Reminder: PyTorch Model

class MyModel(torch.nn.Module):

    def __init__(self):
        super().__init__()
        # ...

    def forward(self, input_ids, attention_mask, ...):
        # do the computation
        return output


# usage
model = MyModel()
model(input_ids=some_data, attention_mask=other_data, ...)

Model Outputs

Document Classification - K classes
If you pass in N documents, you will get back an N x K output
If you pass in a single document, you will get back a K-sized output
The outputs are logits (value before applying softmax)
To get class prediction, just take the argmax

Share on

Twitter Facebook Google+ LinkedIn

Transformers Essentials - Part 2!

Sahil Gupta