1. Fine-Tuning (Intermediate)

1.1 Fine-Tuning Section Introduction

  • Previous section was easy: only use models others have trained
  • i.e. We only used the model for inference / prediction
  • Very simple thanks to pipeline - same interface for all tasks

1.1.1 Section Outline

  • Review text preprocessing (convert text into numbers)
  • Tokenization, token to integer mapping, padding
  • Pipeline (two parts : tokenizer + model)
    • Text —> Tokenizer —> Tokenized Data —> Model —> Predictions

Examples: * Example 1 : sentiment analysis on built-in dataset * Example 2 : text classification on custom dataset * Example 3 : text classification with multiple input sentences

1.2 Text Preprocessing Review

The steps:

  • Tokenize (string split, split words + punctuation, characters, subwords…)
  • Map tokens to integers (e.g. {“i”,”like”,”cats”} –> {420,650,103})
  • Padding / truncation (to process batches of different length sequences)

1.2.1 Tokenization

  • word-level tokenization
  • char-level tokenization
  • string split
  • punctuation
  • stemming / lemmatization
  • subword tokenization
    • (tokenize - run-ING, run-S, run-ER)
    • wouldn’t is a contraction of would not
  • Different models use different tokenization schemes
  • Each model has its’s won tokenizer

1.2.2 Mapping Tokens to Integers

  • assign unique id to each unique token
  • vocabulary = {token1 : id1, tok2n2 : id2}
  • idx2token = {v : k for k, v in token2idx.items()}

Padding

  • In deep learning, we want to process batches of inputs at a time
  • But what if the documents in a batch have different lengths?
  • Generally Pad token is ‘0’ but could be something else
  • Remove outliers (like most docs are 1K while 1 doc is 100K tokens)
  • RNN: h(t) = f(W_x.X + W_h.H(t-1) + b)

Truncation

  • Padding makes some documents longer
  • Truncation makes some documents shorter
  • Transformers have a maximum sequence length
    • e.g. BERT limit = 512 tokens, GPT-2 limit = 1024 tokens
    • quadratic complexity
  • For text classification (single prediction), doesn’t matter so much
  • For NER or machine translation, it does

1.3 What Does a Pipeline Actually Do?

  • Step 1
    • Text processing : convert input text into numbers
    • Tokenizer + standard text-preprocessing
    • Universal interface to BertTokenizer, DistilBertTokenizer, GPT2Tokenizer
    • Hugging Face Model is a wrapper around TF / PyTorch model with a few added bells and whistles (universal interface to BertModel, GPT2Model…)
  • Step 2
    • Model : convert the input into predictions (e.g. negative/positive)
  • Step 3
    • Post processing : make prediction human-readable (e.g. translation)
  • Hard example : NER (tokens are subwords, but labels correspond to words!)

from transformers import AutoTokenizer

checkpoint = 'bert-base-cased'

tokenzier = AutoTokenizer.from_pretrained(checkpoint)

tokenizer("hello world")

OUTPUT :
{
    'input_ids': [101, 7592, 2088, 102],
    'token_type_ids': [0, 0, 0, 0], # will show up for BERT, but not for DistillBERT
    'attention_mask': [1, 1, 1, 1]
}

[CLS] - classification [SEP] - sentence separator

1.3.1 Why do all theses methods exists?

  • This is a new library, things are always changing!
  • A few related methods (not shown) are deprecated
  • Don’t be surprised if things changed overnight
  • Corollary : don’t be alarmed if/when it does
  • The goal of thos course isn’t to memorize one set of syntax, it’s to become skilled enough to handle any syntax!
tokenizer("hello world", return_tensors='pt')

‘pt’ = pytorch ‘tf’ = tensor flow ‘np’ = numpy arrays

OUTPUT :
{
    'input_ids': tensor([[101, 7592, 2088, 102]]),
    'token_type_ids': tensor([[0, 0, 0, 0]]), # will show up for BERT, but not for DistillBERT
    'attention_mask': tensor([[1, 1, 1, 1]])
}

1.3.2

data = [
 "I like cats.",
 "Do you like cats too?",
]
model_inputs = tokenizer(data) # OK, but model won't accept
model_inputs = tokenizer(data, return_tensors='pt') # ERROR

you’ll need to give two additional args

model_inputs = tokenizer(data, padding=True, truncation=True, return_tensors='pt')

attention_mask = where to pay attention to (example, don’t pay attention to padded tokens which are only to make equal size tensor)

1.3.3

Using the Model

from transformers import AutoModelForSequenceClassification

# should be the same checkpoint as tokenizer
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

OUTPUT: Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']

You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference

1.3.4

Making Predictions

model_inputs = tpkenizer(
    data, padding=True, truncation=True, return_tensors='pt')

outputs = model(**model_inputs)

Double Asterix

  • Recall : converts a dictionary into named arguments
def my_function(name, email, password):
    # do some stuff

# normal function call
my_function("Alice", "alice@emal.com", "12345")

# function call with explicit arguments
my_function(name="Alice", email="alice@email.com", password="12345")

# function call with dictionary - equivalent to above
d = {'name': 'Alice', 'email': 'alice@email.com', 'password'='12345'}
my_function(**d)

Reminder: PyTorch Model


class MyModel(torch.nn.Module):

    def __init__(self):
        super().__init__()
        # ...

    def forward(self, input_ids, attention_mask, ...):
        # do the computation
        return output


# usage
model = MyModel()
model(input_ids=some_data, attention_mask=other_data, ...)

Model Outputs

  • Document Classification - K classes
  • If you pass in N documents, you will get back an N x K output
  • If you pass in a single document, you will get back a K-sized output
  • The outputs are logits (value before applying softmax)
  • To get class prediction, just take the argmax

Updated:

Leave a Comment