Hugging Face Transformers 🤗


ML / AI Task Notes Notebook Universal Interface / API
Sentiment Analysis * Sentiment Analysis - Positive, Neutral, Negative, (Very Positive, Very Negative)
* Reputation management
* Report sentiment statistics on Twitter of competitors and yourself
* Customer support sentiment
* Stock price prediction
Pipeline_Sentiment_Analysis.ipynb CODE
> from transformers import pipeline
> classifier = pipeline("sentiment-analysis")
> classifier("This is such a great movie!")

OUTPUT
[{‘label’: ‘POSITIVE’, ‘score’: 0.9998759031295776}]
Text Generation * Autoregressive Language Models
* Language is a “time series” (i.e. a sequence) of categorical objects
* An autoregressive language model is one where we find the conditional distribution of the next word given past words
* Transformers (attention-mechanism) have been a key technology (long-range dependencies)
* Uses cases: we’ve used them to generate poetry
* Use cases: help writing emails / creative writing
* Use cases: Github Copilot (can generate working code from text prompt)
* Use cases: Build full website designs (actual code), compose music, medical queries
Pipeline_Text_Generation.ipynb CODE
> !wget -nc https://raw.githubusercontent.com/lazyprogrammer/machine_learning_examples/master/hmm_class/robert_frost.txt
> from transformers import pipeline, set_seed
> lines = [line.rstrip() for line in open('robert_frost.txt')]
> gen = pipeline("text-generation")
> gen(lines[0])

OUTPUT
[{‘generated_text’: ‘Two roads diverged in a yellow wood, which they had left behind a few yards
from where they had cut from. At the end of the road
stood a tall red pole and, just out of view, the white-lipped man could see’}]
Masked Language Modeling * Example - Article Spinning (create content with keyboards that match users’ queries)
* Article spinning idea: change enough words (while keeping the article coherent) such that it doesn’t match the original
* Article Spinning is a Black Hat SEO technique
Pipeline_Masked_Language_Modeling.ipynb CODE
> !wget -nc https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv
> from transformers import pipeline
> mlm = pipeline('fill-mask')
> mlm('Bombardier chief to leave <mask>')

OUTPUT
[{‘score’: 0.06950818747282028, ‘sequence’: ‘Bombardier chief to leave job’, ‘token’: 633, ‘token_str’: ‘ job’},
{‘score’: 0.06693071871995926, ‘sequence’: ‘Bombardier chief to leave France’, ‘token’: 1470, ‘token_str’: ‘ France’},
{‘score’: 0.052735257893800735, ‘sequence’: ‘Bombardier chief to leave office’, ‘token’: 558, ‘token_str’: ‘ office’},
{‘score’: 0.025823095813393593, ‘sequence’: ‘Bombardier chief to leave Paris’, ‘token’: 2201, ‘token_str’: ‘ Paris’},
{‘score’: 0.021368568763136864, ‘sequence’: ‘Bombardier chief to leave Canada’, ‘token’: 896, ‘token_str’: ‘ Canada’}]
Named Entity Recognition * Named entity recognition (NER) allows us to identify (i.e. tag) all the people, places, and companies in a document Pipeline_NER.ipynb CODE
> from transformers import pipeline
> ner = pipeline("ner", aggregation_strategy='simple', device=0)
> inputs[9]
> from nltk.tokenize.treebank import TreebankWordDetokenizer
> detokenizer = TreebankWordDetokenizer()
> ner(detokenizer.detokenize(inputs[9]))

OUTPUT
NER Input = [‘He’,’was’,’well’,’backed’,’by’,’England’,’hopeful’,’Mark’,’Butcher’,’who’,’made’,’70’,’as’,’Surrey’,
‘closed’,’on’,’429’,’for’,’seven’,’,’,’a’,’lead’,’of’,’234’,’.’]

NER Result = [{‘entity_group’: ‘LOC’,’score’: 0.99967515,’word’: ‘England’,’start’: 22,’end’: 29},
{‘entity_group’: ‘PER’,’score’: 0.99974275,’word’: ‘Mark Butcher’,’start’: 38,’end’: 50},
{‘entity_group’: ‘ORG’,’score’: 0.9996264,’word’: ‘Surrey’,’start’: 66,’end’: 72}]
Text Summarization * We already do this all the time! - Scientific paper abstracts, Executive summaries
* Paraphrase / Summarize
* Summarization is a way for learning systems to demonstrate understanding of a concept
* Extractive vs. Abstractive
* Extractive summaries consist of text taken from the original document
* Abstractive summaries can contain novel sequences of text not necessarily taken from the input
Pipeline_Text_Generation.ipynb CODE
> from transformers import pipeline
> summarizer = pipeline("summarization")
> summarizer(doc.iloc[0].split("\n", 1)[1])

OUTPUT
[{‘summary_text’: ‘ Retail sales dropped by 1% on the month in December, after a 0.6% rise in November .
Clothing retailers and non-specialist stores were the worst hit with only internet retailers showing any significant growth .
The last time retailers endured a tougher Christmas was 23 years ago, when sales plunged 1.7% .’}]
Neural Machine Translation * Convert phrases from one language to another
* Sequence-to-sequence task
* Many valid translations
* BLEU score is the most popular metric
Pipeline_Neural_Machine_Translation.ipynb CODE
> from transformers import pipeline
> translator = pipeline("translation”, model='Helsinki-NLP/opus-mt-en-es', device=0)
> translator("I like eggs and ham")

OUTPUT
[{‘translation_text’: ‘Me gustan los huevos y el jamón.’}]
Question Answering * SQuAD [Stanford Question Answering Dataset]
* It is an extractive question answering dataset
* The answer is contained in the input, and the model simply “extracts” the portion which makes up the answer
* [CLS] question tokens [SEP] context tokens
Pipeline_Question_Answering.ipynb CODE
> from transformers import pipeline
> qa = pipeline("question-answering")
> context = "Today I went to the store to purchase a carton of milk."
> question = "What did I buy?"
> qa(context=context, question=question)

OUTPUT
{‘answer’: ‘a carton of milk’,’end’: 54,’score’: 0.5626223683357239,’start’: 38}
Zero-Shot Classification * Classification without labels Pipeline_Zero_Shot_Classification.ipynb CODE
> from transformers import pipeline
> classifier = pipeline("zero-shot-classification", device=0)
> text = “Due to the presence of isoforms of its components, there are 12 “ + \
“versions of AMPK in mammals, each of which can have different tissue “ + \
“localizations, and different functions under different conditions. “ + \
“AMPK is regulated allosterically and by post-translational “ + \
“modification, which work together.”
> classifier(text, candidate_labels=["biology", "math", "geology"])

OUTPUT
{‘labels’: [‘biology’, ‘math’, ‘geology’],
‘scores’: [0.8908600807189941, 0.06606598943471909, 0.04307396709918976],
‘sequence’: ‘Due to the presence of isoforms of its components, there are 12
versions of AMPK in mammals, each of which can have different tissue
localizations, and different functions under different conditions.
AMPK is regulated allosterically and by post-translational
modification, which work together.’}