Hugging Face Transformers 🤗

Popular NLP Tasks with Hugging Face

Powerful models through a simple & universal interface / API

ML / AI Task	Notes	Notebook	Universal Interface / API
Sentiment Analysis	* Sentiment Analysis - Positive, Neutral, Negative, (Very Positive, Very Negative) * Reputation management * Report sentiment statistics on Twitter of competitors and yourself * Customer support sentiment * Stock price prediction	Pipeline_Sentiment_Analysis.ipynb	CODE > `from transformers import pipeline` > `classifier = pipeline("sentiment-analysis")` > `classifier("This is such a great movie!")` OUTPUT [{‘label’: ‘POSITIVE’, ‘score’: 0.9998759031295776}]
Text Generation	* Autoregressive Language Models * Language is a “time series” (i.e. a sequence) of categorical objects * An autoregressive language model is one where we find the conditional distribution of the next word given past words * Transformers (attention-mechanism) have been a key technology (long-range dependencies) * Uses cases: we’ve used them to generate poetry * Use cases: help writing emails / creative writing * Use cases: Github Copilot (can generate working code from text prompt) * Use cases: Build full website designs (actual code), compose music, medical queries	Pipeline_Text_Generation.ipynb	CODE > `!wget -nc https://raw.githubusercontent.com/lazyprogrammer/machine_learning_examples/master/hmm_class/robert_frost.txt` > `from transformers import pipeline, set_seed` > `lines = [line.rstrip() for line in open('robert_frost.txt')]` > `gen = pipeline("text-generation")` > `gen(lines[0])` OUTPUT [{‘generated_text’: ‘Two roads diverged in a yellow wood, which they had left behind a few yards from where they had cut from. At the end of the road stood a tall red pole and, just out of view, the white-lipped man could see’}]
Masked Language Modeling	* Example - Article Spinning (create content with keyboards that match users’ queries) * Article spinning idea: change enough words (while keeping the article coherent) such that it doesn’t match the original * Article Spinning is a Black Hat SEO technique	Pipeline_Masked_Language_Modeling.ipynb	CODE > `!wget -nc https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv` > `from transformers import pipeline` > `mlm = pipeline('fill-mask')` > `mlm('Bombardier chief to leave <mask>')` OUTPUT [{‘score’: 0.06950818747282028, ‘sequence’: ‘Bombardier chief to leave job’, ‘token’: 633, ‘token_str’: ‘ job’}, {‘score’: 0.06693071871995926, ‘sequence’: ‘Bombardier chief to leave France’, ‘token’: 1470, ‘token_str’: ‘ France’}, {‘score’: 0.052735257893800735, ‘sequence’: ‘Bombardier chief to leave office’, ‘token’: 558, ‘token_str’: ‘ office’}, {‘score’: 0.025823095813393593, ‘sequence’: ‘Bombardier chief to leave Paris’, ‘token’: 2201, ‘token_str’: ‘ Paris’}, {‘score’: 0.021368568763136864, ‘sequence’: ‘Bombardier chief to leave Canada’, ‘token’: 896, ‘token_str’: ‘ Canada’}]
Named Entity Recognition	* Named entity recognition (NER) allows us to identify (i.e. tag) all the people, places, and companies in a document	Pipeline_NER.ipynb	CODE > `from transformers import pipeline` > `ner = pipeline("ner", aggregation_strategy='simple', device=0)` > `inputs[9]` > `from nltk.tokenize.treebank import TreebankWordDetokenizer` > `detokenizer = TreebankWordDetokenizer()` > `ner(detokenizer.detokenize(inputs[9]))` OUTPUT NER Input = [‘He’,’was’,’well’,’backed’,’by’,’England’,’hopeful’,’Mark’,’Butcher’,’who’,’made’,’70’,’as’,’Surrey’, ‘closed’,’on’,’429’,’for’,’seven’,’,’,’a’,’lead’,’of’,’234’,’.’] NER Result = [{‘entity_group’: ‘LOC’,’score’: 0.99967515,’word’: ‘England’,’start’: 22,’end’: 29}, {‘entity_group’: ‘PER’,’score’: 0.99974275,’word’: ‘Mark Butcher’,’start’: 38,’end’: 50}, {‘entity_group’: ‘ORG’,’score’: 0.9996264,’word’: ‘Surrey’,’start’: 66,’end’: 72}]
Text Summarization	* We already do this all the time! - Scientific paper abstracts, Executive summaries * Paraphrase / Summarize * Summarization is a way for learning systems to demonstrate understanding of a concept * Extractive vs. Abstractive * Extractive summaries consist of text taken from the original document * Abstractive summaries can contain novel sequences of text not necessarily taken from the input	Pipeline_Text_Generation.ipynb	CODE > `from transformers import pipeline` > `summarizer = pipeline("summarization")` > `summarizer(doc.iloc[0].split("\n", 1)[1])` OUTPUT [{‘summary_text’: ‘ Retail sales dropped by 1% on the month in December, after a 0.6% rise in November . Clothing retailers and non-specialist stores were the worst hit with only internet retailers showing any significant growth . The last time retailers endured a tougher Christmas was 23 years ago, when sales plunged 1.7% .’}]
Neural Machine Translation	* Convert phrases from one language to another * Sequence-to-sequence task * Many valid translations * BLEU score is the most popular metric	Pipeline_Neural_Machine_Translation.ipynb	CODE > `from transformers import pipeline` > `translator = pipeline("translation”, model='Helsinki-NLP/opus-mt-en-es', device=0)` > `translator("I like eggs and ham")` OUTPUT [{‘translation_text’: ‘Me gustan los huevos y el jamón.’}]
Question Answering	* SQuAD [Stanford Question Answering Dataset] * It is an extractive question answering dataset * The answer is contained in the input, and the model simply “extracts” the portion which makes up the answer * [CLS] question tokens [SEP] context tokens	Pipeline_Question_Answering.ipynb	CODE > `from transformers import pipeline` > `qa = pipeline("question-answering")` > `context = "Today I went to the store to purchase a carton of milk."` > `question = "What did I buy?"` > `qa(context=context, question=question)` OUTPUT {‘answer’: ‘a carton of milk’,’end’: 54,’score’: 0.5626223683357239,’start’: 38}
Zero-Shot Classification	* Classification without labels	Pipeline_Zero_Shot_Classification.ipynb	CODE > `from transformers import pipeline` > `classifier = pipeline("zero-shot-classification", device=0)` > `text =` “Due to the presence of isoforms of its components, there are 12 “ + \ “versions of AMPK in mammals, each of which can have different tissue “ + \ “localizations, and different functions under different conditions. “ + \ “AMPK is regulated allosterically and by post-translational “ + \ “modification, which work together.” > `classifier(text, candidate_labels=["biology", "math", "geology"])` OUTPUT {‘labels’: [‘biology’, ‘math’, ‘geology’], ‘scores’: [0.8908600807189941, 0.06606598943471909, 0.04307396709918976], ‘sequence’: ‘Due to the presence of isoforms of its components, there are 12 versions of AMPK in mammals, each of which can have different tissue localizations, and different functions under different conditions. AMPK is regulated allosterically and by post-translational modification, which work together.’}