Introduction

Just one of the times you stumble upon an excellent dataset on Kaggle for a really interesting data mining problem - sarcasm detection in text and cannot resist to take a stab at it. I have looked for labelled datasets for this problem earlier but couldn’t find a reasonably clean corpus with sufficient instances.

But this json has a class-balanced dataset with ~27K news headlines labelled as sarcastic or non-sarcastic. Kaggle Link to Dataset

This weekend data mining endeavour has been a good exercise to make fun discoveries around what makes a news headline to be sarcastic. Some discoveries were quite specific to this dataset. For example - I was surprised when the words ‘Area’ and ‘Man’ appeared in my top 10 features to identify sarcasm in news headlines. But then I found out ‘Area Man’ is a sarcastic slang used as recurring joke on theonion.com

raw_df[tokenDF_Final.area == 1][['article_link','headline_feature','is_sarcastic']].head(3).reset_index(drop=True)
article_link headline_feature is_sarcastic
https://local.theonion.com/area-woman-said-sorry-118-times-yesterday-1819576089 area woman said ‘sorry’ 118 times yesterday 1
https://www.theonion.com/area-insurance-salesman-celebrates-14th-year-of-quoting-1819565058 area insurance salesman celebrates 14th year of quoting fletch 1
https://local.theonion.com/is-area-man-going-to-finish-those-fries-1819565422 is area man going to finish those fries? 1

Whereas few discoveries are generalized and appear in sarcastic text everywhere and even corroborate with my personal experience. For example - ‘Clearly’ popped up in top 10 features. And if you think of it people do tend to use the word frequently in sarcastic remarks.

article_link headline_feature is_sarcastic
https://www.theonion.com/jealous-gps-clearly-wants-man-to-back-over-wife-1819589581 jealous gps clearly wants man to back over wife 1
https://politics.theonion.com/new-job-posting-on-craigslist-clearly-for-secretary-of-1819568699 new job posting on craigslist clearly for secretary of the interior 1
https://www.theonion.com/elementary-schooler-clearly-just-learned-to-swear-1819566113 elementary schooler clearly just learned to swear 1

Cool… now, let’s get down to step-by-step going about the problem - Data Cleaning & Exploration, Feature Engineering & Model Training/Testing

Sample Data Exhibit

# Reading the JSON File
raw_df = pd.read_json('Sarcasm_Headlines_Dataset.json', lines=True)

# Extracting the Hostname from URL using regular expressions
raw_df['website_name'] = raw_df['article_link'].str.extract('(https://.*?[.]comhttp/'
                                                            '|https://.*?[.]com)', expand=True)
raw_df['website_name'] = raw_df['website_name'].str.replace('https://','').str.replace('/','').str.replace('comhttp','com')
raw_df = raw_df.drop(['article_link'], axis=1)
raw_df.head(3)
headline is_sarcastic website_name
former versace store clerk sues over secret ‘black code’ for minority shoppers 0 www.huffingtonpost.com
the ‘roseanne’ revival catches up to our thorny political mood, for better and worse 0 www.huffingtonpost.com
mom starting to fear son’s web series closest thing she will have to grandchild 1 local.theonion.com

The news articles from theonion are all sarcastic whereas the ones from huffingpost are all non-sarcastic. Since, the aim is to understand the linguistic features - vocabulary or semantics that help us identify sarcasm rather than building a 100% accurate model using just the website_name as a feature, we’ll not use this variable for modelling.

pd.pivot_table(raw_df, values=['is_sarcastic'], index=['website_name'], aggfunc=('sum','count'), fill_value=0)
website_name is_sarcastic - count is_sarcastic - sum
entertainment.theonion.com 1194 1194
local.theonion.com 2852 2852
politics.theonion.com 1767 1767
sports.theonion.com 100 100
www.huffingtonpost.com 14985 0
www.theonion.com 5811 5811

Data Cleaning

For NLP algorithms - Bag of Words or Doc2Vec, we’ll first need a clean set of tokens.

Note - Creating lists with List Comprehensions is more concise and significantly faster than defining functions with For Loops because it escapes calling append attribute of the list in each iteration. Hence, have resorted to LCs everywhere.

Ok, let’s first start by some standard data cleaning steps while working with text:

1. Tokenizing

# Split into Words
from nltk.tokenize import word_tokenize
raw_df['tokens'] = raw_df['headline_feature'].apply(nltk.word_tokenize)

2. Normalizing Case

# Convert to lower case
lower_case_tokens = lambda x : [w.lower() for w in x]
raw_df['tokens'] = raw_df['tokens'].apply(lower_case_tokens)

3. Removing Punctuation

# Filter Out Punctuation
import string
punctuation_dict = str.maketrans(dict.fromkeys(string.punctuation))
# This creates a dictionary mapping of every character from string.punctuation to None

punctuation_remover = lambda x : [w.translate(punctuation_dict) for w in x]
raw_df['tokens'] = raw_df['tokens'].apply(punctuation_remover)

4. Removing Non-alphabetic Tokens

# Remove remaining tokens that are not alphabetic
nonalphabet_remover = lambda x : [w for w in x if w.isalpha()]
raw_df['tokens'] = raw_df['tokens'].apply(nonalphabet_remover)

5. Filtering out Stop Words

# Filter out Stop Words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
stopwords_remover = lambda x : [w for w in x if not w in stop_words]
raw_df['tokens'] = raw_df['tokens'].apply(stopwords_remover)

6. Stemming / Lemmatizing the Tokens

# Stem / Lemmatize the Words
from nltk.stem.wordnet import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
word_lematizer = lambda x : [lmtzr.lemmatize(w) for w in x]
raw_df['tokens'] = raw_df['tokens'].apply(word_lematizer)

Feature Engineering

Bag of Words

We can extract Term Frequency to build a Bag of Words model or else use TFIDF statistic (which discounts words which are too common across documents). Though instead of calculating the TFIDF statistic, I chose to simply remove terms which are too common or too rare because removing redundant features all together seems preferrable to avoid over-fitting arising from high dimensionality.

Note - The earlier cleaning steps related to Stop Words removal and Non-Alphabetic tokens removal also addressed redundant dimensions.

sentence_creator = lambda x : [' '.join(x)][0]
raw_df['sentence_feature'] = raw_df['tokens'].apply(sentence_creator)
import sklearn.feature_extraction.text as sfText

vect = sfText.CountVectorizer()#(ngram_range = (1, 2))
vect.fit(raw_df['sentence_feature'])

X = vect.transform(raw_df['sentence_feature'])
tokenDataFrame = pd.DataFrame(X.A, columns = vect.get_feature_names())
tokens_redundant = token_sums[token_sums < 10].index
tokenDataFrame2 = tokenDataFrame.drop(tokens_redundant, axis = 1)

Sample Term Frequency dataset using two instances to demonstrate

headline                        better black catch clerk code former minority mood political revival roseanne secret shopper store sue thorny versace worse
former versace store clerk sue secret black code minority shopper 0 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0
roseanne revival catch thorny political mood better worse 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1

The unigram BoW models ignore the word order and context. We can resort to n-grams to retain some context but it will lead to sparse high dimensional feature vectors. (have therefore commented the code chunk for n-grams)

Semantics

A better way to capture the context and latent relationships between different words including synonyms, antonyms, anologies etc is to use word2vec. The word2vec unsupervised models can learn context through vector representation of words called ‘word embeddings’ based on conditional probabilities of word occurrences around other words. Surprisingly, even with low dimensionality in hundreds, word2vec embeddings can learn really meaningful relationships. Once we have vectors for words, we can take the mean or sum of all words in a document to represent whole document as a single vector. There are also alternate doc2vec methods which directly learn vector representation for documents. I have chosen to average the word2vec embeddings because based on many posts on stack overflow community, they perform better than doc2vec when dealling with small to medium size corpus.

import gensim
w2v_size = 100
model = gensim.models.Word2Vec(raw_df['tokens'], size = w2v_size)
w2v = dict(zip(model.wv.index2word, model.wv.syn0))

# Word2Vec of Words
fetch_w2v = lambda x : [w2v[w] for w in x if w in w2v]
mean_w2v = lambda x : np.sum(x, axis=0)/len(x)
raw_df['fetch_w2v'] = raw_df['tokens'].apply(fetch_w2v)
raw_df['mean_w2v'] = raw_df['fetch_w2v'].apply(mean_w2v)

There is some more data wrangling then to transform the array of vectors to dataframe with each vector as a row. Corresponding code in Jupyter Notebook.

Sample 2-D representation of the word embeddings to show the model works:

# Principal Component Analysis to represent word embeddings in 2-D
from sklearn.decomposition import PCA
from matplotlib import pyplot
X = model[model.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
# create a scatter plot of the projection
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model.wv.vocab)
for i, word in enumerate(words):
    pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))

pyplot.rcParams['figure.figsize'] = [50, 100]
pyplot.show()

word2vec_visualization

Behind the scenes magic of word2vec worked amazingly well as the model is grouping words by meaningful relationships (evident even in the 2-D representation of embeddings):

  • Hillary-Clinton-Donald-Trump
  • Obama-President-White-House-Campaign
  • Area-Man

Model Training & Testing

Instead of using either/or we can combine the features from Bag of Words & word2vec to build our model. Subsequently, see which ones are more informative for this problem.

raw_df2 = pd.concat([w2v_DF, tokenDF_Final], axis=1)

Also, as a next step clearly state the training features & target variable.

redundant = ['is_sarcastic','article_link','headline_feature','sentence_feature','tokens','website_name']
features = list(set(raw_df2.columns) - set(redundant))
target_var = ['is_sarcastic']

Now, let’s build a Logistic Regression model (rather than a black-box model) as we are interested in the feature importance (i.e. weights).

Train/Test split: 80/20

from sklearn.model_selection import train_test_split, StratifiedKFold
train_data, test_data, train_target, test_target = train_test_split(raw_df2[features].as_matrix(), raw_df2[target_var], train_size = .8, random_state = 100)

5-Folds Cross Validated accuracy on Training Dataset: 83.7%

from sklearn import linear_model

clf = linear_model.LogisticRegressionCV(cv = 5, random_state=0, solver='lbfgs',multi_class='ovr',penalty='l2').fit(train_data, train_target.values.ravel())
clf.score(train_data, train_target.values.ravel())

Accuracy on Test Dataset: 78.4%

Note - Sarcasm Detection is a challenging problem due to nuances in meaning. Therefore, this accuracy is impressive (I wasn’t hoping to get ~80% accuracy when I set forth) but it is specifically for this news headlines dataset from theonion.com & huffingpost.com

from sklearn.metrics import accuracy_score
y_true = test_target.values.ravel()
y_pred = clf.predict(test_data)
accuracy_score(y_true, y_pred)

Confusion Matrix: (2539, 467, 688, 1648)

from sklearn.metrics import confusion_matrix
tn, fp, fn, tp  = confusion_matrix(y_true, y_pred).ravel()
(tn, fp, fn, tp)

Feature Importance

The magnitude of the coefficients in logistic regression can be loosely interpreted as feature importance (or use pow(math.e, w) to be more accurate). Other ways for finding feature importance or parameter influence include using p-values, drop column feature importance or permutation feature importance

weights = clf.coef_
feature_weights = weights[0]
feature_abs_weights = np.abs(weights[0])
feature_names = np.array(raw_df2[features].columns)
feature_importance = pd.DataFrame({'Features':feature_names, 'Weights':feature_weights, 'Weights Absolute':feature_abs_weights}).sort_values(by='Weights Absolute', ascending=False).reset_index(drop = True)
feature_importance.head(12)
  Features Weights Weights Absolute
0 area 2.860869 2.860869
1 nation 2.499772 2.499772
2 introduces 2.109260 2.109260
3 shit 2.069392 2.069392
4 local 1.957934 1.957934
5 self 1.894267 1.894267
6 fucking 1.885271 1.885271
7 report 1.877206 1.877206
8 study 1.812988 1.812988
9 clearly 1.657876 1.657876
10 man 1.657209 1.657209
11 announces 1.611359 1.611359

We discussed in the beginning why the words - ‘area’, ‘man’ & ‘clearly’ show up as top features to identify sarcasm.

Other top features like ‘introduces’, ‘report’, ‘study’, ‘announces’, ‘unveils’ has to do with sarcastic emphasis around false claims.

‘introduces’

article_link headline_feature is_sarcastic
https://www.theonion.com/3m-introduces-new-line-of-protective-foam-eye-plugs-1822590036 3m introduces new line of protective foam eye plugs 1
https://local.theonion.com/burger-king-introduces-new-thing-to-throw-in-front-of-k-1819573136 burger king introduces new thing to throw in front of kids after another hellish day at work 1

‘study’

article_link headline_feature is_sarcastic
https://www.theonion.com/study-more-couples-delaying-divorce-until-kids-old-eno-1819576618 study: more couples delaying divorce until kids old enough to remember every painful detail 1
https://www.theonion.com/study-universe-actually-shrunk-by-about-19-inches-last-1819589814 study: universe actually shrunk by about 19 inches last year 1

whereas ‘fucking’ is used for sarcastic exaggeration to invoke humor

article_link headline_feature is_sarcastic
https://www.theonion.com/frontier-airlines-tells-customers-to-just-fucking-deal-1819580035 frontier airlines tells customers to just fucking deal with it 1
https://www.theonion.com/girls-scouts-announces-they-ll-never-ever-let-gross-fuc-1825752568 girls scouts announces they’ll never ever let gross fucking boys in 1

and ‘self’ is sarcastically calling out bad judgement/vanity etc

article_link headline_feature is_sarcastic
https://local.theonion.com/just-take-it-slow-and-you-ll-be-fine-drunk-driver-a-1820399426 ‘just take it slow, and you’ll be fine,’ drunk driver assures self while speeding away in stolen police car 1
https://www.theonion.com/narcissist-mentally-undresses-self-1819567215 narcissist mentally undresses self 1

Also, note that the latent features we learnt using word2vec appeared quite low in importance.

feature_importance[feature_importance.Features.isin(w2v_col_names)]
  Features Weights Weights Absolute
700 c42 0.588654 0.588654
1062 c94 0.471734 0.471734
1152 c78 0.451407 0.451407
1170 c85 -0.447847 0.447847
1310 c61 0.412665 0.412665

It probably has to do with the nature of articles in theonion.com and also, the relatively small size of corpus that the Bag of Words features turned out to be more informative for our model. But in other settings, when the size of corpus increases, the large vocabulary leads to sparse high dimensional feature vectors and in those problems the low dimensional dense feature vectors from word2vec will likely serve us better.

Appendix

BOW & TF-IDF

  • Limitations - sparse representations, large vector size, don’t capture semantics

Word2Vec

  • Take a fake problem (CBOW or Skipgrams)
  • Solve it using neural network
  • You get word embeddings as a side effect
  • Two methods
    • 1. CBOW : Continuous Bag of Words - Given context words predict the target word
    • 2. Skipgrams - Given the target predict context words

GloVe : Global Vectors for Word Representation

  • Similar to Word2Vec

BERT : Bidirectional Encoder Representations from Transformers

  • Based on transformer architecture

Leave a Comment