Jovian
⭐️
Sign In

Introduction

Spam detection is one of the most baisc applications using Lexical Processing. In this notebook, we will use the Spam SMS dataset downloaded from

https://www.kaggle.com/uciml/sms-spam-collection-dataset

We will look at Text Cleaning using NLTK. We will use BoW and TF-IDF with NAive Bayes to classify a message as "Spam" and "Ham"

In [3]:

import numpy as np 
import pandas as pd 


import os


import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', None)  


Analysis/Modelling

In [4]:
data=pd.read_csv("spam.csv",encoding='latin-1')
data.shape
Out[4]:
(5572, 5)
In [5]:
data.head()
Out[5]:

We will only consider the column v1 which indictes whether the message is spam or not and v2 which is the message. We will also rename the columns

In [6]:
data=data[['v1','v2']]
data.columns=['label','message']
data.head()
Out[6]:

ham indicates non-spam messages. Let us get the distribution of spam and ham messages

In [7]:
data['label'].value_counts()
Out[7]:
ham     4825
spam    747 
Name: label, dtype: int64
In [8]:
## getting the distribution is percentage
(data['label'].value_counts()/data.shape[0])*100
Out[8]:
ham     86.593683
spam    13.406317
Name: label, dtype: float64

Only 13% of the data is spam - this is an imbalanced classification problem. Before we build a model, let us explore the data a little more

Is there is a difference is length of spam and non-spam messages?

In [9]:
data['len_message']=data['message'].apply(lambda x:len(x.split()))

In [10]:
data.head()
Out[10]:
In [11]:
data['len_message'].describe()
Out[11]:
count    5572.000000
mean     15.494436  
std      11.329427  
min      1.000000   
25%      7.000000   
50%      12.000000  
75%      23.000000  
max      171.000000 
Name: len_message, dtype: float64
In [12]:
sns.kdeplot(data['len_message']).set_title(" Distribution of Length of Message")
Out[12]:
Text(0.5,1,' Distribution of Length of Message')
Notebook Image

Does the length of message vary for ham and spam messages?

In [13]:
data.loc[data['label']=="ham","len_message"].describe()
Out[13]:
count    4825.000000
mean     14.200622  
std      11.424511  
min      1.000000   
25%      7.000000   
50%      11.000000  
75%      19.000000  
max      171.000000 
Name: len_message, dtype: float64
In [14]:
data.loc[data['label']=="spam","len_message"].describe()
Out[14]:
count    747.000000
mean     23.851406 
std      5.811898  
min      2.000000  
25%      22.000000 
50%      25.000000 
75%      28.000000 
max      35.000000 
Name: len_message, dtype: float64
In [15]:
sns.kdeplot(data.loc[data['label']=="ham","len_message"],label='ham');
sns.kdeplot(data.loc[data['label']=="spam","len_message"],label='spam');

# beautifying the labels
plt.xlabel('Length of Message')
plt.ylabel('density')
plt.show()
Notebook Image

Spam messages on an average have a greater length than non-spam messages.

There is one non-spam very long message as well

Let us now look at the messages themselves.

Let us first tokenise the words and remove stopwords and punctuations.

In [16]:
data['message']
Out[16]:
0       Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...                                                                                                                                                                  
1       Ok lar... Joking wif u oni...                                                                                                                                                                                                                                                    
2       Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's                                                                                                                      
3       U dun say so early hor... U c already then say...                                                                                                                                                                                                                                
4       Nah I don't think he goes to usf, he lives around here though                                                                                                                                                                                                                    
5       FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv                                                                                                                             
6       Even my brother is not like to speak with me. They treat me like aids patent.                                                                                                                                                                                                    
7       As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune                                                                                                                 
8       WINNER!! As a valued network customer you have been selected to receivea å£900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.                                                                                                                   
9       Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030                                                                                                                       
10      I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.                                                                                                                                                                    
11      SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info                                                                                                                                         
12      URGENT! You have won a 1 week FREE membership in our å£100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18                                                                                                                     
13      I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.                                                                             
14      I HAVE A DATE ON SUNDAY WITH WILL!!                                                                                                                                                                                                                                              
15      XXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here>> http://wap. xxxmobilemovieclub.com?n=QJKGIGHJJGCBL                                                                                                                            
16      Oh k...i'm watching here:)                                                                                                                                                                                                                                                       
17      Eh u remember how 2 spell his name... Yes i did. He v naughty make until i v wet.                                                                                                                                                                                                
18      Fine if thatåÕs the way u feel. ThatåÕs the way its gota b                                                                                                                                                                                                                       
19      England v Macedonia - dont miss the goals/team news. Txt ur national team to 87077 eg ENGLAND to 87077 Try:WALES, SCOTLAND 4txt/̼1.20 POBOXox36504W45WQ 16+                                                                                                                     
20      Is that seriously how you spell his name?                                                                                                                                                                                                                                        
21      I‰Û÷m going to try for 2 months ha ha only joking                                                                                                                                                                                                                                
22      So Ì_ pay first lar... Then when is da stock comin...                                                                                                                                                                                                                            
23      Aft i finish my lunch then i go str down lor. Ard 3 smth lor. U finish ur lunch already?                                                                                                                                                                                         
24      Ffffffffff. Alright no way I can meet up with you sooner?                                                                                                                                                                                                                        
25      Just forced myself to eat a slice. I'm really not hungry tho. This sucks. Mark is getting worried. He knows I'm sick when I turn down pizza. Lol                                                                                                                                 
26      Lol your always so convincing.                                                                                                                                                                                                                                                   
27      Did you catch the bus ? Are you frying an egg ? Did you make a tea? Are you eating your mom's left over dinner ? Do you feel my Love ?                                                                                                                                           
28      I'm back & we're packing the car now, I'll let you know if there's room                                                                                                                                                                                                      
29      Ahhh. Work. I vaguely remember that! What does it feel like? Lol                                                                                                                                                                                                                 
                                      ...                                                                                                                                                                                                                                                
5542    Armand says get your ass over to epsilon                                                                                                                                                                                                                                         
5543    U still havent got urself a jacket ah?                                                                                                                                                                                                                                           
5544    I'm taking derek & taylor to walmart, if I'm not back by the time you're done just leave the mouse on my desk and I'll text you when priscilla's ready                                                                                                                       
5545    Hi its in durban are you still on this number                                                                                                                                                                                                                                    
5546    Ic. There are a lotta childporn cars then.                                                                                                                                                                                                                                       
5547    Had your contract mobile 11 Mnths? Latest Motorola, Nokia etc. all FREE! Double Mins & Text on Orange tariffs. TEXT YES for callback, no to remove from records.                                                                                                                 
5548    No, I was trying it all weekend ;V                                                                                                                                                                                                                                               
5549    You know, wot people wear. T shirts, jumpers, hat, belt, is all we know. We r at Cribbs                                                                                                                                                                                          
5550    Cool, what time you think you can get here?                                                                                                                                                                                                                                      
5551    Wen did you get so spiritual and deep. That's great                                                                                                                                                                                                                              
5552    Have a safe trip to Nigeria. Wish you happiness and very soon company to share moments with                                                                                                                                                                                      
5553    Hahaha..use your brain dear                                                                                                                                                                                                                                                      
5554    Well keep in mind I've only got enough gas for one more round trip barring a sudden influx of cash                                                                                                                                                                               
5555    Yeh. Indians was nice. Tho it did kane me off a bit he he. We shud go out 4 a drink sometime soon. Mite hav 2 go 2 da works 4 a laugh soon. Love Pete x x                                                                                                                        
5556    Yes i have. So that's why u texted. Pshew...missing you so much                                                                                                                                                                                                                  
5557    No. I meant the calculation is the same. That  <#> units at  <#> . This school is really expensive. Have you started practicing your accent. Because its important. And have you decided if you are doing 4years of dental school or if you'll just do the nmde exam.
5558    Sorry, I'll call later                                                                                                                                                                                                                                                           
5559    if you aren't here in the next  <#>  hours imma flip my shit                                                                                                                                                                                                               
5560    Anything lor. Juz both of us lor.                                                                                                                                                                                                                                                
5561    Get me out of this dump heap. My mom decided to come to lowes. BORING.                                                                                                                                                                                                           
5562    Ok lor... Sony ericsson salesman... I ask shuhui then she say quite gd 2 use so i considering...                                                                                                                                                                                 
5563    Ard 6 like dat lor.                                                                                                                                                                                                                                                              
5564    Why don't you wait 'til at least wednesday to see if you get your .                                                                                                                                                                                                              
5565    Huh y lei...                                                                                                                                                                                                                                                                     
5566    REMINDER FROM O2: To get 2.50 pounds free call credit and details of great offers pls reply 2 this text with your valid name, house no and postcode                                                                                                                              
5567    This is the 2nd time we have tried 2 contact u. U have won the å£750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate.                                                                                                                
5568    Will Ì_ b going to esplanade fr home?                                                                                                                                                                                                                                            
5569    Pity, * was in mood for that. So...any other suggestions?                                                                                                                                                                                                                        
5570    The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free                                                                                                                                                    
5571    Rofl. Its true to its name                                                                                                                                                                                                                                                       
Name: message, Length: 5572, dtype: object
In [17]:
import string
from nltk.corpus import stopwords
from nltk import PorterStemmer as Stemmer
In [18]:
STOPWORDS=stopwords.words("english")
In [19]:
STOPWORDS
Out[19]:
['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 'not',
 'only',
 'own',
 'same',
 'so',
 'than',
 'too',
 'very',
 's',
 't',
 'can',
 'will',
 'just',
 'don',
 "don't",
 'should',
 "should've",
 'now',
 'd',
 'll',
 'm',
 'o',
 're',
 've',
 'y',
 'ain',
 'aren',
 "aren't",
 'couldn',
 "couldn't",
 'didn',
 "didn't",
 'doesn',
 "doesn't",
 'hadn',
 "hadn't",
 'hasn',
 "hasn't",
 'haven',
 "haven't",
 'isn',
 "isn't",
 'ma',
 'mightn',
 "mightn't",
 'mustn',
 "mustn't",
 'needn',
 "needn't",
 'shan',
 "shan't",
 'shouldn',
 "shouldn't",
 'wasn',
 "wasn't",
 'weren',
 "weren't",
 'won',
 "won't",
 'wouldn',
 "wouldn't"]

Let us do it for one message and then create a function to do all the preprocessing steps

In [20]:
test_doc="Ok lar... Joking wif u oni..."

## Remove punctuation
test_doc_cleaned="".join([x for x in test_doc if x not in string.punctuation])
test_doc_cleaned
Out[20]:
'Ok lar Joking wif u oni'
In [21]:
## Lower case all words
test_doc_cleaned=test_doc_cleaned.lower()
test_doc_cleaned
Out[21]:
'ok lar joking wif u oni'
In [22]:
## Let us remove the stopwords
test_tokens=test_doc_cleaned.split(" ")
test_tokens=[token for token in test_tokens if token not in STOPWORDS]
test_tokens
Out[22]:
['ok', 'lar', 'joking', 'wif', 'u', 'oni']
In [23]:
## Stem the words
from nltk.stem import PorterStemmer
ps = PorterStemmer() 
test_doc_cleaned=" ".join([ps.stem(token) for token in test_tokens])
test_doc_cleaned
Out[23]:
'ok lar joke wif u oni'

Let us now create a function to do all the above steps

In [24]:
import string   
import re

def clean_text(text):
    ps=PorterStemmer()
    text = text.translate(str.maketrans({key: " {0} ".format(key) for key in string.punctuation}))
    #remove extra white space
    
    text_cleaned="".join([x for x in text if x not in string.punctuation])
    
    text_cleaned=re.sub(' +', ' ', text_cleaned)
    text_cleaned=text_cleaned.lower()
    tokens=text_cleaned.split(" ")
    tokens=[token for token in tokens if token not in STOPWORDS]
    text_cleaned=" ".join([ps.stem(token) for token in tokens])
    
    
    return text_cleaned


print(clean_text(test_doc))

    
    
    
ok lar joke wif u oni

Let us clean all the messages in the dataset...

In [25]:
data['cleaned_messages']=data['message'].apply(lambda x:clean_text(x))
data.head()
Out[25]:

Let us look at the most common words in Spam and Ham Messages - you can either plot a freuency bar plot or build a word cloud.

In [26]:
from wordcloud import WordCloud
wordcloud = WordCloud(height=2000, width=2000, stopwords=set(stopwords.words('english')), background_color='white')
wordcloud = wordcloud.generate(' '.join(data.loc[data['label']=='spam','cleaned_messages'].tolist()))
plt.imshow(wordcloud)
plt.title("Most common words in spam SMS")
plt.axis('off')
plt.show()
Notebook Image
In [27]:
wordcloud = WordCloud(height=2000, width=2000, stopwords=set(stopwords.words('english')), background_color='white')
wordcloud = wordcloud.generate(' '.join(data.loc[data['label']=='ham','cleaned_messages'].tolist()))
plt.imshow(wordcloud)
plt.title("Most common words in Ham SMS")
plt.axis('off')
plt.show()
Notebook Image

Spam messages are more like marketing messages with words like free, call, text etc. A simple word cloud here, shows us how different words are present in Ham and Spam Messages

Let us build a BoW model and then use Naive Bayes to predict is message is spam or ham

For building our model, we will use only Text, but as a practice try creating more features like the length of the message to identify spam vs ham

In [28]:
from sklearn.feature_extraction.text import CountVectorizer

In [29]:
bow=CountVectorizer()
bow_data = bow.fit_transform(data['cleaned_messages'])
In [30]:
len(bow.vocabulary_) # Get number of words in vocabulary
Out[30]:
7219
In [31]:
## Let us take an sms and see how BoW modle has transformed it

text=data.iloc[2]['cleaned_messages']
text
Out[31]:
'free entri 2 wkli comp win fa cup final tkt 21st may 2005 text fa 87121 receiv entri question std txt rate c appli 08452810075over18'
In [32]:
text_transform=bow.transform([text])
print(text_transform)
(0, 77) 1 (0, 401) 1 (0, 410) 1 (0, 780) 1 (0, 1099) 1 (0, 1929) 1 (0, 2094) 1 (0, 2548) 2 (0, 2668) 2 (0, 2764) 1 (0, 2879) 1 (0, 4176) 1 (0, 5220) 1 (0, 5265) 1 (0, 5304) 1 (0, 6030) 1 (0, 6328) 1 (0, 6442) 1 (0, 6601) 1 (0, 6986) 1 (0, 7018) 1

To understand better

In [33]:
j = bow.transform([text]).toarray()[0]

print('index\tterm\tcount')
for i in range(len(j)):
    if j[i] != 0:
        print(i, bow.get_feature_names()[i],j[i],sep='\t')
index term count 77 08452810075over18 1 401 2005 1 410 21st 1 780 87121 1 1099 appli 1 1929 comp 1 2094 cup 1 2548 entri 2 2668 fa 2 2764 final 1 2879 free 1 4176 may 1 5220 question 1 5265 rate 1 5304 receiv 1 6030 std 1 6328 text 1 6442 tkt 1 6601 txt 1 6986 win 1 7018 wkli 1

Modelling Using Naive Bayes

Let us convert to dataframe, with the column names as the feature namem

In [34]:
bow_df=pd.DataFrame(bow_data.toarray(),columns= bow.get_feature_names())

In [35]:
bow_df['is_spam']=data['label']

Let us now apply Naive Bayes.

Befire modelling, we need to split the sata into train and test

In [36]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split



In [37]:
x_train, x_test, y_train, y_test = train_test_split(bow_df[bow.get_feature_names()], bow_df['is_spam'], test_size=0.20, random_state = 42,stratify=bow_df['is_spam']) #This is to split the data by maintaing the distribution of train and test data same
In [38]:
spamFilter_nb=MultinomialNB()
spamFilter_nb.fit(x_train,y_train)
Out[38]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
In [39]:
predictions = spamFilter_nb.predict(x_test)
In [40]:
from sklearn.metrics import classification_report
In [41]:
print(classification_report(predictions, y_test))

precision recall f1-score support ham 0.99 0.99 0.99 966 spam 0.95 0.95 0.95 149 micro avg 0.99 0.99 0.99 1115 macro avg 0.97 0.97 0.97 1115 weighted avg 0.99 0.99 0.99 1115
In [42]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, predictions)
Out[42]:
array([[959,   7],
       [  7, 142]])

Let us look at the cases where "ham" has been predictted as spam

In [43]:
test=x_test
test['is_spam']=y_test
test['bow_prediction']=predictions
wrong_index=test[(test['is_spam']=='ham') & (test['bow_prediction']=="spam")].index
wrong_index
Out[43]:
Int64Index([2635, 2569, 3888, 1742, 1234, 4860, 5044], dtype='int64')
In [44]:
bow_misclassfied=data.iloc[wrong_index]
bow_misclassfied
Out[44]:

What are the cases where "spam" has been classified as "ham"

In [45]:
wrong_index=test[(test['is_spam']=='spam') & (test['bow_prediction']=="ham")].index
wrong_index
Out[45]:
Int64Index([855, 3358, 5449, 1939, 2821, 750, 2246], dtype='int64')
In [46]:
bow_misclassfied=data.iloc[wrong_index]
bow_misclassfied
Out[46]:

A simple look at this show thy if there is an phone number it is more likely to be a spam.We can use these features as well to identify spam vs ham.

Instead of CountVectoriser, use TF-IDFVectoriser to create the model using TF-IDF.Also, try to incorporate other featires into the model.

In [ ]: