Spam detection is one of the most baisc applications using Lexical Processing. In this notebook, we will use the Spam SMS dataset downloaded from
https://www.kaggle.com/uciml/sms-spam-collection-dataset
We will look at Text Cleaning using NLTK. We will use BoW and TF-IDF with NAive Bayes to classify a message as "Spam" and "Ham"
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', None)
data=pd.read_csv("spam.csv",encoding='latin-1')
data.shape
(5572, 5)
data.head()
data=data[['v1','v2']]
data.columns=['label','message']
data.head()
data['label'].value_counts()
ham 4825
spam 747
Name: label, dtype: int64
## getting the distribution is percentage
(data['label'].value_counts()/data.shape[0])*100
ham 86.593683
spam 13.406317
Name: label, dtype: float64
Only 13% of the data is spam - this is an imbalanced classification problem. Before we build a model, let us explore the data a little more
data['len_message']=data['message'].apply(lambda x:len(x.split()))
data.head()
data['len_message'].describe()
count 5572.000000
mean 15.494436
std 11.329427
min 1.000000
25% 7.000000
50% 12.000000
75% 23.000000
max 171.000000
Name: len_message, dtype: float64
sns.kdeplot(data['len_message']).set_title(" Distribution of Length of Message")
Text(0.5,1,' Distribution of Length of Message')
data.loc[data['label']=="ham","len_message"].describe()
count 4825.000000
mean 14.200622
std 11.424511
min 1.000000
25% 7.000000
50% 11.000000
75% 19.000000
max 171.000000
Name: len_message, dtype: float64
data.loc[data['label']=="spam","len_message"].describe()
count 747.000000
mean 23.851406
std 5.811898
min 2.000000
25% 22.000000
50% 25.000000
75% 28.000000
max 35.000000
Name: len_message, dtype: float64
sns.kdeplot(data.loc[data['label']=="ham","len_message"],label='ham');
sns.kdeplot(data.loc[data['label']=="spam","len_message"],label='spam');
# beautifying the labels
plt.xlabel('Length of Message')
plt.ylabel('density')
plt.show()
Spam messages on an average have a greater length than non-spam messages.
There is one non-spam very long message as well
Let us first tokenise the words and remove stopwords and punctuations.
data['message']
0 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
1 Ok lar... Joking wif u oni...
2 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3 U dun say so early hor... U c already then say...
4 Nah I don't think he goes to usf, he lives around here though
5 FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv
6 Even my brother is not like to speak with me. They treat me like aids patent.
7 As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
8 WINNER!! As a valued network customer you have been selected to receivea å£900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.
9 Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030
10 I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.
11 SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info
12 URGENT! You have won a 1 week FREE membership in our å£100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18
13 I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.
14 I HAVE A DATE ON SUNDAY WITH WILL!!
15 XXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here>> http://wap. xxxmobilemovieclub.com?n=QJKGIGHJJGCBL
16 Oh k...i'm watching here:)
17 Eh u remember how 2 spell his name... Yes i did. He v naughty make until i v wet.
18 Fine if thatåÕs the way u feel. ThatåÕs the way its gota b
19 England v Macedonia - dont miss the goals/team news. Txt ur national team to 87077 eg ENGLAND to 87077 Try:WALES, SCOTLAND 4txt/̼1.20 POBOXox36504W45WQ 16+
20 Is that seriously how you spell his name?
21 IÛ÷m going to try for 2 months ha ha only joking
22 So Ì_ pay first lar... Then when is da stock comin...
23 Aft i finish my lunch then i go str down lor. Ard 3 smth lor. U finish ur lunch already?
24 Ffffffffff. Alright no way I can meet up with you sooner?
25 Just forced myself to eat a slice. I'm really not hungry tho. This sucks. Mark is getting worried. He knows I'm sick when I turn down pizza. Lol
26 Lol your always so convincing.
27 Did you catch the bus ? Are you frying an egg ? Did you make a tea? Are you eating your mom's left over dinner ? Do you feel my Love ?
28 I'm back & we're packing the car now, I'll let you know if there's room
29 Ahhh. Work. I vaguely remember that! What does it feel like? Lol
...
5542 Armand says get your ass over to epsilon
5543 U still havent got urself a jacket ah?
5544 I'm taking derek & taylor to walmart, if I'm not back by the time you're done just leave the mouse on my desk and I'll text you when priscilla's ready
5545 Hi its in durban are you still on this number
5546 Ic. There are a lotta childporn cars then.
5547 Had your contract mobile 11 Mnths? Latest Motorola, Nokia etc. all FREE! Double Mins & Text on Orange tariffs. TEXT YES for callback, no to remove from records.
5548 No, I was trying it all weekend ;V
5549 You know, wot people wear. T shirts, jumpers, hat, belt, is all we know. We r at Cribbs
5550 Cool, what time you think you can get here?
5551 Wen did you get so spiritual and deep. That's great
5552 Have a safe trip to Nigeria. Wish you happiness and very soon company to share moments with
5553 Hahaha..use your brain dear
5554 Well keep in mind I've only got enough gas for one more round trip barring a sudden influx of cash
5555 Yeh. Indians was nice. Tho it did kane me off a bit he he. We shud go out 4 a drink sometime soon. Mite hav 2 go 2 da works 4 a laugh soon. Love Pete x x
5556 Yes i have. So that's why u texted. Pshew...missing you so much
5557 No. I meant the calculation is the same. That <#> units at <#> . This school is really expensive. Have you started practicing your accent. Because its important. And have you decided if you are doing 4years of dental school or if you'll just do the nmde exam.
5558 Sorry, I'll call later
5559 if you aren't here in the next <#> hours imma flip my shit
5560 Anything lor. Juz both of us lor.
5561 Get me out of this dump heap. My mom decided to come to lowes. BORING.
5562 Ok lor... Sony ericsson salesman... I ask shuhui then she say quite gd 2 use so i considering...
5563 Ard 6 like dat lor.
5564 Why don't you wait 'til at least wednesday to see if you get your .
5565 Huh y lei...
5566 REMINDER FROM O2: To get 2.50 pounds free call credit and details of great offers pls reply 2 this text with your valid name, house no and postcode
5567 This is the 2nd time we have tried 2 contact u. U have won the å£750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate.
5568 Will Ì_ b going to esplanade fr home?
5569 Pity, * was in mood for that. So...any other suggestions?
5570 The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free
5571 Rofl. Its true to its name
Name: message, Length: 5572, dtype: object
import string
from nltk.corpus import stopwords
from nltk import PorterStemmer as Stemmer
STOPWORDS=stopwords.words("english")
STOPWORDS
['i',
'me',
'my',
'myself',
'we',
'our',
'ours',
'ourselves',
'you',
"you're",
"you've",
"you'll",
"you'd",
'your',
'yours',
'yourself',
'yourselves',
'he',
'him',
'his',
'himself',
'she',
"she's",
'her',
'hers',
'herself',
'it',
"it's",
'its',
'itself',
'they',
'them',
'their',
'theirs',
'themselves',
'what',
'which',
'who',
'whom',
'this',
'that',
"that'll",
'these',
'those',
'am',
'is',
'are',
'was',
'were',
'be',
'been',
'being',
'have',
'has',
'had',
'having',
'do',
'does',
'did',
'doing',
'a',
'an',
'the',
'and',
'but',
'if',
'or',
'because',
'as',
'until',
'while',
'of',
'at',
'by',
'for',
'with',
'about',
'against',
'between',
'into',
'through',
'during',
'before',
'after',
'above',
'below',
'to',
'from',
'up',
'down',
'in',
'out',
'on',
'off',
'over',
'under',
'again',
'further',
'then',
'once',
'here',
'there',
'when',
'where',
'why',
'how',
'all',
'any',
'both',
'each',
'few',
'more',
'most',
'other',
'some',
'such',
'no',
'nor',
'not',
'only',
'own',
'same',
'so',
'than',
'too',
'very',
's',
't',
'can',
'will',
'just',
'don',
"don't",
'should',
"should've",
'now',
'd',
'll',
'm',
'o',
're',
've',
'y',
'ain',
'aren',
"aren't",
'couldn',
"couldn't",
'didn',
"didn't",
'doesn',
"doesn't",
'hadn',
"hadn't",
'hasn',
"hasn't",
'haven',
"haven't",
'isn',
"isn't",
'ma',
'mightn',
"mightn't",
'mustn',
"mustn't",
'needn',
"needn't",
'shan',
"shan't",
'shouldn',
"shouldn't",
'wasn',
"wasn't",
'weren',
"weren't",
'won',
"won't",
'wouldn',
"wouldn't"]
test_doc="Ok lar... Joking wif u oni..."
## Remove punctuation
test_doc_cleaned="".join([x for x in test_doc if x not in string.punctuation])
test_doc_cleaned
'Ok lar Joking wif u oni'
## Lower case all words
test_doc_cleaned=test_doc_cleaned.lower()
test_doc_cleaned
'ok lar joking wif u oni'
## Let us remove the stopwords
test_tokens=test_doc_cleaned.split(" ")
test_tokens=[token for token in test_tokens if token not in STOPWORDS]
test_tokens
['ok', 'lar', 'joking', 'wif', 'u', 'oni']
## Stem the words
from nltk.stem import PorterStemmer
ps = PorterStemmer()
test_doc_cleaned=" ".join([ps.stem(token) for token in test_tokens])
test_doc_cleaned
'ok lar joke wif u oni'
import string
import re
def clean_text(text):
ps=PorterStemmer()
text = text.translate(str.maketrans({key: " {0} ".format(key) for key in string.punctuation}))
#remove extra white space
text_cleaned="".join([x for x in text if x not in string.punctuation])
text_cleaned=re.sub(' +', ' ', text_cleaned)
text_cleaned=text_cleaned.lower()
tokens=text_cleaned.split(" ")
tokens=[token for token in tokens if token not in STOPWORDS]
text_cleaned=" ".join([ps.stem(token) for token in tokens])
return text_cleaned
print(clean_text(test_doc))
ok lar joke wif u oni
data['cleaned_messages']=data['message'].apply(lambda x:clean_text(x))
data.head()
from wordcloud import WordCloud
wordcloud = WordCloud(height=2000, width=2000, stopwords=set(stopwords.words('english')), background_color='white')
wordcloud = wordcloud.generate(' '.join(data.loc[data['label']=='spam','cleaned_messages'].tolist()))
plt.imshow(wordcloud)
plt.title("Most common words in spam SMS")
plt.axis('off')
plt.show()
wordcloud = WordCloud(height=2000, width=2000, stopwords=set(stopwords.words('english')), background_color='white')
wordcloud = wordcloud.generate(' '.join(data.loc[data['label']=='ham','cleaned_messages'].tolist()))
plt.imshow(wordcloud)
plt.title("Most common words in Ham SMS")
plt.axis('off')
plt.show()
Spam messages are more like marketing messages with words like free, call, text etc. A simple word cloud here, shows us how different words are present in Ham and Spam Messages
For building our model, we will use only Text, but as a practice try creating more features like the length of the message to identify spam vs ham
from sklearn.feature_extraction.text import CountVectorizer
bow=CountVectorizer()
bow_data = bow.fit_transform(data['cleaned_messages'])
len(bow.vocabulary_) # Get number of words in vocabulary
7219
## Let us take an sms and see how BoW modle has transformed it
text=data.iloc[2]['cleaned_messages']
text
'free entri 2 wkli comp win fa cup final tkt 21st may 2005 text fa 87121 receiv entri question std txt rate c appli 08452810075over18'
text_transform=bow.transform([text])
print(text_transform)
(0, 77) 1
(0, 401) 1
(0, 410) 1
(0, 780) 1
(0, 1099) 1
(0, 1929) 1
(0, 2094) 1
(0, 2548) 2
(0, 2668) 2
(0, 2764) 1
(0, 2879) 1
(0, 4176) 1
(0, 5220) 1
(0, 5265) 1
(0, 5304) 1
(0, 6030) 1
(0, 6328) 1
(0, 6442) 1
(0, 6601) 1
(0, 6986) 1
(0, 7018) 1
To understand better
j = bow.transform([text]).toarray()[0]
print('index\tterm\tcount')
for i in range(len(j)):
if j[i] != 0:
print(i, bow.get_feature_names()[i],j[i],sep='\t')
index term count
77 08452810075over18 1
401 2005 1
410 21st 1
780 87121 1
1099 appli 1
1929 comp 1
2094 cup 1
2548 entri 2
2668 fa 2
2764 final 1
2879 free 1
4176 may 1
5220 question 1
5265 rate 1
5304 receiv 1
6030 std 1
6328 text 1
6442 tkt 1
6601 txt 1
6986 win 1
7018 wkli 1
bow_df=pd.DataFrame(bow_data.toarray(),columns= bow.get_feature_names())
bow_df['is_spam']=data['label']
Befire modelling, we need to split the sata into train and test
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(bow_df[bow.get_feature_names()], bow_df['is_spam'], test_size=0.20, random_state = 42,stratify=bow_df['is_spam']) #This is to split the data by maintaing the distribution of train and test data same
spamFilter_nb=MultinomialNB()
spamFilter_nb.fit(x_train,y_train)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
predictions = spamFilter_nb.predict(x_test)
from sklearn.metrics import classification_report
print(classification_report(predictions, y_test))
precision recall f1-score support
ham 0.99 0.99 0.99 966
spam 0.95 0.95 0.95 149
micro avg 0.99 0.99 0.99 1115
macro avg 0.97 0.97 0.97 1115
weighted avg 0.99 0.99 0.99 1115
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, predictions)
array([[959, 7],
[ 7, 142]])
test=x_test
test['is_spam']=y_test
test['bow_prediction']=predictions
wrong_index=test[(test['is_spam']=='ham') & (test['bow_prediction']=="spam")].index
wrong_index
Int64Index([2635, 2569, 3888, 1742, 1234, 4860, 5044], dtype='int64')
bow_misclassfied=data.iloc[wrong_index]
bow_misclassfied
wrong_index=test[(test['is_spam']=='spam') & (test['bow_prediction']=="ham")].index
wrong_index
Int64Index([855, 3358, 5449, 1939, 2821, 750, 2246], dtype='int64')
bow_misclassfied=data.iloc[wrong_index]
bow_misclassfied
A simple look at this show thy if there is an phone number it is more likely to be a spam.We can use these features as well to identify spam vs ham.