Jovian
⭐️
Sign In

Text to Numeric using sklearn feature extraction

Ref: https://github.com/justmarkham/pycon-2016-tutorial/blob/master/tutorial.ipynb

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
import scikitplot as skp

# Suppress Warning 
import warnings
warnings.filterwarnings("ignore")
sample dataset

Each element is treated as document

In [2]:
sample_data = ["This is test1","This is test2","This is another line with test3","Yet another line with test4",
               "yet again another line with test5"]
sample_data
Out[2]:
['This is test1',
 'This is test2',
 'This is another line with test3',
 'Yet another line with test4',
 'yet again another line with test5']

using CountVectorizer()

In [3]:
vect1 = CountVectorizer() # Using defaul options
vect1
Out[3]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)
In [4]:
vect1.fit(sample_data)
Out[4]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)
Get all the feature names after fit
In [5]:
print(vect1.get_feature_names())
['again', 'another', 'is', 'line', 'test1', 'test2', 'test3', 'test4', 'test5', 'this', 'with', 'yet']
fiter 1 gram words
In [6]:
[wd for wd in vect1.get_feature_names() if len(wd.split(" ")) == 1]
Out[6]:
['again',
 'another',
 'is',
 'line',
 'test1',
 'test2',
 'test3',
 'test4',
 'test5',
 'this',
 'with',
 'yet']
fit again with stop words
In [7]:
vect2 = CountVectorizer(stop_words="english") # With stop words
vect2
Out[7]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)
In [8]:
vect2.fit(sample_data)
Out[8]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)
Get all the feature names after fit

This clearly shows that (diff between vact1 and vect2) lot of stop words used are removed while fitting

In [9]:
print(vect1.get_feature_names())
print(vect2.get_feature_names())
['again', 'another', 'is', 'line', 'test1', 'test2', 'test3', 'test4', 'test5', 'this', 'with', 'yet'] ['line', 'test1', 'test2', 'test3', 'test4', 'test5']
These are the possible set of stop words used in english

To reduce the dimesnion of features, we should go with STOP_WORDS

In [10]:
print(vect2.get_stop_words())
frozenset({'back', 'detail', 'within', 'get', 'latter', 'must', 'both', 'may', 'nevertheless', 'off', 'ours', 'become', 'per', 'fifteen', 'un', 'a', 'very', 'another', 'yourselves', 'towards', 'amongst', 'this', 'when', 'moreover', 'hereafter', 'its', 'see', 'behind', 'once', 'amoungst', 'ltd', 'then', 'my', 'anyway', 'whereas', 'several', 'wherever', 'before', 'side', 'cannot', 'yourself', 'eight', 'our', 'into', 'now', 'therein', 'sometimes', 'about', 'hasnt', 'me', 'since', 'nine', 'he', 'other', 'whenever', 'four', 'mine', 'him', 'also', 'throughout', 'thereby', 'top', 'perhaps', 'same', 'bottom', 'call', 'first', 're', 'enough', 'most', 'there', 'go', 'had', 'below', 'whether', 'everything', 'still', 'too', 'together', 'of', 'on', 'otherwise', 'take', 'at', 'again', 'am', 'twelve', 'those', 'until', 'with', 'it', 'although', 'found', 'during', 'ourselves', 'sometime', 'herself', 'almost', 'everywhere', 'itself', 'while', 'anything', 'etc', 'something', 'they', 'without', 'anyone', 'alone', 'either', 'less', 'out', 'myself', 'would', 'thin', 'always', 'than', 'above', 'please', 'forty', 'cant', 'ever', 'noone', 'nor', 'more', 'de', 'full', 'serious', 'though', 'whole', 'cry', 'because', 'none', 'indeed', 'mill', 'across', 'you', 'eleven', 'two', 'his', 'over', 'whither', 'onto', 'themselves', 'empty', 'should', 'somewhere', 'she', 'up', 'former', 'ie', 'fifty', 'toward', 'could', 'couldnt', 'under', 'them', 'next', 'often', 'every', 'been', 'except', 'thus', 'show', 'her', 'becoming', 'has', 'elsewhere', 'system', 'seeming', 'against', 'so', 'hundred', 'six', 'himself', 'each', 'any', 'hers', 'via', 'but', 'find', 'never', 'whereupon', 'who', 'seem', 'anywhere', 'keep', 'afterwards', 'how', 'whereby', 'after', 'what', 'anyhow', 'are', 'nowhere', 'for', 'eg', 'front', 'wherein', 'meanwhile', 'hence', 'last', 'neither', 'seemed', 'the', 'thru', 'co', 'these', 'sixty', 'much', 'we', 'bill', 'even', 'interest', 'seems', 'many', 'somehow', 'here', 'and', 'is', 'thence', 'some', 'however', 'along', 'five', 'ten', 'rather', 'formerly', 'why', 'few', 'three', 'your', 'thereafter', 'yet', 'were', 'third', 'by', 'fire', 'around', 'whom', 'part', 'if', 'con', 'hereupon', 'an', 'beside', 'done', 'hereby', 'beforehand', 'latterly', 'us', 'down', 'least', 'therefore', 'might', 'name', 'between', 'give', 'that', 'upon', 'do', 'as', 'becomes', 'was', 'to', 'their', 'from', 'everyone', 'such', 'thick', 'in', 'move', 'sincere', 'whereafter', 'among', 'all', 'well', 'inc', 'became', 'made', 'beyond', 'already', 'through', 'own', 'put', 'only', 'whence', 'besides', 'be', 'which', 'yours', 'nobody', 'i', 'describe', 'further', 'thereupon', 'no', 'one', 'amount', 'herein', 'namely', 'else', 'whose', 'fill', 'where', 'mostly', 'can', 'due', 'will', 'someone', 'twenty', 'whoever', 'not', 'or', 'whatever', 'nothing', 'being', 'others', 'have'})
In [11]:
sm = vect2.transform(sample_data) # Output is Sparse matrix
sm
Out[11]:
<5x6 sparse matrix of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format>
In [13]:
# # Or, We can use fit_transform function as well in on eline
# sm = vect2.fit_transform(sample_data)
# sm.toarray()
Decoding Above outpout :

5x6 sparse matrix:

  • 5 is how many documents are there in sample data len(sample_data)
  • 6 is how many unique words (features) are available i.e len(vect2.get_feature_names())
  • with 8 stored elements in Compressed Sparse Row format: Means, , details will be stored where 1's are present That is what is compressed sparse matrix
type of above output
In [14]:
type(sm)
Out[14]:
scipy.sparse.csr.csr_matrix
Check the actual sparse matrix array
In [62]:
sm.toarray()
Out[62]:
array([[0, 1, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0],
       [1, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 1]], dtype=int64)
get back all the token details
In [66]:
vect2.inverse_transform(sample_data)
Out[66]:
[array(['line', 'test1', 'test2', 'test3', 'test4'], dtype='<U5')]
Visualize as heatmap
In [56]:
plt.figure(figsize=(10,4))
sns.heatmap(sm.toarray(),xticklabels=vect2.get_feature_names(),yticklabels=["document_"+str(i) for i in range(5)],
            annot=True,vmin=0,vmax=1,
           linewidths=1)
plt.yticks(rotation=0)
plt.show()
Notebook Image
If we want to see above array as pandas data frame, map to unique words (features) as their column name then take a look below
In [16]:
print(sample_data)
['This is test1', 'This is test2', 'This is another line with test3', 'Yet another line with test4', 'yet again another line with test5']
In [17]:
pd.DataFrame(sm.toarray(),columns=vect2.get_feature_names(),index = ["document_"+str(i) for i in range(5)])
Out[17]:
Check the compressed form of sparse matrix

If we see the above output, wherever 1s are present only those , details are stored in compressed sparse matrix

Convert to dense format if required
In [18]:
print(sm.todense())
[[0 1 0 0 0 0] [0 0 1 0 0 0] [1 0 0 1 0 0] [1 0 0 0 1 0] [1 0 0 0 0 1]]
Convert to co-ordinate format if required
In [19]:
print(sm.tocoo())
(0, 1) 1 (1, 2) 1 (2, 0) 1 (2, 3) 1 (3, 0) 1 (3, 4) 1 (4, 0) 1 (4, 5) 1
In [20]:
test_data = ["I will write another line with test5 and test1"]
op = vect2.transform(test_data)
In [21]:
pd.DataFrame(op.toarray(), columns=vect2.get_feature_names())
Out[21]:

Using TfIdf vectorizer

In [22]:
sample_data = ["This is test1","This is test2","This is another line with test3","Yet another line with test4",
               "yet again another line with test5"]
sample_data
Out[22]:
['This is test1',
 'This is test2',
 'This is another line with test3',
 'Yet another line with test4',
 'yet again another line with test5']
In [23]:
tfv = TfidfVectorizer() # usign default parameters
In [24]:
tfv1 = TfidfVectorizer(stop_words="english") # Using stop words
In [25]:
tfv.fit(sample_data)
Out[25]:
TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)
In [26]:
tfv1.fit(sample_data)
Out[26]:
TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words='english', strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)
In [27]:
print(tfv.get_feature_names())
print(vect2.get_feature_names())
['again', 'another', 'is', 'line', 'test1', 'test2', 'test3', 'test4', 'test5', 'this', 'with', 'yet'] ['line', 'test1', 'test2', 'test3', 'test4', 'test5']
In [28]:
print(tfv.vocabulary_)
{'this': 9, 'is': 2, 'test1': 4, 'test2': 5, 'another': 1, 'line': 3, 'with': 10, 'test3': 6, 'yet': 11, 'test4': 7, 'again': 0, 'test5': 8}
In [29]:
print(tfv1.vocabulary_)
{'test1': 1, 'test2': 2, 'line': 0, 'test3': 3, 'test4': 4, 'test5': 5}
In [30]:
tfv_sm = tfv1.transform(sample_data)
tfv_sm
Out[30]:
<5x6 sparse matrix of type '<class 'numpy.float64'>'
	with 8 stored elements in Compressed Sparse Row format>
In [31]:
sm.toarray()
Out[31]:
array([[0, 1, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0],
       [1, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 1]], dtype=int64)
In [32]:
tfv_sm.toarray()
Out[32]:
array([[0.        , 1.        , 0.        , 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , 0.        , 0.        ,
        0.        ],
       [0.55645052, 0.        , 0.        , 0.83088075, 0.        ,
        0.        ],
       [0.55645052, 0.        , 0.        , 0.        , 0.83088075,
        0.        ],
       [0.55645052, 0.        , 0.        , 0.        , 0.        ,
        0.83088075]])
In [33]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This the',
    'This is ',
    'And this is ',
    'Is this the first ',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(vectorizer.vocabulary_)

print(X.toarray())
['and', 'first', 'is', 'the', 'this'] {'this': 4, 'the': 3, 'is': 2, 'and': 0, 'first': 1} [[0 0 0 1 1] [0 0 1 0 1] [1 0 1 0 1] [0 1 1 1 1]]
In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This the',
    'This is ',
    'And this is ',
    'Is this the first ',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

X.toarray()
['and', 'first', 'is', 'the', 'this']
Out[34]:
array([[0.        , 0.        , 0.        , 0.83388421, 0.55193942],
       [0.        , 0.        , 0.77419109, 0.        , 0.63295194],
       [0.77157901, 0.        , 0.49248889, 0.        , 0.40264194],
       [0.        , 0.65919112, 0.42075315, 0.51971385, 0.34399327]])

Lets see how it got calculated:

using following formula: tf(t)*idf(t) where idf = \(ln(\dfrac{n}{df(t)})+1\)

where ln = natural log; tf = term frequency of term t ; n = number of docs ; df(t) = number of docs where term t is present

Tokenization using ML function

In [35]:
cvbt = CountVectorizer(stop_words="english").build_tokenizer() # This will just tokenize, Stop word will not have any effect!!
cvbt("This is line1 and used as demo")
Out[35]:
['This', 'is', 'line1', 'and', 'used', 'as', 'demo']
In [36]:
cvbt = CountVectorizer()

cvbt.fit(["This is line1 and used as demo"])
cvbt.get_feature_names()
Out[36]:
['and', 'as', 'demo', 'is', 'line1', 'this', 'used']
In [37]:
cvbt = CountVectorizer(stop_words="english")

cvbt.fit(["This is line1 and used as demo"])
cvbt.get_feature_names()
Out[37]:
['demo', 'line1', 'used']

Using NMF (Non-Negative Matrix Factorization)

TBD

In [38]:
from sklearn.decomposition import NMF
In [39]:
nmf = NMF(n_components=2)
In [40]:
tmp = sm.toarray()
In [41]:
nmf.fit_transform(X)
Out[41]:
array([[0.        , 0.83424253],
       [0.69226605, 0.08114503],
       [0.72327066, 0.        ],
       [0.17654209, 0.70417581]])
In [42]:
H = nmf.fit_transform(tmp)
H
Out[42]:
array([[0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00],
       [8.56215763e-01, 0.00000000e+00],
       [8.56215763e-01, 0.00000000e+00],
       [4.55631805e-05, 1.20964065e+00]])
In [43]:
cvbt.get_feature_names()
Out[43]:
['demo', 'line1', 'used']
In [44]:
idx_to_word = np.array(vectorizer.get_feature_names())
idx_to_word
Out[44]:
array(['and', 'first', 'is', 'the', 'this'], dtype='<U5')
In [45]:
for i, topic in enumerate(H):
    print(i,"::",topic)
    print("Topic {}: {}".format(i + 1, ",".join([str(x) for x in idx_to_word[topic.argsort()[-10:]]])))
0 :: [0. 0.] Topic 1: and,first 1 :: [0. 0.] Topic 2: and,first 2 :: [0.85621576 0. ] Topic 3: first,and 3 :: [0.85621576 0. ] Topic 4: first,and 4 :: [4.55631805e-05 1.20964065e+00] Topic 5: and,first