Jovian
⭐️
Sign In

Natural Language Processing using spacy

  • Introduction of Spacy library
  • Load english dictionary
  • Find out stop words
  • create an nlp object of given document (sentence)
  • Count frequency of each word using hash values (using count_by(ORTH) and nlp.vocab.strings)
  • print each word count, using dictionary comprehension
  • print index of each token
  • Print various attributes of nlp object (i.e. is_alpha,tok.shape_,is_stop,tok.pos_,tok.tag_) !!!
  • Stemming (using nltk)
    • using PorterStemmer()
    • using SnowballStemmer()
  • Lemmatization
  • Display tree view of words using displacy using displacy.render()
  • How to get the meaning of any denoted words by nlp using explain()
  • How to Find out NER(Named entity Recognition) in given doc
  • Display Named Entity in doc using displacy.render
  • Remove stop_words/punctuation using is_stop & is_punct attribute
  • create a list of words/sentence after removing stop_words then make sentence
  • Sentence and Word Tokenization
  • Pipelining:
    • Get all the factory pipelining options available
    • How to disable preloaded pipeline, that will enahnce the processing time?
    • Adding custom pipelines
  • Reading a file and displaying entity
  • Chunking
  • Computing word similarity
  • n-grams (using nltk and sklearn-CountVectorizer())
    • bi-grams
    • tri-grams
    • n-grams
In [ ]:
import spacy as sp
from spacy import displacy # used for data visualization
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.attrs import ORTH # to be used for word count
In [ ]:
nlp = sp.load("en_core_web_sm") # ref: https://spacy.io/models/en
To load english model

!python -m spacy download en_core_web_sm

In [ ]:
txt = """Commercial writers know that most people don’t want to read 1,000 
words of closely-spaced text in order to see what they are writing about, so 
they also like to keep sentences and paragraphs short. 
They’ll even use lots of sub-headers so you can see what each paragraph is about 
before you read it."""
In [ ]:
obj = nlp(txt)
How to get al the words from text
In [ ]:
for wd in obj:
    print(wd.text)
Commercial writers know that most people do n’t want to read 1,000 words of closely - spaced text in order to see what they are writing about , so they also like to keep sentences and paragraphs short . They ’ll even use lots of sub - headers so you can see what each paragraph is about before you read it .
Find out stop words
In [ ]:
for wd in obj:
    print((wd.text,wd.is_stop))
('Commercial', False) ('writers', False) ('know', False) ('that', True) ('most', True) ('people', False) ('do', True) ('n’t', True) ('want', False) ('to', True) ('read', False) ('1,000', False) ('\n', False) ('words', False) ('of', True) ('closely', False) ('-', False) ('spaced', False) ('text', False) ('in', True) ('order', False) ('to', True) ('see', True) ('what', True) ('they', True) ('are', True) ('writing', False) ('about', True) (',', False) ('so', True) ('\n', False) ('they', True) ('also', True) ('like', False) ('to', True) ('keep', True) ('sentences', False) ('and', True) ('paragraphs', False) ('short', False) ('.', False) ('\n', False) ('They', True) ('’ll', True) ('even', True) ('use', False) ('lots', False) ('of', True) ('sub', False) ('-', False) ('headers', False) ('so', True) ('you', True) ('can', True) ('see', True) ('what', True) ('each', True) ('paragraph', False) ('is', True) ('about', True) ('\n', False) ('before', True) ('you', True) ('read', False) ('it', True) ('.', False)
create an nlp object of given document (sentence)
In [ ]:
for sent in obj.sents:
    print(sent.text)
Commercial writers know that most people don’t want to read 1,000 words of closely-spaced text in order to see what they are writing about, so they also like to keep sentences and paragraphs short. They’ll even use lots of sub-headers so you can see what each paragraph is about before you read it.
to create separate word from senetence
In [ ]:
for sent in obj.sents:
    for wd in sent:
        print(wd.text)
Commercial writers know that most people do n’t want to read 1,000 words of closely - spaced text in order to see what they are writing about , so they also like to keep sentences and paragraphs short . They ’ll even use lots of sub - headers so you can see what each paragraph is about before you read it .
In [ ]:
for sent in obj.sents:
    print(sent.text.split(" "))
['Commercial', 'writers', 'know', 'that', 'most', 'people', 'don’t', 'want', 'to', 'read', '1,000', '\nwords', 'of', 'closely-spaced', 'text', 'in', 'order', 'to', 'see', 'what', 'they', 'are', 'writing', 'about,', 'so', '\nthey', 'also', 'like', 'to', 'keep', 'sentences', 'and', 'paragraphs', 'short.', '\n'] ['They’ll', 'even', 'use', 'lots', 'of', 'sub-headers'] ['so', 'you', 'can', 'see', 'what', 'each', 'paragraph', 'is', 'about', '\nbefore', 'you', 'read', 'it.']
Count frequency of each word using hash values (using count_by(ORTH) and nlp.vocab.strings)
In [ ]:
obj.count_by(ORTH)
Out[0]:
{6679199052911211715: 1,
 357501887436434592: 1,
 7743033266031195906: 1,
 4380130941430378203: 1,
 11104729984170784471: 1,
 7593739049417968140: 1,
 2158845516055552166: 1,
 16712971838599463365: 1,
 7597692042947428029: 1,
 3791531372978436496: 3,
 11792590063656742891: 2,
 18254674181385630108: 1,
 962983613142996970: 4,
 10289140944597012527: 1,
 886050111519832510: 2,
 9696970313201087903: 1,
 9153284864653046197: 2,
 16159022834684645410: 1,
 15099781594404091470: 1,
 3002984154512732771: 1,
 13136985495629980461: 1,
 11925638236994514241: 2,
 5865838185239622912: 2,
 16875582379069451158: 2,
 5012629990875267006: 1,
 9147119992364589469: 1,
 942632335873952620: 2,
 2593208677638477497: 1,
 9781598966686434415: 2,
 12084876542534825196: 1,
 18194338103975822726: 1,
 9099225972875567996: 1,
 5257340109698985342: 1,
 2283656566040971221: 1,
 12626284911390218812: 1,
 3563698965725164461: 1,
 12646065887601541794: 2,
 14947529218328092544: 1,
 17092777669037358890: 1,
 17339226045912991082: 1,
 6873750497785110593: 1,
 17842523177576739921: 1,
 144868287865513341: 1,
 18375123465971211096: 1,
 7624161793554793053: 2,
 6635067063807956629: 1,
 5379624210385286023: 1,
 9194963477161408182: 1,
 3411606890003347522: 1,
 11320251846592927908: 1,
 10239237003504588839: 1}
In [ ]:
for k,v in obj.count_by(ORTH).items():
    print((nlp.vocab.strings[k],v))
('Commercial', 1) ('writers', 1) ('know', 1) ('that', 1) ('most', 1) ('people', 1) ('do', 1) ('n’t', 1) ('want', 1) ('to', 3) ('read', 2) ('1,000', 1) ('\n', 4) ('words', 1) ('of', 2) ('closely', 1) ('-', 2) ('spaced', 1) ('text', 1) ('in', 1) ('order', 1) ('see', 2) ('what', 2) ('they', 2) ('are', 1) ('writing', 1) ('about', 2) (',', 1) ('so', 2) ('also', 1) ('like', 1) ('keep', 1) ('sentences', 1) ('and', 1) ('paragraphs', 1) ('short', 1) ('.', 2) ('They', 1) ('’ll', 1) ('even', 1) ('use', 1) ('lots', 1) ('sub', 1) ('headers', 1) ('you', 2) ('can', 1) ('each', 1) ('paragraph', 1) ('is', 1) ('before', 1) ('it', 1)
print each word count, using dictionary comprehension
In [ ]:
for wd in obj:
    print(wd.text)
    break
Commercial
In [ ]:
print({wd:txt.count(wd.text) for wd in obj})
{Commercial: 1, writers: 1, know: 1, that: 1, most: 1, people: 1, do: 1, n’t: 1, want: 1, to: 3, read: 2, 1,000: 1, : 4, words: 1, of: 2, closely: 1, -: 2, spaced: 1, text: 1, in: 2, order: 1, to: 3, see: 2, what: 2, they: 2, are: 1, writing: 1, about: 2, ,: 2, so: 3, : 4, they: 2, also: 1, like: 1, to: 3, keep: 1, sentences: 1, and: 1, paragraphs: 1, short: 1, .: 2, : 4, They: 1, ’ll: 1, even: 1, use: 1, lots: 1, of: 2, sub: 1, -: 2, headers: 1, so: 3, you: 2, can: 1, see: 2, what: 2, each: 1, paragraph: 2, is: 1, about: 2, : 4, before: 1, you: 2, read: 2, it: 3, .: 2}
print index of each token
In [ ]:
print({wd.text:wd.idx for wd in obj}) # idx : index of the given word
{'Commercial': 0, 'writers': 11, 'know': 19, 'that': 24, 'most': 29, 'people': 34, 'do': 41, 'n’t': 43, 'want': 47, 'to': 160, 'read': 294, '1,000': 60, '\n': 282, 'words': 67, 'of': 223, 'closely': 76, '-': 229, 'spaced': 84, 'text': 91, 'in': 96, 'order': 99, 'see': 249, 'what': 253, 'they': 145, 'are': 122, 'writing': 126, 'about': 276, ',': 139, 'so': 238, 'also': 150, 'like': 155, 'keep': 163, 'sentences': 168, 'and': 178, 'paragraphs': 182, 'short': 193, '.': 301, 'They': 201, '’ll': 205, 'even': 209, 'use': 214, 'lots': 218, 'sub': 226, 'headers': 230, 'you': 290, 'can': 245, 'each': 258, 'paragraph': 263, 'is': 273, 'before': 283, 'it': 299}
Print various attributes of nlp object (i.e. is_alpha,tok.shape_,is_stop,tok.pos_,tok.tag_,is_punct) !!!
In [ ]:
obj2 = nlp("CommercIAl writers")
for wd in obj2:
    print((wd.text,wd.is_alpha,wd.shape_))
('CommercIAl', True, 'XxxxxXXx') ('writers', True, 'xxxx')
In [ ]:
for wd in obj:
    print((wd.text,wd.is_alpha,wd.shape_,wd.pos_,wd.tag_,wd.is_punct))
('Commercial', True, 'Xxxxx', 'ADJ', 'JJ', False) ('writers', True, 'xxxx', 'NOUN', 'NNS', False) ('know', True, 'xxxx', 'VERB', 'VBP', False) ('that', True, 'xxxx', 'SCONJ', 'IN', False) ('most', True, 'xxxx', 'ADJ', 'JJS', False) ('people', True, 'xxxx', 'NOUN', 'NNS', False) ('do', True, 'xx', 'AUX', 'VBP', False) ('n’t', False, 'x’x', 'PART', 'RB', False) ('want', True, 'xxxx', 'VERB', 'VB', False) ('to', True, 'xx', 'PART', 'TO', False) ('read', True, 'xxxx', 'VERB', 'VB', False) ('1,000', False, 'd,ddd', 'NUM', 'CD', False) ('\n', False, '\n', 'SPACE', '_SP', False) ('words', True, 'xxxx', 'NOUN', 'NNS', False) ('of', True, 'xx', 'ADP', 'IN', False) ('closely', True, 'xxxx', 'ADV', 'RB', False) ('-', False, '-', 'PUNCT', 'HYPH', True) ('spaced', True, 'xxxx', 'VERB', 'VBN', False) ('text', True, 'xxxx', 'NOUN', 'NN', False) ('in', True, 'xx', 'ADP', 'IN', False) ('order', True, 'xxxx', 'NOUN', 'NN', False) ('to', True, 'xx', 'PART', 'TO', False) ('see', True, 'xxx', 'VERB', 'VB', False) ('what', True, 'xxxx', 'PRON', 'WP', False) ('they', True, 'xxxx', 'PRON', 'PRP', False) ('are', True, 'xxx', 'AUX', 'VBP', False) ('writing', True, 'xxxx', 'VERB', 'VBG', False) ('about', True, 'xxxx', 'ADP', 'IN', False) (',', False, ',', 'PUNCT', ',', True) ('so', True, 'xx', 'ADV', 'RB', False) ('\n', False, '\n', 'SPACE', '_SP', False) ('they', True, 'xxxx', 'PRON', 'PRP', False) ('also', True, 'xxxx', 'ADV', 'RB', False) ('like', True, 'xxxx', 'VERB', 'VBP', False) ('to', True, 'xx', 'PART', 'TO', False) ('keep', True, 'xxxx', 'VERB', 'VB', False) ('sentences', True, 'xxxx', 'NOUN', 'NNS', False) ('and', True, 'xxx', 'CCONJ', 'CC', False) ('paragraphs', True, 'xxxx', 'NOUN', 'NNS', False) ('short', True, 'xxxx', 'ADJ', 'JJ', False) ('.', False, '.', 'PUNCT', '.', True) ('\n', False, '\n', 'SPACE', '_SP', False) ('They', True, 'Xxxx', 'PRON', 'PRP', False) ('’ll', False, '’xx', 'AUX', 'MD', False) ('even', True, 'xxxx', 'ADV', 'RB', False) ('use', True, 'xxx', 'VERB', 'VB', False) ('lots', True, 'xxxx', 'NOUN', 'NNS', False) ('of', True, 'xx', 'ADP', 'IN', False) ('sub', True, 'xxx', 'NOUN', 'NN', False) ('-', False, '-', 'NOUN', 'NNS', True) ('headers', True, 'xxxx', 'NOUN', 'NNS', False) ('so', True, 'xx', 'SCONJ', 'IN', False) ('you', True, 'xxx', 'PRON', 'PRP', False) ('can', True, 'xxx', 'AUX', 'MD', False) ('see', True, 'xxx', 'VERB', 'VB', False) ('what', True, 'xxxx', 'PRON', 'WP', False) ('each', True, 'xxxx', 'DET', 'DT', False) ('paragraph', True, 'xxxx', 'NOUN', 'NN', False) ('is', True, 'xx', 'AUX', 'VBZ', False) ('about', True, 'xxxx', 'ADP', 'IN', False) ('\n', False, '\n', 'SPACE', '_SP', False) ('before', True, 'xxxx', 'ADP', 'IN', False) ('you', True, 'xxx', 'PRON', 'PRP', False) ('read', True, 'xxxx', 'VERB', 'VBP', False) ('it', True, 'xx', 'PRON', 'PRP', False) ('.', False, '.', 'PUNCT', '.', True)
Exercise: Filter stop words from the given text using list comprehension
In [ ]:
print([wd for wd in obj if wd.is_stop == True])
[that, most, do, n’t, to, of, in, to, see, what, they, are, about, so, they, also, to, keep, and, They, ’ll, even, of, so, you, can, see, what, each, is, about, before, you, it]
Exercise: Filter words excluding stop words from the given text using list comprehension
In [ ]:
print([wd for wd in obj if wd.is_stop == False])
[Commercial, writers, know, people, want, read, 1,000, , words, closely, -, spaced, text, order, writing, ,, , like, sentences, paragraphs, short, ., , use, lots, sub, -, headers, paragraph, , read, .]
Stemming (using nltk)

using PorterStemmer() using SnowballStemmer()

In [ ]:
from nltk.stem import SnowballStemmer
from nltk.stem import PorterStemmer
In [ ]:
sn_stemmer = SnowballStemmer("english")
po_stemmer = PorterStemmer()
In [ ]:
sent = nlp("give gave given giving gives giving")
for wd in sent:
    #print(wd.text)
    print((wd.text,sn_stemmer.stem(wd.text)))
    
('give', 'give') ('gave', 'gave') ('given', 'given') ('giving', 'give') ('gives', 'give') ('giving', 'give')
In [ ]:
sent = nlp("play plays played playing playable")
for wd in sent:
    #print(wd.text)
    print((wd.text,sn_stemmer.stem(wd.text)))
    
('play', 'play') ('plays', 'play') ('played', 'play') ('playing', 'play') ('playable', 'playabl')
using porter stemmer
In [ ]:
sent = nlp("play plays played playing playable")
for wd in sent:
    #print(wd.text)
    print((wd.text,po_stemmer.stem(wd.text)))
    
('play', 'play') ('plays', 'play') ('played', 'play') ('playing', 'play') ('playable', 'playabl')
Lemmatization
In [ ]:
sent = nlp("play plays played playing playable")
# sent = nlp("give gave given giving gives giving")
# sent = nlp("go went gone")

for wd in sent:
    #print(wd.text)
    print((wd.text,wd.lemma_))
    
('play', 'play') ('plays', 'play') ('played', 'play') ('playing', 'play') ('playable', 'playable')
Display tree view of words using displacy using displacy.render()
In [ ]:
obj3 = nlp("This is line1 and used for displacy purpose")
In [ ]:
displacy.render(obj3,jupyter=True)
How to get the meaning of any denoted words by nlp using explain()
In [ ]:
sp.explain("nsubj")
Out[0]:
'nominal subject'
In [ ]:
sp.explain("ADP")
Out[0]:
'adposition'
In [ ]:
sp.explain("prep")
Out[0]:
'prepositional modifier'
How to Find out NER(Named entity Recognition) in given doc
In [ ]:
obj4 = nlp("""The show's name has been variously translated as Chat with Beauties,[1] Chatting Beauties,[2] Beauties's Chatterbox, or Misuda (a shortened version of its Korean name).[3]

The show was hosted by Nam Hui-seok, a television personality and comedian. Later on, announcer Eom Ji-in joined as co-host, and eventually Lee Yun-seok and Seo Gyeong-seok became the final hosts. The song "Bring It All Back" by S Club 7 is played after the opening cut to the studio floor that follows the playing of the opening intro and the viewer advisory that it is a rated "15" program. Unlike most talk shows, Global Talk Show does not have a live studio audience and instead uses audience laughter and applause tracks as well as on-screen text and sound effects.

In 2009 the program came under attack, receiving widespread criticism by internet users after a student panelist labeled short men (men under 180cm) as "losers".[3] The program suffered a decline in popularity thereafter and was later cancelled.[citation needed] Nevertheless, the popularity of the program gave celebrity status within South Korea to some of the panelists.[4] A portion of the program was also published as a book featuring the same subject.""")
In [ ]:
for wd in obj4.ents:
    print((wd.text,wd.label_))
('Misuda', 'GPE') ('Korean', 'NORP') ('Nam Hui-seok', 'PERSON') ('Eom Ji', 'PERSON') ('Lee Yun-seok', 'PERSON') ('Seo Gyeong-seok', 'PERSON') ('Bring It All Back', 'WORK_OF_ART') ('S Club 7', 'ORG') ('15', 'CARDINAL') ('Global Talk Show', 'WORK_OF_ART') ('2009', 'DATE') ('180cm', 'QUANTITY') ('South Korea', 'GPE')
In [ ]:
[wd for wd in obj4.ents if wd.label_ == "PERSON"]
Out[0]:
[Nam Hui-seok, Eom Ji, Lee Yun-seok, Seo Gyeong-seok]
In [ ]:
len([wd for wd in obj4.ents if wd.label_ == "PERSON"])
Out[0]:
4
Display Named Entity in doc using displacy.render
In [ ]:
displacy.render(obj4,style="ent",jupyter=True)
In [ ]:
sp.explain("GPE")
Out[0]:
'Countries, cities, states'
In [ ]:
sp.explain("NORP")
Out[0]:
'Nationalities or religious or political groups'

Reading a file and displaying entity

In [ ]:
fh = open("obama_speech.txt")
In [ ]:
obj5 = nlp(fh.read())
In [ ]:
displacy.render(obj5,style="ent")
In [ ]:
# displacy.render(obj5,jupyter=True)
Remove stop_words/punctuation using is_stop & is_punct attribute
already done on top
create a list of words/sentence after removing stop_words then make sentence
already done on top
Sentence and Word Tokenization
already done on top

Pipelining:

  • Get all the factory pipelining options available
  • How to disable preloaded pipeline, that will enahnce the processing time?
  • Adding custom pipelines
In [ ]:
nlp.pipe_names
Out[0]:
['tagger', 'parser', 'ner']
In [ ]:
nlp.pipeline
Out[0]:
[('tagger', <spacy.pipeline.pipes.Tagger at 0x1a2d067d68>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x1a2daf52e8>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x1a2daf5348>)]
In [ ]:
nlp.factories
Out[0]:
{'tokenizer': <function spacy.language.Language.<lambda>(nlp)>,
 'tensorizer': <function spacy.language.Language.<lambda>(nlp, **cfg)>,
 'tagger': <function spacy.language.Language.<lambda>(nlp, **cfg)>,
 'morphologizer': <function spacy.language.Language.<lambda>(nlp, **cfg)>,
 'parser': <function spacy.language.Language.<lambda>(nlp, **cfg)>,
 'ner': <function spacy.language.Language.<lambda>(nlp, **cfg)>,
 'entity_linker': <function spacy.language.Language.<lambda>(nlp, **cfg)>,
 'similarity': <function spacy.language.Language.<lambda>(nlp, **cfg)>,
 'textcat': <function spacy.language.Language.<lambda>(nlp, **cfg)>,
 'sentencizer': <function spacy.language.Language.<lambda>(nlp, **cfg)>,
 'merge_noun_chunks': <function spacy.language.Language.<lambda>(nlp, **cfg)>,
 'merge_entities': <function spacy.language.Language.<lambda>(nlp, **cfg)>,
 'merge_subtokens': <function spacy.language.Language.<lambda>(nlp, **cfg)>,
 'entity_ruler': <function spacy.language.Language.<lambda>(nlp, **cfg)>}
In [ ]:
ner_obj = nlp.disable_pipes("ner")
ner_obj
Out[0]:
[('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x1a2daf5348>)]
In [ ]:
nlp.pipe_names
Out[0]:
['tagger', 'parser']
In [ ]:
ner_obj.restore()
In [ ]:
nlp.pipe_names
Out[0]:
['tagger', 'parser', 'ner']
Adding custom pipeline
In [ ]:
def Upperizer(sentence):
    print(sentence.text.upper())
    return sentence.text.upper()
In [ ]:
nlp.remove_pipe("Upperizer")
Out[0]:
('Upperizer', <function __main__.Upperizer(sentence)>)
In [ ]:
nlp.add_pipe(Upperizer)
In [ ]:
nlp.pipe_names
Out[0]:
['tagger', 'parser', 'ner', 'Upperizer']
In [ ]:
tst = nlp("This is test line for my function.")
THIS IS TEST LINE FOR MY FUNCTION.

Chunking

In [ ]:
txt
Out[0]:
'Commercial writers know that most people don’t want to read 1,000 \nwords of closely-spaced text in order to see what they are writing about, so \nthey also like to keep sentences and paragraphs short. \nThey’ll even use lots of sub-headers so you can see what each paragraph is about \nbefore you read it.'
In [ ]:
for wd in obj.noun_chunks:
    print((wd.text,wd.root.text))
('Commercial writers', 'writers') ('most people', 'people') ('1,000 \nwords', 'words') ('closely-spaced text', 'text') ('order', 'order') ('what', 'what') ('they', 'they') ('they', 'they') ('sentences', 'sentences') ('They', 'They') ('lots', 'lots') ('sub-headers', 'headers') ('you', 'you') ('what', 'what') ('each paragraph', 'paragraph') ('you', 'you') ('it', 'it')

Computing word similarity

In [ ]:
from nltk.corpus import wordnet as wn
In [ ]:
wn.synsets("like")
Out[0]:
[Synset('like.n.01'),
 Synset('like.n.02'),
 Synset('wish.v.02'),
 Synset('like.v.02'),
 Synset('like.v.03'),
 Synset('like.v.04'),
 Synset('like.v.05'),
 Synset('like.a.01'),
 Synset('like.a.02'),
 Synset('alike.a.01'),
 Synset('comparable.s.02')]
In [ ]:
w1 = wn.synset("good.n.01")
w2 = wn.synset("good.n.01")
w1.wup_similarity(w2)
Out[0]:
1.0
In [ ]:
w1 = wn.synset("good.n.01")
w2 = wn.synset("better.n.01")
w1.wup_similarity(w2)
Out[0]:
0.6153846153846154
In [ ]:
w1 = wn.synset("dog.n.01")
w2 = wn.synset("cat.n.01")
In [ ]:
print(w1.wup_similarity(w2)*100)
85.71428571428571
In [ ]:
txt1 = "dog cat lion elephant"
In [ ]:
low1 = txt1.split(" ")
cnt = 0
for wd1 in low1:
    w1 = wn.synsets(wd1)[0].name()
    ss1 = wn.synset(w1)
    for wd2 in low1:
        w2 = wn.synsets(wd2)[0].name()
        ss2 = wn.synset(w2)
        print("Word similarity:",(wd1,wd2,ss1.wup_similarity(ss2)*100))
Note, above code will work now, we were missing to run synset function after calculating synsets of a given word. Now it will print output like this:

Word similarity: ('dog', 'dog', 92.85714285714286) Word similarity: ('dog', 'cat', 85.71428571428571) Word similarity: ('dog', 'lion', 82.75862068965517) Word similarity: ('dog', 'elephant', 81.48148148148148) Word similarity: ('cat', 'dog', 85.71428571428571) Word similarity: ('cat', 'cat', 100.0) Word similarity: ('cat', 'lion', 89.65517241379311) Word similarity: ('cat', 'elephant', 81.48148148148148) Word similarity: ('lion', 'dog', 82.75862068965517) Word similarity: ('lion', 'cat', 89.65517241379311) Word similarity: ('lion', 'lion', 100.0) Word similarity: ('lion', 'elephant', 78.57142857142857) Word similarity: ('elephant', 'dog', 81.48148148148148) Word similarity: ('elephant', 'cat', 81.48148148148148) Word similarity: ('elephant', 'lion', 78.57142857142857) Word similarity: ('elephant', 'elephant', 100.0)

n-grams (using nltk and sklearn-CountVectorizer())

  • bi-grams
  • tri-grams
  • n-grams
In [ ]:
# I like food +ve
# I dont like food -ve
In [ ]:
# ngram = (1,2)
# I like
# Like food
# food
In [ ]:
# I, like,food ngram-1
# I, like,food,i like, like food ngram-(1,10)
In [ ]:
from nltk import bigrams,trigrams,ngrams
In [ ]:
sent1 = "I don't like food"
sow = sent1.split(" ")
sow
Out[0]:
['I', "don't", 'like', 'food']

bi-grams

In [ ]:
list(bigrams(sow))
Out[0]:
[('I', "don't"), ("don't", 'like'), ('like', 'food')]

how to make sentence from bi-grams

In [ ]:
for grams in list(bigrams(sow)):
    print(" ".join(grams))
I don't don't like like food

tri-grams

In [ ]:
list(trigrams(sow))
Out[0]:
[('I', "don't", 'like'), ("don't", 'like', 'food')]
In [ ]:
for grams in list(trigrams(sow)):
    print(" ".join(grams))
I don't like don't like food

n-grams

In [ ]:
list(ngrams(sow,4))
Out[0]:
[('I', "don't", 'like', 'food')]
In [ ]:
for grams in list(ngrams(sow,4)):
    print(" ".join(grams))
I don't like food
In [ ]:
for grams in list(ngrams(sow,1)):
    print(" ".join(grams))
I don't like food
In [ ]:
for grams in list(ngrams(sow,2)):
    print(" ".join(grams))
I don't don't like like food
In [ ]:
for grams in list(ngrams(sow,3)):
    print(" ".join(grams))
I don't like don't like food