Jovian
⭐️
Sign In

Natural Language Processing using spacy

  • Introduction of Spacy library
  • Load english dictionary
  • Find out stop words
  • create an nlp object of given document (sentence)
  • Count frequency of each word using hash values (using count_by(ORTH) and nlp.vocab.strings)
  • print each word count, using dictionary comprehension
  • print index of each token
  • Print various attributes of nlp object (i.e. is_alpha,tok.shape_,is_stop,tok.pos_,tok.tag_) !!!
  • Stemming (using nltk)
    • using PorterStemmer()
    • using SnowballStemmer()
  • Lemmatization
  • Display tree view of words using displacy using displacy.render()
  • How to get the meaning of any denoted words by nlp using explain()
  • How to Find out NER(Named entity Recognition) in given doc
  • Display Named Entity in doc using displacy.render
  • Remove stop_words/punctuation using is_stop & is_punct attribute
  • create a list of words/sentence after removing stop_words then make sentence
  • Sentence and Word Tokenization
  • Pipelining:
    • Get all the factory pipelining options available
    • How to disable preloaded pipeline, that will enahnce the processing time?
    • Adding custom pipelines
  • Reading a file and displaying entity
  • Chunking
  • Computing word similarity
  • n-grams (using nltk and sklearn-CountVectorizer())
    • bi-grams
    • tri-grams
    • n-grams
In [1]:
import spacy as sp
from spacy import displacy # used for data visualization
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.attrs import ORTH # to be used for word count
In [2]:
nlp = sp.load("en_core_web_sm") # ref: https://spacy.io/models/en
To load english model

!python -m spacy download en_core_web_sm

In [3]:
txt = """Commercial writers know that most people don’t want to read 1,000 
words of closely-spaced text in order to see what they are writing about, so 
they also like to keep sentences and paragraphs short. 
They’ll even use lots of sub-headers so you can see what each paragraph is about 
before you read it."""
In [4]:
obj = nlp(txt)
How to get al the words from text
In [9]:
for wd in obj:
    print(wd.text)
Commercial writers know that most people do n’t want to read 1,000 words of closely - spaced text in order to see what they are writing about , so they also like to keep sentences and paragraphs short . They ’ll even use lots of sub - headers so you can see what each paragraph is about before you read it .
Find out stop words
In [11]:
for wd in obj:
    print((wd.text,wd.is_stop))
('Commercial', False) ('writers', False) ('know', False) ('that', True) ('most', True) ('people', False) ('do', True) ('n’t', True) ('want', False) ('to', True) ('read', False) ('1,000', False) ('\n', False) ('words', False) ('of', True) ('closely', False) ('-', False) ('spaced', False) ('text', False) ('in', True) ('order', False) ('to', True) ('see', True) ('what', True) ('they', True) ('are', True) ('writing', False) ('about', True) (',', False) ('so', True) ('\n', False) ('they', True) ('also', True) ('like', False) ('to', True) ('keep', True) ('sentences', False) ('and', True) ('paragraphs', False) ('short', False) ('.', False) ('\n', False) ('They', True) ('’ll', True) ('even', True) ('use', False) ('lots', False) ('of', True) ('sub', False) ('-', False) ('headers', False) ('so', True) ('you', True) ('can', True) ('see', True) ('what', True) ('each', True) ('paragraph', False) ('is', True) ('about', True) ('\n', False) ('before', True) ('you', True) ('read', False) ('it', True) ('.', False)
create an nlp object of given document (sentence)
In [16]:
for sent in obj.sents:
    print(sent.text)
Commercial writers know that most people don’t want to read 1,000 words of closely-spaced text in order to see what they are writing about, so they also like to keep sentences and paragraphs short. They’ll even use lots of sub-headers so you can see what each paragraph is about before you read it.
to create separate word from senetence
In [23]:
for sent in obj.sents:
    for wd in sent:
        print(wd.text)
Commercial writers know that most people do n’t want to read 1,000 words of closely - spaced text in order to see what they are writing about , so they also like to keep sentences and paragraphs short . They ’ll even use lots of sub - headers so you can see what each paragraph is about before you read it .
In [26]:
for sent in obj.sents:
    print(sent.text.split(" "))
['Commercial', 'writers', 'know', 'that', 'most', 'people', 'don’t', 'want', 'to', 'read', '1,000', '\nwords', 'of', 'closely-spaced', 'text', 'in', 'order', 'to', 'see', 'what', 'they', 'are', 'writing', 'about,', 'so', '\nthey', 'also', 'like', 'to', 'keep', 'sentences', 'and', 'paragraphs', 'short.', '\n'] ['They’ll', 'even', 'use', 'lots', 'of', 'sub-headers'] ['so', 'you', 'can', 'see', 'what', 'each', 'paragraph', 'is', 'about', '\nbefore', 'you', 'read', 'it.']
Count frequency of each word using hash values (using count_by(ORTH) and nlp.vocab.strings)
In [27]:
obj.count_by(ORTH)
Out[27]:
{6679199052911211715: 1,
 357501887436434592: 1,
 7743033266031195906: 1,
 4380130941430378203: 1,
 11104729984170784471: 1,
 7593739049417968140: 1,
 2158845516055552166: 1,
 16712971838599463365: 1,
 7597692042947428029: 1,
 3791531372978436496: 3,
 11792590063656742891: 2,
 18254674181385630108: 1,
 962983613142996970: 4,
 10289140944597012527: 1,
 886050111519832510: 2,
 9696970313201087903: 1,
 9153284864653046197: 2,
 16159022834684645410: 1,
 15099781594404091470: 1,
 3002984154512732771: 1,
 13136985495629980461: 1,
 11925638236994514241: 2,
 5865838185239622912: 2,
 16875582379069451158: 2,
 5012629990875267006: 1,
 9147119992364589469: 1,
 942632335873952620: 2,
 2593208677638477497: 1,
 9781598966686434415: 2,
 12084876542534825196: 1,
 18194338103975822726: 1,
 9099225972875567996: 1,
 5257340109698985342: 1,
 2283656566040971221: 1,
 12626284911390218812: 1,
 3563698965725164461: 1,
 12646065887601541794: 2,
 14947529218328092544: 1,
 17092777669037358890: 1,
 17339226045912991082: 1,
 6873750497785110593: 1,
 17842523177576739921: 1,
 144868287865513341: 1,
 18375123465971211096: 1,
 7624161793554793053: 2,
 6635067063807956629: 1,
 5379624210385286023: 1,
 9194963477161408182: 1,
 3411606890003347522: 1,
 11320251846592927908: 1,
 10239237003504588839: 1}
In [45]:
for k,v in obj.count_by(ORTH).items():
    print((nlp.vocab.strings[k],v))
('Commercial', 1) ('writers', 1) ('know', 1) ('that', 1) ('most', 1) ('people', 1) ('do', 1) ('n’t', 1) ('want', 1) ('to', 3) ('read', 2) ('1,000', 1) ('\n', 4) ('words', 1) ('of', 2) ('closely', 1) ('-', 2) ('spaced', 1) ('text', 1) ('in', 1) ('order', 1) ('see', 2) ('what', 2) ('they', 2) ('are', 1) ('writing', 1) ('about', 2) (',', 1) ('so', 2) ('also', 1) ('like', 1) ('keep', 1) ('sentences', 1) ('and', 1) ('paragraphs', 1) ('short', 1) ('.', 2) ('They', 1) ('’ll', 1) ('even', 1) ('use', 1) ('lots', 1) ('sub', 1) ('headers', 1) ('you', 2) ('can', 1) ('each', 1) ('paragraph', 1) ('is', 1) ('before', 1) ('it', 1)
print each word count, using dictionary comprehension
print index of each token
Print various attributes of nlp object (i.e. is_alpha,tok.shape_,is_stop,tok.pos_,tok.tag_) !!!
Stemming (using nltk)

using PorterStemmer() using SnowballStemmer()

Lemmatization
Display tree view of words using displacy using displacy.render()
How to get the meaning of any denoted words by nlp using explain()
How to Find out NER(Named entity Recognition) in given doc
Display Named Entity in doc using displacy.render
Remove stop_words/punctuation using is_stop & is_punct attribute
create a list of words/sentence after removing stop_words then make sentence
Sentence and Word Tokenization