Jovian
⭐️
Sign In
In [1]:
import yake
import pke
from nltk.corpus import stopwords

References

Next Steps

Two approaches discussed in Text Analytics with Python and Code

  1. Collocations
  2. Weighted tag-based phrase extraction (extract noun phrases chunks using shallow parsing followed by computing tf-idf weights for each chunk and return top weighted phrases)

The following also provides a Context aware Text Analysis https://learning.oreilly.com/library/view/Applied+Text+Analysis+with+Python/97814919630

In [2]:
#!python -m nltk.downloader stopwords
#!python -m nltk.downloader universal_target
#!python -m spacy download en # download the english model
In [3]:
text_content = """

	Sources tell us that Google is acquiring Kaggle, a platform that hosts data science and machine learning
	competitions. Details about the transaction remain somewhat vague , but given that Google is hosting
	its Cloud Next conference in San Francisco this week, the official announcement could come as early
	as tomorrow.  Reached by phone, Kaggle co-founder CEO Anthony Goldbloom declined to deny that the
	acquisition is happening. Google itself declined 'to comment on rumors'.
"""
In [4]:
blog_content = """
Red Hat decided to invest in Podman because it’s an easy change for anyone used to the Docker command line, but doesn’t require a daemon to be running when one isn’t needed. Also, Podman shares many of the same underlying components with other container engines, including CRI-O, providing a proving ground for new and different container runtimes, and other experiments in underlying technology like CRIU, Udica, Kata Containers, gVisor, etc.

"""

Here we are going to use python keyword extraction module(pke) to demonstrate extracting keyphrases from the document.

Python Keyword Extraction

Unsupervised KPE involves corpus-dependent & corpus independent

corpus independent requires no other i/o other than the document itself. These methods are typically graph based(Eg: single rank) with exceptions like KeyCluster(???) & TopicRank (written by pke author)

Graph based represents words from the doc as nodes & edges represent the co-occurrence. The edges may represent weights using the no.of cooccurrences. The nodes are then scored using some form of ranking metrics like PageRank etc. Multi word phrases scores are computed by aggregating from individual word scores. Finally these multi word phrases that fit with a certain pattern of POS tags are considered as candidate phrases ranked by their scores.

Graph Based - TopicRank

In [5]:
#initialize keyphrase extraction model
extractor = pke.unsupervised.TopicRank()
In [6]:
# load the content of the document (string or file )
extractor.load_document(text_content)
In [7]:
# keyphrase candidate selection using heuristics (eg: sequences of nouns and adjectives)
extractor.candidate_selection()
In [8]:
# ranking the candidates (Topic Rank: using random walk algorithm)
extractor.candidate_weighting()
In [9]:
# keyphrase formation from the top ranked scored candidates
keyphrases = extractor.get_n_best(n=10)
In [10]:
keyphrases
Out[10]:
[('google', 0.10549899364530257),
 ('kaggle', 0.08646825869098751),
 ('competitions', 0.06307172703237668),
 ('san francisco', 0.06268047843373921),
 ('details', 0.06252563436526277),
 ('machine', 0.061955230945746176),
 ('science', 0.05896549771261056),
 ('week', 0.056938286219298764),
 ('transaction', 0.05679456706887484),
 ('cloud next conference', 0.05616106987604108)]

Graph Based - Single Rank

In [11]:
# define the set of valid POS
pos = {'NOUN', 'PROPN', 'ADJ'}
In [12]:
# create a SingleRank extractor
extractor = pke.unsupervised.SingleRank()
In [14]:
# load the content of the document
extractor.load_document(input=text_content, language='en', normalization=None)
In [15]:
# candidate selections - select the longest sequences of nounds and adjectives as candidates
extractor.candidate_selection(pos=pos)
In [16]:
# weight the candidates using the sum of their word's scores that are
#    computed using random walk. In the graph, nodes are words of
#    certain part-of-speech (nouns and adjectives) that are connected if
#    they occur in a window of 10 words.
extractor.candidate_weighting(window=10, pos=pos)
In [17]:
# get the 10-highest scored candidates as phrases
keyphrases = extractor.get_n_best(n=10)
In [18]:
keyphrases
Out[18]:
[('- founder ceo anthony goldbloom', 0.19387528038050295),
 ('cloud next conference', 0.11718267397129922),
 ('san francisco', 0.0825604116847842),
 ('official announcement', 0.07275616993773729),
 ('kaggle', 0.06799465834471034),
 ('google', 0.0675311728471354),
 ('phone', 0.039603826387358404),
 ('competitions', 0.039540893204963644),
 ('week', 0.03864227310165734),
 ('machine', 0.03856957359641699)]

SingleRank performs better for shorter documents.

Statistical models

KPMiner

In [69]:
#create KPMiner extractor
extractor = pke.unsupervised.KPMiner()
In [71]:
# load the content of document
extractor.load_document(blog_content)
In [72]:
# select {1-5} grams that do not contain puntuation marks or stopwords as keyphrase candidates. Set the least allowable seen
#    frequency to 5 and the number of words after which candidates are
#    filtered out to 200.
lasf = 5
cutoff = 200
extractor.candidate_selection(lasf=lasf, cutoff=cutoff)

The below requires corpus dependent document frequency in order to use.

In [74]:
# weight the candidate using KPMiner weighting function
#df = pke.load_document_frequency_file(input_file='path/to/df.tsv.gz')
#alpha = 2.3
#sigma = 3.0
#extractor.candidate_weighting(df=df, alpha=alpha, sigma=sigma)
# 5. get the 10-highest scored candidates as keyphrases
#keyphrases = extractor.get_n_best(n=10)

Yake

In [63]:
extractor = pke.unsupervised.YAKE()
In [64]:
extractor.load_document(blog_content)
In [65]:
stoplist = stopwords.words('english')
extractor.candidate_selection(n=3, stoplist=stoplist)
In [66]:
# weight the candidate using Yake weighting scheme, a window(in words ) for computing lef/right contexts can be specified
window = 2
use_stems = False #use stems instead of words for weighting
extractor.candidate_weighting(window=window, stoplist=stoplist, use_stems=use_stems)
In [67]:
#Get the 10-highest scored candidates as keyphrases
#reduntant keyphrases are removed from output using levenstein distance and threshold
threshold = 0.8
keyphrases = extractor.get_n_best(n=10, threshold=threshold)
In [68]:
keyphrases
Out[68]:
[('red hat decided', 0.0008888723918774053),
 ('docker command line', 0.0017485528499129259),
 ('red hat', 0.005022644514962233),
 ('hat decided', 0.014417621538266964),
 ('docker command', 0.014417621538266964),
 ('command line', 0.020822996440012925),
 ('easy change', 0.03080944329280461),
 ('anyone used', 0.03080944329280461),
 ('kata containers', 0.034859532506968334),
 ('technology like criu', 0.0442574269450327)]
In [14]:
simple_kwextractor = yake.KeywordExtractor()
In [15]:
blog_keywords = simple_kwextractor.extract_keywords(blog_content)
In [16]:
blog_keywords
Out[16]:
[(0.01907888198684912, 'docker command line'),
 (0.01957252934780936, 'red hat decided'),
 (0.038999087844447805, 'red hat'),
 (0.06771843728775587, 'docker command'),
 (0.06874297250030004, 'hat decided'),
 (0.10290092596722104, 'command line'),
 (0.13316928160841723, 'udica'),
 (0.1498981946279202, 'hat'),
 (0.15600219041082652, 'n’t needed'),
 (0.1702198818373236, 'docker'),
 (0.18235640488829719, 'podman'),
 (0.19092908375150364, 'kata containers'),
 (0.22483474847238993, 'gvisor'),
 (0.2256808233632596, 'decided to invest'),
 (0.2256808233632596, 'easy change'),
 (0.2256808233632596, 'require a daemon'),
 (0.2452126575417378, 'kata'),
 (0.2485787634248192, 'including cri-o'),
 (0.25040493333313885, 'red'),
 (0.25040493333313885, 'line')]
In [12]:
custom_kwextractor.extract_keywords(blog_content)
Out[12]:
[(0.02828100634115418, 'similar low level'),
 (0.037840350767863545, 'red hat'),
 (0.07666479321354663, 'hat'),
 (0.08301061655304112, 'red'),
 (0.08587055055075608, 'low level research'),
 (0.08731194256472048, 'organizations have committed'),
 (0.08731194256472048, 'committed to similar'),
 (0.08731194256472048, 'similar low'),
 (0.08731194256472048, 'low level'),
 (0.10841773955745555, 'chrome web browser'),
 (0.11338895979911398, 'chrome web'),
 (0.1553826923536995, 'chrome'),
 (0.16360757338108564, 'google'),
 (0.19046322782305794, 'communities'),
 (0.19481132490025377, 'container'),
 (0.19828091918987265, 'web'),
 (0.20980449848411628, 'development'),
 (0.2178590850198492, 'lead'),
 (0.21843682213059992, 'native computing foundation'),
 (0.2285620361564212, 'research')]
In [ ]:
 
In [ ]:
 
In [4]:
keywords = simple_kwextractor.extract_keywords(text_content)
In [5]:
keywords
Out[5]:
[(0.04208883962734637, 'machine learning competitions'),
 (0.0887384781074878, 'learning competitions'),
 (0.08977421437281216, 'hosts data science'),
 (0.137173406118932, 'acquiring kaggle'),
 (0.1530583423439245, 'google'),
 (0.15371265551846858, 'platform that hosts'),
 (0.15371265551846858, 'machine learning'),
 (0.1625230334351792, 'san francisco'),
 (0.17665908177224074, 'kaggle'),
 (0.1796477364349524, 'hosts data'),
 (0.1796477364349524, 'data science'),
 (0.1796477364349524, 'science and machine'),
 (0.18244401210589162, 'ceo anthony goldbloom'),
 (0.2387322772174652, 'sources'),
 (0.2387322772174652, 'competitions'),
 (0.24212495657951347, 'francisco this week'),
 (0.2474315198387865, 'kaggle co-founder ceo'),
 (0.27627102189890745, 'ceo anthony'),
 (0.2885044790076837, 'google is acquiring'),
 (0.34141080647242195, 'acquiring')]
In [6]:
# specifying parameters
custom_kwextractor = yake.KeywordExtractor(lan="en", n=3, dedupLim=0.9, dedupFunc='seqm', windowsSize=1, top=20, features=None)
In [7]:
keywords = custom_kwextractor.extract_keywords(text_content)
In [8]:
keywords
Out[8]:
[(0.018859382212529502, 'machine learning competitions'),
 (0.030832095331371854, 'hosts data science'),
 (0.05754330797217759, 'learning competitions'),
 (0.09222227644002941, 'platform that hosts'),
 (0.09222227644002941, 'hosts data'),
 (0.09222227644002941, 'data science'),
 (0.09222227644002941, 'science and machine'),
 (0.09222227644002941, 'machine learning'),
 (0.09384707176893009, 'ceo anthony goldbloom'),
 (0.09509185681479737, 'acquiring kaggle'),
 (0.09992928718426312, 'google'),
 (0.10574223667473541, 'san francisco'),
 (0.13879597333090907, 'kaggle co-founder ceo'),
 (0.149417125928561, 'kaggle'),
 (0.15542345042439917, 'google is acquiring'),
 (0.1593647044964056, 'francisco this week'),
 (0.18460659509977959, 'ceo anthony'),
 (0.18460659509977959, 'anthony goldbloom'),
 (0.18725536044688476, 'sources'),
 (0.18725536044688476, 'competitions')]
In [19]:
import jovian
In [ ]:
jovian.commit()
[jovian] Saving notebook..
[jovian] Creating a new notebook on https://jvn.io
[jovian] Error: The current API key is invalid or expired.
[jovian] Please enter your API key (from https://jvn.io ):
In [ ]: