import yake
import pke
from nltk.corpus import stopwords
Blog Post that introduced KPE https://nishkalavallabhi.github.io/KPE1/
https://github.com/kpnDataScienceLab/keyword-extraction - Survey and Evaluation of different keyword extraction algorithms by a data science team
[Papers] EmbedRank https://arxiv.org/pdf/1801.04470.pdf
Next Steps
Two approaches discussed in Text Analytics with Python and Code
The following also provides a Context aware Text Analysis https://learning.oreilly.com/library/view/Applied+Text+Analysis+with+Python/97814919630
#!python -m nltk.downloader stopwords
#!python -m nltk.downloader universal_target
#!python -m spacy download en # download the english model
text_content = """
Sources tell us that Google is acquiring Kaggle, a platform that hosts data science and machine learning
competitions. Details about the transaction remain somewhat vague , but given that Google is hosting
its Cloud Next conference in San Francisco this week, the official announcement could come as early
as tomorrow. Reached by phone, Kaggle co-founder CEO Anthony Goldbloom declined to deny that the
acquisition is happening. Google itself declined 'to comment on rumors'.
"""
blog_content = """
Red Hat decided to invest in Podman because it’s an easy change for anyone used to the Docker command line, but doesn’t require a daemon to be running when one isn’t needed. Also, Podman shares many of the same underlying components with other container engines, including CRI-O, providing a proving ground for new and different container runtimes, and other experiments in underlying technology like CRIU, Udica, Kata Containers, gVisor, etc.
"""
Here we are going to use python keyword extraction module(pke) to demonstrate extracting keyphrases from the document.
Unsupervised KPE involves corpus-dependent & corpus independent
corpus independent requires no other i/o other than the document itself. These methods are typically graph based(Eg: single rank) with exceptions like KeyCluster(???) & TopicRank (written by pke author)
Graph based represents words from the doc as nodes & edges represent the co-occurrence. The edges may represent weights using the no.of cooccurrences. The nodes are then scored using some form of ranking metrics like PageRank etc. Multi word phrases scores are computed by aggregating from individual word scores. Finally these multi word phrases that fit with a certain pattern of POS tags are considered as candidate phrases ranked by their scores.
#initialize keyphrase extraction model
extractor = pke.unsupervised.TopicRank()
# load the content of the document (string or file )
extractor.load_document(text_content)
# keyphrase candidate selection using heuristics (eg: sequences of nouns and adjectives)
extractor.candidate_selection()
# ranking the candidates (Topic Rank: using random walk algorithm)
extractor.candidate_weighting()
# keyphrase formation from the top ranked scored candidates
keyphrases = extractor.get_n_best(n=10)
keyphrases
[('google', 0.10549899364530257),
('kaggle', 0.08646825869098751),
('competitions', 0.06307172703237668),
('san francisco', 0.06268047843373921),
('details', 0.06252563436526277),
('machine', 0.061955230945746176),
('science', 0.05896549771261056),
('week', 0.056938286219298764),
('transaction', 0.05679456706887484),
('cloud next conference', 0.05616106987604108)]
# define the set of valid POS
pos = {'NOUN', 'PROPN', 'ADJ'}
# create a SingleRank extractor
extractor = pke.unsupervised.SingleRank()
# load the content of the document
extractor.load_document(input=text_content, language='en', normalization=None)
# candidate selections - select the longest sequences of nounds and adjectives as candidates
extractor.candidate_selection(pos=pos)
# weight the candidates using the sum of their word's scores that are
# computed using random walk. In the graph, nodes are words of
# certain part-of-speech (nouns and adjectives) that are connected if
# they occur in a window of 10 words.
extractor.candidate_weighting(window=10, pos=pos)
# get the 10-highest scored candidates as phrases
keyphrases = extractor.get_n_best(n=10)
keyphrases
[('- founder ceo anthony goldbloom', 0.19387528038050295),
('cloud next conference', 0.11718267397129922),
('san francisco', 0.0825604116847842),
('official announcement', 0.07275616993773729),
('kaggle', 0.06799465834471034),
('google', 0.0675311728471354),
('phone', 0.039603826387358404),
('competitions', 0.039540893204963644),
('week', 0.03864227310165734),
('machine', 0.03856957359641699)]
SingleRank performs better for shorter documents.
#create KPMiner extractor
extractor = pke.unsupervised.KPMiner()
# load the content of document
extractor.load_document(blog_content)
# select {1-5} grams that do not contain puntuation marks or stopwords as keyphrase candidates. Set the least allowable seen
# frequency to 5 and the number of words after which candidates are
# filtered out to 200.
lasf = 5
cutoff = 200
extractor.candidate_selection(lasf=lasf, cutoff=cutoff)
The below requires corpus dependent document frequency in order to use.
# weight the candidate using KPMiner weighting function
#df = pke.load_document_frequency_file(input_file='path/to/df.tsv.gz')
#alpha = 2.3
#sigma = 3.0
#extractor.candidate_weighting(df=df, alpha=alpha, sigma=sigma)
# 5. get the 10-highest scored candidates as keyphrases
#keyphrases = extractor.get_n_best(n=10)
extractor = pke.unsupervised.YAKE()
extractor.load_document(blog_content)
stoplist = stopwords.words('english')
extractor.candidate_selection(n=3, stoplist=stoplist)
# weight the candidate using Yake weighting scheme, a window(in words ) for computing lef/right contexts can be specified
window = 2
use_stems = False #use stems instead of words for weighting
extractor.candidate_weighting(window=window, stoplist=stoplist, use_stems=use_stems)
#Get the 10-highest scored candidates as keyphrases
#reduntant keyphrases are removed from output using levenstein distance and threshold
threshold = 0.8
keyphrases = extractor.get_n_best(n=10, threshold=threshold)
keyphrases
[('red hat decided', 0.0008888723918774053),
('docker command line', 0.0017485528499129259),
('red hat', 0.005022644514962233),
('hat decided', 0.014417621538266964),
('docker command', 0.014417621538266964),
('command line', 0.020822996440012925),
('easy change', 0.03080944329280461),
('anyone used', 0.03080944329280461),
('kata containers', 0.034859532506968334),
('technology like criu', 0.0442574269450327)]
simple_kwextractor = yake.KeywordExtractor()
blog_keywords = simple_kwextractor.extract_keywords(blog_content)
blog_keywords
[(0.01907888198684912, 'docker command line'),
(0.01957252934780936, 'red hat decided'),
(0.038999087844447805, 'red hat'),
(0.06771843728775587, 'docker command'),
(0.06874297250030004, 'hat decided'),
(0.10290092596722104, 'command line'),
(0.13316928160841723, 'udica'),
(0.1498981946279202, 'hat'),
(0.15600219041082652, 'n’t needed'),
(0.1702198818373236, 'docker'),
(0.18235640488829719, 'podman'),
(0.19092908375150364, 'kata containers'),
(0.22483474847238993, 'gvisor'),
(0.2256808233632596, 'decided to invest'),
(0.2256808233632596, 'easy change'),
(0.2256808233632596, 'require a daemon'),
(0.2452126575417378, 'kata'),
(0.2485787634248192, 'including cri-o'),
(0.25040493333313885, 'red'),
(0.25040493333313885, 'line')]
custom_kwextractor.extract_keywords(blog_content)
[(0.02828100634115418, 'similar low level'),
(0.037840350767863545, 'red hat'),
(0.07666479321354663, 'hat'),
(0.08301061655304112, 'red'),
(0.08587055055075608, 'low level research'),
(0.08731194256472048, 'organizations have committed'),
(0.08731194256472048, 'committed to similar'),
(0.08731194256472048, 'similar low'),
(0.08731194256472048, 'low level'),
(0.10841773955745555, 'chrome web browser'),
(0.11338895979911398, 'chrome web'),
(0.1553826923536995, 'chrome'),
(0.16360757338108564, 'google'),
(0.19046322782305794, 'communities'),
(0.19481132490025377, 'container'),
(0.19828091918987265, 'web'),
(0.20980449848411628, 'development'),
(0.2178590850198492, 'lead'),
(0.21843682213059992, 'native computing foundation'),
(0.2285620361564212, 'research')]
keywords = simple_kwextractor.extract_keywords(text_content)
keywords
[(0.04208883962734637, 'machine learning competitions'),
(0.0887384781074878, 'learning competitions'),
(0.08977421437281216, 'hosts data science'),
(0.137173406118932, 'acquiring kaggle'),
(0.1530583423439245, 'google'),
(0.15371265551846858, 'platform that hosts'),
(0.15371265551846858, 'machine learning'),
(0.1625230334351792, 'san francisco'),
(0.17665908177224074, 'kaggle'),
(0.1796477364349524, 'hosts data'),
(0.1796477364349524, 'data science'),
(0.1796477364349524, 'science and machine'),
(0.18244401210589162, 'ceo anthony goldbloom'),
(0.2387322772174652, 'sources'),
(0.2387322772174652, 'competitions'),
(0.24212495657951347, 'francisco this week'),
(0.2474315198387865, 'kaggle co-founder ceo'),
(0.27627102189890745, 'ceo anthony'),
(0.2885044790076837, 'google is acquiring'),
(0.34141080647242195, 'acquiring')]
# specifying parameters
custom_kwextractor = yake.KeywordExtractor(lan="en", n=3, dedupLim=0.9, dedupFunc='seqm', windowsSize=1, top=20, features=None)
keywords = custom_kwextractor.extract_keywords(text_content)
keywords
[(0.018859382212529502, 'machine learning competitions'),
(0.030832095331371854, 'hosts data science'),
(0.05754330797217759, 'learning competitions'),
(0.09222227644002941, 'platform that hosts'),
(0.09222227644002941, 'hosts data'),
(0.09222227644002941, 'data science'),
(0.09222227644002941, 'science and machine'),
(0.09222227644002941, 'machine learning'),
(0.09384707176893009, 'ceo anthony goldbloom'),
(0.09509185681479737, 'acquiring kaggle'),
(0.09992928718426312, 'google'),
(0.10574223667473541, 'san francisco'),
(0.13879597333090907, 'kaggle co-founder ceo'),
(0.149417125928561, 'kaggle'),
(0.15542345042439917, 'google is acquiring'),
(0.1593647044964056, 'francisco this week'),
(0.18460659509977959, 'ceo anthony'),
(0.18460659509977959, 'anthony goldbloom'),
(0.18725536044688476, 'sources'),
(0.18725536044688476, 'competitions')]
import jovian
jovian.commit()
[jovian] Saving notebook..
[jovian] Creating a new notebook on https://jvn.io
[jovian] Error: The current API key is invalid or expired.
[jovian] Please enter your API key (from https://jvn.io ):