Jovian
⭐️
Sign In

基于文本相似度计算的相似APP发现

1. 主要思路:

将APP的文本简介的向量作为APP的特征向量,通过计算两个APP的特征向量之间的余弦值来判断两个APP之间的相似度。
优化:由于在所有APP中两两计算相似度复杂度太大,所以只考虑同一类别下的APP两两之间进行相似度的计算。

2. 主要流程:

  • 对所有APP的简介文本进行分词、去停用词、去标点等处理。
  • 计算APP的TF-IDF向量表示,并进行l2归一化
  • 对于每一个APP,计算其他APP与当前APP的余弦相似度,并排序,取topk作为与当前APP相似的APP。

3. 实现步骤

主要包括:

  • 环境数据准备
  • 文本分词处理
  • ti-idf 计算并进行l2归一化,方便cosine距离的计算
  • 在同一类别中计算APP之间的cosine距离,取topk

3.1 环境数据准备

In [2]:
import re
import jieba
import pickle
import multiprocessing

import pandas as pd
import numpy as np

from tqdm import tqdm
from sklearn.feature_extraction.text import TfidfVectorizer
from zhon.hanzi import punctuation
In [4]:
# app_raw_data 数据 或者 node_app数据
app_raw_data = pd.read_csv('../../data/kgdata/node_app.csv')

# 取出需要的两列数据:app的唯一标识以及简介
app_ids = list(app_raw_data['app_id:ID(app_id)'].values)
app_briefs = list(app_raw_data['soft_brief'].values)
print("number of app_id:",len(app_ids))
print("number of app_brief:",len(app_briefs))
/home/LAB/zhangcw/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2785: DtypeWarning: Columns (0,6,8,12,14,22,23,24) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
number of app_id: 1881732 number of app_brief: 1881732

3.2 文本分词处理

In [5]:
# chinese stop words(来源于网络)
stop_words = [line.strip() for line in open("ChineseStopWords.txt",'r').readlines()]
print("number of stop words:",len(stop_words))
number of stop words: 1893
In [6]:
# split words and delete stop words
def sentence_process(sentence):
    global stop_words
    s_id,s = sentence
    sentence = re.sub(r"[%s]+" %punctuation, "",str(s))
    words = []
    for word in jieba.cut(sentence):
        if word not in stop_words:
            words.append(word)
    return s_id,' '.join(words)
In [7]:
# parallel processing
pool = multiprocessing.Pool()
results = pool.map(sentence_process,tqdm(list(enumerate(app_briefs))))
pool.close()
pool.join()
100%|██████████| 1881732/1881732 [13:55<00:00, 2253.55it/s]
In [8]:
# reduce
def get_briefs(results):
    dic = {}
    for s_id,sentence in results:
        dic[s_id] = sentence
    briefs = []
    for i in range(len(dic)):
        briefs.append(dic[i])
    return briefs
briefs = get_briefs(results)

3.3 ti-idf 计算并进行l2归一化,方便cosine距离的计算

In [10]:
# tf-idf
vectorizer = TfidfVectorizer().fit(briefs)
app_vectors = vectorizer.transform(briefs)

3.4 在同一类别中计算APP之间的cosine距离,取topk

In [11]:
# laod cluster results
with open("all_app_cluster_result.pkl",'rb') as f:
    app_clusters = pickle.load(f)

app_id_index = dict([str(i),index] for index,i in enumerate(app_ids))

def similar_app(apps):
    global app_id_index,app_vectors,app_ids
    n = len(apps) # number of apps in this category
    indexs = [app_id_index[str(i)] for i in apps] # index of app_id

    local_app_vectors = app_vectors[indexs] # 取出对应的行
    scores = local_app_vectors.dot(local_app_vectors.transpose()) # 计算相互之间的cosine距离
    def similar_top(i):
        index1 = indexs[i]
        local_index = list(range(i))+list(range(i+1,n))
        similar_app = scores[i,local_index].toarray()[0].argsort()[-100:]
        return [app_ids[indexs[local_index[j]]] for j in similar_app][::-1]
    return dict([app_ids[indexs[i]],similar_top(i)] for i in range(n))
In [12]:
# parallel processing
parameters  = list(app_clusters.values())
pool = multiprocessing.Pool()
results = pool.map(similar_app,tqdm(parameters))
pool.close()
pool.join()
top100apps = {}
for dic in tqdm(results):
    top100apps.update(dic)
100%|██████████| 405/405 [05:08<00:00, 1.31it/s] 100%|██████████| 405/405 [00:02<00:00, 153.50it/s]
In [13]:
import jovian
In [14]:
jovian.commit(nb_filename=)
[jovian] Saving notebook..
[jovian] Failed to detect notebook filename. Please provide the notebook filename (including .ipynb extension) as the "nb_filename" argument to "jovian.commit".