Jovian
⭐️
Sign In

结果分析 by Bill Yuan

@钱宇欣

@曾义夫
@2019/5/3 16:52:02

之前大家是用最后20%的数据作为验证集,现在为了和测试集尽可能相似以比较接近的测试模型在测试集上的表现,调整一下验证集,方便数据分析组分析,统一使用训练集最后7天的数据作为验证集。
之前的本地验证结果请保留下来,将有价值的模型的结果分类打包后给复杂数据分析的同学:

  • 分析目前已有模型的具体表现情况,优势和不足,
  • 相互的差异
    • 如在不同Mode上的准确率,
    • 每个mode的假阳性,
    • 假阴性结果分布,
    • 不同模型的能力是否存在互补或相似性,特别是对于结果相近的模型;
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In [2]:
filename = 'eva5_4__17.csv'
result = pd.read_csv(filename)
result.head()
Out[2]:

0 1112456 5 1 1413458 2 2 1243160 2 3 2040494 2 4 1448779 2

In [3]:
y_train = result.label
y_train_pred = result.recommend_mode

多分类混淆矩阵(TT,TF,FT,FF)

  • 真阳性(True Positive,TP):指被分类器正确分类的正例数据
  • 真阴性(True Negative,TN):指被分类器正确分类的负例数据
  • 假阳性(False Positive,FP):被错误地标记为正例数据的负例数据
  • 假阴性(False Negative,FN):被错误地标记为负例数据的正例数据

行代表了真实的类别,列代表了预测的类别,对角线上为TT

关注仅包含误差数据的图像呈现,所以将混淆矩阵的每一个值除以相应类别的图片的总数目。
可以比较错误率,而不是绝对的错误数 注意数据量大的类别

In [36]:
from sklearn.metrics import confusion_matrix
import itertools
from sklearn import metrics
def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_train,y_train_pred)
np.set_printoptions(precision=2)

print('分类准确率是 ',metrics.accuracy_score(y_train, y_train_pred))
print("Recall metric in the testing dataset: ", metrics.recall_score(y_train,y_train_pred))

# Plot non-normalized confusion matrix
class_names = range(1,12)
plt.figure(figsize=(16,9))
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
# plt.savefig('conf_mx_5_4_17.png', bbox_inches='tight')
plt.savefig('twx_valPredict0508.png', bbox_inches='tight')
plt.show()
分类准确率是 0.7817691496533961
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-36-2f38f00a340a> in <module> 30 31 print('分类准确率是 ',metrics.accuracy_score(y_train, y_train_pred)) ---> 32 print("Recall metric in the testing dataset: ", metrics.recall_score(y_train,y_train_pred)) 33 34 # Plot non-normalized confusion matrix C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\classification.py in recall_score(y_true, y_pred, labels, pos_label, average, sample_weight) 1365 average=average, 1366 warn_for=('recall',), -> 1367 sample_weight=sample_weight) 1368 return r 1369 C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\classification.py in precision_recall_fscore_support(y_true, y_pred, beta, labels, pos_label, average, warn_for, sample_weight) 1045 else: 1046 raise ValueError("Target is %s but average='binary'. Please " -> 1047 "choose another average setting." % y_type) 1048 elif pos_label not in (None, 1): 1049 warnings.warn("Note that pos_label (set to %r) is ignored when " ValueError: Target is multiclass but average='binary'. Please choose another average setting.

不同mode出行模式的准确率

  • 从结果上看,每一个mode的预测准确性

计算验证集每个mode的数量,每个mode对应的id上预测的结果,看占比(顺便看一下在该mode上,分错的mode的分布)

In [44]:
classification = []
for i in range(1,12):
    a = 'mode'+str(i)
    classification.append(a)
In [43]:
print(metrics.classification_report(y_train,y_train_pred,target_names=classification))
precision recall f1-score support mode1 0.68 0.78 0.72 9167 mode2 0.89 0.96 0.92 19861 mode3 0.49 0.06 0.11 2827 mode4 0.39 0.01 0.02 1550 mode5 0.85 0.92 0.88 6185 mode6 0.39 0.10 0.16 1261 mode7 0.76 0.90 0.82 11278 mode8 0.34 0.25 0.28 289 mode9 0.64 0.46 0.54 3165 mode10 0.52 0.66 0.58 1805 mode11 0.54 0.46 0.49 459 micro avg 0.78 0.78 0.78 57847 macro avg 0.59 0.50 0.50 57847 weighted avg 0.75 0.78 0.75 57847
In [41]:
print(classification)
['m', 'o', 'd', 'e', 'r', 'a', 'n', 'g', 'e', '(', '1', ',', ' ', '1', '2', ')']
In [35]:
for i in range(1,12):
    print('mode %d 的精确率为 %f, 召回率为 %f'%(i,np.divide(cnf_matrix[i-1,i-1],np.sum(cnf_matrix[0:12,i-1]))
                                                   ,np.divide(cnf_matrix[i-1,i-1],np.sum(cnf_matrix[i-1,0:12]))))
    print('\t预测错误 %d, 自身分类错误 %d'%(np.subtract(np.sum(cnf_matrix[0:12,i-1]),cnf_matrix[i-1,i-1])
                                                     ,np.subtract(np.sum(cnf_matrix[i-1,0:12]),cnf_matrix[i-1,i-1])))
mode 1 的精确率为 0.677340, 召回率为 0.777681 预测错误 3396, 自身分类错误 2038 mode 2 的精确率为 0.891286, 召回率为 0.958914 预测错误 2323, 自身分类错误 816 mode 3 的精确率为 0.492958, 召回率为 0.061903 预测错误 180, 自身分类错误 2652 mode 4 的精确率为 0.387755, 召回率为 0.012258 预测错误 30, 自身分类错误 1531 mode 5 的精确率为 0.846784, 召回率为 0.919483 预测错误 1029, 自身分类错误 498 mode 6 的精确率为 0.387195, 召回率为 0.100714 预测错误 201, 自身分类错误 1134 mode 7 的精确率为 0.758351, 召回率为 0.895726 预测错误 3219, 自身分类错误 1176 mode 8 的精确率为 0.336493, 召回率为 0.245675 预测错误 140, 自身分类错误 218 mode 9 的精确率为 0.636600, 召回率为 0.463823 预测错误 838, 自身分类错误 1697 mode 10 的精确率为 0.521911, 召回率为 0.659834 预测错误 1091, 自身分类错误 614 mode 11 的精确率为 0.541451, 召回率为 0.455338 预测错误 177, 自身分类错误 250
In [28]:
for i in range(1,12):
#     TrueData = 
    print(cnf_matrix[i-1,i-1])
#     PredDate = 
#     cnf_matrix[range(12),i]

7129 19045 175 19 5687 127 10102 71 1468 1191 209

mode统计分布的画图

  • 统计分布的画图(mode分错的概率,被分错到哪些mode中)
    • subplot
    • 各个mode中,
In [45]:
# import matplotlib.pyplot as plt
# import numpy as np

# plt.rcParams['font.sans-serif'] = ['SimHei'] #(替换sans-serif字体)
# plt.rcParams['axes.unicode_minus'] = False   #(解决坐标轴负数的负号显示问题)


# TrueData = cnf_matrix[i,range(12)]
# PredDate = cnf_matrix[range(12),i]
# labels =["mode0","mode1","mode2","mode3","mode4","mode5","mode6","mode7","mode8","mode9","mode10","mode11"]


# #ax2 = fig.add_subplot(222) #2X2 中的第一个子图
# #bar(left, height, width, color, align, yerr)函数:绘制柱形图。
# # left为x轴的位置序列,一般采用arange函数产生一个序列;
# # height为y轴的数值序列,也就是柱形图的高度,一般就是我们需要展示的数据;
# # width为柱形图的宽度,一般这是为1即可;color为柱形图填充的颜色;
# # align设置plt.xticks()函数中的标签的位置;
# # yerr让柱形图的顶端空出一部分。
# # color设置柱状的颜色
# # alpha 设置柱状填充颜色的透明度 大于0 小于等于1
# # linewidth 线条的宽度

# #设置各种参数
# xlocation =  np.linspace(1, len(job_data) * 0.6, len(job_data)) #len(data个序列)
# print(xlocation)
# height01 = job_data
# height02 = how_many
# width = 0.2
# color01='darkgoldenrod'
# color02 = 'seagreen'

# # 画柱状图
# ax1 = plt.figure('十大热门城市招聘排行',figsize=(10,6)) #指定了图的名称 和画布的大小
# ax1.tight_layout()
# # ax1 = fig.add_subplot(221) #2X2 中的第一个子图
# plt.title('十大热门城市招聘排行', fontsize=15) # 添加图标题
# #画图
# rects01 = plt.bar(xlocation, height01, width = 0.2, color=color01,linewidth=1,alpha=0.8)
# rects02 = plt.bar(xlocation+0.2,height02 ,width = 0.2, color=color02,linewidth=1,alpha=0.8)
# #添加x轴标签
# plt.xticks(xlocation+0.15,labels, fontsize=12 ,rotation = 20)  # 横坐标轴标签 rotation x轴标签旋转的角度
# # 横纵坐标分别代表什么
# plt.xlabel(u'地点', fontsize=15, labelpad=10)
# plt.ylabel(u'职位数量', fontsize=15, labelpad=10)
# #图例
# plt.legend((rects01,rects02),( u'职位数量',u'招聘人数'), fontsize=15)  # 图例
# # 添加数据标签
# for r1,r2 ,amount01,amount02 in zip(rects01, rects02,job_data,how_many):
#         h01 = r1.get_height()
#         h02 = r2.get_height()
#         plt.text(r1.get_x(), h01, amount01, fontsize=13, va='bottom')  # 添加职位数量标签
#         plt.text(r2.get_x(), h02 , amount02, fontsize=13, va='bottom')  # 添加招聘人数
        
# plt.show()
  • 不同结果之间的补集,为后边做模型融合做准备
    • 即不同模型之间,正确结果的分布
    • A模型的结果做对了哪些mode,是哪些id;B模型做对了哪些mode
In [ ]:
 

结果保留

@钱宇欣 11分类base

线下0.670264

5_4_17

mode 0 的准确率为 0.000000
mode 1 的准确率为 0.772591
mode 2 的准确率为 0.957966
mode 3 的准确率为 0.056965
mode 4 的准确率为 0.009070
mode 5 的准确率为 0.921448
mode 6 的准确率为 0.103575
mode 7 的准确率为 0.897545
mode 8 的准确率为 0.091603
mode 9 的准确率为 0.507319
mode 10 的准确率为 0.653871
mode 11 的准确率为 0.468672

@钱宇欣 12分类base

线下0.670868

5_4_19

mode 0 的准确率为 0.012929
mode 1 的准确率为 0.772841
mode 2 的准确率为 0.956734
mode 3 的准确率为 0.052807
mode 4 的准确率为 0.008314
mode 5 的准确率为 0.919034
mode 6 的准确率为 0.090742
mode 7 的准确率为 0.897545
mode 8 的准确率为 0.091603
mode 9 的准确率为 0.502678
mode 10 的准确率为 0.650032
mode 11 的准确率为 0.446115
In [36]:
# import 
profile1 = []
for i in range(67):
    string='p'+str(i-1)
    profile1=profile1.extend(string)
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-36-d04c607b4b73> in <module> 3 for i in range(67): 4 string='p'+str(i-1) ----> 5 profile1=profile1.extend(string) AttributeError: 'NoneType' object has no attribute 'extend'
In [34]:
profile1

Test数据集中的EDA

  • pid等用户相关数据
    • test_queries中pid为NaN的数量有31447,占比0.333 (即profile无相应用户数据)
    • test_queries出现过的pid和train_queries的pid的包含关系:
      • 共55025个pid包含在train_queries中,占比0.583(已去除NaN)
In [2]:
test_queries = pd.read_csv('test_queries.csv')
In [11]:
train_queries = pd.read_csv('train_queries.csv')
In [34]:
55025/94357
Out[34]:
0.5831575823733268
In [23]:
test_pid = list(test_queries['pid'][test_queries['pid'].notna()])
In [24]:
train_pid = list(train_queries['pid'][train_queries['pid'].notna()])
In [31]:
i = 0
for pid in test_pid:
    if pid in train_pid:
        i+=1

print(i)
In [9]:
test_queries['pid'].isna().sum()
# test_queries['pid'].count()
# len(test_queries['pid'])
Out[9]:
31447

根据提交的数据集,寻找不同结果之间的补集

  • 根据最后预测的mode,把sid和pid进行聚类
  • 再对比不同模型中,mode与mode之间的命中率差异
In [2]:
twx_eval = pd.read_csv('twx_valPredict0508.csv')
twx_eval.head()
Out[2]:
In [6]:
twx_eval.predict.unique()
Out[6]:
array([ 7.,  2.,  1.,  5.,  9., 10.,  8.,  3., 11.,  6.,  4.])
In [3]:
y_train = twx_eval.true
y_train_pred = twx_eval.predict
In [46]:
!pip install jovian --upgrade
Collecting jovian Downloading https://files.pythonhosted.org/packages/de/a0/b4bc29837a7dd2d561e0291d40e8ebfa77ab1de5c7e9c1e338692ae47450/jovian-0.1.53.tar.gz Requirement already satisfied, skipping upgrade: requests in c:\programdata\anaconda3\lib\site-packages (from jovian) (2.21.0) Collecting uuid (from jovian) Downloading https://files.pythonhosted.org/packages/ce/63/f42f5aa951ebf2c8dac81f77a8edcc1c218640a2a35a03b9ff2d4aa64c3d/uuid-1.30.tar.gz Requirement already satisfied, skipping upgrade: chardet<3.1.0,>=3.0.2 in c:\programdata\anaconda3\lib\site-packages (from requests->jovian) (3.0.4) Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in c:\programdata\anaconda3\lib\site-packages (from requests->jovian) (2019.3.9) Requirement already satisfied, skipping upgrade: idna<2.9,>=2.5 in c:\programdata\anaconda3\lib\site-packages (from requests->jovian) (2.8) Requirement already satisfied, skipping upgrade: urllib3<1.25,>=1.21.1 in c:\programdata\anaconda3\lib\site-packages (from requests->jovian) (1.24.1) Building wheels for collected packages: jovian, uuid Building wheel for jovian (setup.py): started Building wheel for jovian (setup.py): finished with status 'done' Stored in directory: C:\Users\14496\AppData\Local\pip\Cache\wheels\60\73\79\0926f9cc17a6f1686c5b2356b299947bb3f6972f5d160916b2 Building wheel for uuid (setup.py): started Building wheel for uuid (setup.py): finished with status 'done' Stored in directory: C:\Users\14496\AppData\Local\pip\Cache\wheels\2a\80\9b\015026567c29fdffe31d91edbe7ba1b17728db79194fca1f21 Successfully built jovian uuid Installing collected packages: uuid, jovian Successfully installed jovian-0.1.53 uuid-1.30
In [ ]:
import jovian