@钱宇欣
@曾义夫
@2019/5/3 16:52:02
之前大家是用最后20%的数据作为验证集,现在为了和测试集尽可能相似以比较接近的测试模型在测试集上的表现,调整一下验证集,方便数据分析组分析,统一使用训练集最后7天的数据作为验证集。
之前的本地验证结果请保留下来,将有价值的模型的结果分类打包后给复杂数据分析的同学:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
filename = 'eva5_4__17.csv'
result = pd.read_csv(filename)
result.head()
0 1112456 5 1 1413458 2 2 1243160 2 3 2040494 2 4 1448779 2
y_train = result.label
y_train_pred = result.recommend_mode
行代表了真实的类别,列代表了预测的类别,对角线上为TT
关注仅包含误差数据的图像呈现,所以将混淆矩阵的每一个值除以相应类别的图片的总数目。
可以比较错误率,而不是绝对的错误数 注意数据量大的类别
from sklearn.metrics import confusion_matrix
import itertools
from sklearn import metrics
def plot_confusion_matrix(cm, classes,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
"""
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=0)
plt.yticks(tick_marks, classes)
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_train,y_train_pred)
np.set_printoptions(precision=2)
print('分类准确率是 ',metrics.accuracy_score(y_train, y_train_pred))
print("Recall metric in the testing dataset: ", metrics.recall_score(y_train,y_train_pred))
# Plot non-normalized confusion matrix
class_names = range(1,12)
plt.figure(figsize=(16,9))
plot_confusion_matrix(cnf_matrix
, classes=class_names
, title='Confusion matrix')
# plt.savefig('conf_mx_5_4_17.png', bbox_inches='tight')
plt.savefig('twx_valPredict0508.png', bbox_inches='tight')
plt.show()
分类准确率是 0.7817691496533961
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-36-2f38f00a340a> in <module>
30
31 print('分类准确率是 ',metrics.accuracy_score(y_train, y_train_pred))
---> 32 print("Recall metric in the testing dataset: ", metrics.recall_score(y_train,y_train_pred))
33
34 # Plot non-normalized confusion matrix
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\classification.py in recall_score(y_true, y_pred, labels, pos_label, average, sample_weight)
1365 average=average,
1366 warn_for=('recall',),
-> 1367 sample_weight=sample_weight)
1368 return r
1369
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\classification.py in precision_recall_fscore_support(y_true, y_pred, beta, labels, pos_label, average, warn_for, sample_weight)
1045 else:
1046 raise ValueError("Target is %s but average='binary'. Please "
-> 1047 "choose another average setting." % y_type)
1048 elif pos_label not in (None, 1):
1049 warnings.warn("Note that pos_label (set to %r) is ignored when "
ValueError: Target is multiclass but average='binary'. Please choose another average setting.
计算验证集每个mode的数量,每个mode对应的id上预测的结果,看占比(顺便看一下在该mode上,分错的mode的分布)
classification = []
for i in range(1,12):
a = 'mode'+str(i)
classification.append(a)
print(metrics.classification_report(y_train,y_train_pred,target_names=classification))
precision recall f1-score support
mode1 0.68 0.78 0.72 9167
mode2 0.89 0.96 0.92 19861
mode3 0.49 0.06 0.11 2827
mode4 0.39 0.01 0.02 1550
mode5 0.85 0.92 0.88 6185
mode6 0.39 0.10 0.16 1261
mode7 0.76 0.90 0.82 11278
mode8 0.34 0.25 0.28 289
mode9 0.64 0.46 0.54 3165
mode10 0.52 0.66 0.58 1805
mode11 0.54 0.46 0.49 459
micro avg 0.78 0.78 0.78 57847
macro avg 0.59 0.50 0.50 57847
weighted avg 0.75 0.78 0.75 57847
print(classification)
['m', 'o', 'd', 'e', 'r', 'a', 'n', 'g', 'e', '(', '1', ',', ' ', '1', '2', ')']
for i in range(1,12):
print('mode %d 的精确率为 %f, 召回率为 %f'%(i,np.divide(cnf_matrix[i-1,i-1],np.sum(cnf_matrix[0:12,i-1]))
,np.divide(cnf_matrix[i-1,i-1],np.sum(cnf_matrix[i-1,0:12]))))
print('\t预测错误 %d, 自身分类错误 %d'%(np.subtract(np.sum(cnf_matrix[0:12,i-1]),cnf_matrix[i-1,i-1])
,np.subtract(np.sum(cnf_matrix[i-1,0:12]),cnf_matrix[i-1,i-1])))
mode 1 的精确率为 0.677340, 召回率为 0.777681
预测错误 3396, 自身分类错误 2038
mode 2 的精确率为 0.891286, 召回率为 0.958914
预测错误 2323, 自身分类错误 816
mode 3 的精确率为 0.492958, 召回率为 0.061903
预测错误 180, 自身分类错误 2652
mode 4 的精确率为 0.387755, 召回率为 0.012258
预测错误 30, 自身分类错误 1531
mode 5 的精确率为 0.846784, 召回率为 0.919483
预测错误 1029, 自身分类错误 498
mode 6 的精确率为 0.387195, 召回率为 0.100714
预测错误 201, 自身分类错误 1134
mode 7 的精确率为 0.758351, 召回率为 0.895726
预测错误 3219, 自身分类错误 1176
mode 8 的精确率为 0.336493, 召回率为 0.245675
预测错误 140, 自身分类错误 218
mode 9 的精确率为 0.636600, 召回率为 0.463823
预测错误 838, 自身分类错误 1697
mode 10 的精确率为 0.521911, 召回率为 0.659834
预测错误 1091, 自身分类错误 614
mode 11 的精确率为 0.541451, 召回率为 0.455338
预测错误 177, 自身分类错误 250
for i in range(1,12):
# TrueData =
print(cnf_matrix[i-1,i-1])
# PredDate =
# cnf_matrix[range(12),i]
7129
19045
175
19
5687
127
10102
71
1468
1191
209
# import matplotlib.pyplot as plt
# import numpy as np
# plt.rcParams['font.sans-serif'] = ['SimHei'] #(替换sans-serif字体)
# plt.rcParams['axes.unicode_minus'] = False #(解决坐标轴负数的负号显示问题)
# TrueData = cnf_matrix[i,range(12)]
# PredDate = cnf_matrix[range(12),i]
# labels =["mode0","mode1","mode2","mode3","mode4","mode5","mode6","mode7","mode8","mode9","mode10","mode11"]
# #ax2 = fig.add_subplot(222) #2X2 中的第一个子图
# #bar(left, height, width, color, align, yerr)函数:绘制柱形图。
# # left为x轴的位置序列,一般采用arange函数产生一个序列;
# # height为y轴的数值序列,也就是柱形图的高度,一般就是我们需要展示的数据;
# # width为柱形图的宽度,一般这是为1即可;color为柱形图填充的颜色;
# # align设置plt.xticks()函数中的标签的位置;
# # yerr让柱形图的顶端空出一部分。
# # color设置柱状的颜色
# # alpha 设置柱状填充颜色的透明度 大于0 小于等于1
# # linewidth 线条的宽度
# #设置各种参数
# xlocation = np.linspace(1, len(job_data) * 0.6, len(job_data)) #len(data个序列)
# print(xlocation)
# height01 = job_data
# height02 = how_many
# width = 0.2
# color01='darkgoldenrod'
# color02 = 'seagreen'
# # 画柱状图
# ax1 = plt.figure('十大热门城市招聘排行',figsize=(10,6)) #指定了图的名称 和画布的大小
# ax1.tight_layout()
# # ax1 = fig.add_subplot(221) #2X2 中的第一个子图
# plt.title('十大热门城市招聘排行', fontsize=15) # 添加图标题
# #画图
# rects01 = plt.bar(xlocation, height01, width = 0.2, color=color01,linewidth=1,alpha=0.8)
# rects02 = plt.bar(xlocation+0.2,height02 ,width = 0.2, color=color02,linewidth=1,alpha=0.8)
# #添加x轴标签
# plt.xticks(xlocation+0.15,labels, fontsize=12 ,rotation = 20) # 横坐标轴标签 rotation x轴标签旋转的角度
# # 横纵坐标分别代表什么
# plt.xlabel(u'地点', fontsize=15, labelpad=10)
# plt.ylabel(u'职位数量', fontsize=15, labelpad=10)
# #图例
# plt.legend((rects01,rects02),( u'职位数量',u'招聘人数'), fontsize=15) # 图例
# # 添加数据标签
# for r1,r2 ,amount01,amount02 in zip(rects01, rects02,job_data,how_many):
# h01 = r1.get_height()
# h02 = r2.get_height()
# plt.text(r1.get_x(), h01, amount01, fontsize=13, va='bottom') # 添加职位数量标签
# plt.text(r2.get_x(), h02 , amount02, fontsize=13, va='bottom') # 添加招聘人数
# plt.show()
线下0.670264
mode 0 的准确率为 0.000000
mode 1 的准确率为 0.772591
mode 2 的准确率为 0.957966
mode 3 的准确率为 0.056965
mode 4 的准确率为 0.009070
mode 5 的准确率为 0.921448
mode 6 的准确率为 0.103575
mode 7 的准确率为 0.897545
mode 8 的准确率为 0.091603
mode 9 的准确率为 0.507319
mode 10 的准确率为 0.653871
mode 11 的准确率为 0.468672
线下0.670868
mode 0 的准确率为 0.012929
mode 1 的准确率为 0.772841
mode 2 的准确率为 0.956734
mode 3 的准确率为 0.052807
mode 4 的准确率为 0.008314
mode 5 的准确率为 0.919034
mode 6 的准确率为 0.090742
mode 7 的准确率为 0.897545
mode 8 的准确率为 0.091603
mode 9 的准确率为 0.502678
mode 10 的准确率为 0.650032
mode 11 的准确率为 0.446115
# import
profile1 = []
for i in range(67):
string='p'+str(i-1)
profile1=profile1.extend(string)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-36-d04c607b4b73> in <module>
3 for i in range(67):
4 string='p'+str(i-1)
----> 5 profile1=profile1.extend(string)
AttributeError: 'NoneType' object has no attribute 'extend'
profile1
test_queries = pd.read_csv('test_queries.csv')
train_queries = pd.read_csv('train_queries.csv')
55025/94357
0.5831575823733268
test_pid = list(test_queries['pid'][test_queries['pid'].notna()])
train_pid = list(train_queries['pid'][train_queries['pid'].notna()])
i = 0
for pid in test_pid:
if pid in train_pid:
i+=1
print(i)
test_queries['pid'].isna().sum()
# test_queries['pid'].count()
# len(test_queries['pid'])
31447
twx_eval = pd.read_csv('twx_valPredict0508.csv')
twx_eval.head()
twx_eval.predict.unique()
array([ 7., 2., 1., 5., 9., 10., 8., 3., 11., 6., 4.])
y_train = twx_eval.true
y_train_pred = twx_eval.predict
!pip install jovian --upgrade
Collecting jovian
Downloading https://files.pythonhosted.org/packages/de/a0/b4bc29837a7dd2d561e0291d40e8ebfa77ab1de5c7e9c1e338692ae47450/jovian-0.1.53.tar.gz
Requirement already satisfied, skipping upgrade: requests in c:\programdata\anaconda3\lib\site-packages (from jovian) (2.21.0)
Collecting uuid (from jovian)
Downloading https://files.pythonhosted.org/packages/ce/63/f42f5aa951ebf2c8dac81f77a8edcc1c218640a2a35a03b9ff2d4aa64c3d/uuid-1.30.tar.gz
Requirement already satisfied, skipping upgrade: chardet<3.1.0,>=3.0.2 in c:\programdata\anaconda3\lib\site-packages (from requests->jovian) (3.0.4)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in c:\programdata\anaconda3\lib\site-packages (from requests->jovian) (2019.3.9)
Requirement already satisfied, skipping upgrade: idna<2.9,>=2.5 in c:\programdata\anaconda3\lib\site-packages (from requests->jovian) (2.8)
Requirement already satisfied, skipping upgrade: urllib3<1.25,>=1.21.1 in c:\programdata\anaconda3\lib\site-packages (from requests->jovian) (1.24.1)
Building wheels for collected packages: jovian, uuid
Building wheel for jovian (setup.py): started
Building wheel for jovian (setup.py): finished with status 'done'
Stored in directory: C:\Users\14496\AppData\Local\pip\Cache\wheels\60\73\79\0926f9cc17a6f1686c5b2356b299947bb3f6972f5d160916b2
Building wheel for uuid (setup.py): started
Building wheel for uuid (setup.py): finished with status 'done'
Stored in directory: C:\Users\14496\AppData\Local\pip\Cache\wheels\2a\80\9b\015026567c29fdffe31d91edbe7ba1b17728db79194fca1f21
Successfully built jovian uuid
Installing collected packages: uuid, jovian
Successfully installed jovian-0.1.53 uuid-1.30
import jovian