python 文本分类系列（1）——naive bayes 和 SVM多分类器

##文本多分类

这里主要介绍三种方法：1)朴素贝叶斯 , 2)一对一SVM , 3)一对多SVM

1. 朴素贝叶斯

朴素贝叶斯方法具有容易实现，运行速度快的特点，被广泛在文本多分类器中使用，其中有比较常见的模型就是，多项式模型和伯努利模型(前者以单词为粒度，后者以文件为粒度)
在这里我们用多项式模型来运行起这个python代码

怎么理解这个模型？通俗说就是，单词tk在证明d属于类c上提供了多大的证据，这个证据怎么表示？

p(tk|c)=(类c下单词tk在各个文档中出现过的次数之和+1)/(类c下单词总数+|V|)

V是训练样本的单词表（即抽取单词，单词出现多次，只算一个），|V|则表示训练样本包含多少种单词。

python代码：

Naive Bayes

from sklearn.datasets import fetch_20newsgroups
categories = ['comp.sys.ibm.pc.hardware','comp.sys.mac.hardware','misc.forsale','soc.religion.christian']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
text_clf_NB = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
])

这里的pipeline的作用是把前面的数据处理和后面的分类函数绑定在一起。

1	text_clf_NB = text_clf_NB.fit(twenty_train.data, twenty_train.target)

1	predicted = text_clf_NB.predict(twenty_test.data)

from sklearn import metrics
precision_nb=metrics.precision_score(twenty_test.target, predicted, average='weighted')
recall_nb=metrics.recall_score(twenty_test.target, predicted, average='weighted')
accuracy_nb=metrics.accuracy_score(twenty_test.target, predicted)
matrix_nb=metrics.confusion_matrix(twenty_test.target, predicted)
print ("accuracy: %f\nrecall: %f\nprecision:%f\n"%(accuracy_nb,recall_nb,precision_nb))
print("confusion matrix: \n%s"%matrix_nb)

accuracy: 0.890096
recall: 0.890096
precision:0.894808

confusion matrix: 
[[355  23   4  10]
 [ 33 330   5  17]
 [ 40  22 313  15]
 [  2   1   0 395]]

使用metric包进行评估，这里precision和recall值里面average要设置成weighted，若是默认则为二分类问题

2. 一对一SVM

思路：
对每两个标签的数据进行训练，预测测试集，预测结果相应为该标签加个权重值，这里使用字典查询features里面的键（就是这两个正在比较的标签）对应的值通过加布尔值方法来count
代码中1-predict_temp是predict_temp的取反。
在代码中还有一个注意点字典是没有顺序的如果直接使用features.values（）得到的array是无序的，这样就不好锁定index
最后就是使用numpy里面一个神奇的函数argmax（发现这个函数之前，我用了很多方法试图实现这样一个功能，都失败了）
这样就成功把count最大的index 给导入到一个list中能够与target进行比较了

python代码：

One VS One SVM

from sklearn.svm import LinearSVC
text_clf_svm = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', LinearSVC()),
])

param=[['comp.sys.ibm.pc.hardware','comp.sys.mac.hardware'],
       ['comp.sys.ibm.pc.hardware','misc.forsale'],
      ['comp.sys.ibm.pc.hardware','soc.religion.christian'],
       ['comp.sys.mac.hardware','misc.forsale'],
       ['comp.sys.mac.hardware','soc.religion.christian'],
       ['misc.forsale','soc.religion.christian']]

import numpy as np
features = {
         'comp.sys.ibm.pc.hardware':0,
         'comp.sys.mac.hardware':0,
         'misc.forsale':0,
         'soc.religion.christian':0
        }
for p in param:
    twenty_train_temp = fetch_20newsgroups(subset='train', categories=p, shuffle=True, random_state=42)
    text_clf_svm = text_clf_svm.fit(twenty_train_temp.data, twenty_train_temp.target)
    predict_temp = text_clf_svm.predict(twenty_test.data)  
    features[p[0]]=features[p[0]]+(1-predict_temp)  
    features[p[1]]=features[p[1]]+predict_temp    
    #1-predicted means it belongs to the first category, predicted means it belongs to the second category
    
    
t = [features[categories[0]],features[categories[1]],features[categories[2]],features[categories[3]]]
arr = np.array(t)
predict2 = arr.argmax(axis=0)

precision_svc1=metrics.precision_score(twenty_test.target, predict2, average='weighted')
recall_svc1=metrics.recall_score(twenty_test.target, predict2, average='weighted')
accuracy_svc1=metrics.accuracy_score(twenty_test.target, predict2)
matrix_svc1=metrics.confusion_matrix(twenty_test.target, predict2)
print ("accuracy: %f\nrecall: %f\nprecision:%f\n"%(accuracy_svc1,recall_svc1,precision_svc1))
print("confusion matrix: \n%s"%matrix_svc1)

accuracy: 0.915655
recall: 0.915655
precision:0.916187

confusion matrix: 
[[338  35  19   0]
 [ 26 342  16   1]
 [ 16   5 368   1]
 [  9   1   3 385]]

2. 一对多SVM

相对简单很多只要更改下训练数据的target变成一个new_target（这里使用list里面lambda 的高级使用方法map()，除了map()之外lambda还有filter reduce这两个函数来满足list里面的元素操作）

然后就可以训练预测了

预测结果是布尔值，乘上类别序号，再把它们都加起来就可以得到index 的list

python 代码：

One VS All SVM

predict3 = np.zeros(len(twenty_test.target))
for tar in range(0,4):
    new_train_target = map(lambda x:x==tar, twenty_train.target)
    text_clf_svm = text_clf_svm.fit(twenty_train.data,new_train_target)
    predict_temp = text_clf_svm.predict(twenty_test.data)  
    predict3 = predict3 +  predict_temp*tar

precision_svc2=metrics.precision_score(twenty_test.target, predict3, average='weighted')
recall_svc2=metrics.recall_score(twenty_test.target, predict3, average='weighted')
accuracy_svc2=metrics.accuracy_score(twenty_test.target, predict3)
matrix_svc2=metrics.confusion_matrix(twenty_test.target, predict3)
print ("accuracy: %f\nrecall: %f\nprecision:%f\n"%(accuracy_svc2,recall_svc2,precision_svc2))
print("confusion matrix: \n%s"%matrix_svc2)

accuracy: 0.890735
recall: 0.890735
precision:0.907103

confusion matrix: 
[[371  12   9   0]
 [ 83 296   5   1]
 [ 37   2 345   6]
 [ 15   0   1 382]]