支持向量机
在本练习中,我们将使用支持向量机(SVM)来构建垃圾邮件分类器。 我们将从一些简单的2D数据集开始使用SVM来查看它们的工作原理。 然后,我们将对一组原始电子邮件进行一些预处理工作,并使用SVM在处理的电子邮件上构建分类器,以确定它们是否为垃圾邮件。
我们要做的第一件事是看一个简单的二维数据集,看看线性SVM如何对数据集进行不同的C值(类似于线性/逻辑回归中的正则化项)。
实例1 线性可分的支持向量机
1.1 数据读取
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sb
import warnings warnings.simplefilter("ignore")
我们将其用散点图表示,其中类标签由符号表示(+表示正类,o表示负类)。
data1 = pd.read_csv('data/svmdata1.csv')
data1.head()
X1 |
X2 | y | |
0 | 1.9643 | 4.5957 | 1 |
1 | 2.2753 | 3.8589 | 1 |
2 | 2.9781 | 4.5651 | 1 |
3 | 2.9320 | 3.5519 | 1 |
4 | 3.5772 | 2.8560 | 1 |
positive=data1[data1["y"].isin([1])] negative=data1[data1["y"].isin([0])] negative
X1 | X2 | y |
20 | 1.58410 | 3.3575 | 0 |
21 | 2.01030 | 3.2039 | 0 |
22 | 1.95270 | 2.7843 | 0 |
23 | 2.27530 | 2.7127 | 0 |
24 | 2.30990 | 2.9584 | 0 |
25 | 2.82830 | 2.6309 | 0 |
26 | 3.04730 | 2.2931 | 0 |
27 | 2.48270 | 2.0373 | 0 |
28 | 2.50570 | 2.3853 | 0 |
29 | 1.87210 | 2.0577 | 0 |
30 | 2.01030 | 2.3546 | 0 |
31 | 1.22690 | 2.3239 | 0 |
32 | 1.89510 | 2.9174 | 0 |
33 | 1.56100 | 3.0709 | 0 |
34 | 1.54950 | 2.6923 | 0 |
35 | 1.68780 | 2.4057 | 0 |
36 | 1.49190 | 2.0271 | 0 |
37 | 0.96200 | 2.6820 | 0 |
38 | 1.16930 | 2.9276 | 0 |
39 | 0.81220 | 2.9992 | 0 |
40 | 0.97350 | 3.3881 | 0 |
41 | 1.25000 | 3.1937 | 0 |
42 | 1.31910 | 3.5109 | 0 |
43 | 2.22920 | 2.2010 | 0 |
44 | 2.44820 | 2.6411 | 0 |
45 | 2.79380 | 1.9656 | 0 |
46 | 2.09100 | 1.6177 | 0 |
47 | 2.54030 | 2.8867 | 0 |
48 | 0.90440 | 3.0198 | 0 |
49 | 0.76615 | 2.5899 | 0 |
positive = data1[data1['y'].isin([1])] negative = data1[data1['y'].isin([0])] fig, ax = plt.subplots(figsize=(6,4)) ax.scatter(positive['X1'], positive['X2'], s=50, marker='x', label='Positive') ax.scatter(negative['X1'], negative['X2'], s=50, marker='o', label='Negative') ax.legend() plt.show()
请注意,还有一个异常的正例在其他样本之外。
这些类仍然是线性分离的,但它非常紧凑。 我们要训练线性支持向量机来学习类边界。 在这个练习中,我们没有从头开始执行SVM的任务,所以用scikit-learn。
1.2 准备训练数据
在这里,我们不准备测试数据,直接用所有数据训练,然后查看训练完成后,每个点属于这个类别的置信度
X_train=data1[["X1","X2"]].values
y_train=data1["y"].values
1.3 实例化线性支持向量机
#建立第一个支持向量机对象,C=1 from sklearn import svm svc1=svm.LinearSVC(C=1,loss="hinge",max_iter=1000) svc1.fit(X_train,y_train) svc1.score(X_train,y_train)
0.9803921568627451
from sklearn.model_selection import cross_val_score cross_val_score(svc1,X_train,y_train,cv=5).mean()
0.9800000000000001
让我们看看如果C的值越大,会发生什么
#建立第二个支持向量机对象C=100 svc2=svm.LinearSVC(C=100,loss="hinge",max_iter=1000) svc2.fit(X_train,y_train) svc2.score(X_train,y_train)
0.9411764705882353
from sklearn.model_selection import cross_val_score cross_val_score(svc2,X_train,y_train,cv=5).mean()
0.96
X_train.shape
(51, 2)
svc1.decision_function(X_train).shape
(51,)
#建立两个支持向量机的决策函数 data1["SV1 decision function"]=svc1.decision_function(X_train) data1["SV2 decision function"]=svc2.decision_function(X_train)
data1
X1 |
X2 | y | SV1 decision function | SV2 decision function | |
0 | 1.964300 | 4.5957 | 1 | 0.798413 | 4.490754 |
1 | 2.275300 | 3.8589 | 1 | 0.380809 | 2.544578 |
2 | 2.978100 | 4.5651 | 1 | 1.373025 | 5.668147 |
3 | 2.932000 | 3.5519 | 1 | 0.518562 | 2.396315 |
4 | 3.577200 | 2.8560 | 1 | 0.332007 | 1.000000 |
5 | 4.015000 | 3.1937 | 1 | 0.866642 | 2.621549 |
6 | 3.381400 | 3.4291 | 1 | 0.684095 | 2.571736 |
7 | 3.911300 | 4.1761 | 1 | 1.607362 | 5.607368 |
8 | 2.782200 | 4.0431 | 1 | 0.830991 | 3.766091 |
9 | 2.551800 | 4.6162 | 1 | 1.162616 | 5.294331 |
10 | 3.369800 | 3.9101 | 1 | 1.069933 | 4.082890 |
11 | 3.104800 | 3.0709 | 1 | 0.228063 | 1.087807 |
12 | 1.918200 | 4.0534 | 1 | 0.328403 | 2.712621 |
13 | 2.263800 | 4.3706 | 1 | 0.791771 | 4.153238 |
14 | 2.655500 | 3.5008 | 1 | 0.313312 | 1.886635 |
15 | 3.185500 | 4.2888 | 1 | 1.270111 | 5.052445 |
16 | 3.657900 | 3.8692 | 1 | 1.206933 | 4.315328 |
17 | 3.911300 | 3.4291 | 1 | 0.997496 | 3.237878 |
18 | 3.600200 | 3.1221 | 1 | 0.562860 | 1.872985 |
19 | 3.035700 | 3.3165 | 1 | 0.387708 | 1.779986 |
20 | 1.584100 | 3.3575 | 0 | -0.437342 | 0.085220 |
21 | 2.010300 | 3.2039 | 0 | -0.310676 | 0.133779 |
22 | 1.952700 | 2.7843 | 0 | -0.687313 | -1.269605 |
23 | 2.275300 | 2.7127 | 0 | -0.554972 | -1.091178 |
24 | 2.309900 | 2.9584 | 0 | -0.333914 | -0.268319 |
25 | 2.828300 | 2.6309 | 0 | -0.294693 | -0.655467 |
26 | 3.047300 | 2.2931 | 0 | -0.440957 | -1.451665 |
27 | 2.482700 | 2.0373 | 0 | -0.983720 | -2.972828 |
28 | 2.505700 | 2.3853 | 0 | -0.686002 | -1.840056 |
29 | 1.872100 | 2.0577 | 0 | -1.328194 | -3.675710 |
30 | 2.010300 | 2.3546 | 0 | -1.004062 | -2.560208 |
31 | 1.226900 | 2.3239 | 0 | -1.492455 | -3.642407 |
32 | 1.895100 | 2.9174 | 0 | -0.612714 | -0.919820 |
33 | 1.561000 | 3.0709 | 0 | -0.684991 | -0.852917 |
34 | 1.549500 | 2.6923 | 0 | -1.000889 | -2.068296 |
35 | 1.687800 | 2.4057 | 0 | -1.153080 | -2.803536 |
36 | 1.491900 | 2.0271 | 0 | -1.578039 | -4.250726 |
37 | 0.962000 | 2.6820 | 0 | -1.356765 | -2.839519 |
38 | 1.169300 | 2.9276 | 0 | -1.033648 | -1.799875 |
39 | 0.812200 | 2.9992 | 0 | -1.186393 | -2.021672 |
40 | 0.973500 | 3.3881 | 0 | -0.773489 | -0.585307 |
41 | 1.250000 | 3.1937 | 0 | -0.768670 | -0.854355 |
42 | 1.319100 | 3.5109 | 0 | -0.468833 | 0.238673 |
43 | 2.229200 | 2.2010 | 0 | -1.000000 | -2.772247 |
44 | 2.448200 | 2.6411 | 0 | -0.511169 | -1.100940 |
45 | 2.793800 | 1.9656 | 0 | -0.858263 | -2.809175 |
46 | 2.091000 | 1.6177 | 0 | -1.557954 | -4.796212 |
47 | 2.540300 | 2.8867 | 0 | -0.256185 | -0.206115 |
48 | 0.904400 | 3.0198 | 0 | -1.115044 | -1.840424 |
49 | 0.766150 | 2.5899 | 0 | -1.547789 | -3.377865 |
50 | 0.086405 | 4.1045 | 1 | -0.713261 | 0.571946 |
采用决策函数的值作为颜色来看看每个点的置信度,比较两个支持向量机产生的结果的差异