作业二
2.1 最小风险贝叶斯决策分类计算
1、请给出以下问题的求解步骤,逐步给出计算过程:
已知条件为
P(w_1) = 0.9
P(w_2)=0.1
p(x|w_1)=0.2
p(x|w_w)=0.4
λ 11 = 0 \lambda_{11}=0 λ11\=0, λ 12 = 6 \lambda_{12}=6 λ12\=6
λ 21 = 1 \lambda_{21}=1 λ21\=1, λ 22 = 0 \lambda_{22}=0 λ22\=0
根据以下决策表,按最小风险贝叶斯决策进行分类
w_1 | w_2 | |
---|---|---|
a_1 | 0 | 6 |
a_2 | 1 | 0 |
解:
根据最小风险贝叶斯决策,当后验概率乘以代价最小时,我们才能选择正确的分类。所以首先需要求出后验概率和总代价。
根据贝叶斯公式,对于给定的观测值 x x x,我们可以得到后验概率为:
$P(w_1|x) = \frac{P(x|w_1)P(w_1)}{P(x)} $
$ P(w_2|x) = \frac{P(x|w_2)P(w_2)}{P(x)}$
其中分母为归一化常数:
$P(x)=P(x|w_1)P(w_1)+P(x|w_2)P(w_2)$
代入题目中给出的条件:
$ P(w_1)=0.9\\P(w_2)=0.1\\p(x|w_1)=0.2\\p(x|w_w)=0.4 $
则对于任意的x,
\begin{aligned} &P(x) = 0.9\times 0.2 + 0.1 \times 0.4 = 0.22\\ &P(w_1|x) = \frac{0.2\times 0.9}{0.22} \approx 0.818\\ &P(w_2|x) = \frac{0.4\times 0.1}{0.22} \approx 0.182\\ \end{aligned}
接下来计算总代价。函数表格中已经给出了各种决策取值下的代价。因此只需将每个决策取值下的代价与相应的后验概率相乘,并将两者相加即可得到总代价:
$ R(a_1|x) =\lambda_{11} P(w_1, a_1)+\lambda_{12} P(w _2,a _1 ) \\ =0+ 6\times 0.182 \\ =1.092 $
$ R(a _2 | x)=\lambda_{21} P(w _l,a _2)+\lambda_{22} P(w _2,a _2 ) \\ =1\times 0.818\\ =0.818 \\ $
根据最小风险贝叶斯决策,则选择R最小化的决策。最终选择 $ a_2 $作为分类结果。
2.2 最小风险贝叶斯决策和最小错误率贝叶斯决策的区别
最小风险贝叶斯决策和最小错误率贝叶斯决策都是常用的贝叶斯决策方法,区别如下:
(1)决策目标不同:最小风险贝叶斯决策旨在使总体风险最小化,即将各种可能出现的损失考虑进来,以最小化总损失;而最小错误率贝叶斯决策则旨在使分类错误率最小化。
(2)决策规则不同:最小风险贝叶斯决策采用期望损失作为决策依据,通过比较各个类别的期望损失大小来选择具有最小期望损失的类别作为分类结果;而最小错误率贝叶斯决策则采用后验概率作为分类依据,选择后验概率值最大的类别作为分类结果。
(3)假设条件不同:最小风险贝叶斯决策需要知道各种情形下的损失函数和先验概率分布;而最小错误率贝叶斯决策只需要知道各类别条件概率分布和先验概率分布即可。
(4) 应用场景不同:由于两种方法所需信息不同,因此应用场景也有所差异。如果已知各类别之间的代价或收益关系,并且可以明确量化,则适合采用最小风险贝叶斯决策;而如果只关注分类准确性,并且无法精确量化代价或收益,则适合采用最小错误率贝叶斯决策。
作业三
3.1 最小风险贝叶斯决策实现
请用python编写程序实现:
设正态分布的均值分别为 $\mu_1=[1 ,1]^T, \mu_2 =[1.5 1.5]^T $ 协方差矩阵均为0.2I,先验概率相等,决策表为下公式。由正态分布生成各1000个二维向量的数据集,利用其中的800个样本,采用最大似然估计方法估计样本分布的参数,利用最小风险贝叶斯决策方法对其余200个样本进行决策,并计算识别率。
\begin{bmatrix} 0 & 1 \\ 0.5 & 0 \end{bmatrix}
import numpy as np
from scipy.stats import multivariate_normal
# 生成数据集
def generate_data(mu1, mu2, cov, num_samples):
# 生成服从正态分布的数据
data1 = np.random.multivariate_normal(mu1, cov, num_samples)
data2 = np.random.multivariate_normal(mu2, cov, num_samples)
return data1, data2
# 估计样本分布的参数
def estimate_parameters(data):
# 计算均值
mean = np.mean(data, axis=0)
# 计算协方差矩阵
cov = np.cov(data.T)
return mean, cov
# 最小风险贝叶斯决策
def minimum_risk_bayesian_decision(data, means, covs, prior_probs):
num_samples = data.shape[0]
num_classes = len(means)
decisions = np.zeros(num_samples)
for i in range(num_samples):
# 计算样本在每个类别下的后验概率
posterior_probs = np.zeros(num_classes)
for j in range(num_classes):
posterior_probs[j] = multivariate_normal.pdf(data[i], means[j], covs[j]) * prior_probs[j]
# 进行决策
decisions[i] = np.argmax(posterior_probs)
return decisions
# 计算识别率
def compute_accuracy(true_labels, predicted_labels):
num_samples = true_labels.shape[0]
num_correct = np.sum(true_labels == predicted_labels)
accuracy = num_correct / num_samples
return accuracy
# 定义参数
mu1 = np.array([1, 1])
mu2 = np.array([1.5, 1.5])
cov = np.array([[0, 1], [0.5, 0]])
prior_probs = np.array([0.5, 0.5])
# 生成数据集
num_samples = 1000
data1, data2 = generate_data(mu1, mu2, cov, num_samples)
# 估计样本分布的参数
train_data = np.concatenate((data1[:800], data2[:800]), axis=0)
train_labels = np.concatenate((np.zeros(800), np.ones(800)))
mean1, cov1 = estimate_parameters(data1[:800])
mean2, cov2 = estimate_parameters(data2[:800])
means = [mean1, mean2]
covs = [cov1, cov2]
# 最小风险贝叶斯决策
test_data = np.concatenate((data1[800:], data2[800:]), axis=0)
test_labels = np.concatenate((np.zeros(200), np.ones(200)))
decisions = minimum_risk_bayesian_decision(test_data, means, covs, prior_probs)
# 计算识别率
accuracy = compute_accuracy(test_labels, decisions)
print("识别率:", accuracy)
识别率: 0.695
3.2 手写数字识别实现
使用python实现基于朴素贝叶斯分类器实现手写数字识别。训练图像和测试图像在文件夹Handwriting下的train和test文件夹。并测试准确率。每张图片的名称规则如“1_2.png”,则表示数字1的第2张图片。
数据集下载:Handwriting.rar
import os
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from skimage.io import imread
# 加载训练数据
def load_data(data_folder):
X = []
y = []
for label in os.listdir(data_folder):
if label.startswith('.'):
continue
image_path = os.path.join(data_folder, label)
image = imread(image_path, as_gray=True)
X.append(image.flatten())
y.append(int(label[0]))
return np.array(X), np.array(y)
# 加载训练数据和测试数据
train_folder = "Handwriting/train"
test_folder = "Handwriting/test"
X_train, y_train = load_data(train_folder)
X_test, y_test = load_data(test_folder)
# 训练模型
clf = MultinomialNB()
clf.fit(X_train, y_train)
# 预测
y_pred = clf.predict(X_test)
# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print("准确率: {:.2%}".format(accuracy))
准确率: 53.33%
作业四
python实现最近邻算法的实现与算法评估
数据集下载:svmguide1.rar
import numpy as np
import random
from collections import Counter
########读取机器学习数据集的示例代码 (LIBSVM格式)
def load_svmfile(filename):
X = []
Y = []
with open(filename, 'r') as f:
filelines = f.readlines()
for fileline in filelines:
fileline = fileline.strip().split(' ')
#print(fileline)
Y.append(int(fileline[0]))
tmp = []
for t in fileline[1:]:
if len(t)==0:
continue
tmp.append(float(t.split(':')[1]))
X.append(tmp)
return np.array(X), np.array(Y)
########从这个网址下载数据集:https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#svmguide1
########将数据集保存在当前目录下
########读取数据集
dataset = 'svmguide1'
print('Start loading dataset {}'.format(dataset))
X, Y = load_svmfile('{}.t'.format(dataset)) # test set
X_test, Y_test = load_svmfile('{}.t'.format(dataset)) # test set
print('trainset X shape {}, train label Y shape {}'.format(X.shape, Y.shape))
Start loading dataset svmguide1
trainset X shape (4000, 4), train label Y shape (4000,)
########实现一个KNN分类器的模型,需要完成的功能包括train, test和_calculate_distances三部分
class KNN_model():
def __init__(self, k=1):
self.k = k
def train(self, x_train, y_train):
"""Implement the training code for KNN
Input:
x_train: Training instances of size (N, D), where N denotes the number of instances and D denotes the feature dimension
y_train: Training labels of size (N, )
"""
self.x_train = x_train
self.y_train = y_train
def test(self, x_test):
"""
Input: Test instances of size (N, D), where N denotes the number of instances and D denotes the feature dimension
Return: Predicted labels of size (N, )
"""
N = x_test.shape[0] # 测试数据数量
y_pred = [] # 预测标签列表
for i in range(N): # 遍历每一组测试数据
distances = self._calculate_distances(x_test[i]) # 计算测试数据和训练数据的距离
nn_indices = np.argsort(distances)[:self.k] # 距离最近的k个点的下标,从小到大排序
nn_labels = self.y_train[nn_indices] # 距离最近的k个点的标签
counts = np.bincount(nn_labels) # 统计最近的k个点中各类别的数量
y_pred.append(np.argmax(counts)) # 最多的那一类作为预测标签
return np.array(y_pred)
def _calculate_distances(self, point):
"""Calculate the euclidean distance between a test instance and all points in the training set x_train
Input: a single point of size (D, )
Return: distance matrix of size (N, )
"""
return np.sqrt(np.sum((self.x_train - point) ** 2, axis=1))
######### 将原来的训练集划分成两部分:训练和验证
random.seed(777777) #定下随机种子
N = X.shape[0]
valid_frac = 0.2 # 设置验证集的比例为20%
valid_size = int(N*valid_frac)
# 出于简单起见,这里直接使用random shuffle来划分
shuffle_index = [i for i in range(N)]
random.shuffle(shuffle_index)
valid_index, train_index = shuffle_index[:valid_size], shuffle_index[valid_size:]
X_valid, Y_valid = X[valid_index], Y[valid_index]
X_train, Y_train = X[train_index], Y[train_index]
print('trainset X_train shape {}, validset X_valid shape {}'.format(X_train.shape, X_valid.shape))
trainset X_train shape (3200, 4), validset X_valid shape (800, 4)
######### 这里需要实现计算准确率的函数,注意我们期望的输出是百分制,如准确率是0.95,我们期望的输出是95
def cal_accuracy(y_pred, y_gt):
'''
y_pred: predicted labels (N,)
y_gt: ground truth labels (N,)
Return: Accuracy (%)
'''
accuracy = (y_pred == y_gt).mean()
return accuracy * 100
# assert abs(cal_accuracy(np.zeros(Y.shape[0]), Y)-100*1089.0/3089.0)<1e-3
#####使用验证集来选择超参数
possible_k_list = [1,3,5,7,9,11] # 在本次实验中候选的超参数取值
accs = [] # 将每个取值k对应的验证集准确率加入列表
for k in possible_k_list:
#####模型的超参数设置为k
model = KNN_model(k=k)
#####在训练集上训练, 提示: model.train()
model.train(X_train, Y_train)
#####在验证集X_valid上给出预测结果 Y_pred_valid, 提示:model.test()
Y_pred_valid = model.test(X_valid)
#####计算验证集上的准确率
acc_k = cal_accuracy(Y_pred_valid, Y_valid)
#####将每个取值k对应的验证集准确率加入列表
accs.append(acc_k)
print('k={}, accuracy on validation={}%'.format(k, acc_k))
import matplotlib.pyplot as plt
plt.plot(possible_k_list, accs) #画出每个k对应的验证集准确率
k=1, accuracy on validation=96.5%
k=3, accuracy on validation=97.0%
k=5, accuracy on validation=96.875%
k=7, accuracy on validation=97.25%
k=9, accuracy on validation=97.0%
k=11, accuracy on validation=97.5%
#####基于上面的结果确定验证集上的最好的超参数k,根据这个k最终在测试集上进行测试
#####定义最好的k对应的模型
best_k = possible_k_list[np.argmax(accs)]
model = KNN_model(k=best_k)
#####在训练集上训练,注意这里可以使用全部的训练数据
model.train(X, Y)
#####在测试集上测试生成预测 Y_pred_test
Y_pred_test = model.test(X_test)
print('Test Accuracy={}%'.format(cal_accuracy(Y_pred_test, Y_test)))
Test Accuracy=97.05%
#####以下需要实现5折交叉验证,可以参考之前训练集和验证集划分的方式
folds = 5
for k in possible_k_list: # 遍历所有可能的k
print('******k={}******'.format(k))
valid_accs = []
for i in range(folds): # 第i折的实验
##### 生成第i折的训练集 X_train_i, Y_train_i和验证集 X_valid_i, Y_valid_i; 提示:可参考之前random shuffle的方式来生成index
np.random.shuffle(shuffle_index)
fold_size = int(N/folds)
test_start = i * fold_size
test_end = min((i+1) * fold_size, N)
valid_index_i, train_index_i = shuffle_index[test_start:test_end], np.concatenate([shuffle_index[:test_start], shuffle_index[test_end:]])
X_valid_i, Y_valid_i = X[valid_index_i], Y[valid_index_i]
train_index_i = list(map(int, train_index_i))
X_train_i, Y_train_i = X[train_index_i], Y[train_index_i]
##### 定义超参数设置为k的模型
model = KNN_model(k=k)
##### 在Fold-i上进行训练
model.train(X_train_i, Y_train_i)
##### 给出Fold-i验证集X_valid_i上的预测结果 Y_pred_valid_i
Y_pred_valid_i = model.test(X_valid_i)
acc = cal_accuracy(Y_pred_valid_i, Y_valid_i)
valid_accs.append(acc)
print('Valid Accuracy on Fold-{}: {}%'.format(i+1, acc))
print('k={}, Accuracy {}+-{}%'.format(k, np.mean(valid_accs), np.std(valid_accs)))
**k=1**
Valid Accuracy on Fold-1: 95.625%
Valid Accuracy on Fold-2: 95.375%
Valid Accuracy on Fold-3: 96.0%
Valid Accuracy on Fold-4: 97.0%
Valid Accuracy on Fold-5: 96.0%
k=1, Accuracy 96.0±0.5533985905294664%
**k=3**
Valid Accuracy on Fold-1: 96.5%
Valid Accuracy on Fold-2: 97.0%
Valid Accuracy on Fold-3: 96.625%
Valid Accuracy on Fold-4: 96.5%
Valid Accuracy on Fold-5: 95.625%
k=3, Accuracy 96.45±0.4513867521316947%
**k=5**
Valid Accuracy on Fold-1: 96.625%
Valid Accuracy on Fold-2: 97.125%
Valid Accuracy on Fold-3: 96.75%
Valid Accuracy on Fold-4: 96.5%
Valid Accuracy on Fold-5: 97.0%
k=5, Accuracy 96.8±0.2318404623873926%
**k=7**
Valid Accuracy on Fold-1: 96.375%
Valid Accuracy on Fold-2: 96.5%
Valid Accuracy on Fold-3: 97.375%
Valid Accuracy on Fold-4: 95.875%
Valid Accuracy on Fold-5: 96.125%
k=7, Accuracy 96.45±0.5099019513592785%
**k=9**
Valid Accuracy on Fold-1: 96.875%
Valid Accuracy on Fold-2: 96.75%
Valid Accuracy on Fold-3: 97.125%
Valid Accuracy on Fold-4: 96.875%
Valid Accuracy on Fold-5: 95.875%
k=9, Accuracy 96.7±0.4301162633521313%
**k=11**
Valid Accuracy on Fold-1: 97.25%
Valid Accuracy on Fold-2: 97.0%
Valid Accuracy on Fold-3: 96.625%
Valid Accuracy on Fold-4: 96.875%
Valid Accuracy on Fold-5: 96.625%
k=11, Accuracy 96.875±0.23717082451262844%
#####基于交叉验证确定验证集上的最好的超参数k,根据这个k最终在测试集上进行测试
#####定义最好的k对应的模型
best_k = possible_k_list[np.argmax(np.mean(valid_accs))]
model = KNN_model(k=best_k)
#####在训练集上训练,注意这里可以使用全部的训练数据
model.train(X, Y)
#####在测试集上测试生成预测 Y_pred_test
Y_pred_test = model.test(X_test)
print('Test Accuracy chosing k using cross-validation={}%'.format(cal_accuracy(Y_pred_test, Y_test)))
Test Accuracy chosing k using cross-validation=100.0%
#####如果训练/测试集不均衡如果评估模型呢?
#####生成一个不均衡的测试集,由于示例数据集中所有的标签1都在后面所以出于方便直接这样来生成一个不均衡的测试集
N_test = int(X_test.shape[0]*0.7)
X_test, Y_test = X_test[:N_test], Y_test[:N_test]
print(Counter(Y_test)) # 输出新的测试集中的标签分布
model = KNN_model(k=best_k) # 此处请填入交叉验证确定的最好的k
model.train(X, Y)
Y_pred_test = model.test(X_test)
#实现计算percision, recall和F1 score的函数
from sklearn.metrics import precision_recall_fscore_support
def cal_prec_recall_f1(Y_pred, Y_gt):
'''
Input: predicted labels y_pred, ground truth labels Y_gt
Return: precision, recall, and F1 score
'''
precision, recall, f1, _ = precision_recall_fscore_support(Y_gt, Y_pred, average='binary')
return precision, recall, f1
print(cal_prec_recall_f1(Y_pred_test, Y_test))
Counter({0: 2000, 1: 800})
(1.0, 1.0, 1.0)
作业五
请用python代码实现:
已知正例点 $ x_1 =(1,2)^T,x_2=(2,3)^T,x_3=(3,3)^T,负例点x_4=(2,1)^T,x_5=(3,2)^T $,用sklearn.svm的SVC类来求最大间隔分离平面和分类决策函数,并用matplotlib库画出分离超平面、间隔边界及支持向量。输出Python代码。
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
# 正例点
x1 = np.array([1, 2])
x2 = np.array([2, 3])
x3 = np.array([3, 3])
# 负例点
x4 = np.array([2, 1])
x5 = np.array([3, 2])
# 数据集
X = np.array([x1, x2, x3, x4, x5])
y = np.array([1, 1, 1, -1, -1])
# 创建SVC模型,使用线性核函数
model = SVC(kernel='linear')
# 拟合模型
model.fit(X, y)
# 获取分离超平面的系数
w = model.coef_[0]
slope = -w[0] / w[1]
b = model.intercept_[0]
x_min = min(X[:, 0]) - 1
x_max = max(X[:, 0]) + 1
y_min = min(X[:, 1]) - 1
y_max = max(X[:, 1]) + 1
xx = np.linspace(x_min, x_max, 100)
yy = slope * xx - (b / w[1])
# 画出分离超平面、间隔边界及支持向量
plt.plot(xx, yy, 'k-', label='Separating Hyperplane')
plt.plot(xx, yy + 1 / np.sqrt(np.sum(w ** 2)), 'k--', label='Margin Boundary')
plt.plot(xx, yy - 1 / np.sqrt(np.sum(w ** 2)), 'k--')
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired, label='Points')
plt.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1], s=100, facecolors='none', edgecolors='k', label='Support Vectors')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.savefig('5.png',dpi=100)
plt.show()
作业六
与MNIST手写体数字集一样,CIFAR-10包含了60000张图片,共10类。训练集50000张,测试集10000张。但与MNIST不同的是,CIFAR-10数据集中的图片是彩色的,每张图片的大小是32x32x3,其中3代表RGB三个通道,每个像素点的颜色由RGB三个值决定,而RGB的取值范围为0~255。仿照MNIST手写体数字识别,用Pytorch框架实现卷积神经网络对CIFAR-10进行分类实验。
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import torch.nn.functional as F
# 加载CIFAR-10数据集
transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
# 第一次运行,会自动下载数据集。下载后解压文件,再改download=False
data_path = "data"
trainset = torchvision.datasets.CIFAR10(root=data_path, train=True,
download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
shuffle=True, num_workers=2)
testset = torchvision.datasets.CIFAR10(root='./data', train=False,
download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
shuffle=False, num_workers=2)
classes = ('plane', 'car', 'bird', 'cat',
'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
# 定义卷积神经网络模型
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
net = Net()
# 定义损失函数和优化器
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
for epoch in range(10):
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
inputs, labels = data
optimizer.zero_grad()
# 前向传播
outputs = net(inputs)
loss = criterion(outputs, labels)
# 反向传播和优化
loss.backward()
optimizer.step()
running_loss += loss.item()
if i % 2000 == 1999:
print('[%d, %5d] loss: %.3f' %
(epoch + 1, i + 1, running_loss / 2000))
running_loss = 0.0
print('Finished Training')
[1, 2000] loss: 2.166
[1, 4000] loss: 1.783
[1, 6000] loss: 1.656
[1, 8000] loss: 1.580
[1, 10000] loss: 1.509
[1, 12000] loss: 1.424
[2, 2000] loss: 1.374
[2, 4000] loss: 1.373
[2, 6000] loss: 1.319
[2, 8000] loss: 1.287
[2, 10000] loss: 1.275
[2, 12000] loss: 1.246
Finished Training
# 在测试集上进行测试
correct = 0
total = 0
with torch.no_grad():
for data in testloader:
images, labels = data
outputs = net(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print('Accuracy of the network on the 10000 test images: %d %%' % (
100 * correct / total))
Accuracy of the network on the 10000 test images: 55 %