我们还可以计算数据集的分布来计算均值和方差:
#获得训练集的数据分布 X_scaler = StandardScaler() X_trainT = X_scaler.fit_transform(X_train) X_valT = X_scaler.transform(X_val) # 计算训练集的均值和方差 X_trainT_mean = X_trainT.mean(axis=0) X_trainT_var = X_trainT.var(axis=0) X_valT_mean = X_valT.mean(axis=0) X_valT_var = X_valT.var(axis=0)
In [42]: X_trainT_mean Out[42]: array([ 0.00000000e+00, -7.49863135e-16, 4.25585493e-16, 2.57201667e-16])
然后我们可以进行一个简单的训练,虽然有三个类[0,1,2],但我们只进行二分类,即非0即1
# -----简单的分类训练------------ C = 1 # 正则化系数 alpha = 0.1 # 学习率 y = data.target # 对应0,1,2 y[y == 2] = 1 # 只考虑分类为0的,即二分类 # 给矩阵加一列特征值为1,即加了个偏执向量 X = np.hstack([data.data, np.ones_like(y).reshape(len(y), 1)]) # 权重随机初始化 np.random.seed(42) omega = np.random.rand(X.shape[1]).reshape(5, 1) for epoch in range(10): y_hat = 1 / (1 + np.exp(-X.dot(omega))) # sigmod # 计算loss [人为计算出来的导函数] (y-y_hat) + C*omege² Loss = X.T.dot(C*(y.reshape(-1, 1) - y_hat)) + omega omega += alpha*Loss print(omega) plt.plot(1/(1+np.exp(-X.dot(omega)))) # 预测结果的可视化 plt.show()
打印一下权重在每个epoch中的值
[[-24.57023345] [-16.06360265] [ -6.48818978] [ -0.56811685] [ -4.81871533]] [[35.5927432 ] [11.05003709] [41.92299124] [16.13507147] [ 4.69941314]] [[14.12201752] [-4.9849592 ] [38.80529036] [16.51857861] [ 0.16935445]] [[ -9.49578072] [-22.62345512] [ 35.3758194 ] [ 16.94043648] [ -4.8137101 ]] [[-10.44535864] [-24.88580056] [ 38.91340144] [ 18.63448016] [ -5.29508108]] [[-11.48989446] [-27.37438059] [ 42.8047416 ] [ 20.49792818] [ -5.82458918]] [[-12.6388839 ] [-30.11181865] [ 47.08521577] [ 22.547721 ] [ -6.4070481 ]] [[-13.90277229] [-33.12300051] [ 51.79373734] [ 24.8024931 ] [ -7.04775291]] [[-15.29304952] [-36.43530057] [ 56.97311108] [ 27.28274241] [ -7.7525282 ]] [[-16.82235447] [-40.07883062] [ 62.67042219] [ 30.01101665] [ -8.52778102]]
可以看出在前50个样本中分类为0类,后面为非0类
完整代码:
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler import pandas as pd import matplotlib.pyplot as plt import numpy as np import seaborn as sns data = load_iris() # data.data # 数据集数据 shape (150, 4) # data.targat # 对应类 0,1,2 长度为150,对应于上面的数据 # data.target_names # 类的名字 ['setosa' 'versicolor' 'virginica'] # data.feature_names # 特征名字 ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] ''' data.data: [[5.1 3.5 1.4 0.2] [4.9 3. 1.4 0.2] [4.7 3.2 1.3 0.2] [4.6 3.1 1.5 0.2] [5. 3.6 1.4 0.2] [5.4 3.9 1.7 0.4] [4.6 3.4 1.4 0.3] [5. 3.4 1.5 0.2] [4.4 2.9 1.4 0.2] [4.9 3.1 1.5 0.1]...... ''' # 通过DataFrame()对数据进行结构化 df = pd.DataFrame(data.data) ''' df = pd.DataFrame(data.data) 列指的是对应的4个特征 行指的是数据集的长度,样本的大小 0 1 2 3 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 .. ... ... ... ... 145 6.7 3.0 5.2 2.3 146 6.3 2.5 5.0 1.9 147 6.5 3.0 5.2 2.0 148 6.2 3.4 5.4 2.3 149 5.9 3.0 5.1 1.8 ''' # 给df的列和行对应到数据集的名字上 df.columns = data.feature_names # 列上面附于特征名称 df['Species'] = [data['target_names'][x] for x in data.target] df.head() ''' sepal length (cm) sepal width (cm) ... petal width (cm) Species 0 5.1 3.5 ... 0.2 setosa 1 4.9 3.0 ... 0.2 setosa 2 4.7 3.2 ... 0.2 setosa 3 4.6 3.1 ... 0.2 setosa 4 5.0 3.6 ... 0.2 setosa .. ... ... ... ... ... 145 6.7 3.0 ... 2.3 virginica 146 6.3 2.5 ... 1.9 virginica 147 6.5 3.0 ... 2.0 virginica 148 6.2 3.4 ... 2.3 virginica 149 5.9 3.0 ... 1.8 virginica ''' # 对各类数据分布进行可视化,看看分布的是否均匀 df_count = df['Species'].value_counts().reset_index() #sns.barplot(data=df_count, x='index', y='Species') #plt.show() df_train, df_val = train_test_split(df, train_size=0.8, random_state=0) # 划分训练集和验证集 ''' df_train: sepal length (cm) sepal width (cm) ... petal width (cm) Species 137 6.4 3.1 ... 1.8 virginica 84 5.4 3.0 ... 1.5 versicolor 27 5.2 3.5 ... 0.2 setosa 127 6.1 3.0 ... 1.8 virginica 132 6.4 2.8 ... 2.2 virginica .. ... ... ... ... ... 9 4.9 3.1 ... 0.1 setosa 103 6.3 2.9 ... 1.8 virginica 67 5.8 2.7 ... 1.0 versicolor 117 7.7 3.8 ... 2.2 virginica 47 4.6 3.2 ... 0.2 setosa ''' # 提取特征(不含类) X_train = df_train.drop(['Species'], axis=1) X_val = df_val.drop(['Species'], axis=1) X_train0 = X_train.copy() X_train0 = X_train0.reset_index().drop(['index'], axis=1) # 特征分布的可视化 #plt.plot(X_train0["sepal length (cm)"],".", label="sepal length (cm)") #plt.legend() #plt.show() # 提取类别 Y_train = df_train['Species'] Y_val = df_val['Species'] #获得训练集的数据分布 X_scaler = StandardScaler() X_trainT = X_scaler.fit_transform(X_train) X_valT = X_scaler.transform(X_val) # 计算训练集的均值和方差 X_trainT_mean = X_trainT.mean(axis=0) X_trainT_var = X_trainT.var(axis=0) X_valT_mean = X_valT.mean(axis=0) X_valT_var = X_valT.var(axis=0) # -----简单的分类训练------------ C = 1 # 正则化系数 alpha = 0.1 # 学习率 y = data.target # 对应0,1,2 y[y == 2] = 1 # 只考虑分类为0的,即二分类 # 给矩阵加一列特征值为1,即加了个偏执向量 X = np.hstack([data.data, np.ones_like(y).reshape(len(y), 1)]) # 权重随机初始化 np.random.seed(42) omega = np.random.rand(X.shape[1]).reshape(5, 1) for epoch in range(10): y_hat = 1 / (1 + np.exp(-X.dot(omega))) # sigmod # 计算loss [人为计算出来的导函数] (y-y_hat) + C*omege² Loss = X.T.dot(C*(y.reshape(-1, 1) - y_hat)) + omega omega += alpha*Loss print(omega) plt.plot(1/(1+np.exp(-X.dot(omega)))) # 预测结果的可视化 plt.show()
好了,我们现在可以任意做一个类似数据集试试,将特征值和类名以数值的方式存在txt文件中,看看我们应该如何读取【下面的数据集是我以前看我一个光学方向的同学数据集的样子写,特征值数字是我瞎编的,只是让大家看看这类数据集怎么读取】
比如,我的数据集的txt文件内容如下:其中TC1和TC2是类名,后面的数字为特征值
数据集形式:
TC1 1.25644448
TC1 1.45556654
TC2 3.45646845
TC2 3.45678645
files = open(r'./test.txt',encoding='utf-8') lines = files.readlines()
用open函数打开txt文件夹,用readlines()函数读取文件内容并放在一个列表中,则得到的值为:
['TC1 1.25644448\n', 'TC1 1.45556654\n', 'TC2 3.45646845\n', 'TC2 3.45678645\n']
数据集是以换行符划分的,所以我们可以对上述列表循环读取,并用split()函数划分
dataset = [] for i in lines: dataset.append(i.split()) dataset = np.array(dataset) dataset.reshape(-1, 1)
得到数据集:
[['TC1' '1.25644448']
['TC1' '1.45556654']
['TC2' '3.45646845']
['TC2' '3.45678645']]
然后我们建一个字典,用'classes'存放类,用'feature'存放特征值,
datasetes = {} datasetes['classes'] = dataset[:, 0] datasetes['feature'] = dataset[:, 1]
{'classes': array(['TC1', 'TC1', 'TC2', 'TC2'], dtype='<U10'), 'feature': array(['1.25644448', '1.45556654', '3.45646845', '3.45678645'], dtype='<U10')}
接下来可以仿照鸢尾花的处理方式,用pands工具对数据进行结构化
df = pd.DataFrame(datasetes)
则结构化后的数据集长这样:
classes feature 0 TC1 1.25644448 1 TC1 1.45556654 2 TC2 3.45646845 3 TC2 3.45678645
同理,利用sklearn中的train_test_split()划分数据集和验证集
df_train, df_val = train_test_split(df, train_size=0.8) # 划分训练集和验证集 In [5]: df_train Out[5]: classes feature 3 TC2 3.45678645 1 TC1 1.45556654 0 TC1 1.25644448
同样,对特征值部分进行提取:
# 只取特征 X_train = df_train.drop(['classes'], axis=1) X_val = df_val.drop(['classes'], axis=1)
feature 3 3.45678645 1 1.45556654 0 1.25644448
提取标签部分:
# 取出标签 Y_train = df_train['classes'] Y_val = df_val['classes']
3 TC2 1 TC1 0 TC1 Name: classes, dtype: object
这样类似这种将类和特征值直接存储为txt的数据集就读取完了【这种数据我在一些竞赛和信号处理的数据集中有见过】,后面就是再根据个人需要搭建网络进行训练。
【针对VOC数据集统计各类分布情况,与上面不相关,不需要的可以直接直接略过下部分】
这里在插一句,上面我们可以统计这类数据集每个类的分布情况,那如果是VOC生成的txt文件中应该怎么统计各类的分布呢?因为VOC中信息量更大,生成的txt是由图像路径、边界框信息和类别信息的,针对这个,我仿照上述代码,写了一个统计VOC【txt格式】各类分布情况,方便大家查看各个类到底多少。
import pandas as pd with open(r'./2007_train.txt',encoding='utf-8') as f: lines = f.readlines() line = [] for i in lines: line.append(i.split()) image_path = [] box = [] for i in range(len(line)): #line[i][0] # 图像路径 #line[i][1] # 边界框信息 image_path.append(line[i][0]) box.append(line[i][1]) target = [x[-1] for x in box] img = [x for x in image_path] my_df = pd.DataFrame(target) my_df_count = my_df.value_counts() print(my_df_count) # 可以统计每个类有多少样本