使用Logistic算法从疝气病症预测病马的死亡率-阿里云开发者社区

使用Logistic算法从疝气病症预测病马的死亡率

2024-06-11 66

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 使用Logistic算法从疝气病症预测病马的死亡率

从疝气病症预测病马的死亡率

（1）收集数据：给定数据文件

（2）准备数据：用Python解析文本文件并填充缺失值

（3）分析数据：可视化并观察数据

（4）训练算法：使用优化算法，找到最佳系数

（5）测试算法：错误率

（6）使用算法：预测

准备数据：处理数据中的缺失值

from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive

测试算法：用Logistic回归进行分类

from math import *
from numpy import *
def sigmoid(inX):
  return 1.0/(1+exp(-inX))

def stocGradAscent1(dataMatrix, classLabels, numIter=150):
  m, n = shape(dataMatrix)
  weights = ones(n)
  for j in range(numIter):
    dataIndex = list(range(m))
    for i in range(m):
      alpha = 4/(1+j+i)+0.01
      randIndex = int(random.uniform(0, len(dataIndex)))
      h = sigmoid(sum(dataMatrix[randIndex] * weights))
      error = classLabels[randIndex] - h
      weights = weights + alpha * error * dataMatrix[randIndex]
      del(dataIndex[randIndex])
  return weights

该函数是使用随机梯度上升算法来更新回归系数weights的函数。函数的输入参数dataMatrix是一个m行n列的矩阵，代表m个样本的n个特征值；classLabels是一个长度为m的向量，代表每个样本的类别标签；numIter是迭代次数，默认为150次。

函数首先获取矩阵dataMatrix的行数m和列数n，并初始化回归系数weights为长度为n的全1向量。

接下来开始迭代更新回归系数，迭代次数由参数numIter确定。内层循环使用随机梯度上升算法，每次从dataIndex中随机选择一个索引randIndex，计算该样本的sigmoid函数值，并计算误差error。然后根据梯度上升算法的更新公式，更新回归系数weights。更新步长alpha是根据公式4/(1+j+i)+0.01计算得到的，其中j和i分别表示外层和内层循环的迭代次数。

最后返回更新后的回归系数weights。

总结来说，该函数使用随机梯度上升算法来迭代更新回归系数weights，以拟合给定的样本数据和标签。每次迭代只使用一个样本进行更新，通过随机选择样本的方式来提高效率。最终得到的回归系数可用于预测新的样本的类别标签。

def classifyVector(inX, weights):
  prob = sigmoid(sum(inX * weights))
  if prob > 0.5:
    return 1
  else:
    return 0

def colicTest():
  frTrain = open('/content/drive/MyDrive/Colab Notebooks/MachineLearning/《机器学习实战》/Logistic回归/从疝气病症预测病马的死亡率/horseColicTraining.txt')
  frTest = open('/content/drive/MyDrive/Colab Notebooks/MachineLearning/《机器学习实战》/Logistic回归/从疝气病症预测病马的死亡率/horseColicTest.txt')
  trainingSet = []
  trainingLabels = []
  for line in frTrain.readlines():
    currLine = line.strip().split('\t')
    lineArr = []
    for i in range(21):
      lineArr.append(float(currLine[i]))
    trainingSet.append(lineArr)
    trainingLabels.append(float(currLine[21]))
  trainWeights = stocGradAscent1(array(trainingSet), trainingLabels, 1000)
  errorCount = 0
  numTestVec = 0.0
  for line in frTest.readlines():
    numTestVec += 1.0
    currLine = line.strip().split('\t')
    lineArr = []
    for i in range(21):
      lineArr.append(float(currLine[i]))
    if int(classifyVector(array(lineArr), trainWeights)) != int(currLine[21]):
      errorCount += 1
  errorRate = (float(errorCount)/numTestVec)
  print("the error rate of this test is: %f" % errorRate)
  return errorRate

这段代码是一个马的疝气病症预测模型。它使用逻辑回归算法来训练一个模型，然后使用测试数据集来测试该模型的准确率。

首先，代码打开了两个文件，一个用于训练模型，一个用于测试模型。然后，它用一个循环读取训练数据集，并将每一行的特征值转换为浮点型，并将其添加到训练集列表中。同时，将每一行的标签值转换为浮点型，并将其添加到训练标签列表中。

接下来，代码使用随机梯度上升算法（stocGradAscent1）训练模型。该算法根据训练集和标签列表迭代多次来更新权重值，以最小化模型的预测误差。

然后，代码初始化错误计数和测试向量数量的变量。接下来，它使用循环读取测试数据集，并将每一行的特征值转换为浮点型，并将其添加到测试向量列表中。然后，它使用训练得到的权重值来预测当前测试向量的标签值，并与实际标签值进行比较。如果预测错误，错误计数就会增加。

最后，代码计算预测错误率，并打印出来。该错误率表示模型在测试数据集上的准确率。

总的来说，这个函数用于训练一个马的疝气病症预测模型，并使用测试数据集来评估模型的准确率。

def multiTest():
  numTests = 10
  errorSum = 0
  for k in range(numTests):
    errorSum += colicTest()
  print("after %d iterations the average error rate is: %f" % (numTests, errorSum/float(numTests)))

这段代码定义了一个名为multiTest的函数，该函数执行了10次colicTest()函数，并计算了这10次测试的平均错误率。

在每一次循环中，errorSum变量会累加上colicTest()函数的返回值，表示该次测试的错误率。在结束循环后，会打印出"after 10 iterations the average error rate is: X"的结果，其中X是所有测试的平均错误率。需要注意的是，colicTest()函数的具体实现没有提供，因此无法确定其功能或返回类型。根据上下文推测，colicTest()函数可能用于执行某种分类测试，并返回错误率。

multiTest()

<ipython-input-5-cc83312199bc>:4: RuntimeWarning: overflow encountered in exp
  return 1.0/(1+exp(-inX))


the error rate of this test is: 0.283582
the error rate of this test is: 0.313433
the error rate of this test is: 0.313433
the error rate of this test is: 0.358209
the error rate of this test is: 0.253731
the error rate of this test is: 0.358209
the error rate of this test is: 0.373134
the error rate of this test is: 0.358209
the error rate of this test is: 0.343284
the error rate of this test is: 0.402985
after 10 iterations the average error rate is: 0.335821

使用Logistic算法从疝气病症预测病马的死亡率

从疝气病症预测病马的死亡率

准备数据：处理数据中的缺失值

测试算法：用Logistic回归进行分类

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

使用Logistic算法从疝气病症预测病马的死亡率

从疝气病症预测病马的死亡率

准备数据：处理数据中的缺失值

测试算法：用Logistic回归进行分类

热门文章

最新文章

相关课程

相关电子书

相关实验场景