C231n-KNN-assignment1-完全代码及注释

简介: 以下内容为C231n-assignment1-KNN的代码 作业网址:http://cs231n.github.io/assignments2017/assignment1/ Q1: k-Nearest Neigh...

以下内容为C231n-assignment1-KNN的代码
作业网址:http://cs231n.github.io/assignments2017/assignment1/

Q1: k-Nearest Neighbor classifier (20 points)
The IPython Notebook knn.ipynb will walk you through implementing the kNN classifier.

以下是notobook中的代码以及结果(其中KNN算法代码在最下方)

import sys 
print(sys.version)
# k-Nearest Neighbor (kNN) exercise

*Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the [assignments page](http://vision.stanford.edu/teaching/cs231n/assignments.html) on the course website.*

The kNN classifier consists of two stages:

- During training, the classifier takes the training data and simply remembers it
- 在训练中,分类器分析训练数据并简单记住这些数据
- During testing, kNN classifies every test image by comparing to all training images and transfering the labels of the k most similar training examples
- 在训练中,KNN分类器将每个测试图像与训练图像进行比较然后传递出k个最近距离的训练集的标记
- The value of k is cross-validated
- K值通过交叉验证得到


In this exercise you will implement these steps and understand the basic Image Classification pipeline, cross-validation, and gain proficiency in writing efficient, vectorized code.
# 初始化

import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt

from __future__ import print_function

# 下面这个%得作用时使图像直接产生在这个页面中而不是产生到一个新的窗口
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2
# Load the raw CIFAR-10 data.
cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)

# As a sanity check, we print out the size of the training and test data.
print('Training data shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)

Training data shape: (50000, 32, 32, 3)
Training labels shape: (50000,)
Test data shape: (10000, 32, 32, 3)
Test labels shape: (10000,)

# Visualize some examples from the dataset.
# We show a few examples of training images from each class.
classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
num_classes = len(classes)
samples_per_class = 7
for y, cls in enumerate(classes):
    idxs = np.flatnonzero(y_train == y)
    idxs = np.random.choice(idxs, samples_per_class, replace=False)
    for i, idx in enumerate(idxs):
        plt_idx = i * num_classes + y + 1
        plt.subplot(samples_per_class, num_classes, plt_idx)
        plt.imshow(X_train[idx].astype('uint8'))
        plt.axis('off')
        if i == 0:
            plt.title(cls)
plt.show()

这里写图片描述

# 取子样本(从50000中去5000个出来),使程序执行的更快些
num_training = 5000
mask = list(range(num_training))
X_train = X_train[mask]
y_train = y_train[mask]

num_test = 500
mask = list(range(num_test))
X_test = X_test[mask]
y_test = y_test[mask]
# 将图像中的像素变成一行数据
X_train = np.reshape(X_train, (X_train.shape[0], -1))
X_test = np.reshape(X_test, (X_test.shape[0], -1))
print(X_train.shape, X_test.shape)

(5000, 3072) (500, 3072)

from cs231n.classifiers import KNearestNeighbor  
#该代码在最下方

# Create a kNN classifier instance. 
# Remember that training a kNN classifier is a noop: 
# 执行这段代码需要较长的时间
# the Classifier simply remembers the data and does no further processing 
classifier = KNearestNeighbor()
classifier.train(X_train, y_train)
# Open cs231n/classifiers/k_nearest_neighbor.py and implement
# compute_distances_two_loops.

# Test your implementation:
dists = classifier.compute_distances_two_loops(X_test)
print(dists.shape)

(500, 5000)

# We can visualize the distance matrix: each row is a single test example and
# its distances to training examples
plt.imshow(dists, interpolation='none')
plt.show()

这里写图片描述

# Now implement the function predict_labels and run the code below:
# We use k = 1 (which is Nearest Neighbor).
y_test_pred = classifier.predict_labels(dists, k=1)

# Compute and print the fraction of correctly predicted examples
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

Got 137 / 500 correct => accuracy: 0.274000

You should expect to see approximately 27% accuracy. Now lets try out a larger k, say k = 5:
y_test_pred = classifier.predict_labels(dists, k=5)
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

Got 144 / 500 correct => accuracy: 0.288000

You should expect to see a slightly better performance than with k = 1.
# Now lets speed up distance matrix computation by using partial vectorization
# with one loop. Implement the function compute_distances_one_loop and run the
# code below:
dists_one = classifier.compute_distances_one_loop(X_test)

# To ensure that our vectorized implementation is correct, we make sure that it
# agrees with the naive implementation. There are many ways to decide whether
# two matrices are similar; one of the simplest is the Frobenius norm. In case
# you haven't seen it before, the Frobenius norm of two matrices is the square
# root of the squared sum of differences of all elements; in other words, reshape
# the matrices into vectors and compute the Euclidean distance between them.
difference = np.linalg.norm(dists - dists_one, ord='fro')
print('Difference was: %f' % (difference, ))
if difference < 0.001:
    print('Good! The distance matrices are the same')
else:
    print('Uh-oh! The distance matrices are different')

Difference was: 0.000000
Good! The distance matrices are the same

# Now implement the fully vectorized version inside compute_distances_no_loops
# and run the code
dists_two = classifier.compute_distances_no_loops(X_test)

# check that the distance matrix agrees with the one we computed before:
difference = np.linalg.norm(dists - dists_two, ord='fro')
print('Difference was: %f' % (difference, ))
if difference < 0.001:
    print('Good! The distance matrices are the same')
else:
    print('Uh-oh! The distance matrices are different')

Difference was: 0.000000
Good! The distance matrices are the same

# Let's compare how fast the implementations are
def time_function(f, *args):
    """
    Call a function f with args and return the time (in seconds) that it took to execute.
    """
    import time
    tic = time.time()
    f(*args)
    toc = time.time()
    return toc - tic

two_loop_time = time_function(classifier.compute_distances_two_loops, X_test)
print('Two loop version took %f seconds' % two_loop_time)

one_loop_time = time_function(classifier.compute_distances_one_loop, X_test)
print('One loop version took %f seconds' % one_loop_time)

no_loop_time = time_function(classifier.compute_distances_no_loops, X_test)
print('No loop version took %f seconds' % no_loop_time)

# you should see significantly faster performance with the fully vectorized implementation

Two loop version took 56.785069 seconds
One loop version took 136.449761 seconds
No loop version took 0.591535 seconds
(矩阵乘法真的很快!)

### Cross-validation
### 交叉验证

We have implemented the k-Nearest Neighbor classifier but we set the value k = 5 arbitrarily. We will now determine the best value of this hyperparameter with cross-validation.
我们已经完成了k近邻分类但是我们只测试了k=5的情况,我们将要通过交叉验证来得出最好的k系数
num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]

X_train_folds = []
y_train_folds = []
################################################################################
# TODO:
# Split up the training data into folds. After splitting, X_train_folds and    #
# y_train_folds should each be lists of length num_folds, where                #
# y_train_folds[i] is the label vector for the points in X_train_folds[i].     #
# Hint: Look up the numpy array_split function.  
# 将训练数据分成几份,在分完后X_train_folds和y_train_folds应该为长度为num_folds
# 的lists,其中y_train_folds[i]是X_train_folds[i]的标志向量
# 提示:看一下numpy中的array_split函数
################################################################################
X_train_folds = np.array_split(X_train, num_folds, axis=0) # list
y_train_folds = np.array_split(y_train, num_folds, axis=0) # list
################################################################################
#                                 END OF YOUR CODE                             #
################################################################################

# A dictionary holding the accuracies for different values of k that we find
# when running cross-validation. After running cross-validation,
# k_to_accuracies[k] should be a list of length num_folds giving the different
# accuracy values that we found when using that value of k.
# 一个包含不同k值对应我们进行交叉验证时的不同准确度的字典。在进行交叉验证后,k_to_accuracies[k] 
# 应该为一个长度为num_folds的list,其中每个k在我们测试中得到不同的准确度
k_to_accuracies = {}


################################################################################
# TODO:                                                                        #
# Perform k-fold cross validation to find the best value of k. For each        #
# possible value of k, run the k-nearest-neighbor algorithm num_folds times,   #
# where in each case you use all but one of the folds as training data and the #
# last fold as a validation set. Store the accuracies for all fold and all     #
# values of k in the k_to_accuracies dictionary.  #
# 进行k折交叉验证,找到最好结果的k值。对于每个可能的k值执行k近邻算法num_folds
# 次。其中,一共num_folds叠中,用除了其中一叠的所有叠当做训练集进行训练,然后用
# 剩余的一叠当做验证集。把不用k的结果存在k_to_accuracies dictionary里面
################################################################################
for i in range(num_folds):
    # 训练 / 验证集 比例 (80% 20%)
    X_train_batch = np.concatenate(X_train_folds[1:num_folds])   
    y_train_batch = np.concatenate(y_train_folds[1:num_folds])
    X_valid_batch = X_train_folds[0]   
    y_valid_batch = y_train_folds[0]

    # 交换数据 (for next iteration)  
    if i < num_folds - 1:
        tmp = X_train_folds[0]                   
        X_train_folds[0] = X_train_folds[i+1]
        X_train_folds[i+1] = tmp
        tmp = y_train_folds[0]
        y_train_folds[0] = y_train_folds[i+1]
        y_train_folds[i+1] = tmp


    # 训练模型
    model = KNearestNeighbor()
    model.train(X_train_batch, y_train_batch)
    dists = model.compute_distances_no_loops(X_valid_batch)

    # 计算每个k的准确度
    for k in k_choices:
        y_valid_pred = model.predict_labels(dists, k=k)

        # 计算验证准确度
        num_correct = np.sum(y_valid_pred == y_valid_batch)
        accuracy = float(num_correct) / y_valid_batch.shape[0]

        # 将准确度值放入字典中
        if i == 0:
            k_to_accuracies[k] = [] 
        k_to_accuracies[k].append(accuracy)
################################################################################
#                                 END OF YOUR CODE                             #
################################################################################

# 打印出计算好的准确度
for k in sorted(k_to_accuracies):
    for accuracy in k_to_accuracies[k]:
        print('k = %d, accuracy = %f' % (k, accuracy))

296000
k = 8, accuracy = 0.278000
k = 8, accuracy = 0.289000
k = 8, accuracy = 0.285000
k = 10, accuracy = 0.273000
k = 10, accuracy = 0.296000
k = 10, accuracy = 0.277000
k = 10, accuracy = 0.293000
k = 10, accuracy = 0.285000
k = 12, accuracy = 0.269000
k = 12, accuracy = 0.304000
k = 12, accuracy = 0.286000
k = 12, accuracy = 0.283000
k = 12, accuracy = 0.278000
k = 15, accuracy = 0.259000
k = 15, accuracy = 0.308000
k = 15, accuracy = 0.287000
k = 15, accuracy = 0.287000
k = 15, accuracy = 0.276000
k = 20, accuracy = 0.265000
k = 20, accuracy = 0.287000
k = 20, accuracy = 0.288000
k = 20, accuracy = 0.286000
k = 20, accuracy = 0.283000
k = 50, accuracy = 0.273000
k = 50, accuracy = 0.287000
k = 50, accuracy = 0.279000
k = 50, accuracy = 0.269000
k = 50, accuracy = 0.272000
k = 100, accuracy = 0.258000
k = 100, accuracy = 0.273000
k = 100, accuracy = 0.264000
k = 100, accuracy = 0.260000
k = 100, accuracy = 0.266000

# plot the raw observations
for k in k_choices:
    accuracies = k_to_accuracies[k]
    plt.scatter([k] * len(accuracies), accuracies)

# plot the trend line with error bars that correspond to standard deviation
accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])
accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])
plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)
plt.title('Cross-validation on k')
plt.xlabel('k')
plt.ylabel('Cross-validation accuracy')
plt.show()

这里写图片描述

# Based on the cross-validation results above, choose the best value for k,   
# retrain the classifier using all the training data, and test it on the test
# data. You should be able to get above 28% accuracy on the test data.
# 在交叉验证的结果中,选择最合适的k,重新操作一遍,你能得到大概28%的准确率
best_k = 10

classifier = KNearestNeighbor()
classifier.train(X_train, y_train)
y_test_pred = classifier.predict(X_test, k=best_k)

# 计算并展示准确率
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

Got 144 / 500 correct => accuracy: 0.288000

以下是KNN代码部分(其中KNN-no-loop代码讲解在这里

import numpy as np

# 在python3.3后 将xrange和range合并,调用range相当于调用xrange
xrange = range

class KNearestNeighbor(object):
    """ a kNN classifier with L2 distance """

    def __init__(self):
        pass

    def train(self, X, y):
        """
        Train the classifier. For k-nearest neighbors this is just
        memorizing the training data.

        Inputs:
        - X: A numpy array of shape (num_train, D) containing the training data
        consisting of num_train samples each of dimension D.
        - y: A numpy array of shape (N,) containing the training labels, where
           y[i] is the label for X[i].
           :type self: object
        """

        self.X_train = X
        self.y_train = y

    def predict(self, X, k=1, num_loops=0):
        """
        Predict labels for test data using this classifier.

        Inputs:
        - X: A numpy array of shape (num_test, D) containing test data consisting
             of num_test samples each of dimension D.
        - k: The number of nearest neighbors that vote for the predicted labels.
        - num_loops: Determines which implementation to use to compute distances
          between training points and testing points.

        Returns:
        - y: A numpy array of shape (num_test,) containing predicted labels for the
          test data, where y[i] is the predicted label for the test point X[i].
        """
        if num_loops == 0:
            dists = self.compute_distances_no_loops(X)
        elif num_loops == 1:
            dists = self.compute_distances_one_loop(X)
        elif num_loops == 2:
            dists = self.compute_distances_two_loops(X)
        else:
            raise ValueError('Invalid value %d for num_loops' % num_loops)

        return self.predict_labels(dists, k=k)

    def compute_distances_two_loops(self, X):
        """
        Compute the distance between each test point in X and each training point
        in self.X_train using a nested loop over both the training data and the
        test data.

        Inputs:
        - X: A numpy array of shape (num_test, D) containing test data.

        Returns:
        - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
          is the Euclidean distance between the ith test point and the jth training
          point.
        """
        num_test = X.shape[0]  # 测试集数量
        num_train = self.X_train.shape[0]  # 训练集数量
        sum = 0
        dists = np.zeros((num_test, num_train))  # 创造零矩阵,测试集数量为行数,训练集数量为列数 500*5000
        for i in xrange(num_test):  # num_test 数量为500 num_train 数量为5000
            for j in xrange(num_train):
                #####################################################################
                # TODO:                                                             #
                # Compute the l2 distance between the ith test point and the jth    #
                # training point, and store the result in dists[i, j]. You should   #
                # not use a loop over dimension.                                    #
                #####################################################################
                # np.dot()对两个一维向量进行运算,会自动转换成1×n 与 n×1的形式从而得到一个数
                dists[i, j] = np.sqrt(np.dot(X[i] - self.X_train[j], X[i] - self.X_train[j]))
            #####################################################################
            #                       END OF YOUR CODE                            #
            #####################################################################
        return dists

    def compute_distances_one_loop(self, X):
        """
    Compute the distance between each test point in X and each training point
    in self.X_train using a single loop over the test data.

    Input / Output: Same as compute_distances_two_loops
    """
        num_test = X.shape[0]
        num_train = self.X_train.shape[0]
        dists = np.zeros((num_test, num_train))
        for i in xrange(num_test):
            dists[i, :] = np.sqrt(np.sum((self.X_train - X[i, :]) ** 2, axis=1))

        #######################################################################
        # TODO:                                                               #
        # Compute the l2 distance between the ith test point and all training #
        # points, and store the result in dists[i, :].                        #
        #######################################################################

        #######################################################################
        #                         END OF YOUR CODE                            #
        #######################################################################
        return dists

    def compute_distances_no_loops(self, X):
        """
        Compute the distance between each test point in X and each training point
        in self.X_train using no explicit loops.
        不适用任何循环去计算测试集和训练集中的距离

        Input / Output: Same as compute_distances_two_loops
        """
        num_test = X.shape[0]
        num_train = self.X_train.shape[0]
        dists = np.zeros((num_test, num_train))

        #########################################################################
        # TODO:                                                                 #
        # Compute the l2 distance between all test points and all training      #
        # points without using any explicit loops, and store the result in      #
        # dists.                                                                #
        #                                                                       #
        # You should implement this function using only basic array operations; #
        # in particular you should not use functions from scipy.                #
        #                                                                       #
        # HINT: Try to formulate the l2 distance using matrix multiplication    #
        #       and two broadcast sums.                                         #
        #########################################################################
        test_sum = np.sum(np.square(X), axis=1)  # num_test x 1
        train_sum = np.sum(np.square(self.X_train), axis=1)  # num_train x 1
        inner_product = np.dot(X, self.X_train.T)  # num_test x num_train
        dists = np.sqrt(-2 * inner_product + test_sum.reshape(-1, 1) + train_sum)  # broadcast
         (http://blog.csdn.net/iamoldpan/article/details/78359195)
        # dists = np.sqrt((-2 * np.dot(X, self.X_train.T)) + np.sum(X ** 2, axis=1, keepdims=True)
        # + np.sum(self.X_train ** 2,axis=1))
        #########################################################################
        #                         END OF YOUR CODE                              #
        #########################################################################
        return dists

    def predict_labels(self, dists, k=1):

        """
        Given a matrix of distances between test points and training points,
        predict a label for each test point.

        Inputs:
        - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
          gives the distance betwen the ith test point and the jth training point.

        Returns:
        - y: A numpy array of shape (num_test,) containing predicted labels for the
          test data, where y[i] is the predicted label for the test point X[i].
        """

        num_test = dists.shape[0]  # 500为测试数量
        y_pred = np.zeros(num_test)  #
        for i in xrange(num_test):
            # A list of length k storing the labels of the k nearest neighbors to
            # the ith test point.
            closest_y = []
            #########################################################################
            # TODO:                                                                 #
            # Use the distance matrix to find the k nearest neighbors of the ith    #
            # testing point, and use self.y_train to find the labels of these       #
            # neighbors. Store these labels in closest_y.                           #
            # Hint: Look up the function numpy.argsort.                             #
            #########################################################################
            sorted_dic = np.argsort(dists[i])
            closest_y = self.y_train[sorted_dic[:k]]
            #########################################################################
            # TODO:                                                                 #
            # Now that you have found the labels of the k nearest neighbors, you    #
            # need to find the most common label in the list closest_y of labels.   #
            # Store this label in y_pred[i]. Break ties by choosing the smaller     #
            # label.                                                                #
            #########################################################################
            count = {}
            label = 0
            num = 0
            for j in range(k):
                count[closest_y[j]] = count.get(closest_y[j], 0) + 1
                if num < count[closest_y[j]]:
                    label = closest_y[j]
                    num = count[closest_y[j]]
            y_pred[i] = label

        #########################################################################
        #                           END OF YOUR CODE                            #
        #########################################################################

        return y_pred
目录
相关文章
|
机器学习/深度学习 人工智能 自然语言处理
神经网络优化:提高AI模型性能的策略
神经网络优化是确保人工智能模型性能达到最佳的关键步骤。通过选择合适的激活函数、优化器和正则化技术,可以提高神经网络模型的准确度和泛化能力。随着深度学习领域的不断发展,神经网络优化策略也将不断进化,为人工智能技术带来更多的突破和创新。
1336 1
神经网络优化:提高AI模型性能的策略
|
12月前
|
前端开发 JavaScript 定位技术
三、前端高德地图、测量两个点之前的距离
文章介绍了如何在前端使用高德地图API实现测量两个点之间的距离,包括开启和关闭测量工具的方法,以及如何清除地图上的测量点、连线和文字。
473 1
三、前端高德地图、测量两个点之前的距离
|
8月前
|
算法 API 网络安全
京东 API 接口调用失败的常见原因及解决方法
京东API为电商开发提供了丰富的功能,如商品信息获取、订单管理和物流查询等。然而,实际开发中常遇接口调用失败的问题。本文深入探讨了常见原因及解决方法,包括网络问题(如不稳定连接和防火墙限制)、权限问题(如密钥无效和接口权限不足)、参数问题(如缺失或格式不匹配)及服务器端问题(如维护和版本不兼容),并附带代码示例,帮助开发者快速定位和解决问题,提升应用的稳定性和可靠性。
1354 5
|
监控 Devops 测试技术
云效DevOps:让每一位开发者都成为“超级英雄”
【6月更文挑战第11天】云效DevOps是面向软件开发者的云原生协作平台,集成代码管理、自动化构建、持续集成/部署、测试管理和监控告警等功能。它简化代码管理,支持Git等版本控制,提供代码质量检查。自动化构建与持续集成加快开发速度,减少错误。例如,通过YAML配置实现Docker中Maven的构建任务。持续部署功能加速代码上线,监控告警确保问题及时处理。团队协作功能促进高效沟通,助力开发者成为项目中的“超级英雄”。
173 3
|
9月前
|
数据采集 监控 数据挖掘
常用电商商品数据API接口(item get)概述,数据分析以及上货
电商商品数据API接口(item get)是电商平台上用于提供商品详细信息的接口。这些接口允许开发者或系统以编程方式获取商品的详细信息,包括但不限于商品的标题、价格、库存、图片、销量、规格参数、用户评价等。这些信息对于电商业务来说至关重要,是商品数据分析、价格监控、上货策略制定等工作的基础。
|
10月前
|
存储 数据可视化 数据挖掘
免费低代码平台精选10款,哪款更实用
本文将为您盘点10款免费的低代码平台,包括Zoho Creator、OutSystems Community Edition、Baserow、Mendix等,详细介绍它们的特点,帮助您选择合适的工具。Zoho Creator以其灵活的工作流设计、实时共享数据、快速部署和移动端支持等优势,成为多家企业的首选。OutSystems Community Edition则提供强大的集成能力和组件市场,支持多种应用类型。Baserow开源且高度定制化,适合团队协作。Mendix免费版支持多种应用开发,提供丰富的学习资源。
650 0
|
关系型数据库 MySQL 数据库连接
绝对干货!从MySQL5.7平滑升级到MySQL8.0的最佳实践分享
绝对干货!从MySQL5.7平滑升级到MySQL8.0的最佳实践分享
1386 0
|
Windows
带你启用window10专业版系统自带的远程桌面
带你启用window10专业版系统自带的远程桌面
415 0
|
JavaScript
JS【详解】setTimeout 延时(含清除 setTimeout,计时开始时间,0 秒延时解析,多 setTimeout 执行顺序,setTimeout 应用场景,网红面试题)
JS【详解】setTimeout 延时(含清除 setTimeout,计时开始时间,0 秒延时解析,多 setTimeout 执行顺序,setTimeout 应用场景,网红面试题)
2418 0
|
机器学习/深度学习 搜索推荐 Python
Python特征工程面试:从理论到实践
【4月更文挑战第17天】本文探讨了Python在数据科学面试中的特征工程,涵盖基础概念如特征选择和提取,实战技能如缺失值和异常值处理,以及特定场景应用。强调避免过度依赖单一方法,忽视数据分布和相关性,以及保持特征工程的可解释性。提供代码示例展示了处理缺失值、标准化、特征选择和异常值检测的基本操作。建议结合业务理解,灵活运用多种方法并注重模型解释性。
170 9