初识 Tensorflow

Tensorflow 是干嘛的，就不过多赘述了，小伙伴们可以访问官网来详细了解

https://www.tensorflow.org/

基础概念

• 张量

• 指令

• 图

• 会话

张量：张量是任意维度的数组

标量，是零维数组（零阶张量）。例如，\'Howdy\' 或 5

矢量，是一维数组（一阶张量）。例如，[2, 3, 5, 7, 11] 或 [5]

矩阵，是二维数组（二阶张量）。例如，[[3.1, 8.2, 5.9][4.3, -2.7, 6.5]]

指令：指令会创建、销毁和控制张量

大多数 Tensorflow 代码都是指令。

图（也称为计算图或数据流图）：是一种图数据结构

图的节点是指令，图的边是张量。张量流经图，在每个节点由一个指令操控。一个指令的输出张量通常会变成后续指令的输入张量。

张量可以作为常量或者变量存储在图中。常量是始终会返回同一张量值的指令。变量是会返回分配给它的任何张量的指令。

定义常量：

x = tf.constant([5.2])

定义变量：

y = tf.Variable([5])

给变量赋值：

y = tf.Variable([0])
y = y.assign([5])

会话：图必须运行在会话中

会话存储了它所运行图的状态。

with tf.Session() as sess:
  initialization = tf.global_variables_initializer()
  print(y.eval())

Tensorflow 编程总结：

将常量、变量和指令整合到一个图中。
在一个会话中评估这些常量、变量和指令。

实战房价预测

因为搭建 Tensorflow 环境还是蛮复杂的，这里还是使用 Google Colab 环境。

加载必要库

import math
# display模块可以决定显示的内容以何种格式显示
from IPython import display
# matplotlib为python的2D绘图库
# cm为颜色映射表
from matplotlib import cm  
# 使用 GridSpec 自定义子图位置
from matplotlib import gridspec
# pyplot提供了和matlab类似的绘图API，方便用户快速绘制2D图表
from matplotlib import pyplot as plt
# numpy为python的科学计算包，提供了许多高级的数值编程工具
import numpy as np    
# pandas是基于numpy的数据分析包，是为了解决数据分析任务而创建的    
import pandas as pd     
# sklearn(scikit-_learn_)是一个机器学习算法库,包含了许多种机器学习得方式
# *   Classification 分类
# *   Regression 回归
# *   Clustering 非监督分类
# *   Dimensionality reduction 数据降维
# *   Model Selection 模型选择
# *   Preprocessing 数据预处理 
# metrics:度量（字面意思），它提供了很多模块可以为第三方库或者应用提供辅助统计信息
from sklearn import metrics
# tensorflow是谷歌的机器学习框架
import tensorflow as tf   
# Dataset无比强大得数据集
from tensorflow.python.data import Dataset
tf.logging.set_verbosity(tf.logging.ERROR)
# 为了观察数据方便，最多只显示10行数据
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

加载数据

数据基于加利福尼亚州 1990 年的人口普查数据。

housing_dataframe = pd.read_csv("https://download.mlcc.google.cn/mledu-datasets/california_housing_train.csv", sep=",")
housing_dataframe.describe()

随机排序数据，因为这样更接近生活中的真实数据,并将 median_house_value 的数值以千为单位处理，同时查看数据概要信息。

housing_dataframe = housing_dataframe.reindex(
    np.random.permutation(housing_dataframe.index))
housing_dataframe["median_house_value"] /= 1000.0
housing_dataframe.describe()

构建模型

因为是准备预测 median_house_value 的值，所以它就是我们的标签（也成为目标）。因为 total_rooms 代表的是某个街区的总房数，所以用 total_rooms 作为输入特征。

这里选取 TensorFlow Estimator API（是基于 Tensorflow 的高级 API）提供的 LinearRegressor 接口，来构建模型。

1.定义特征并配置特征列

我们要指定每个特征所包含的数据类型，根据经验，主要有两种数据类型：

分类数据：一种文字数据。

数值数据：一种数字（整数或者浮点数）数据以及可以视为数字的数据。

在 TensorFlow 中，使用一种称为“特征列”的结构来表示特征的数据类型。特征列仅仅存储对特征数据的描述，不包含特征数据本身。

提取特征total_rooms数据，并使用 numeric_column定义特征列，将其数据指定为数值数据。

# Define the input feature: total_rooms.
my_feature = housing_dataframe[["total_rooms"]]
# Configure a numeric feature column for total_rooms.
feature_columns = [tf.feature_column.numeric_column("total_rooms")]

2. 定义标签

从 housing_dataframe 中提取标签

# Define the label.
targets = housing_dataframe["median_house_value"]

3. 创建回归模型

使用 LinearRegressor 配置线性回归模型，并通过 GradientDescentOptimizer（它会实现小批量随机梯度下降法 (SGD)）训练该模型。learning_rate 参数可控制梯度步长的大小。

为了安全起见，我们还会通过 clip_gradients_by_norm 将梯度裁剪应用到优化器。梯度裁剪可确保梯度大小在训练期间不会变得过大，梯度过大会导致梯度下降法失败。

# Use gradient descent as the optimizer for training the model.
my_optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.0000001)
my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
# Configure the linear regression model with our feature columns and optimizer.
# Set a learning rate of 0.0000001 for Gradient Descent.
linear_regressor = tf.estimator.LinearRegressor(
    feature_columns=feature_columns,
    optimizer=my_optimizer
)

梯度 (gradient):偏导数相对于所有自变量的向量。在机器学习中，梯度是模型函数偏导数的向量。梯度指向最速上升的方向。

梯度下降法 (gradient descent):一种通过计算并且减小梯度将损失降至最低的技术，它以训练数据为条件，来计算损失相对于模型参数的梯度。通俗来说，梯度下降法以迭代方式调整参数，逐渐找到权重和偏差的最佳组合，从而将损失降至最低。

这里还涉及了一些其他的机器学习的概念，比如学习速率，梯度裁剪等，感兴趣的同学可以自行查找概念。

4. 定义导入数据函数

主要定义TensorFlow 如何对数据进行预处理，以及在模型训练期间如何批处理、随机处理和重复数据。

首先，将 Pandas 特征数据转换成 NumPy 数据字典。然后，使用 TensorFlow Dataset API 构建 Dataset 对象，并将数据拆分成大小为 batch_size 的多批次数据，以此来按照指定周期（num_epochs）进行重复。
如果 num_epochs 设置为 None，那么输入的数据会无限期重复训练。
其次，可以设置 shuffle 为 True，来对数据进行随机处理，以便数据在训练期间以随机的方式传递到模型。buffer_size 参数会指定 shuffle 将从中随机抽样的数据集的大小。
最后，我们定义的输入函数，可以为数据集构建一个迭代器，并向模型返回下一批数据。

def my_input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None):
    # Convert pandas data into a dict of np arrays.
    features = {key:np.array(value) for key,value in dict(features).items()}                                           
    # Construct a dataset, and configure batching/repeating.
    ds = Dataset.from_tensor_slices((features,targets)) # warning: 2GB limit
    ds = ds.batch(batch_size).repeat(num_epochs)
    # Shuffle the data, if specified.
    if shuffle:
      ds = ds.shuffle(buffer_size=10000)
    # Return the next batch of data.
    features, labels = ds.make_one_shot_iterator().get_next()
return features, labels

有关 Dataset API 更详细的文档，请参阅 TensorFlow 编程人员指南.

5. 训练模型

现在再在 linear_regressor 上调用 train() 来训练模型。

_ = linear_regressor.train(
    input_fn = lambda:my_input_fn(my_feature, targets),
    steps=100
)

这里使用 _ 来接收函数，目的是表明函数是有返回值的，但是我们并不需要返回值。

6. 评估模型

我们基于该训练数据做一次预测，看看我们的模型在训练期间与这些数据的拟合情况。

训练误差可以衡量我们的模型与训练数据的拟合情况，但并不能衡量模型泛化到新数据的效果。

泛化能力（generalization ability）是指机器学习算法对新鲜样本的适应能力。学习的目的是学到隐含在数据对背后的规律，对具有同一规律的学习集以外的数据，经过训练的网络也能给出合适的输出，该能力称为泛化能力。

# Create an input function for predictions.
# Note: Since we're making just one prediction for each example, we don't 
# need to repeat or shuffle the data here.
prediction_input_fn =lambda: my_input_fn(my_feature, targets, num_epochs=1, shuffle=False)
# Call predict() on the linear_regressor to make predictions.
predictions = linear_regressor.predict(input_fn=prediction_input_fn)
# Format predictions as a NumPy array, so we can calculate error metrics.
predictions = np.array([item['predictions'][0] for item in predictions])
# Print Mean Squared Error and Root Mean Squared Error.
mean_squared_error = metrics.mean_squared_error(predictions, targets)
root_mean_squared_error = math.sqrt(mean_squared_error)
print("Mean Squared Error (on training data): %0.3f" % mean_squared_error)
print("Root Mean Squared Error (on training data): %0.3f" % root_mean_squared_error)

得到的结果如下：

（均方误差）Mean Squared Error (on training data): 56425052440.760

（均方根误差）Root Mean Squared Error (on training data): 237539.581

我们通常使用均方根误差来判断误差的大小，因为它通常会与 target 在同一个数量级上。

比较一下 RMSE 与目标最大值和最小值的差值

min_house_value = housing_dataframe["median_house_value"].min()
max_house_value = housing_dataframe["median_house_value"].max()
min_max_difference = max_house_value - min_house_value
print("Min. Median House Value: %0.3f" % min_house_value)
print("Max. Median House Value: %0.3f" % max_house_value)
print("Difference between Min. and Max.: %0.3f" % min_max_difference)
print("Root Mean Squared Error: %0.3f" % root_mean_squared_error)

结果：

Min. Median House Value: 14999.000

Max. Median House Value: 500001.000

Difference between Min. and Max.: 485002.000

Root Mean Squared Error: 237539.581

再根据总体摘要统计信息，了解预测和目标的符合情况

calibration_data = pd.DataFrame()
calibration_data["predictions"] = pd.Series(predictions)
calibration_data["targets"] = pd.Series(targets)
calibration_data.describe()

很明显，预测的结果并不好，我们要继续调整参数，增加训练时长，来获得更好的模型。

7.超参数调整

定义一个可以自由调整参数的函数，来方便的调节超参，并绘制出训练数据

def train_model(learning_rate, steps, batch_size, input_feature="total_rooms"):
  periods = 10
  steps_per_period = steps / periods
  my_feature = input_feature
  my_feature_data = california_housing_dataframe[[my_feature]]
  my_label = "median_house_value"
  targets = california_housing_dataframe[my_label]
  # Create feature columns.
  feature_columns = [tf.feature_column.numeric_column(my_feature)]
  # Create input functions.
  training_input_fn = lambda:my_input_fn(my_feature_data, targets, batch_size=batch_size)
  prediction_input_fn = lambda: my_input_fn(my_feature_data, targets, num_epochs=1, shuffle=False)
  # Create a linear regressor object.
  my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
  my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
  linear_regressor = tf.estimator.LinearRegressor(
      feature_columns=feature_columns,
      optimizer=my_optimizer,
      config=tf.estimator.RunConfig(model_dir='out')
  )
  # Set up to plot the state of our model's line each period.
  plt.figure(figsize=(15, 6))
  plt.subplot(1, 2, 1)
  plt.title("Learned Line by Period")
  plt.ylabel(my_label)
  plt.xlabel(my_feature)
  sample = california_housing_dataframe.sample(n=300)
  plt.scatter(sample[my_feature], sample[my_label])
  colors = [cm.coolwarm(x) for x in np.linspace(-1, 1, periods)]
  # Train the model, but do so inside a loop so that we can periodically assess
  # loss metrics.
  print("Training model...")
  print("RMSE (on training data):")
  root_mean_squared_errors = []
  for period in range (0, periods):
    # Train the model, starting from the prior state.
    linear_regressor.train(
        input_fn=training_input_fn,
        steps=steps_per_period
    )
    # Take a break and compute predictions.
    predictions = linear_regressor.predict(input_fn=prediction_input_fn)
    predictions = np.array([item['predictions'][0] for item in predictions])
    # Compute loss.
    root_mean_squared_error = math.sqrt(
        metrics.mean_squared_error(predictions, targets))
    # Occasionally print the current loss.
    print("  period %02d : %0.2f" % (period, root_mean_squared_error))
    # Add the loss metrics from this period to our list.
    root_mean_squared_errors.append(root_mean_squared_error)
    # Finally, track the weights and biases over time.
    # Apply some math to ensure that the data and line are plotted neatly.
    y_extents = np.array([0, sample[my_label].max()])
    weight = linear_regressor.get_variable_value('linear/linear_model/%s/weights' % input_feature)[0]
    bias = linear_regressor.get_variable_value('linear/linear_model/bias_weights')
    x_extents = (y_extents - bias) / weight
    x_extents = np.maximum(np.minimum(x_extents,
                                      sample[my_feature].max()),
                           sample[my_feature].min())
    y_extents = weight * x_extents + bias
    plt.plot(x_extents, y_extents, color=colors[period]) 
  print("Model training finished.")
  # Output a graph of loss metrics over periods.
  plt.subplot(1, 2, 2)
  plt.ylabel('RMSE')
  plt.xlabel('Periods')
  plt.title("Root Mean Squared Error vs. Periods")
  plt.tight_layout()
  plt.plot(root_mean_squared_errors)
  # Output a table with calibration data.
  calibration_data = pd.DataFrame()
  calibration_data["predictions"] = pd.Series(predictions)
  calibration_data["targets"] = pd.Series(targets)
  display.display(calibration_data.describe())
  print("Final RMSE (on training data): %0.2f" % root_mean_squared_error)

调用函数训练 01

train_model(
    learning_rate=0.00001,
    steps=500,
    batch_size=5
)

调用函数训练 02

train_model(
    learning_rate=0.5,
    steps=500,
    batch_size=5
)

可以看到，通过修改学习速率和训练步数以及每批数据数量，可以慢慢的使模型接近于真实值。

8.切换特征值

我们把特征 total_rooms 切换为 population，来进行预测，可以得到如下的预测结果

train_model(
    learning_rate=0.00002,
    steps=1000,
    batch_size=5,
    input_feature="population"
)

TensorFlow 基础实战

初识 Tensorflow

基础概念

张量：张量是任意维度的数组

指令：指令会创建、销毁和控制张量

图（也称为计算图或数据流图）：是一种图数据结构

会话：图必须运行在会话中

实战房价预测

加载必要库

加载数据

1.定义特征并配置特征列

2. 定义标签

3. 创建回归模型

4. 定义导入数据函数

5. 训练模型

6. 评估模型

7.超参数调整

8.切换特征值

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

TensorFlow 基础实战

初识 Tensorflow

基础概念

张量：张量是任意维度的数组

指令：指令会创建、销毁和控制张量

图（也称为计算图或数据流图）：是一种图数据结构

会话：图必须运行在会话中

实战房价预测

加载必要库

加载数据

1.定义特征并配置特征列

2. 定义标签

3. 创建回归模型

4. 定义导入数据函数

5. 训练模型

6. 评估模型

7.超参数调整

8.切换特征值

热门文章

最新文章

相关课程

相关电子书

相关实验场景