Python深度学习基于Tensorflow（4）Tensorflow 数据处理和数据可视化-阿里云开发者社区

Python深度学习基于Tensorflow（4）Tensorflow 数据处理和数据可视化

2024-05-09 101

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： Python深度学习基于Tensorflow（4）Tensorflow 数据处理和数据可视化

构建Tensorflow.data数据集

tf.data.Dataset表示一串元素（element），其中每个元素包含一个或多个Tensor对象。例如：在一个图像流水线（pipeline）中，一个元素可以是单个训练样本，它们带有一个表示图像数据的张量和一个标签组成的数据对（pair）。有两种不同的方式构建一个数据集，具体如下。

直接从 Tensor 创建数据集（例如 Dataset.from_tensor_slices()）；当然 NumPy 也是可以的，TensorFlow 会自动将其转换为 Tensor。
通过对一个或多个 tf.data.Dataset 对象的变换（例如 Dataset.batch()）来创建数据集。这两类构建方法又可以进一步分为7种方法。如下所示：

数据格式	读取方法	备注
从NumPy数组读取	tf.data.Dataset.from_tensor_slices	当数据较小时
从Python Generator读取	tf.data.Dataset.from_generator
从文本数据读取	tf.data.TextLineDataset
从CSV数据读取	tf.data.experimental.CsvDataset
从TFRecord data读取	tf.data.TFRecordDataset	TFRecord 是TensorFlow中自带的，它是一种方便储存比较大的数据集的数据格式(二进制格式)，当内存不足时，我们可以将数据集制作成TFRecord格式的再将其解压读取。
从二进制文件读取数据	tf.data.FixedLengthRecordDataset
从文件集中读取数据	tf.data.Dataset.list_files()
这里除了TFRecord以外的东西都很好理解，所以这里主要讲一下TFRecord数据集；

显示详细信息

TFRecord 是 TensorFlow 自带的一种数据格式，是一种二进制文件。它是TensorFlow 官方推荐的数据保存格式，其数据的存储、读取操作更加高效。具体来说，TFRecord的优势可概括为：

1）支持多种数据格式；

2）更好的利用内存，方便复制和移动；

3）将二进制数据和标签(label)存储在同一个文件中。

TFRecord 格式文件的存储形式会很合理地帮我们存储数据。TFRecord 内部使用了 Protocol Buffer 二进制数据编码方案，它只占用一个内存块，只需要一次性加载一个二进制文件的方式即可，简单，快速，尤其对大型训练数据很友好。当我们的训练数据量比较大的时候，TFRecord可以将数据分成多个 TFRecord 文件，以提高处理效率。

假设有一万张图像， TFRecord 可以将其保存成 5 个.tfrecords 文件（具体保存成几个文件，要看文件大小），这样我们在读取数据时，只需要进行5 次数据读取。如果把这一万张图像保存为NumPy格式数据，则需要进行10000次数据读取。我们可以使用tf.data.TFRecordDataset类读取TFRecord文件。

TFRecord数据底层

在数据转换过程中，Example是TFReocrd的核心，TFReocrd包含一系列Example，每个Example可以认为是一个样本。Example是Tensorflow的对象类型，可通过tf.train.example来使用。特征指的是Example中输入向量的维度，有多少个维度就有多少个特征。

TFRecord，Example，features对应关系具体如下：

生成TFRecord文件数据

这里使用cat-dag数据集，数据下载连接如下：

链接：https://pan.baidu.com/s/1e9skHjPAzy9Bfd5Z7Xl70A?pwd=zynb 
提取码：zynb

下载之后解压到当前目录的./data文件夹，然后读取文件位置和标签，最后依次写入TFRecord中

import tensorflow as tf
import os
## 设置文件位置以及标签
data_dir = "./data/cat-dog"
train_cat_dir = data_dir + '/train/cats/'
train_dog_dir = data_dir + "/train/dogs/"
test_cat_dir = data_dir + "/test/cats/"
test_dog_dir = data_dir + "/test/dogs/"
train_cat_filenames = [train_cat_dir + filename for filename in os.listdir(train_cat_dir)]
train_dog_filenames = [train_dog_dir + filename for filename in os.listdir(train_dog_dir)]
train_filenames = train_cat_filenames + train_dog_filenames
train_labels = [0]*len(train_cat_filenames) + [1]*len(train_dog_filenames)
test_cat_filenames = [test_cat_dir + filename for filename in os.listdir(test_cat_dir)]
test_dog_filenames = [test_dog_dir + filename for filename in os.listdir(test_dog_dir)]
test_filenames = test_cat_filenames + test_dog_filenames
test_labels = [0]*len(test_cat_filenames) + [1]*len(test_dog_filenames)
## 创建生成TFRecord数据集函数
def encoder(filenames, labels, tfrecord_file):
    with tf.io.TFRecordWriter(tfrecord_file) as writer:
        for filename, label in zip(filenames, labels):
            with open(filename, 'rb') as f:
                image = f.read()
                ## 将img，label转化为向量的形式  这里只能是普通的形式，不能np和tf，所以读取图片最好是直接读取字节，虽然np.array(Image.open(filename)) 很快，但是array.tolist() 很慢，这就导致效果很慢，所以这里还是读字节最后再在读取tfrecord数据的时候，使用tf.io.decode_jpeg对图片进行解码。
                image_feature = tf.train.Feature(bytes_list=tf.train.BytesList(value=[image]))
                label_feature = tf.train.Feature(int64_list=tf.train.Int64List(value=[label]))
                
                ## 建立feature字典
                feature = {
                    'image': image_feature,
                    'label': label_feature
                }
                # 通过字典创建example，example对象对label和image数据进行封装
                example = tf.train.Example(features=tf.train.Features(feature=feature))
                # 将example序列化并写入字典
                writer.write(example.SerializeToString())
## 创建TFRecord
encoder(train_filenames, train_labels, 'train.tfrecords')
encoder(test_filenames, test_labels, 'test.tfrecords')

这里要注意的是，一共有三种类型，int64，float，bytes，只能是最原始的类型，不能np和tf

tf.train.Feature(bytes_list=tf.train.BytesList(value=[*]))
tf.train.Feature(int64_list=tf.train.Int64List(value=[*]))
tf.train.Feature(float_list=tf.train.FloatList(value=[*]))

可以看到当前目录下面有两个tfrecords文件。

读取TFRecord文件数据

def decoder(tfrecord_file, is_train_dataset=None):
    #构建dataset
    dataset = tf.data.TFRecordDataset(tfrecord_file)
    #说明特征的描述属性，为解吗每个example使用
    feature_discription = {
        'image': tf.io.FixedLenFeature([], tf.string),
        'label': tf.io.FixedLenFeature([], tf.int64)
    }
    def _parse_example(example_string): # 解码每一个example
        #将文件读入到队列中
        feature_dic = tf.io.parse_single_example(example_string, feature_discription)
        feature_dic['image'] = tf.io.decode_jpeg(feature_dic['image'])
        #对图片进行resize，属于数据处理的操作
        feature_dic['image'] = tf.image.resize(feature_dic['image'], [256, 256])/255.0
        return feature_dic['image'], feature_dic['label']
    batch_size = 4
    if is_train_dataset is not None:
        #tf.data.experimental.AUTOTUNE#根据计算机性能进行运算速度的调整
        dataset = dataset.map(_parse_example).shuffle(buffer_size=2000).batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)
    else:
        dataset = dataset.map(_parse_example)
        dataset = dataset.batch(batch_size)
    return dataset
train_data = decoder('train.tfrecords', is_train_dataset=True)
test_data = decoder('test.tfrecords')

最后得到的train_data抽取四个进行展示一下：

import matplotlib.pyplot as plt
def plot_img_label(elemtents):
    imgs, labels = elemtents
    num_imgs = labels.shape[0]
    for i in range(num_imgs):
        plt.subplot(1,num_imgs,i+1)
        plt.axis('off')
        plt.title(labels[i].numpy())
        plt.imshow(imgs[i].numpy())
    plt.show()
plot_img_label(train_data.take(1).get_single_element())

图像增强

# 解码图片 转换图片数据类型 调整图片尺寸
image = tf.image.decode_jpeg(image, channels=3)
image = tf.image.convert_image_dtype(image, dtype=tf.float32)
image = tf.image.resize(image, (156, 156))
# 对图片进行上下左右随机的翻转,调整明亮度最后旋转90度
images = tf.image.random_flip_left_right(images)
images = tf.image.random_flip_up_down(images)
images = tf.image.random_brightness(images, 1)
images = tf.image.rot90(images, 1)

数据可视化

数据可视化一般来说的库有 matplotlib，tensorboard

更多可视化操作可以看这一个专栏：数据可视化 Python_Bigcrab__的博客-CSDN博客

matplotlib 设置中文

# windows
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus'] = False 
# mac
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
plt.rcParams['axes.unicode_minus'] = False

tensorboard

logdir = os.path.join("logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)
model.fit(x=x_train, 
    y=y_train, 
    epochs=5, 
    validation_data=(x_test, y_test), 
    callbacks=[tensorboard_callback])

在Windows的命令行启动Tensorboard 服务，指定日志读写路径，如果是linux环境，请根据实际情况，修改logdir的值。 tensorboard --logdir=“C:\Users\wumg\jupyter-ipynb\tensorflow2-book\char-05\logs”

在 jupyter 中运行下列代码：

%load_ext tensorboard
%tensorboard --logdir logs
from tensorboard import notebook
notebook.list() # View open TensorBoard instances

Python深度学习基于Tensorflow（4）Tensorflow 数据处理和数据可视化

构建Tensorflow.data数据集

TFRecord数据底层

生成TFRecord文件数据

读取TFRecord文件数据

图像增强

数据可视化

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Python深度学习基于Tensorflow（4）Tensorflow 数据处理和数据可视化

构建Tensorflow.data数据集

TFRecord数据底层

生成TFRecord文件数据

读取TFRecord文件数据

图像增强

数据可视化

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像