Text classification with TensorFlow Hub: Movie reviews

简介: This notebook classifies movie reviews as positive or negative using the text of the review. This is an example of binary—or two-class—classification, an important and widely applicable kind of machine learning problem.

This notebook classifies movie reviews as positive or negative using the text of the review. This is an example of binary—or two-class—classification, an important and widely applicable kind of machine learning problem.

The tutorial demonstrates the basic application of transfer learning with TensorFlow Hub and Keras.

It uses the IMDB dataset that contains the text of 50,000 movie reviews from the Internet Movie Database. These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced, meaning they contain an equal number of positive and negative reviews.

This notebook uses tf.keras, a high-level API to build and train models in TensorFlow, and tensorflow_hub, a library for loading trained models from TFHub in a single line of code. For a more advanced text classification tutorial using tf.keras, see the MLCC Text Classification Guide.

import os
import numpy as np

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.list_physical_devices("GPU") else "NOT AVAILABLE")

Download the IMDB dataset

The IMDB dataset is available on imdb reviews or on TensorFlow datasets. The following code downloads the IMDB dataset to your machine (or the colab runtime):

# Split the training set into 60% and 40% to end up with 15,000 examples
# for training, 10,000 examples for validation and 25,000 examples for testing.
train_data, validation_data, test_data = tfds.load(
    name="imdb_reviews", 
    split=('train[:60%]', 'train[60%:]', 'test'),
    as_supervised=True)

Explore the data

Let's take a moment to understand the format of the data. Each example is a sentence representing the movie review and a corresponding label. The sentence is not preprocessed in any way. The label is an integer value of either 0 or 1, where 0 is a negative review, and 1 is a positive review.
Let's print first 10 examples

train_examples_batch, train_labels_batch = next(iter(train_data.batch(10)))
train_examples_batch

Build the model

The neural network is created by stacking layers—this requires three main architectural decisions:

  • How to represent the text?
  • How many layers to use in the model?
  • How many hidden units to use for each layer?

In this example, the input data consists of sentences. The labels to predict are either 0 or 1.

One way to represent the text is to convert sentences into embeddings vectors. Use a pre-trained text embedding as the first layer, which will have three advantages:

  • You don't have to worry about text preprocessing,
  • Benefit from transfer learning,
  • the embedding has a fixed size, so it's simpler to process.

For this example you use a pre-trained text embedding model from TensorFlow Hub called google/nnlm-en-dim50/2.

There are many other pre-trained text embeddings from TFHub that can be used in this tutorial:

And many more! Find more text embedding models on TFHub.

Let's first create a Keras layer that uses a TensorFlow Hub model to embed the sentences, and try it out on a couple of input examples. Note that no matter the length of the input text, the output shape of the embeddings is: (num_examples, embedding_dimension).

embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:3])

Let's now build the full model:

model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary()

The layers are stacked sequentially to build the classifier:

  1. The first layer is a TensorFlow Hub layer. This layer uses a pre-trained Saved Model to map a sentence into its embedding vector. The pre-trained text embedding model that you are using (google/nnlm-en-dim50/2) splits the sentence into tokens, embeds each token and then combines the embedding. The resulting dimensions are: (num_examples, embedding_dimension). For this NNLM model, the embedding_dimension is 50.
  2. This fixed-length output vector is piped through a fully-connected (Dense) layer with 16 hidden units.
  3. The last layer is densely connected with a single output node.

Let's compile the model.

Loss function and optimizer

A model needs a loss function and an optimizer for training. Since this is a binary classification problem and the model outputs logits (a single-unit layer with a linear activation), you'll use the binary_crossentropy loss function.

This isn't the only choice for a loss function, you could, for instance, choose mean_squared_error. But, generally, binary_crossentropy is better for dealing with probabilities—it measures the "distance" between probability distributions, or in our case, between the ground-truth distribution and the predictions.

Later, when you are exploring regression problems (say, to predict the price of a house), you'll see how to use another loss function called mean squared error.

Now, configure the model to use an optimizer and a loss function:

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

Train the model

Train the model for 10 epochs in mini-batches of 512 samples. This is 10 iterations over all samples in the x_train and y_train tensors. While training, monitor the model's loss and accuracy on the 10,000 samples from the validation set:

history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=10,
                    validation_data=validation_data.batch(512),
                    verbose=1)

Evaluate the model

And let's see how the model performs. Two values will be returned. Loss (a number which represents our error, lower values are better), and accuracy.

results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))
目录
相关文章
|
机器学习/深度学习 大数据 TensorFlow
TensorFlow Hub介绍:TensorFlow中可重用的机器学习模块库
本文对TensorFlow Hub库的介绍,并举例说明其用法。
7475 0
|
4月前
|
机器学习/深度学习 人工智能 API
TensorFlow Lite,ML Kit 和 Flutter 移动深度学习:1~5
TensorFlow Lite,ML Kit 和 Flutter 移动深度学习:1~5
77 0
|
4月前
|
机器学习/深度学习 存储 人工智能
TensorFlow Lite,ML Kit 和 Flutter 移动深度学习:6~11(3)
TensorFlow Lite,ML Kit 和 Flutter 移动深度学习:6~11(3)
82 0
|
4月前
|
机器学习/深度学习 Dart TensorFlow
TensorFlow Lite,ML Kit 和 Flutter 移动深度学习:6~11(5)
TensorFlow Lite,ML Kit 和 Flutter 移动深度学习:6~11(5)
75 0
|
1天前
|
机器学习/深度学习 人工智能 自然语言处理
使用TensorFlow进行深度学习入门
【5月更文挑战第11天】本文引导读者入门TensorFlow深度学习,介绍TensorFlow——Google的开源机器学习框架,用于处理各种机器学习问题。内容包括TensorFlow安装(使用pip)、核心概念(张量、计算图和会话)以及构建和训练简单线性回归模型的示例。通过这个例子,读者可掌握TensorFlow的基本操作,包括定义模型、损失函数、优化器以及运行会话。
|
2天前
|
机器学习/深度学习 TensorFlow 算法框架/工具
Python深度学习基于Tensorflow(7)视觉处理基础
Python深度学习基于Tensorflow(7)视觉处理基础
17 2
Python深度学习基于Tensorflow(7)视觉处理基础
|
2天前
|
机器学习/深度学习 算法 TensorFlow
Python深度学习基于Tensorflow(6)神经网络基础
Python深度学习基于Tensorflow(6)神经网络基础
15 2
Python深度学习基于Tensorflow(6)神经网络基础
|
2天前
|
机器学习/深度学习 算法 算法框架/工具
Python深度学习基于Tensorflow(5)机器学习基础
Python深度学习基于Tensorflow(5)机器学习基础
13 2
|
11天前
|
机器学习/深度学习 数据可视化 TensorFlow
【Python 机器学习专栏】使用 TensorFlow 构建深度学习模型
【4月更文挑战第30天】本文介绍了如何使用 TensorFlow 构建深度学习模型。TensorFlow 是谷歌的开源深度学习框架,具备强大计算能力和灵活编程接口。构建模型涉及数据准备、模型定义、选择损失函数和优化器、训练、评估及模型保存部署。文中以全连接神经网络为例,展示了从数据预处理到模型训练和评估的完整流程。此外,还提到了 TensorFlow 的自动微分、模型可视化和分布式训练等高级特性。通过本文,读者可掌握 TensorFlow 基本用法,为构建高效深度学习模型打下基础。
|
13天前
|
机器学习/深度学习 算法 TensorFlow
TensorFlow 2keras开发深度学习模型实例:多层感知器(MLP),卷积神经网络(CNN)和递归神经网络(RNN)
TensorFlow 2keras开发深度学习模型实例:多层感知器(MLP),卷积神经网络(CNN)和递归神经网络(RNN)

热门文章

最新文章