动手实验 - TensorFlow和TensorBoard自然语言分析-阿里云开发者社区

通过这篇文章，你会了解如何在自然语言处理项目中运用TensorFlow这个强大的工具。并同时体验TensorBoard的一些基本用法。我会对文中涉及到的部分代码进行解释，而一些更基础运用需要你自己去补充了。

这个项目中，我们会用到Kaggle （https://www.kaggle.com/c/word2vec-nlp-tutorial）中的数据。这个数据集包括了25000个标记过的影评和50000个未标记过的训练影评。

这里是我们会用到的Python包:

import pandas as pd
import numpy as np
import tensorflow as tf
import nltk, re, time
from nltk.corpus import stopwords
from collections import defaultdict
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from collections import namedtuple

你可能对其中的一些包并不熟悉，但是没关系，我会对他们的用法做一定的解释。同时，你也可以在网上找到相关的资料。这里数据格式需要是.tsv文件，首先我们要上传数据并加入相应的分隔符。

train = pd.read_csv("labeledTrainData.tsv", delimiter="\t")
test = pd.read_csv("testData.tsv", delimiter="\t")

这里是个数据样例:

# Here's the first review as an example
With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter.

为了提高性能，我们需要对原数据进行一定的预处理。比如文本数据中的<br/>标签，它对我们的训练并没有任何作用。我们需要清除这些噪音。

def clean_text(text, remove_stopwords=True):
    '''Clean the text, with the option to remove stopwords'''
    
    # Convert words to lower case and split them
    text = text.lower().split()

    # Optionally, remove stop words
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        text = [w for w in text if not w in stops]
    
    text = " ".join(text)

    # Clean the text
    text = re.sub(r"<br />", " ", text)
    text = re.sub(r"[^a-z]", " ", text)
    text = re.sub(r"   ", " ", text) # Remove any extra spaces
    text = re.sub(r"  ", " ", text)
    
    # Return a list of words
    return(text)

文本数据可以分为两类：停止词和正则式。

# stop words
if remove_stopwords:
    stops = set(stopwords.words("english"))
    text = [w for w in text if not w in stops]

停止词就是那些并没有太多实际意义的单词(a,the,just等)。这类单词同样是数据中的噪音，我们需要在开始训练前清理掉它们。这是我用到的停止词词典:

另外需要指出的是对于不同的项目，可能需要对停止词词典做一定的调整。比如在这个项目的词典里包括了一些代词，而他们可能会在另外的项目里是有意义的，并不能简单的作为停止词来处理。

# re
text = re.sub(r"<br />", " ", text)
text = re.sub(r"[^a-z]", " ", text)
text = re.sub(r"   ", " ", text) # Remove any extra spaces
text = re.sub(r"  ", " ", text)

re就是我们的正则式。是字符串的一种简化形式。

你可以看到，在第二行我们把<br/>替换成了空字符串，并移除出我们的文本。接下来是对我们的训练文本中的单词进行令牌化。

# Tokenize the reviews
all_reviews = train_clean + test_clean
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_reviews)
print("Fitting is complete.")

train_seq = tokenizer.texts_to_sequences(train_clean)
print("train_seq is complete.")

test_seq = tokenizer.texts_to_sequences(test_clean)
print("test_seq is complete")

令牌化就是把每个单词转换成唯一对应的数字。例如[“The”, “cat”, “went”, “to”, “the”, “zoo”, “.”]这句话的令牌化结果就是[1, 2, 3, 4, 1, 5, 6]。

目前有几种不同的令牌化的方法，而我更倾向于Kera的方法。

这个项目里的单词表并不是很大，一共有99426个单词。通过下面这行代码就能得到这个数据。

word_index = tokenizer.word_index

另外你可以对文本中的单词进行初步筛选。例如，只选用那些常用的单词，这样就能把单词量降到80000左右。如果更进一步，你可以选出那些至少出现5次的单词。这样的优化能让程序的处理性能更好。比如，‘Goldfinger’（007系列经典电影《金手指》）只在数据文本中出现了一次，这样的单词并不能很好的对影评的态度进行衡量。相反，那些常用的单词‘good’或‘bad’更能体现影评的态度，从而帮助程序进行预测。

以下就是令牌化后的影评数据。

[445, 86, 489, 10939, 8, 61, 583, 2603, 120, 68, 957, 560, 53, 212, 24485, 212, 17247, 219, 193, 97, 20, 695, 2565, 124, 109, 15, 520, 3954, 193, 27, 246, 654, 2352, 1261, 17247, 90, 4782, 90, 712, 3, 305, 86, 16, 358, 1846, 542, 1219, 3592, 10939, 1, 485, 871, 3538, 23, 526, 673, 1414, 19, 63, 5305, 2089, 1118, 185, 413, 1523, 817, 2583, 7, 10939, 477, 86, 665, 85, 272, 114, 578, 10939, 34480, 29662, 148, 2, 10939, 381, 13, 59, 26, 381, 210, 15, 252, 178, 10, 751, 712, 3, 142, 341, 464, 145, 16427, 4121, 1718, 635, 876, 10547, 1018, 12089, 890, 1067, 1652, 416, 10939, 265, 19, 596, 141, 10939, 18336, 2302, 15821, 876, 10547, 1, 34, 38190, 388, 21, 49, 17539, 1414, 434, 9821, 193, 4238, 10939, 1, 120, 669, 520, 96, 7, 10939, 1555, 444, 2271, 138, 2137, 2383, 635, 23, 72, 117, 4750, 5364, 307, 1326, 31136, 19, 635, 556, 888, 665, 697, 6, 452, 195, 547, 138, 689, 3386, 1234, 790, 56, 1239, 268, 2, 21, 7, 10939, 6, 580, 78, 476, 32, 21, 245, 706, 158, 276, 113, 7674, 673, 3526, 10939, 1, 37925, 1690, 2, 159, 413, 1523, 294, 6, 956, 21, 51, 1500, 1226, 2352, 17, 612, 8, 61, 442, 724, 7184, 17, 25, 4, 49, 21, 199, 443, 3912, 3484, 49, 110, 270, 495, 252, 289, 124, 6, 19622, 19910, 363, 1502]

下一步，让我们调整影评数据的长度，使得每条数据都有相同的长度。

max_review_length = 200

train_pad = pad_sequences(train_seq, maxlen = max_review_length)
print("train_pad is complete.")

test_pad = pad_sequences(test_seq, maxlen = max_review_length)
print("test_pad is complete.")

更多的训练数据可以提高程序的准确度。这里，我设置了影评数据长度为200，目的是为了提升我们的训练速度。

在设置之前，检查一下每条数据的长度，从而确定一个比较合适的值。我用到了numpy的 percentile方法来决定这个数值。

np.percentile(lengths.counts, 80)

设定了最长长度为200后，基本上80%的影评数据所有的单词都会被包括。对于那些长度超出200的记录，多余的部分会被截去。相反，那些不足的影评则会被填充令牌补满剩余部分。我们需要考虑如何对数据进行更有效的填充和截取。之后大家可以想想自己有哪些更好的方法。

至此，我们可以把数据分成训练集和验证集。

x_train, x_valid, y_train, y_valid = train_test_split(train_pad, train.sentiment, test_size = 0.15, random_state = 2)

通常，我会把数据分成训练集，验证集和测试集。由于我们的数据来源于 Kaggle Competition,这些数据本身就是测试数据，因此我们可以直接把它们看成测试集。在开始我们的程序前，我们先要创建几个方法以便用于后面的批处理。

def get_batches(x, y, batch_size):
    '''Create the batches for the training and validation data'''
    n_batches = len(x)//batch_size
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]
def get_test_batches(x, batch_size):
    '''Create the batches for the testing data'''
    n_batches = len(x)//batch_size
    x = x[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size]

上面的这些方法会把数据分成数据量相同的组。这里需要注意的是如果最后一个组的数据量小于batch_size，那它将被忽略。因此，batch_size应该是你的数据集长度的整数倍，否则，你可能会在后面上传预测数据时遇到一些问题。下面我们就来开始一步一步进行循环神经网络的编写。

def build_rnn(n_words, embed_size, batch_size, lstm_size, num_layers, dropout, learning_rate, multiple_fc, fc_units):
    '''Build the Recurrent Neural Network'''

    tf.reset_default_graph()

    # Declare placeholders we'll feed into the graph
    with tf.name_scope('inputs'):
        inputs = tf.placeholder(tf.int32, [None, None], name='inputs')

    with tf.name_scope('labels'):
        labels = tf.placeholder(tf.int32, [None, None], name='labels')

    keep_prob = tf.placeholder(tf.float32, name='keep_prob')

    # Create the embeddings
    with tf.name_scope("embeddings"):
        embedding = tf.Variable(tf.random_uniform((n_words, 
                                    embed_size), -1, 1))
        embed = tf.nn.embedding_lookup(embedding, inputs)

    # Build the RNN layers
    with tf.name_scope("RNN_layers"):
        lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
        drop = tf.contrib.rnn.DropoutWrapper(lstm, 
                                         output_keep_prob=keep_prob)
        cell = tf.contrib.rnn.MultiRNNCell([drop] * num_layers)
    
    # Set the initial state
    with tf.name_scope("RNN_init_state"):
        initial_state = cell.zero_state(batch_size, tf.float32)

    # Run the data through the RNN layers
    with tf.name_scope("RNN_forward"):
        outputs, final_state = tf.nn.dynamic_rnn(
                                        cell,         
                                        embed,
                                        initial_state=initial_state)    
    
    # Create the fully connected layers
    with tf.name_scope("fully_connected"):
        
        # Initialize the weights and biases
        weights = tf.truncated_normal_initializer(stddev=0.1)
        biases = tf.zeros_initializer()
        
        dense = tf.contrib.layers.fully_connected(outputs[:, -1],
                    num_outputs = fc_units,
                    activation_fn = tf.sigmoid,
                    weights_initializer = weights,
                    biases_initializer = biases)
        
        dense = tf.contrib.layers.dropout(dense, keep_prob)
        
        # Depending on the iteration, use a second fully connected 
          layer
        if multiple_fc == True:
            dense = tf.contrib.layers.fully_connected(dense,
                        num_outputs = fc_units,
                        activation_fn = tf.sigmoid,
                        weights_initializer = weights,
                        biases_initializer = biases)
            
            dense = tf.contrib.layers.dropout(dense, keep_prob)
    
    # Make the predictions
    with tf.name_scope('predictions'):
        predictions = tf.contrib.layers.fully_connected(dense, 
                          num_outputs = 1, 
                          activation_fn=tf.sigmoid,
                          weights_initializer = weights,
                          biases_initializer = biases)
        
        tf.summary.histogram('predictions', predictions)
    
    # Calculate the cost
    with tf.name_scope('cost'):
        cost = tf.losses.mean_squared_error(labels, predictions)
        tf.summary.scalar('cost', cost)
    
    # Train the model
    with tf.name_scope('train'):    
        optimizer = 
            tf.train.AdamOptimizer(learning_rate).minimize(cost)

    # Determine the accuracy
    with tf.name_scope("accuracy"):
        correct_pred = tf.equal(tf.cast(tf.round(predictions), 
                                        tf.int32), 
                                        labels)
        accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
        tf.summary.scalar('accuracy', accuracy)
    
    # Merge all of the summaries
    merged = tf.summary.merge_all()    

    # Export the nodes 
    export_nodes = ['inputs', 'labels', 'keep_prob','initial_state',        
                    'final_state','accuracy', 'predictions', 'cost', 
                    'optimizer', 'merged']
    Graph = namedtuple('Graph', export_nodes)
    local_dict = locals()
    graph = Graph(*[local_dict[each] for each in export_nodes])
    
    return graph

如果之前你没有用过TensorBoard, 我建议可以在网上找一下Siraj Raval相关教程视频。

tf.reset_default_graph()

在开始训练前，需要把我们的图重置一下，以保证训练数据的质量。

# Declare placeholders we'll feed into the graph
    with tf.name_scope('inputs'):
        inputs = tf.placeholder(tf.int32, [None, None],  
                                    name='inputs')

    with tf.name_scope('labels'):
        labels = tf.placeholder(tf.int32, [None, None], 
                                    name='labels')

    keep_prob = tf.placeholder(tf.float32, name='keep_prob')

这些是我们数据的占位符。当我们用TensorBoard 图形化我们的数据后，tf.name_scope()可以用来标记图中的特定部分。

# Create the embeddings
    with tf.name_scope("embeddings"):
        embedding = tf.Variable(tf.random_uniform((n_words, 
                                  embed_size), -1, 1))
        embed = tf.nn.embedding_lookup(embedding, inputs)

通过映射(embedding)，我们把准备好的单词表转换成向量，embed_size表示这个向量的维度。（这里有更详细的讨论：https://www.quora.com/What-does-the-word-embedding-mean-in-the-context-of-Machine-Learning）

虽然这里我用了随机均匀分布的方法，其实还有其他很多方法来建立这样映射。比如带有较小标准差的截断正态分布也是不错的选择，那代码就是这样的：

embedding = tf.Variable(tf.truncated_normal((n_words, embed_size), -0.1, 0.1))

你可以自己动手试试。

# Build the RNN layers
with tf.name_scope("RNN_layers"):
    lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
    drop = tf.contrib.rnn.DropoutWrapper(lstm, 
                                         output_keep_prob=keep_prob)
    cell = tf.contrib.rnn.MultiRNNCell([drop] * num_layers)

上面的代码是循环神经网络的核心。正如你从这里的超参数看到的，我们会用到一个有着 50%Dropout的两层网络。

# Set the initial state
with tf.name_scope("RNN_init_state"):
    initial_state = cell.zero_state(batch_size, tf.float32)

这步初始化了图的状态。

# Run the data through the RNN layers
with tf.name_scope("RNN_forward"):
    outputs, final_state = tf.nn.dynamic_rnn(cell, embed,
                                        initial_state=initial_state)

这里是模块的正馈部分。正如我之前提到过的，每个批处理的数据文本有着不同的最长长度。因此我们可以利用 tf.nn.dynamic_rnn 来处理。该方法的具体使用请参阅： https://www.tensorflow.org/versions/master/api_docs/python/tf/nn/dynamic_rnn

# Create the fully connected layers
with tf.name_scope("fully_connected"):
        
    # Initialize the weights and biases
    weights = tf.truncated_normal_initializer(stddev=0.1)
    biases = tf.zeros_initializer()
        
    dense = tf.contrib.layers.fully_connected(outputs[:, -1],
                num_outputs = fc_units,
                activation_fn = tf.sigmoid,
                weights_initializer = weights,
                biases_initializer = biases)
        
    dense = tf.contrib.layers.dropout(dense, keep_prob)
        
    # Depending on the iteration, use a second fully connected layer
    if multiple_fc == True:
        dense = tf.contrib.layers.fully_connected(dense,
                    num_outputs = fc_units,
                    activation_fn = tf.sigmoid,
                    weights_initializer = weights,
                    biases_initializer = biases)
            
        dense = tf.contrib.layers.dropout(dense, keep_prob)

这步就是我们添加第一层和可能完全链接的第二层网络。他们的权重(weight)和偏置值(bias)会在之前提到过的映射(embedding)方法中进行初始化。这里multiple_fc 是方法的一个参数。它使我们能够测试这个模型的架构。用这个方法，你可以对不同的内容进行测试，比如如何初始化权重和偏置值，是否使用LSTM（Long Short Term Memory）或GRU（Gated Recurrent Unit），等等

# Make the predictions
with tf.name_scope('predictions'):
    predictions = tf.contrib.layers.fully_connected(dense, 
                      num_outputs = 1, 
                      activation_fn = tf.sigmoid,
                      weights_initializer = weights,
                      biases_initializer = biases)
        
    tf.summary.histogram('predictions', predictions)

我们这里只有关于影评态度(0到1)的一个输出。Sigmoid方法把最终完全连接的输出层映射到了这个区间。tf.summary.histogram() 记录了我们的预测结果，并输出成TensorBoard中的柱状图。这能使我们清楚的看到，在训练过程中，预测分布变化过程。同时，我们也可以看到训练集和验证集是如何进行比较的。

# Calculate the cost
with tf.name_scope('cost'):
    cost = tf.losses.mean_squared_error(labels, predictions)
    tf.summary.scalar('cost', cost)

这里计算的是训练过程中的花费（cost）。我们用到了tf.summary.scalar()方法，这里cost是一个没有区间的标量值。

# Train the model
with tf.name_scope('train'):    
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

Adam是一个通用的优化器，它能使模块训练更有效率。你也可以用其他的算法。

# Determine the accuracy
with tf.name_scope("accuracy"):
    correct_pred = tf.equal(tf.cast(tf.round(predictions), 
                                        tf.int32), 
                                        labels)
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
    tf.summary.scalar('accuracy', accuracy)

我们的预测值是从 0到1之间的数字。为此我们需要在用其和标签比对前，对它的值进行取整。 tf.reduce_mean() 能最大化准确的预测值。

# Merge all of the summaries
merged = tf.summary.merge_all()

这里，我们进行汇总从而简化数据保存的过程。

# Export the nodes 
export_nodes = ['inputs', 'labels', 'keep_prob','initial_state',        
                'final_state','accuracy', 'predictions', 'cost', 
                'optimizer', 'merged']
Graph = namedtuple('Graph', export_nodes)
local_dict = locals()
graph = Graph(*[local_dict[each] for each in export_nodes])

导出所有节点，为后续的训练函数做准备。至此，我们完成了循环神经网络的创建，接下来我们就能用其进行训练了。

def train(model, epochs, log_string):
    '''Train the RNN'''

    saver = tf.train.Saver()
    
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())

        # Used to determine when to stop the training early
        valid_loss_summary = []
        
        # Keep track of which batch iteration is being trained
        iteration = 0

        print()
        print("Training Model: {}".format(log_string))

        train_writer = tf.summary.FileWriter('./logs/3/train/{}'.format(log_string), sess.graph)
        valid_writer = tf.summary.FileWriter('./logs/3/valid/{}'.format(log_string))

        for e in range(epochs):
            state = sess.run(model.initial_state)
            
            # Record progress with each epoch
            train_loss = []
            train_acc = []
            val_acc = []
            val_loss = []

            with tqdm(total=len(x_train)) as pbar:
                for _, (x, y) in enumerate(get_batches(x_train,       
                                               y_train, 
                                               batch_size), 1):
                    feed = {model.inputs: x,
                            model.labels: y[:, None],
                            model.keep_prob: dropout,
                            model.initial_state: state}
                    summary, loss, acc, state, _ =     
                                          sess.run([model.merged, 
                                                  model.cost, 
                                                  model.accuracy, 
                                                  model.final_state, 
                                                  model.optimizer], 
                                                  feed_dict=feed)                
                    
                    # Record the loss and accuracy of each training  
                      batch
                    
                    train_loss.append(loss)
                    train_acc.append(acc)
                    
                    # Record the progress of training
                    train_writer.add_summary(summary, iteration)
                    
                    iteration += 1
                    pbar.update(batch_size)
            
            # Average the training loss and accuracy of each epoch
            avg_train_loss = np.mean(train_loss)
            avg_train_acc = np.mean(train_acc) 

            val_state = sess.run(model.initial_state)
            with tqdm(total=len(x_valid)) as pbar:
                for x, y in get_batches(x_valid,y_valid,batch_size):
                    feed = {model.inputs: x,
                            model.labels: y[:, None],
                            model.keep_prob: 1,
                            model.initial_state: val_state}
                    summary, batch_loss, batch_acc, val_state =     
                                 sess.run([model.merged, 
                                           model.cost, 
                                           model.accuracy, 
                                           model.final_state], 
                                           feed_dict=feed)
                    
                    # Record the validation loss and accuracy of 
                      each epoch
                    
                    val_loss.append(batch_loss)
                    val_acc.append(batch_acc)
                    pbar.update(batch_size)
            
            # Average the validation loss and accuracy of each epoch
            avg_valid_loss = np.mean(val_loss)    
            avg_valid_acc = np.mean(val_acc)
            valid_loss_summary.append(avg_valid_loss)
            
            # Record the validation data's progress
            valid_writer.add_summary(summary, iteration)

            # Print the progress of each epoch
            print("Epoch: {}/{}".format(e, epochs),
                  "Train Loss: {:.3f}".format(avg_train_loss),
                  "Train Acc: {:.3f}".format(avg_train_acc),
                  "Valid Loss: {:.3f}".format(avg_valid_loss),
                  "Valid Acc: {:.3f}".format(avg_valid_acc))

            # Stop training if the validation loss does not decrease 
              after 3 epochs
            
            if avg_valid_loss > min(valid_loss_summary):
                print("No Improvement.")
                stop_early += 1
                if stop_early == 3:
                    break   
            
            # Reset stop_early if the validation loss finds a new low
            # Save a checkpoint of the model
            else:
                print("New Record!")
                stop_early = 0
                checkpoint ="./sentiment_{}.ckpt".format(log_string)
                saver.save(sess, checkpoint)

下面让我们一步步来解释上面的代码

saver = tf.train.Saver()

这里是为了保存每个检查点

train_writer = tf.summary.FileWriter('./logs/3/train/{}'.format(log_string), sess.graph)
valid_writer = tf.summary.FileWriter('./logs/3/valid/{}'.format(log_string))

这段代码把训练的汇总作为日志存在本地。建议把它们存放在同一个logs文件夹下,同时训练和验证汇总分别放在不同的子文件夹中。这样能更方便的在TensorBoard上进行比对。

tqdm() （https://pypi.python.org/pypi/tqdm）则会记录每个过程的时间，这样我就能了解每次过程中还有大致多少剩余时间。

# Record the validation data's progress
valid_writer.add_summary(summary, iteration)

上面这段代码是为了提供每次处理的汇总情况。虽然这不是必须的，但它能提供不少我们需要的信息。

# Reset stop_early if the validation loss finds a new low
# Save a checkpoint of the model
else:
    print("New Record!")
    stop_early = 0
    checkpoint = "./sentiment_{}.ckpt".format(log_string)
    saver.save(sess, checkpoint)

这里我强烈建议尽量早的获取检查点，这能给训练节省大量的时间。同时，根据你自己的情况来决定大致需要几个节点。这里我只对最好的一次模型迭代进行保存（检查点），因为这样能够节省一定的电脑存储空间。当然可以根据不同的项目来进行调整。

下面是我用的一些默认超参数：

n_words = len(word_index)
embed_size = 300
batch_size = 250
lstm_size = 128
num_layers = 2
dropout = 0.5
learning_rate = 0.001
epochs = 100
multiple_fc = False
fc_units = 256

超参数能够很好的为这个模块性能调优。下面是我的调优方案：

· batch_size: 250是每份数据的量。如果我用了256，那么最终在上传预测结果时就会碰到麻烦。因为最后一个批处理会因为长度不够而不进行预测。

· epochs: 我常用13来设置这个值。因为我希望能更早的停止迭代而不是已经运行了多次次迭代以后。如果使用一个比较大的值（例如，100），那我们能保证模型已经被完全训练过了。

# Train the model with the desired tuning parameters
for lstm_size in [64,128]:
    for multiple_fc in [True, False]:
        for fc_units in [128, 256]:
            log_string = 'ru={},fcl={},fcu={}'.format(lstm_size,
                                                      multiple_fc,
                                                      fc_units)
            model = build_rnn(n_words = n_words, 
                              embed_size = embed_size,
                              batch_size = batch_size,
                              lstm_size = lstm_size,
                              num_layers = num_layers,
                              dropout = dropout,
                              learning_rate = learning_rate,
                              multiple_fc = multiple_fc,
                              fc_units = fc_units)            
            train(model, epochs, log_string)

这里我可以调整不同的lstm_size, multiple_fc, 和 fc_units参数值。用这种结构，你可以设置任何你想调优的值。注意，日志中需要记录这些数值。

def make_predictions(lstm_size, multiple_fc, fc_units, checkpoint):
    '''Predict the sentiment of the testing data'''
    
    # Record all of the predictions
    all_preds = []

    model = build_rnn(n_words = n_words, 
                      embed_size = embed_size,
                      batch_size = batch_size,
                      lstm_size = lstm_size,
                      num_layers = num_layers,
                      dropout = dropout,
                      learning_rate = learning_rate,
                      multiple_fc = multiple_fc,
                      fc_units = fc_units) 
    
    with tf.Session() as sess:
        saver = tf.train.Saver()
        # Load the model
        saver.restore(sess, checkpoint)
        test_state = sess.run(model.initial_state)
        for _, x in enumerate(get_test_batches(x_test, 
                                               batch_size), 1):
            feed = {model.inputs: x,
                    model.keep_prob: 1,
                    model.initial_state: test_state}
            predictions = sess.run(model.predictions,feed_dict=feed)
            for pred in predictions:
                all_preds.append(float(pred))
                
    return all_preds

这个方法会得到训练数据的预测结果。需要注意的是参数的设置需要与你的调优匹配。否则，你的预测结果可能会用到默认的参数值，而导致最终的结果并不能达到你的预期。

这就是整个项目的大致内容，这是在Github（https://github.com/Currie32/Movie-Reviews-Sentiment）上的源代码。如果对代码有任何问题或改进意见，我们也可以进行进一步的探讨。

以上为译文

本文由北邮@爱可可-爱生活老师推荐，阿里云云栖社区组织翻译。

文章原标题《Predicting Movie Review Sentiment with TensorFlow and TensorBoard》，作者：Dave Currie，译者：friday012，审校：海棠，阿福。

作者介绍：Dave Currie 致力于机器学习(自然语言处理方向)与数据科学研究的软件工程师。

Linkedin：https://www.linkedin.com/in/davidcurrie32

文章为简译，更为详细的内容，请查看原文

附件为原文pdf版本

动手实验 - TensorFlow和TensorBoard自然语言分析

热门文章

最新文章

相关课程

相关电子书

相关实验场景