另外需要指出的是对于不同的项目,可能需要对停止词词典做一定的调整。比如在这个项目的词典里包括了一些代词,而他们可能会在另外的项目里是有意义的,并不能简单的作为停止词来处理。
[445, 86, 489, 10939, 8, 61, 583, 2603, 120, 68, 957, 560, 53, 212, 24485, 212, 17247, 219, 193, 97, 20, 695, 2565, 124, 109, 15, 520, 3954, 193, 27, 246, 654, 2352, 1261, 17247, 90, 4782, 90, 712, 3, 305, 86, 16, 358, 1846, 542, 1219, 3592, 10939, 1, 485, 871, 3538, 23, 526, 673, 1414, 19, 63, 5305, 2089, 1118, 185, 413, 1523, 817, 2583, 7, 10939, 477, 86, 665, 85, 272, 114, 578, 10939, 34480, 29662, 148, 2, 10939, 381, 13, 59, 26, 381, 210, 15, 252, 178, 10, 751, 712, 3, 142, 341, 464, 145, 16427, 4121, 1718, 635, 876, 10547, 1018, 12089, 890, 1067, 1652, 416, 10939, 265, 19, 596, 141, 10939, 18336, 2302, 15821, 876, 10547, 1, 34, 38190, 388, 21, 49, 17539, 1414, 434, 9821, 193, 4238, 10939, 1, 120, 669, 520, 96, 7, 10939, 1555, 444, 2271, 138, 2137, 2383, 635, 23, 72, 117, 4750, 5364, 307, 1326, 31136, 19, 635, 556, 888, 665, 697, 6, 452, 195, 547, 138, 689, 3386, 1234, 790, 56, 1239, 268, 2, 21, 7, 10939, 6, 580, 78, 476, 32, 21, 245, 706, 158, 276, 113, 7674, 673, 3526, 10939, 1, 37925, 1690, 2, 159, 413, 1523, 294, 6, 956, 21, 51, 1500, 1226, 2352, 17, 612, 8, 61, 442, 724, 7184, 17, 25, 4, 49, 21, 199, 443, 3912, 3484, 49, 110, 270, 495, 252, 289, 124, 6, 19622, 19910, 363, 1502]
下一步,让我们调整影评数据的长度,使得每条数据都有相同的长度。
max_review_length = 200
train_pad = pad_sequences(train_seq, maxlen = max_review_length)
print("train_pad is complete.")
test_pad = pad_sequences(test_seq, maxlen = max_review_length)
print("test_pad is complete.")
更多的训练数据可以提高程序的准确度。这里,我设置了影评数据长度为200,目的是为了提升我们的训练速度。
在设置之前,检查一下每条数据的长度,从而确定一个比较合适的值。我用到了numpy的percentile方法来决定这个数值。
np.percentile(lengths.counts, 80)
设定了最长长度为200后,基本上80%的影评数据所有的单词都会被包括。对于那些长度超出200的记录,多余的部分会被截去。相反,那些不足的影评则会被填充令牌补满剩余部分。我们需要考虑如何对数据进行更有效的填充和截取。之后大家可以想想自己有哪些更好的方法。
至此,我们可以把数据分成训练集和验证集。
x_train, x_valid, y_train, y_valid = train_test_split(train_pad, train.sentiment, test_size = 0.15, random_state = 2)
通常,我会把数据分成训练集,验证集和测试集。由于我们的数据来源于Kaggle Competition,这些数据本身就是测试数据,因此我们可以直接把它们看成测试集。在开始我们的程序前, 我们先要创建几个方法以便用于后面的批处理。
def get_batches(x, y, batch_size):
'''Create the batches for the training and validation data'''
n_batches = len(x)//batch_size
x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
for ii in range(0, len(x), batch_size):
yield x[ii:ii+batch_size], y[ii:ii+batch_size]
def get_test_batches(x, batch_size):
'''Create the batches for the testing data'''
n_batches = len(x)//batch_size
x = x[:n_batches*batch_size]
for ii in range(0, len(x), batch_size):
yield x[ii:ii+batch_size]
上面的这些方法会把数据分成数据量相同的组。这里需要注意的是如果最后一个组的数据量小于batch_size,那它将被忽略。因此,batch_size应该是你的数据集长度的整数倍,否则,你可能会在后面上传预测数据时遇到一些问题。下面我们就来开始一步一步进行循环神经网络的编写。
def build_rnn(n_words, embed_size, batch_size, lstm_size, num_layers, dropout, learning_rate, multiple_fc, fc_units):
'''Build the Recurrent Neural Network'''
tf.reset_default_graph()
# Declare placeholders we'll feed into the graph
with tf.name_scope('inputs'):
inputs = tf.placeholder(tf.int32, [None, None], name='inputs')
with tf.name_scope('labels'):
labels = tf.placeholder(tf.int32, [None, None], name='labels')
keep_prob = tf.placeholder(tf.float32, name='keep_prob')
# Create the embeddings
with tf.name_scope("embeddings"):
embedding = tf.Variable(tf.random_uniform((n_words,
embed_size), -1, 1))
embed = tf.nn.embedding_lookup(embedding, inputs)
# Build the RNN layers
with tf.name_scope("RNN_layers"):
lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
drop = tf.contrib.rnn.DropoutWrapper(lstm,
output_keep_prob=keep_prob)
cell = tf.contrib.rnn.MultiRNNCell([drop] * num_layers)
# Set the initial state
with tf.name_scope("RNN_init_state"):
initial_state = cell.zero_state(batch_size, tf.float32)
# Run the data through the RNN layers
with tf.name_scope("RNN_forward"):
outputs, final_state = tf.nn.dynamic_rnn(
cell,
embed,
initial_state=initial_state)
# Create the fully connected layers
with tf.name_scope("fully_connected"):
# Initialize the weights and biases
weights = tf.truncated_normal_initializer(stddev=0.1)
biases = tf.zeros_initializer()
dense = tf.contrib.layers.fully_connected(outputs[:, -1],
num_outputs = fc_units,
activation_fn = tf.sigmoid,
weights_initializer = weights,
biases_initializer = biases)
dense = tf.contrib.layers.dropout(dense, keep_prob)
# Depending on the iteration, use a second fully connected
layer
if multiple_fc == True:
dense = tf.contrib.layers.fully_connected(dense,
num_outputs = fc_units,
activation_fn = tf.sigmoid,
weights_initializer = weights,
biases_initializer = biases)
dense = tf.contrib.layers.dropout(dense, keep_prob)
# Make the predictions
with tf.name_scope('predictions'):
predictions = tf.contrib.layers.fully_connected(dense,
num_outputs = 1,
activation_fn=tf.sigmoid,
weights_initializer = weights,
biases_initializer = biases)
tf.summary.histogram('predictions', predictions)
# Calculate the cost
with tf.name_scope('cost'):
cost = tf.losses.mean_squared_error(labels, predictions)
tf.summary.scalar('cost', cost)
# Train the model
with tf.name_scope('train'):
optimizer =
tf.train.AdamOptimizer(learning_rate).minimize(cost)
# Determine the accuracy
with tf.name_scope("accuracy"):
correct_pred = tf.equal(tf.cast(tf.round(predictions),
tf.int32),
labels)
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
tf.summary.scalar('accuracy', accuracy)
# Merge all of the summaries
merged = tf.summary.merge_all()
# Export the nodes
export_nodes = ['inputs', 'labels', 'keep_prob','initial_state',
'final_state','accuracy', 'predictions', 'cost',
'optimizer', 'merged']
Graph = namedtuple('Graph', export_nodes)
local_dict = locals()
graph = Graph(*[local_dict[each] for each in export_nodes])
return graph
如果之前你没有用过TensorBoard, 我建议可以在网上找一下Siraj
Raval相关教程视频。
tf.reset_default_graph()
在开始训练前,需要把我们的图重置一下,以保证训练数据的质量。
# Declare placeholders we'll feed into the graph
with tf.name_scope('inputs'):
inputs = tf.placeholder(tf.int32, [None, None],
name='inputs')
with tf.name_scope('labels'):
labels = tf.placeholder(tf.int32, [None, None],
name='labels')
keep_prob = tf.placeholder(tf.float32, name='keep_prob')
这些是我们数据的占位符。当我们用TensorBoard 图形化我们的数据后,tf.name_scope()可以用来标记图中的特定部分。
# Create the embeddings
with tf.name_scope("embeddings"):
embedding = tf.Variable(tf.random_uniform((n_words,
embed_size), -1, 1))
embed = tf.nn.embedding_lookup(embedding, inputs)
通过映射(embedding),我们把准备好的单词表转换成向量,embed_size表示这个向量的维度。(这里有更详细的讨论:https://www.quora.com/What-does-the-word-embedding-mean-in-the-context-of-Machine-Learning)
虽然这里我用了随机均匀分布的方法,其实还有其他很多方法来建立这样映射。比如带有较小标准差的截断正态分布也是不错的选择,那代码就是这样的:
embedding = tf.Variable(tf.truncated_normal((n_words, embed_size), -0.1, 0.1))
你可以自己动手试试。
# Build the RNN layers
with tf.name_scope("RNN_layers"):
lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
drop = tf.contrib.rnn.DropoutWrapper(lstm,
output_keep_prob=keep_prob)
cell = tf.contrib.rnn.MultiRNNCell([drop] * num_layers)
上面的代码是循环神经网络的核心。正如你从这里的超参数看到的,我们会用到一个有着50%Dropout的两层网络。
# Set the initial state
with tf.name_scope("RNN_init_state"):
initial_state = cell.zero_state(batch_size, tf.float32)
这步初始化了图的状态。
# Create the fully connected layers
with tf.name_scope("fully_connected"):
# Initialize the weights and biases
weights = tf.truncated_normal_initializer(stddev=0.1)
biases = tf.zeros_initializer()
dense = tf.contrib.layers.fully_connected(outputs[:, -1],
num_outputs = fc_units,
activation_fn = tf.sigmoid,
weights_initializer = weights,
biases_initializer = biases)
dense = tf.contrib.layers.dropout(dense, keep_prob)
# Depending on the iteration, use a second fully connected layer
if multiple_fc == True:
dense = tf.contrib.layers.fully_connected(dense,
num_outputs = fc_units,
activation_fn = tf.sigmoid,
weights_initializer = weights,
biases_initializer = biases)
dense = tf.contrib.layers.dropout(dense, keep_prob)
这步就是我们添加第一层和可能完全链接的第二层网络。他们的权重(weight)和偏置值(bias)会在之前提到过的映射(embedding)方法中进行初始化。这里multiple_fc
是方法的一个参数。它使我们能够测试这个模型的架构。用这个方法,你可以对不同的内容进行测试,比如如何初始化权重和偏置值,是否使用LSTM(Long
Short Term Memory)或GRU(Gated Recurrent
Unit),等等
# Make the predictions
with tf.name_scope('predictions'):
predictions = tf.contrib.layers.fully_connected(dense,
num_outputs = 1,
activation_fn = tf.sigmoid,
weights_initializer = weights,
biases_initializer = biases)
tf.summary.histogram('predictions', predictions)
我们这里只有关于影评态度(0到1)的一个输出。Sigmoid方法把最终完全连接的输出层映射到了这个区间。tf.summary.histogram()
记录了我们的预测结果,并输出成TensorBoard中的柱状图。这能使我们清楚的看到,在训练过程中,预测分布变化过程。同时,我们也可以看到训练集和验证集是如何进行比较的。
# Calculate the cost
with tf.name_scope('cost'):
cost = tf.losses.mean_squared_error(labels, predictions)
tf.summary.scalar('cost', cost)
这里计算的是训练过程中的花费(cost)。我们用到了tf.summary.scalar()
方法,这里cost是一个没有区间的标量值。
# Train the model
with tf.name_scope('train'):
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
Adam是一个通用的优化器,它能使模块训练更有效率。你也可以用其他的算法。
# Determine the accuracy
with tf.name_scope("accuracy"):
correct_pred = tf.equal(tf.cast(tf.round(predictions),
tf.int32),
labels)
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
tf.summary.scalar('accuracy', accuracy)
我们的预测值是从0到1之间的数字。为此我们需要在用其和标签比对前,对它的值进行取整。tf.reduce_mean()
能最大化准确的预测值。
# Merge all of the summaries
merged = tf.summary.merge_all()
这里,我们进行汇总从而简化数据保存的过程。
# Export the nodes
export_nodes = ['inputs', 'labels', 'keep_prob','initial_state',
'final_state','accuracy', 'predictions', 'cost',
'optimizer', 'merged']
Graph = namedtuple('Graph', export_nodes)
local_dict = locals()
graph = Graph(*[local_dict[each] for each in export_nodes])
导出所有节点,为后续的训练函数做准备。至此,我们完成了循环神经网络的创建,接下来我们就能用其进行训练了。
def train(model, epochs, log_string):
'''Train the RNN'''
saver = tf.train.Saver()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
# Used to determine when to stop the training early
valid_loss_summary = []
# Keep track of which batch iteration is being trained
iteration = 0
print()
print("Training Model: {}".format(log_string))
train_writer = tf.summary.FileWriter('./logs/3/train/{}'.format(log_string), sess.graph)
valid_writer = tf.summary.FileWriter('./logs/3/valid/{}'.format(log_string))
for e in range(epochs):
state = sess.run(model.initial_state)
# Record progress with each epoch
train_loss = []
train_acc = []
val_acc = []
val_loss = []
with tqdm(total=len(x_train)) as pbar:
for _, (x, y) in enumerate(get_batches(x_train,
y_train,
batch_size), 1):
feed = {model.inputs: x,
model.labels: y[:, None],
model.keep_prob: dropout,
model.initial_state: state}
summary, loss, acc, state, _ =
sess.run([model.merged,
model.cost,
model.accuracy,
model.final_state,
model.optimizer],
feed_dict=feed)
# Record the loss and accuracy of each training
batch
train_loss.append(loss)
train_acc.append(acc)
# Record the progress of training
train_writer.add_summary(summary, iteration)
iteration += 1
pbar.update(batch_size)
# Average the training loss and accuracy of each epoch
avg_train_loss = np.mean(train_loss)
avg_train_acc = np.mean(train_acc)
val_state = sess.run(model.initial_state)
with tqdm(total=len(x_valid)) as pbar:
for x, y in get_batches(x_valid,y_valid,batch_size):
feed = {model.inputs: x,
model.labels: y[:, None],
model.keep_prob: 1,
model.initial_state: val_state}
summary, batch_loss, batch_acc, val_state =
sess.run([model.merged,
model.cost,
model.accuracy,
model.final_state],
feed_dict=feed)
# Record the validation loss and accuracy of
each epoch
val_loss.append(batch_loss)
val_acc.append(batch_acc)
pbar.update(batch_size)
# Average the validation loss and accuracy of each epoch
avg_valid_loss = np.mean(val_loss)
avg_valid_acc = np.mean(val_acc)
valid_loss_summary.append(avg_valid_loss)
# Record the validation data's progress
valid_writer.add_summary(summary, iteration)
# Print the progress of each epoch
print("Epoch: {}/{}".format(e, epochs),
"Train Loss: {:.3f}".format(avg_train_loss),
"Train Acc: {:.3f}".format(avg_train_acc),
"Valid Loss: {:.3f}".format(avg_valid_loss),
"Valid Acc: {:.3f}".format(avg_valid_acc))
# Stop training if the validation loss does not decrease
after 3 epochs
if avg_valid_loss > min(valid_loss_summary):
print("No Improvement.")
stop_early += 1
if stop_early == 3:
break
# Reset stop_early if the validation loss finds a new low
# Save a checkpoint of the model
else:
print("New Record!")
stop_early = 0
checkpoint ="./sentiment_{}.ckpt".format(log_string)
saver.save(sess, checkpoint)
下面让我们一步步来解释上面的代码
saver = tf.train.Saver()
这里是为了保存每个检查点
train_writer = tf.summary.FileWriter('./logs/3/train/{}'.format(log_string), sess.graph)
valid_writer = tf.summary.FileWriter('./logs/3/valid/{}'.format(log_string))
这段代码把训练的汇总作为日志存在本地。建议把它们存放在同一个logs文件夹下,同时训练和验证汇总分别放在不同的子文件夹中。这样能更方便的在TensorBoard上进行比对。
tqdm() (https://pypi.python.org/pypi/tqdm)则会记录每个过程的时间,这样我就能了解每次过程中还有大致多少剩余时间。
# Record the validation data's progress
valid_writer.add_summary(summary, iteration)
上面这段代码是为了提供每次处理的汇总情况。虽然这不是必须的,但它能提供不少我们需要的信息。
# Reset stop_early if the validation loss finds a new low
# Save a checkpoint of the model
else:
print("New Record!")
stop_early = 0
checkpoint = "./sentiment_{}.ckpt".format(log_string)
saver.save(sess, checkpoint)
这里我强烈建议尽量早的获取检查点,这能给训练节省大量的时间。同时,根据你自己的情况来决定大致需要几个节点。这里我只对最好的一次模型迭代进行保存(检查点),因为这样能够节省一定的电脑存储空间。当然可以根据不同的项目来进行调整。
下面是我用的一些默认超参数:
n_words = len(word_index)
embed_size = 300
batch_size = 250
lstm_size = 128
num_layers = 2
dropout = 0.5
learning_rate = 0.001
epochs = 100
multiple_fc = False
fc_units = 256
超参数能够很好的为这个模块性能调优。下面是我的调优方案:
·
batch_size: 250是每份数据的量。如果我用了256,那么最终在上传预测结果时就会碰到麻烦。因为最后一个批处理会因为长度不够而不进行预测。
·
epochs: 我常用13来设置这个值。因为我希望能更早的停止迭代而不是已经运行了多次次迭代以后。如果使用一个比较大的值(例如,100),那我们能保证模型已经被完全训练过了。
# Train the model with the desired tuning parameters
for lstm_size in [64,128]:
for multiple_fc in [True, False]:
for fc_units in [128, 256]:
log_string = 'ru={},fcl={},fcu={}'.format(lstm_size,
multiple_fc,
fc_units)
model = build_rnn(n_words = n_words,
embed_size = embed_size,
batch_size = batch_size,
lstm_size = lstm_size,
num_layers = num_layers,
dropout = dropout,
learning_rate = learning_rate,
multiple_fc = multiple_fc,
fc_units = fc_units)
train(model, epochs, log_string)
这里我可以调整不同的lstm_size,
multiple_fc, 和 fc_units参数值。用这种结构,你可以设置任何你想调优的值。注意,日志中需要记录这些数值。
def make_predictions(lstm_size, multiple_fc, fc_units, checkpoint):
'''Predict the sentiment of the testing data'''
# Record all of the predictions
all_preds = []
model = build_rnn(n_words = n_words,
embed_size = embed_size,
batch_size = batch_size,
lstm_size = lstm_size,
num_layers = num_layers,
dropout = dropout,
learning_rate = learning_rate,
multiple_fc = multiple_fc,
fc_units = fc_units)
with tf.Session() as sess:
saver = tf.train.Saver()
# Load the model
saver.restore(sess, checkpoint)
test_state = sess.run(model.initial_state)
for _, x in enumerate(get_test_batches(x_test,
batch_size), 1):
feed = {model.inputs: x,
model.keep_prob: 1,
model.initial_state: test_state}
predictions = sess.run(model.predictions,feed_dict=feed)
for pred in predictions:
all_preds.append(float(pred))
return all_preds
这个方法会得到训练数据的预测结果。需要注意的是参数的设置需要与你的调优匹配。否则,你的预测结果可能会用到默认的参数值,而导致最终的结果并不能达到你的预期。
这就是整个项目的大致内容,这是在Github(https://github.com/Currie32/Movie-Reviews-Sentiment)上的源代码。如果对代码有任何问题或改进意见,我们也可以进行进一步的探讨。
以上为译文
本文由北邮@爱可可-爱生活 老师推荐,阿里云云栖社区组织翻译。
文章原标题《Predicting Movie Review Sentiment with TensorFlow and TensorBoard》,作者:Dave Currie,译者:friday012,审校:海棠,阿福。
作者介绍:Dave Currie 致力于机器学习(自然语言处理方向)与数据科学研究的软件工程师。
Linkedin:https://www.linkedin.com/in/davidcurrie32
文章为简译,更为详细的内容,请查看原文
附件为原文pdf版本