
敲代码、学日语,不做任何付费咨询
神经网络。《Make Your Own Neural Network》,用非常通俗易懂描述讲解人工神经网络原理用代码实现,试验效果非常好。 循环神经网络和LSTM。Christopher Olah http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 。 seq2seq模型基于循环神经网络序列到序列模型,语言翻译、自动问答等序列到序列场景,都可用seq2seq模型,用seq2seq实现聊天机器人的原理 http://suriyadeepan.github.io/2016-06-28-easy-seq2seq/ 。 attention模型(注意力模型)是解决seq2seq解码器只接受编码器最后一个输出远离之前输出导致信息丢失的问题。一个回答一般基于问题中关键位置信息,注意力集中地方, http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/ 。 tensorflow seq2seq制作聊天机器人。tensorflow提关键接口: https://www.tensorflow.org/api_docs/python/tf/contrib/legacy_seq2seq/embedding_attention_seq2seq 。 embedding_attention_seq2seq( encoder_inputs, decoder_inputs, cell, num_encoder_symbols, num_decoder_symbols, embedding_size, num_heads=1, output_projection=None, feed_previous=False, dtype=None, scope=None, initial_state_attention=False ) 参数encoder_inputs是list,list每一项是1D Tensor,Tensor shape是[batch_size],Tensor每一项是整数,类似: [array([0, 0, 0, 0], dtype=int32), array([0, 0, 0, 0], dtype=int32), array([8, 3, 5, 3], dtype=int32), array([7, 8, 2, 1], dtype=int32), array([6, 2, 10, 9], dtype=int32)] 5个array,表示一句话长度是5个词。每个array有4个数,表示batch是4,一共4个样本。第一个样本是[[0],[0],[8],[7],[6]],第二个样本是[[0],[0],[3],[8],[2]],数字区分不同词id,一般通过统计得出,一个id表示一个词。 参数decoder_inputs和encoder_inputs一样结构。 参数cell是tf.nn.rnn_cell.RNNCell类型循环神经网络单元,可用tf.contrib.rnn.BasicLSTMCell、tf.contrib.rnn.GRUCell。 参数num_encoder_symbols是整数,表示encoder_inputs整数词id数目。 num_decoder_symbols表示decoder_inputs中整数词id数目。 embedding_size表示内部word embedding转成几维向量,需要和RNNCell size大小相等。 num_heads表示attention_states抽头数量。 output_projection是(W, B)结构tuple,W是shape [output_size x num_decoder_symbols] weight矩阵,B是shape [num_decoder_symbols] 偏置向量,每个RNNCell输出经过WX+B映射成num_decoder_symbols维向量,向量值表示任意一个decoder_symbol可能性,softmax。 feed_previous表示decoder_inputs是直接提供训练数据输入,还是用前一个RNNCell输出映射,如果feed_previous为True,用前一个RNNCell输出,经过WX+B映射。 dtype是RNN状态数据类型,默认是tf.float32。 scope是子图命名,默认是“embedding_attention_seq2seq”。 initial_state_attention表示是否初始化attentions,默认为否,表示全初始化为0。返回值是(outputs, state)结构tuple,outputs是长度为句子长度(词数,与encoder_inputs list长度一样)list,list每一项是一个2D tf.float32类型 Tensor,第一维度是样本数,比如4个样本有四组Tensor,每个Tensor长度是embedding_size。outputs描述548个浮点数,5是句子长度,4是样本数,8是词向量维数。 返回state,num_layers个LSTMStateTuple组成大tuple,num_layers是初始化cell参数,表示神经网络单元有几层,一个由3层LSTM神经元组成encoder-decoder多层循环神经网络。encoder_inputs输入encoder第一层LSTM神经元,神经元output传给第二层LSTM神经元,第二层output再传给第三层,encoder第一层输出state传给decoder第一层LSTM神经元,依次类推。 LSTMStateTuple结构,由两个Tensor组成tuple,第一个tensor命名为c,由4个8维向量组成(4是batch, 8是state_size词向量维度), 第二个tensor命名为h,同样由4个8维向量组成。 c是传给下一个时序存储数据,h是隐藏的输出。 tensorflow代码实现: concat = _linear([inputs, h], 4 * self._num_units, True) i, j, f, o = array_ops.split(value=concat, num_or_size_splits=4, axis=1) new_c = (c * sigmoid(f + self._forget_bias) + sigmoid(i) * self._activation(j)) new_h = self._activation(new_c) * sigmoid(o) 直接用embedding_attention_seq2seq训练,返回state一般用不到。 构造输入参数训练一个seq2seq模型。以1、3、5、7、9……奇数序列为例构造样本,比如两个样本是[[1,3,5],[7,9,11]]和[[3,5,7],[9,11,13]]: train_set = [[[1, 3, 5], [7, 9, 11]], [[3, 5, 7], [9, 11, 13]]] 满足不同长度序列,训练序列比样本序列长度要长,比设置5 input_seq_len = 5 output_seq_len = 5 样本长度小于训练序列长度,用0填充 PAD_ID = 0 第一个样本encoder_input: encoder_input_0 = [PAD_ID] * (input_seq_len - len(train_set[0][0])) + train_set[0][0] 第二个样本encoder_input: encoder_input_1 = [PAD_ID] * (input_seq_len - len(train_set[1][0])) + train_set[1][0] decoder_input用GO_ID作起始,再输入样本序列,最后用PAD_ID填充。 GO_ID = 1 decoder_input_0 = [GO_ID] + train_set[0][1] + [PAD_ID] * (output_seq_len - len(train_set[0][1]) - 1) decoder_input_1 = [GO_ID] + train_set[1][1] + [PAD_ID] * (output_seq_len - len(train_set[1][1]) - 1) 把输入转成embedding_attention_seq2seq输入参数encoder_inputs和decoder_inputs格式: encoder_inputs = [] decoder_inputs = [] for length_idx in xrange(input_seq_len): encoder_inputs.append(np.array([encoder_input_0[length_idx], encoder_input_1[length_idx]], dtype=np.int32)) for length_idx in xrange(output_seq_len): decoder_inputs.append(np.array([decoder_input_0[length_idx], decoder_input_1[length_idx]], dtype=np.int32)) 独立函数: # coding:utf-8 import numpy as np # 输入序列长度 input_seq_len = 5 # 输出序列长度 output_seq_len = 5 # 空值填充0 PAD_ID = 0 # 输出序列起始标记 GO_ID = 1 def get_samples(): """构造样本数据 :return: encoder_inputs: [array([0, 0], dtype=int32), array([0, 0], dtype=int32), array([1, 3], dtype=int32), array([3, 5], dtype=int32), array([5, 7], dtype=int32)] decoder_inputs: [array([1, 1], dtype=int32), array([7, 9], dtype=int32), array([ 9, 11], dtype=int32), array([11, 13], dtype=int32), array([0, 0], dtype=int32)] """ train_set = [[[1, 3, 5], [7, 9, 11]], [[3, 5, 7], [9, 11, 13]]] encoder_input_0 = [PAD_ID] * (input_seq_len - len(train_set[0][0])) + train_set[0][0] encoder_input_1 = [PAD_ID] * (input_seq_len - len(train_set[1][0])) + train_set[1][0] decoder_input_0 = [GO_ID] + train_set[0][1] + [PAD_ID] * (output_seq_len - len(train_set[0][1]) - 1) decoder_input_1 = [GO_ID] + train_set[1][1] + [PAD_ID] * (output_seq_len - len(train_set[1][1]) - 1) encoder_inputs = [] decoder_inputs = [] for length_idx in xrange(input_seq_len): encoder_inputs.append(np.array([encoder_input_0[length_idx], encoder_input_1[length_idx]], dtype=np.int32)) for length_idx in xrange(output_seq_len): decoder_inputs.append(np.array([decoder_input_0[length_idx], decoder_input_1[length_idx]], dtype=np.int32)) return encoder_inputs, decoder_inputs 构造模型,tensorflow运行过程是先构造图,再塞数据计算,构建模型过程实际上是构建一张图。首先创建encoder_inputs和decoder_inputs的placeholder(占位符): import tensorflow as tf encoder_inputs = [] decoder_inputs = [] for i in xrange(input_seq_len): encoder_inputs.append(tf.placeholder(tf.int32, shape=[None], name="encoder{0}".format(i))) for i in xrange(output_seq_len): decoder_inputs.append(tf.placeholder(tf.int32, shape=[None], name="decoder{0}".format(i))) 创建一个记忆单元数目为size=8的LSTM神经元结构: size = 8 cell = tf.contrib.rnn.BasicLSTMCell(size) 训练奇数序列最大数值是输入最大10输出最大16 num_encoder_symbols = 10 num_decoder_symbols = 16 把参数传入embedding_attention_seq2seq获取output from tensorflow.contrib.legacy_seq2seq.python.ops import seq2seq outputs, _ = seq2seq.embedding_attention_seq2seq( encoder_inputs, decoder_inputs[:output_seq_len], cell, cell, num_encoder_symbols=num_encoder_symbols, num_decoder_symbols=num_decoder_symbols, embedding_size=size, output_projection=None, feed_previous=False, dtype=tf.float32) 构建模型部分放单独函数: def get_model(): """构造模型 """ encoder_inputs = [] decoder_inputs = [] for i in xrange(input_seq_len): encoder_inputs.append(tf.placeholder(tf.int32, shape=[None], name="encoder{0}".format(i))) for i in xrange(output_seq_len): decoder_inputs.append(tf.placeholder(tf.int32, shape=[None], name="decoder{0}".format(i))) cell = tf.contrib.rnn.BasicLSTMCell(size) # 这里输出的状态我们不需要 outputs, _ = seq2seq.embedding_attention_seq2seq( encoder_inputs, decoder_inputs, cell, num_encoder_symbols=num_encoder_symbols, num_decoder_symbols=num_decoder_symbols, embedding_size=size, output_projection=None, feed_previous=False, dtype=tf.float32) return encoder_inputs, decoder_inputs, outputs 构造运行时session,填入样本数据: with tf.Session() as sess: sample_encoder_inputs, sample_decoder_inputs = get_samples() encoder_inputs, decoder_inputs, outputs = get_model() input_feed = {} for l in xrange(input_seq_len): input_feed[encoder_inputs[l].name] = sample_encoder_inputs[l] for l in xrange(output_seq_len): input_feed[decoder_inputs[l].name] = sample_decoder_inputs[l] sess.run(tf.global_variables_initializer()) outputs = sess.run(outputs, input_feed) print outputs 输出outputs是由5个array组成list(5是序列长度),每个array由两个size是16 list组成(2表示2个样本,16表示输出符号有16个)。outputs对应seq2seq输出,W、X、Y、Z、EOS,decoder_inputs[1:],样本里[7,9,11]和[9,11,13]。 decoder_inputs结构: [array([1, 1], dtype=int32), array([ 7, 29], dtype=int32), array([ 9, 31], dtype=int32), array([11, 33], dtype=int32), array([0, 0], dtype=int32)] 损失函数说明:https://www.tensorflow.org/api_docs/python/tf/contrib/legacy_seq2seq/sequence_loss sequence_loss( logits, targets, weights, average_across_timesteps=True, average_across_batch=True, softmax_loss_function=None, name=None ) 损失函数,目标词语的平均负对数概率最小。logits是一个由多个2D shape [batch * num_decoder_symbols] Tensor组成list,batch是2,num_decoder_symbols是16,组成list Tensor 个数是output_seq_len。targets是和logits一样长度(output_seq_len) list,list每一项是整数组成1D Tensor,每个Tensor shape是[batch],数据类型是tf.int32,和decoder_inputs[1:] W、X、Y、Z、EOS结构一样。weights和targets结构一样,数据类型是tf.float32。 计算加权交叉熵损失,weights需要初始化占位符: target_weights = [] target_weights.append(tf.placeholder(tf.float32, shape=[None], name="weight{0}".format(i))) 计算损失值: targets = [decoder_inputs[i + 1] for i in xrange(len(decoder_inputs) - 1)] loss = seq2seq.sequence_loss(outputs, targets, target_weights) targets长度比decoder_inputs少一个,长度保持一致,decoder_inputs的初始化长度加1。计算加权交叉熵损失,有意义数权重大,无意义权重小,targets有值赋值为1,没值赋值为0: # coding:utf-8 import numpy as np import tensorflow as tf from tensorflow.contrib.legacy_seq2seq.python.ops import seq2seq # 输入序列长度 input_seq_len = 5 # 输出序列长度 output_seq_len = 5 # 空值填充0 PAD_ID = 0 # 输出序列起始标记 GO_ID = 1 # LSTM神经元size size = 8 # 最大输入符号数 num_encoder_symbols = 10 # 最大输出符号数 num_decoder_symbols = 16 def get_samples(): """构造样本数据 :return: encoder_inputs: [array([0, 0], dtype=int32), array([0, 0], dtype=int32), array([1, 3], dtype=int32), array([3, 5], dtype=int32), array([5, 7], dtype=int32)] decoder_inputs: [array([1, 1], dtype=int32), array([7, 9], dtype=int32), array([ 9, 11], dtype=int32), array([11, 13], dtype=int32), array([0, 0], dtype=int32)] """ train_set = [[[1, 3, 5], [7, 9, 11]], [[3, 5, 7], [9, 11, 13]]] encoder_input_0 = [PAD_ID] * (input_seq_len - len(train_set[0][0])) + train_set[0][0] encoder_input_1 = [PAD_ID] * (input_seq_len - len(train_set[1][0])) + train_set[1][0] decoder_input_0 = [GO_ID] + train_set[0][1] + [PAD_ID] * (output_seq_len - len(train_set[0][1]) - 1) decoder_input_1 = [GO_ID] + train_set[1][1] + [PAD_ID] * (output_seq_len - len(train_set[1][1]) - 1) encoder_inputs = [] decoder_inputs = [] target_weights = [] for length_idx in xrange(input_seq_len): encoder_inputs.append(np.array([encoder_input_0[length_idx], encoder_input_1[length_idx]], dtype=np.int32)) for length_idx in xrange(output_seq_len): decoder_inputs.append(np.array([decoder_input_0[length_idx], decoder_input_1[length_idx]], dtype=np.int32)) target_weights.append(np.array([ 0.0 if length_idx == output_seq_len - 1 or decoder_input_0[length_idx] == PAD_ID else 1.0, 0.0 if length_idx == output_seq_len - 1 or decoder_input_1[length_idx] == PAD_ID else 1.0, ], dtype=np.float32)) return encoder_inputs, decoder_inputs, target_weights def get_model(): """构造模型 """ encoder_inputs = [] decoder_inputs = [] target_weights = [] for i in xrange(input_seq_len): encoder_inputs.append(tf.placeholder(tf.int32, shape=[None], name="encoder{0}".format(i))) for i in xrange(output_seq_len + 1): decoder_inputs.append(tf.placeholder(tf.int32, shape=[None], name="decoder{0}".format(i))) for i in xrange(output_seq_len): target_weights.append(tf.placeholder(tf.float32, shape=[None], name="weight{0}".format(i))) # decoder_inputs左移一个时序作为targets targets = [decoder_inputs[i + 1] for i in xrange(output_seq_len)] cell = tf.contrib.rnn.BasicLSTMCell(size) # 这里输出的状态我们不需要 outputs, _ = seq2seq.embedding_attention_seq2seq( encoder_inputs, decoder_inputs[:output_seq_len], cell, num_encoder_symbols=num_encoder_symbols, num_decoder_symbols=num_decoder_symbols, embedding_size=size, output_projection=None, feed_previous=False, dtype=tf.float32) # 计算加权交叉熵损失 loss = seq2seq.sequence_loss(outputs, targets, target_weights) return encoder_inputs, decoder_inputs, target_weights, outputs, loss def main(): with tf.Session() as sess: sample_encoder_inputs, sample_decoder_inputs, sample_target_weights = get_samples() encoder_inputs, decoder_inputs, target_weights, outputs, loss = get_model() input_feed = {} for l in xrange(input_seq_len): input_feed[encoder_inputs[l].name] = sample_encoder_inputs[l] for l in xrange(output_seq_len): input_feed[decoder_inputs[l].name] = sample_decoder_inputs[l] input_feed[target_weights[l].name] = sample_target_weights[l] input_feed[decoder_inputs[output_seq_len].name] = np.zeros([2], dtype=np.int32) sess.run(tf.global_variables_initializer()) loss = sess.run(loss, input_feed) print loss if __name__ == "__main__": main() 训练模型,经过多轮计算让loss变得很小,运用梯度下降更新参数。tensorflow提供梯度下降类:https://www.tensorflow.org/api_docs/python/tf/train/GradientDescentOptimizer 。 Class GradientDescentOptimizer构造方法: __init__( learning_rate, use_locking=False, name='GradientDescent' ) 关键是第一个参数 学习率。计算梯度方法: compute_gradients( loss, var_list=None, gate_gradients=GATE_OP aggregation_method=None, colocate_gradients_with_ops=False, grad_loss=None ) 关键参数loss是传入误差值,返回值是(gradient, variable)组成list。更新参数方法: apply_gradients( grads_and_vars, global_step=None, name=None ) grads_and_vars是compute_gradients返回值。根据loss计算梯度更新参数方法: learning_rate = 0.1 opt = tf.train.GradientDescentOptimizer(learning_rate) update = opt.apply_gradients(opt.compute_gradients(loss)) main函数增加循环迭代: def main(): with tf.Session() as sess: sample_encoder_inputs, sample_decoder_inputs, sample_target_weights = get_samples() encoder_inputs, decoder_inputs, target_weights, outputs, loss, update = get_model() input_feed = {} for l in xrange(input_seq_len): input_feed[encoder_inputs[l].name] = sample_encoder_inputs[l] for l in xrange(output_seq_len): input_feed[decoder_inputs[l].name] = sample_decoder_inputs[l] input_feed[target_weights[l].name] = sample_target_weights[l] input_feed[decoder_inputs[output_seq_len].name] = np.zeros([2], dtype=np.int32) sess.run(tf.global_variables_initializer()) while True: [loss_ret, _] = sess.run([loss, update], input_feed) print loss_ret 实现预测逻辑,只输入样本encoder_input,自动预测decoder_input。训练模型保存,重新启动预测时加载: def get_model(): ... saver = tf.train.Saver(tf.global_variables()) return ..., saver 训练结束后执行 saver.save(sess, './model/demo') 模型存储到./model目录下以demo开头文件,加载先调用: saver.restore(sess, './model/demo') 预测候,原则上不能有decoder_inputs输入,执行时decoder_inputs取前一个时序输出,embedding_attention_seq2seq feed_previous参数,若为True则decoder每一步输入都用前一步输出来填充。 get_model传递参数区分训练和预测是不同feed_previous配置,预测时main函数也是不同,分开两个函数分别做train和predict。 # coding:utf-8 import sys import numpy as np import tensorflow as tf from tensorflow.contrib.legacy_seq2seq.python.ops import seq2seq # 输入序列长度 input_seq_len = 5 # 输出序列长度 output_seq_len = 5 # 空值填充0 PAD_ID = 0 # 输出序列起始标记 GO_ID = 1 # 结尾标记 EOS_ID = 2 # LSTM神经元size size = 8 # 最大输入符号数 num_encoder_symbols = 10 # 最大输出符号数 num_decoder_symbols = 16 # 学习率 learning_rate = 0.1 def get_samples(): """构造样本数据 :return: encoder_inputs: [array([0, 0], dtype=int32), array([0, 0], dtype=int32), array([5, 5], dtype=int32), array([7, 7], dtype=int32), array([9, 9], dtype=int32)] decoder_inputs: [array([1, 1], dtype=int32), array([11, 11], dtype=int32), array([13, 13], dtype=int32), array([15, 15], dtype=int32), array([2, 2], dtype=int32)] """ train_set = [[[5, 7, 9], [11, 13, 15, EOS_ID]], [[7, 9, 11], [13, 15, 17, EOS_ID]]] raw_encoder_input = [] raw_decoder_input = [] for sample in train_set: raw_encoder_input.append([PAD_ID] * (input_seq_len - len(sample[0])) + sample[0]) raw_decoder_input.append([GO_ID] + sample[1] + [PAD_ID] * (output_seq_len - len(sample[1]) - 1)) encoder_inputs = [] decoder_inputs = [] target_weights = [] for length_idx in xrange(input_seq_len): encoder_inputs.append(np.array([encoder_input[length_idx] for encoder_input in raw_encoder_input], dtype=np.int32)) for length_idx in xrange(output_seq_len): decoder_inputs.append(np.array([decoder_input[length_idx] for decoder_input in raw_decoder_input], dtype=np.int32)) target_weights.append(np.array([ 0.0 if length_idx == output_seq_len - 1 or decoder_input[length_idx] == PAD_ID else 1.0 for decoder_input in raw_decoder_input ], dtype=np.float32)) return encoder_inputs, decoder_inputs, target_weights def get_model(feed_previous=False): """构造模型 """ encoder_inputs = [] decoder_inputs = [] target_weights = [] for i in xrange(input_seq_len): encoder_inputs.append(tf.placeholder(tf.int32, shape=[None], name="encoder{0}".format(i))) for i in xrange(output_seq_len + 1): decoder_inputs.append(tf.placeholder(tf.int32, shape=[None], name="decoder{0}".format(i))) for i in xrange(output_seq_len): target_weights.append(tf.placeholder(tf.float32, shape=[None], name="weight{0}".format(i))) # decoder_inputs左移一个时序作为targets targets = [decoder_inputs[i + 1] for i in xrange(output_seq_len)] cell = tf.contrib.rnn.BasicLSTMCell(size) # 这里输出的状态我们不需要 outputs, _ = seq2seq.embedding_attention_seq2seq( encoder_inputs, decoder_inputs[:output_seq_len], cell, num_encoder_symbols=num_encoder_symbols, num_decoder_symbols=num_decoder_symbols, embedding_size=size, output_projection=None, feed_previous=feed_previous, dtype=tf.float32) # 计算加权交叉熵损失 loss = seq2seq.sequence_loss(outputs, targets, target_weights) # 梯度下降优化器 opt = tf.train.GradientDescentOptimizer(learning_rate) # 优化目标:让loss最小化 update = opt.apply_gradients(opt.compute_gradients(loss)) # 模型持久化 saver = tf.train.Saver(tf.global_variables()) return encoder_inputs, decoder_inputs, target_weights, outputs, loss, update, saver, targets def train(): """ 训练过程 """ with tf.Session() as sess: sample_encoder_inputs, sample_decoder_inputs, sample_target_weights = get_samples() encoder_inputs, decoder_inputs, target_weights, outputs, loss, update, saver, targets = get_model() input_feed = {} for l in xrange(input_seq_len): input_feed[encoder_inputs[l].name] = sample_encoder_inputs[l] for l in xrange(output_seq_len): input_feed[decoder_inputs[l].name] = sample_decoder_inputs[l] input_feed[target_weights[l].name] = sample_target_weights[l] input_feed[decoder_inputs[output_seq_len].name] = np.zeros([2], dtype=np.int32) # 全部变量初始化 sess.run(tf.global_variables_initializer()) # 训练200次迭代,每隔10次打印一次loss for step in xrange(200): [loss_ret, _] = sess.run([loss, update], input_feed) if step % 10 == 0: print 'step=', step, 'loss=', loss_ret # 模型持久化 saver.save(sess, './model/demo') def predict(): """ 预测过程 """ with tf.Session() as sess: sample_encoder_inputs, sample_decoder_inputs, sample_target_weights = get_samples() encoder_inputs, decoder_inputs, target_weights, outputs, loss, update, saver, targets = get_model(feed_previous=True) # 从文件恢复模型 saver.restore(sess, './model/demo') input_feed = {} for l in xrange(input_seq_len): input_feed[encoder_inputs[l].name] = sample_encoder_inputs[l] for l in xrange(output_seq_len): input_feed[decoder_inputs[l].name] = sample_decoder_inputs[l] input_feed[target_weights[l].name] = sample_target_weights[l] input_feed[decoder_inputs[output_seq_len].name] = np.zeros([2], dtype=np.int32) # 预测输出 outputs = sess.run(outputs, input_feed) # 一共试验样本有2个,所以分别遍历 for sample_index in xrange(2): # 因为输出数据每一个是num_decoder_symbols维的 # 因此找到数值最大的那个就是预测的id,就是这里的argmax函数的功能 outputs_seq = [int(np.argmax(logit[sample_index], axis=0)) for logit in outputs] # 如果是结尾符,那么后面的语句就不输出了 if EOS_ID in outputs_seq: outputs_seq = outputs_seq[:outputs_seq.index(EOS_ID)] outputs_seq = [str(v) for v in outputs_seq] print " ".join(outputs_seq) if __name__ == "__main__": if sys.argv[1] == 'train': train() else: predict() 文件命名demo.py,执行./demo.py train训练模型,执行./demo.py predict预测。 预测时按照完整encoder_inputs和decoder_inputs计算,继续改进predict,手工输入一串数字(只有encoder部分)。 实现从输入空格分隔数字id串,转成预测用encoder、decoder、target_weight函数。 def seq_to_encoder(input_seq): """从输入空格分隔的数字id串,转成预测用的encoder、decoder、target_weight等 """ input_seq_array = [int(v) for v in input_seq.split()] encoder_input = [PAD_ID] * (input_seq_len - len(input_seq_array)) + input_seq_array decoder_input = [GO_ID] + [PAD_ID] * (output_seq_len - 1) encoder_inputs = [np.array([v], dtype=np.int32) for v in encoder_input] decoder_inputs = [np.array([v], dtype=np.int32) for v in decoder_input] target_weights = [np.array([1.0], dtype=np.float32)] * output_seq_len return encoder_inputs, decoder_inputs, target_weights 然后我们改写predict函数如下: def predict(): """ 预测过程 """ with tf.Session() as sess: encoder_inputs, decoder_inputs, target_weights, outputs, loss, update, saver = get_model(feed_previous=True) saver.restore(sess, './model/demo') sys.stdout.write("> ") sys.stdout.flush() input_seq = sys.stdin.readline() while input_seq: input_seq = input_seq.strip() sample_encoder_inputs, sample_decoder_inputs, sample_target_weights = seq_to_encoder(input_seq) input_feed = {} for l in xrange(input_seq_len): input_feed[encoder_inputs[l].name] = sample_encoder_inputs[l] for l in xrange(output_seq_len): input_feed[decoder_inputs[l].name] = sample_decoder_inputs[l] input_feed[target_weights[l].name] = sample_target_weights[l] input_feed[decoder_inputs[output_seq_len].name] = np.zeros([2], dtype=np.int32) # 预测输出 outputs_seq = sess.run(outputs, input_feed) # 因为输出数据每一个是num_decoder_symbols维的 # 因此找到数值最大的那个就是预测的id,就是这里的argmax函数的功能 outputs_seq = [int(np.argmax(logit[0], axis=0)) for logit in outputs_seq] # 如果是结尾符,那么后面的语句就不输出了 if EOS_ID in outputs_seq: outputs_seq = outputs_seq[:outputs_seq.index(EOS_ID)] outputs_seq = [str(v) for v in outputs_seq] print " ".join(outputs_seq) sys.stdout.write("> ") sys.stdout.flush() input_seq = sys.stdin.readline() 执行./demo.py predict。 设置num_encoder_symbols = 10,11无法表达,修改参数并增加样本: # 最大输入符号数 num_encoder_symbols = 32 # 最大输出符号数 num_decoder_symbols = 32 …… train_set = [ [[5, 7, 9], [11, 13, 15, EOS_ID]], [[7, 9, 11], [13, 15, 17, EOS_ID]], [[15, 17, 19], [21, 23, 25, EOS_ID]] ] …… 迭代次数扩大到10000次。 输入样本,预测效果非常好,换成其他输入,还是在样本输出找最相近结果作预测结果,不会思考,没有智能,所模型更适合做分类,不适合做推理。 训练时把中文词汇转成id号,预测时把预测id转成中文。 新建word_token.py文件,并建一个WordToken类,load函数负责加载样本,生成word2id_dict和id2word_dict词典,word2id函数负责将词汇转成id,id2word负责将id转成词汇: # coding:utf-8 import sys import jieba class WordToken(object): def __init__(self): # 最小起始id号, 保留的用于表示特殊标记 self.START_ID = 4 self.word2id_dict = {} self.id2word_dict = {} def load_file_list(self, file_list): """ 加载样本文件列表,全部切词后统计词频,按词频由高到低排序后顺次编号 并存到self.word2id_dict和self.id2word_dict中 """ words_count = {} for file in file_list: with open(file, 'r') as file_object: for line in file_object.readlines(): line = line.strip() seg_list = jieba.cut(line) for str in seg_list: if str in words_count: words_count[str] = words_count[str] + 1 else: words_count[str] = 1 sorted_list = [[v[1], v[0]] for v in words_count.items()] sorted_list.sort(reverse=True) for index, item in enumerate(sorted_list): word = item[1] self.word2id_dict[word] = self.START_ID + index self.id2word_dict[self.START_ID + index] = word def word2id(self, word): if not isinstance(word, unicode): print "Exception: error word not unicode" sys.exit(1) if word in self.word2id_dict: return self.word2id_dict[word] else: return None def id2word(self, id): id = int(id) if id in self.id2word_dict: return self.id2word_dict[id] else: return None demo.py修改get_train_set: def get_train_set(): global num_encoder_symbols, num_decoder_symbols train_set = [] with open('./samples/question', 'r') as question_file: with open('./samples/answer', 'r') as answer_file: while True: question = question_file.readline() answer = answer_file.readline() if question and answer: question = question.strip() answer = answer.strip() question_id_list = get_id_list_from(question) answer_id_list = get_id_list_from(answer) answer_id_list.append(EOS_ID) train_set.append([question_id_list, answer_id_list]) else: break return train_set get_id_list_from实现: def get_id_list_from(sentence): sentence_id_list = [] seg_list = jieba.cut(sentence) for str in seg_list: id = wordToken.word2id(str) if id: sentence_id_list.append(wordToken.word2id(str)) return sentence_id_list wordToken: import word_token import jieba wordToken = word_token.WordToken() # 放在全局的位置,为了动态算出num_encoder_symbols和num_decoder_symbols max_token_id = wordToken.load_file_list(['./samples/question', './samples/answer']) num_encoder_symbols = max_token_id + 5 num_decoder_symbols = max_token_id + 5 训练代码: # 训练很多次迭代,每隔10次打印一次loss,可以看情况直接ctrl+c停止 for step in xrange(100000): [loss_ret, _] = sess.run([loss, update], input_feed) if step % 10 == 0: print 'step=', step, 'loss=', loss_ret # 模型持久化 saver.save(sess, './model/demo') 预测代码修改: def predict(): """ 预测过程 """ with tf.Session() as sess: encoder_inputs, decoder_inputs, target_weights, outputs, loss, update, saver = get_model(feed_previous=True) saver.restore(sess, './model/demo') sys.stdout.write("> ") sys.stdout.flush() input_seq = sys.stdin.readline() while input_seq: input_seq = input_seq.strip() input_id_list = get_id_list_from(input_seq) if (len(input_id_list)): sample_encoder_inputs, sample_decoder_inputs, sample_target_weights = seq_to_encoder(' '.join([str(v) for v in input_id_list])) input_feed = {} for l in xrange(input_seq_len): input_feed[encoder_inputs[l].name] = sample_encoder_inputs[l] for l in xrange(output_seq_len): input_feed[decoder_inputs[l].name] = sample_decoder_inputs[l] input_feed[target_weights[l].name] = sample_target_weights[l] input_feed[decoder_inputs[output_seq_len].name] = np.zeros([2], dtype=np.int32) # 预测输出 outputs_seq = sess.run(outputs, input_feed) # 因为输出数据每一个是num_decoder_symbols维的 # 因此找到数值最大的那个就是预测的id,就是这里的argmax函数的功能 outputs_seq = [int(np.argmax(logit[0], axis=0)) for logit in outputs_seq] # 如果是结尾符,那么后面的语句就不输出了 if EOS_ID in outputs_seq: outputs_seq = outputs_seq[:outputs_seq.index(EOS_ID)] outputs_seq = [wordToken.id2word(v) for v in outputs_seq] print " ".join(outputs_seq) else: print "WARN:词汇不在服务区" sys.stdout.write("> ") sys.stdout.flush() input_seq = sys.stdin.readline() 用存储在['./samples/question', './samples/answer']1000个对话样本训练,使loss输出收敛到一定程度(比如1.0)以下: python demo.py train 到1.0以下后手工ctrl+c停止,每隔10个step都store一次模型。 模型收敛非常慢,设置学习率是0.1。首先学习率大一些,每当下一步loss和上一步相比反弹(反而增大)时再尝试降低学习率。不再直接用learning_rate,初始化一个学习率: init_learning_rate = 1 get_model中创建一个变量,用init_learning_rate初始化: learning_rate = tf.Variable(float(init_learning_rate), trainable=False, dtype=tf.float32) 再创建一个操作,在适当时候把学习率打9折: learning_rate_decay_op = learning_rate.assign(learning_rate * 0.9) 训练代码调整: # 训练很多次迭代,每隔10次打印一次loss,可以看情况直接ctrl+c停止 previous_losses = [] for step in xrange(100000): [loss_ret, _] = sess.run([loss, update], input_feed) if step % 10 == 0: print 'step=', step, 'loss=', loss_ret, 'learning_rate=', learning_rate.eval() if loss_ret > max(previous_losses[-5:]): sess.run(learning_rate_decay_op) previous_losses.append(loss_ret) # 模型持久化 saver.save(sess, './model/demo') 训练可以快速收敛。 参考文献http://colah.github.io/posts/2015-08-Understanding-LSTMs/http://suriyadeepan.github.io/2016-06-28-easy-seq2seq/http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/https://arxiv.org/abs/1406.1078https://arxiv.org/abs/1409.3215https://arxiv.org/abs/1409.0473 样本全量加载,用大量样本训练,内存撑不住,总是Out of memory。方法是把全量加载样本改成批量加载,样本量再大,内存也不会无限增加。 https://github.com/warmheartli/ChatBotCourse/tree/master/chatbotv5 样本量加大内存增长,样本量达到万级别,内存占用达到10G,每次迭代把样本全量加载到内存并一次性训练完再更新模型,词表是基于样本生成,没有做任何限制,导致样本大词表大,模型很大,占据内存更大。 优化方案。把全量加载样本改成批量加载,修改train()函数。 # 训练很多次迭代,每隔10次打印一次loss,可以看情况直接ctrl+c停止 previous_losses = [] for step in xrange(20000): sample_encoder_inputs, sample_decoder_inputs, sample_target_weights = get_samples(train_set, 1000) input_feed = {} for l in xrange(input_seq_len): input_feed[encoder_inputs[l].name] = sample_encoder_inputs[l] for l in xrange(output_seq_len): input_feed[decoder_inputs[l].name] = sample_decoder_inputs[l] input_feed[target_weights[l].name] = sample_target_weights[l] input_feed[decoder_inputs[output_seq_len].name] = np.zeros([len(sample_decoder_inputs[0])], dtype=np.int32) [loss_ret, _] = sess.run([loss, update], input_feed) if step % 10 == 0: print 'step=', step, 'loss=', loss_ret, 'learning_rate=', learning_rate.eval() if len(previous_losses) > 5 and loss_ret > max(previous_losses[-5:]): sess.run(learning_rate_decay_op) previous_losses.append(loss_ret) # 模型持久化 saver.save(sess, './model/demo') get_samples(train_set, 1000) 批量获取样本,1000是每次获取样本量: if batch_num >= len(train_set): batch_train_set = train_set else: random_start = random.randint(0, len(train_set)-batch_num) batch_train_set = train_set[random_start:random_start+batch_num] for sample in batch_train_set: raw_encoder_input.append([PAD_ID] * (input_seq_len - len(sample[0])) + sample[0]) raw_decoder_input.append([GO_ID] + sample[1] + [PAD_ID] * (output_seq_len - len(sample[1]) - 1)) 每次在全量样本中随机位置抽取连续1000条样本。 加载样本词表时做词最小频率限制: def load_file_list(self, file_list, min_freq): ...... for index, item in enumerate(sorted_list): word = item[1] if item[0] < min_freq: break self.word2id_dict[word] = self.START_ID + index self.id2word_dict[self.START_ID + index] = word return index https://github.com/warmheartli/ChatBotCourse/tree/master/chatbotv5 参考资料:《Python 自然语言处理》《NLTK基础教程 用NLTK和Python库构建机器学习应用》http://www.shareditor.com/blogshow?blogId=136http://www.shareditor.com/blogshow?blogId=137 欢迎推荐上海机器学习工作机会
tensorflow基于图结构深度学习框架,内部通过session实现图和计算内核交互。 tensorflow基本数学运算用法。 import tensorflow as tf sess = tf.Session() a = tf.placeholder("float") b = tf.placeholder("float") c = tf.constant(6.0) d = tf.mul(a, b) y = tf.mul(d, c) print sess.run(y, feed_dict={a: 3, b: 3}) A = [[1.1,2.3],[3.4,4.1]] Y = tf.matrix_inverse(A) print sess.run(Y) sess.close() 主要数字运算。 tf.add tf.sub tf.mul tf.div tf.mod tf.abs tf.neg tf.sign tf.inv tf.square tf.round tf.sqrt tf.pow tf.exp tf.log tf.maximum tf.minimum tf.cos tf.sin 主要矩阵运算。 tf.diag #生成对角阵 tf.transpose tf.matmul tf.matrix_determinant #计算行列式的值 tf.matrix_inverse #计算矩阵的逆 tensorboard使用。tensorflow代码,先构建图,然后执行,对中间过程调试不方便,提供一个tensorboard工具调试。训练时提示写入事件文件到目录(/tmp/tflearn_logs/11U8M4/)。执行命令打开 http://192.168.1.101:6006 看到tensorboard的界面。 tensorboard --logdir=/tmp/tflearn_logs/11U8M4/ Graph和Session。 import tensorflow as tf with tf.Graph().as_default() as g: with g.name_scope("myscope") as scope: # 有了这个scope,下面的op的name都是类似myscope/Placeholder这样的前缀 sess = tf.Session(target='', graph = g, config=None) # target表示要连接的tf执行引擎 print "graph version:", g.version # 0 a = tf.placeholder("float") print a.op # 输出整个operation信息,跟下面g.get_operations返回结果一样 print "graph version:", g.version # 1 b = tf.placeholder("float") print "graph version:", g.version # 2 c = tf.placeholder("float") print "graph version:", g.version # 3 y1 = tf.mul(a, b) # 也可以写成a * b print "graph version:", g.version # 4 y2 = tf.mul(y1, c) # 也可以写成y1 * c print "graph version:", g.version # 5 operations = g.get_operations() for (i, op) in enumerate(operations): print "============ operation", i+1, "===========" print op # 一个结构,包括:name、op、attr、input等,不同op不一样 assert y1.graph is g assert sess.graph is g print "================ graph object address ================" print sess.graph print "================ graph define ================" print sess.graph_def print "================ sess str ================" print sess.sess_str print sess.run(y1, feed_dict={a: 3, b: 3}) # 9.0 feed_dictgraph中的元素和值的映射 print sess.run(fetches=[b,y1], feed_dict={a: 3, b: 3}, options=None, run_metadata=None) # 传入的feches和返回值的shape相同 print sess.run({'ret_name':y1}, feed_dict={a: 3, b: 3}) # {'ret_name': 9.0} 传入的feches和返回值的shape相同 assert tf.get_default_session() is not sess with sess.as_default(): # 把sess作为默认的session,那么tf.get_default_session就是sess, 否则不是 assert tf.get_default_session() is sess h = sess.partial_run_setup([y1, y2], [a, b, c]) # 分阶段运行,参数指明了feches和feed_dict列表 res = sess.partial_run(h, y1, feed_dict={a: 3, b: 4}) # 12 运行第一阶段 res = sess.partial_run(h, y2, feed_dict={c: res}) # 144.0 运行第二阶段,其中使用了第一阶段的执行结果 print "partial_run res:", res sess.close() tensorflow Session是Graph和执行者媒介,Session.run()将graph、fetches、feed_dict序列化到字节数组,调用tf_session.TF_Run(参见/usr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py)。tf_session.TF_Run调用动态链接库_pywrap_tensorflow.so实现_pywrap_tensorflow.TF_Run接口(参见/usr/local/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow.py)。动态链接库是tensorflow多语言python接口。_pywrap_tensorflow.so和pywrap_tensorflow.py通过SWIG工具自动生成,tensorflow核心语言c语言,通过SWIG生成各种脚本语言接口。 10行关键代码实现线性回归。用梯度下降求解线性回归问题是tensorflow最简单入门例子(10行关键代码)。 # -*- coding: utf-8 -*- import numpy as np import tensorflow as tf # 随机生成1000个点,围绕在y=0.1x+0.3的直线周围 num_points = 1000 vectors_set = [] for i in xrange(num_points): x1 = np.random.normal(0.0, 0.55) y1 = x1 * 0.1 + 0.3 + np.random.normal(0.0, 0.03) vectors_set.append([x1, y1]) # 生成一些样本 x_data = [v[0] for v in vectors_set] y_data = [v[1] for v in vectors_set] # 生成1维的W矩阵,取值是[-1,1]之间的随机数 W = tf.Variable(tf.random_uniform([1], -1.0, 1.0), name='W') # 生成1维的b矩阵,初始值是0 b = tf.Variable(tf.zeros([1]), name='b') # 经过计算得出预估值y y = W * x_data + b # 以预估值y和实际值y_data之间的均方误差作为损失 loss = tf.reduce_mean(tf.square(y - y_data), name='loss') # 采用梯度下降法来优化参数 optimizer = tf.train.GradientDescentOptimizer(0.5) # 训练的过程就是最小化这个误差值 train = optimizer.minimize(loss, name='train') sess = tf.Session() # 输出图结构 #print sess.graph_def init = tf.initialize_all_variables() sess.run(init) # 初始化的W和b是多少 print "W =", sess.run(W), "b =", sess.run(b), "loss =", sess.run(loss) # 执行20次训练 for step in xrange(20): sess.run(train) # 输出训练好的W和b print "W =", sess.run(W), "b =", sess.run(b), "loss =", sess.run(loss) # 生成summary文件,用于tensorboard使用 writer = tf.train.SummaryWriter("./tmp", sess.graph) 一张图展示线性回归工作原理。执行代码在本地生成一个tmp目录,产生tensorboard读取数据,执行: tensorboard --logdir=./tmp/ 打开 http://localhost:6006/ GRAPHS,展开一系列关键节点。图是代码生成graph结构,graph描述整个梯度下降解决线性回归问题整个过程,每一个节点代表代码的一步操作。 详细分析线性回归graph。W和b。代码对W有三种操作 Assign、read、train。assign基于random_uniform赋值。 W = tf.Variable(tf.random_uniform([1], -1.0, 1.0), name='W') tf.random_uniform graph。read对应: y = W * x_data + b train对应梯度下降训练过程操作。 对b有三种操作:Assign、read、train。用zeros赋初始化值。W和b通过梯度下降计算update_W和update_b,更新W和b的值。update_W和update_b基于三个输入计算得出,学习率learning_rate、W/b当前值、梯度gradients。最关键的梯度下降过程。 loss = tf.reduce_mean(tf.square(y - y_data), name='loss') 以y-y_data为输入,x不是x_data,是一个临时常量 2。2(y-y_data),明显是(y-y_data)^2导数。以2(y-y_data)为输入经过各种处理最终生成参数b的增量update_b。生成update_W更新W,反向追溯依赖于add_grad(基于y-y_data)和W以及y生成,详细计算过程:http://stackoverflow.com/questions/39580427/how-does-tensorflow-calculate-the-gradients-for-the-tf-train-gradientdescentopti ,一步简单操作被tensorflow转成很多个节点图,细节节点不深入分析,只是操作图表达,没有太重要意义。 tensorflow自带seq2seq模型基于one-hot词嵌入,每个词用一个数字代替不足表示词与词之间关系,word2vec多维向量做词嵌入,能够表示出词之间关系。基于seq2seq思想,利用多维词向量实现模型,预期会有更高准确性。 seq2seq模型原理。参考《Sequence to Sequence Learning with Neural Networks》论文。核心思想,ABC是输入语句,WXYZ是输出语句,EOS是标识一句话结束,训练单元是lstm,lstm的特点是有长短时记忆,能够根据输入多个字确定后面多个字,lstm知识参考 http://deeplearning.net/tutorial/lstm.html 模型编码器和解码器共用同一个lstm层,共享参数,分开 https://github.com/farizrahman4u/seq2seq 绿色是编码器,黄色是解码器,橙色箭头传递lstm层状态信息(记忆信息),编码器唯一传给解码器的状态信息。解码器每一时序输入是前一个时序输出,通过不同时序输入“How are you ”,模型能自动一个字一个字输出“W I am fine ”,W是特殊标识,是编码器最后输出,是解码器触发信号。直接把解码器每一时序输入强制改为"W I am fine",把这部分从训练样本输入X中传过来,Y依然是预测输出"W I am fine ",这样训练出来的模型就是编码器解码器模型。使用训练模型预测,在解码时以前一时序输出为输入做预测,就能输出"W I am fine ”。 语料准备。至少300w聊天语料用于词向量训练和seq2seq模型训练,语料越丰富训练词向量质量越好。切词: python word_segment.py ./corpus.raw ./corpus.segment 切词文件转成“|”分隔问答对: cat ./corpus.segment | awk '{if(last!="")print last"|"$0;last=$0}' | sed 's/| /|/g' > ./corpus.segment.pair 训练词向量。用google word2vec训练词向量: word2vec -train ./corpus.segment -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-5 -threads 20 -binary 1 -iter 15 corpus.raw 原始语料数据,vectors.bin 生成的词向量二进制文件。生成词向量二进制加载方法 https://github.com/warmheartli/ChatBotCourse/blob/master/word_vectors_loader.py 。 创建模型。用tensorflow+tflearn库来实现。 # 首先我们为输入的样本数据申请变量空间,如下。其中self.max_seq_len是指一个切好词的句子最多包含多少个词,self.word_vec_dim是词向量的维度,这里面shape指定了输入数据是不确定数量的样本,每个样本最多包含max_seq_len*2个词,每个词用word_vec_dim维浮点数表示。这里面用2倍的max_seq_len是因为我们训练是输入的X既要包含question句子又要包含answer句子 input_data = tflearn.input_data(shape=[None, self.max_seq_len*2, self.word_vec_dim], dtype=tf.float32, name = "XY") # 然后我们将输入的所有样本数据的词序列切出前max_seq_len个,也就是question句子部分,作为编码器的输入 encoder_inputs = tf.slice(input_data, [0, 0, 0], [-1, self.max_seq_len, self.word_vec_dim], name="enc_in") # 再取出后max_seq_len-1个,也就是answer句子部分,作为解码器的输入。注意,这里只取了max_seq_len-1个,是因为还要在前面拼上一组GO标识来告诉解码器我们要开始解码了,也就是下面加上go_inputs拼成最终的go_inputs decoder_inputs_tmp = tf.slice(input_data, [0, self.max_seq_len, 0], [-1, self.max_seq_len-1, self.word_vec_dim], name="dec_in_tmp") go_inputs = tf.ones_like(decoder_inputs_tmp) go_inputs = tf.slice(go_inputs, [0, 0, 0], [-1, 1, self.word_vec_dim]) decoder_inputs = tf.concat(1, [go_inputs, decoder_inputs_tmp], name="dec_in") # 之后开始编码过程,返回的encoder_output_tensor展开成tflearn.regression回归可以识别的形如(?, 1, 200)的向量;返回的states后面传入给解码器 (encoder_output_tensor, states) = tflearn.lstm(encoder_inputs, self.word_vec_dim, return_state=True, scope='encoder_lstm') encoder_output_sequence = tf.pack([encoder_output_tensor], axis=1) # 取出decoder_inputs的第一个词,也就是GO first_dec_input = tf.slice(decoder_inputs, [0, 0, 0], [-1, 1, self.word_vec_dim]) # 将其输入到解码器中,如下,解码器的初始化状态为编码器生成的states,注意:这里的scope='decoder_lstm'是为了下面重用同一个解码器 decoder_output_tensor = tflearn.lstm(first_dec_input, self.word_vec_dim, initial_state=states, return_seq=False, reuse=False, scope='decoder_lstm') # 暂时先将解码器的第一个输出存到decoder_output_sequence_list中供最后一起输出 decoder_output_sequence_single = tf.pack([decoder_output_tensor], axis=1) decoder_output_sequence_list = [decoder_output_tensor] # 接下来我们循环max_seq_len-1次,不断取decoder_inputs的一个个词向量作为下一轮解码器输入,并将结果添加到decoder_output_sequence_list中,这里面的reuse=True, scope='decoder_lstm'说明和上面第一次解码用的是同一个lstm层 for i in range(self.max_seq_len-1): next_dec_input = tf.slice(decoder_inputs, [0, i+1, 0], [-1, 1, self.word_vec_dim]) decoder_output_tensor = tflearn.lstm(next_dec_input, self.word_vec_dim, return_seq=False, reuse=True, scope='decoder_lstm') decoder_output_sequence_single = tf.pack([decoder_output_tensor], axis=1) decoder_output_sequence_list.append(decoder_output_tensor) # 下面我们把编码器第一个输出和解码器所有输出拼接起来,作为tflearn.regression回归的输入 decoder_output_sequence = tf.pack(decoder_output_sequence_list, axis=1) real_output_sequence = tf.concat(1, [encoder_output_sequence, decoder_output_sequence]) net = tflearn.regression(real_output_sequence, optimizer='sgd', learning_rate=0.1, loss='mean_square') model = tflearn.DNN(net) 模型创建完成,汇总思想: 1)训练输入X、Y分别是编码器解码器输入和预测输出; 2)X切分两半,前一半是编码器输入,后一半是解码器输入; 3)编码解码器输出预测值用Y做回归训练 4)训练通过样本真实值作解码器输入,实际预测不会有WXYZ部分,上一时序输出将作下一时序输入 训练模型。实例化模型并喂数据做训练: model = self.model() model.fit(trainXY, trainY, n_epoch=1000, snapshot_epoch=False, batch_size=1) model.load('./model/model') trainXY和trainY通过加载语料赋值。 加载词向量存到word_vector_dict,读取语料文件挨个词查word_vector_dict,赋值向量给question_seq和answer_seq: def init_seq(input_file): """读取切好词的文本文件,加载全部词序列 """ file_object = open(input_file, 'r') vocab_dict = {} while True: question_seq = [] answer_seq = [] line = file_object.readline() if line: line_pair = line.split('|') line_question = line_pair[0] line_answer = line_pair[1] for word in line_question.decode('utf-8').split(' '): if word_vector_dict.has_key(word): question_seq.append(word_vector_dict[word]) for word in line_answer.decode('utf-8').split(' '): if word_vector_dict.has_key(word): answer_seq.append(word_vector_dict[word]) else: break question_seqs.append(question_seq) answer_seqs.append(answer_seq) file_object.close() 有question_seq和answer_seq,构造trainXY和trainY: def generate_trainig_data(self): xy_data = [] y_data = [] for i in range(len(question_seqs)): question_seq = question_seqs[i] answer_seq = answer_seqs[i] if len(question_seq) < self.max_seq_len and len(answer_seq) < self.max_seq_len: sequence_xy = [np.zeros(self.word_vec_dim)] * (self.max_seq_len-len(question_seq)) + list(reversed(question_seq)) sequence_y = answer_seq + [np.zeros(self.word_vec_dim)] * (self.max_seq_len-len(answer_seq)) sequence_xy = sequence_xy + sequence_y sequence_y = [np.ones(self.word_vec_dim)] + sequence_y xy_data.append(sequence_xy) y_data.append(sequence_y) return np.array(xy_data), np.array(y_data) 构造训练数据创建模型,训练: python my_seq2seq_v2.py train 最终生成./model/model模型文件。 效果预测。训练模型,输入一句话预测一下回答: predict = model.predict(testXY) 只有question没有answer,testXY没有Y部分,用上一句输出作为下一句输入: for i in range(self.max_seq_len-1): # next_dec_input = tf.slice(decoder_inputs, [0, i+1, 0], [-1, 1, self.word_vec_dim])这里改成下面这句 next_dec_input = decoder_output_sequence_single decoder_output_tensor = tflearn.lstm(next_dec_input, self.word_vec_dim, return_seq=False, reuse=True, scope='decoder_lstm') decoder_output_sequence_single = tf.pack([decoder_output_tensor], axis=1) decoder_output_sequence_list.append(decoder_output_tensor) 词向量是多维浮点数,预测词向量要通过余弦相似度匹配,余弦相似度匹配方法: def vector2word(vector): max_cos = -10000 match_word = '' for word in word_vector_dict: v = word_vector_dict[word] cosine = vector_cosine(vector, v) if cosine > max_cos: max_cos = cosine match_word = word return (match_word, max_cos) 其中的vector_cosine实现如下: def vector_cosine(v1, v2): if len(v1) != len(v2): sys.exit(1) sqrtlen1 = vector_sqrtlen(v1) sqrtlen2 = vector_sqrtlen(v2) value = 0 for item1, item2 in zip(v1, v2): value += item1 * item2 return value / (sqrtlen1*sqrtlen2) def vector_sqrtlen(vector): len = 0 for item in vector: len += item * item len = math.sqrt(len) return len 预测: python my_seq2seq_v2.py test test.data 输出第一列是预测每个时序产生词,第二列是预测输出向量和最近词向量余弦相似度,第三列是预测向量欧氏距离。max_seq_len定长8,输出序列最后会多余一些字,根据余弦相似度或者其他指标设定一个阈值截断。全部代码 https://github.com/warmheartli/ChatBotCourse/blob/master/chatbotv2/my_seq2seq_v2.py 。 参考资料:《Python 自然语言处理》《NLTK基础教程 用NLTK和Python库构建机器学习应用》http://www.shareditor.com/blogshow?blogId=119http://www.shareditor.com/blogshow?blogId=120http://www.shareditor.com/blogshow?blogId=121 欢迎推荐上海机器学习工作机会
真正掌握一种算法,最实际的方法,完全手写出来。 LSTM(Long Short Tem Memory)特殊递归神经网络,神经元保存历史记忆,解决自然语言处理统计方法只能考虑最近n个词语而忽略更久前词语的问题。用途:word representation(embedding)(词语向量)、sequence to sequence learning(输入句子预测句子)、机器翻译、语音识别等。 100多行原始python代码实现基于LSTM二进制加法器。https://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/ ,翻译http://blog.csdn.net/zzukun/article/details/49968129 : import copy, numpy as np np.random.seed(0) 最开始引入numpy库,矩阵操作。 def sigmoid(x): output = 1/(1+np.exp(-x)) return output 声明sigmoid激活函数,神经网络基础内容,常用激活函数sigmoid、tan、relu等,sigmoid取值范围[0, 1],tan取值范围[-1,1],x是向量,返回output是向量。 def sigmoid_output_to_derivative(output): return output*(1-output) 声明sigmoid求导函数。加法器思路:二进制加法是二进制位相加,记录满二进一进位,训练时随机c=a+b样本,输入a、b输出c是整个lstm预测过程,训练由a、b二进制向c各种转换矩阵和权重,神经网络。 int2binary = {} 声明词典,由整型数字转成二进制,存起来不用随时计算,提前存好读取更快。 binary_dim = 8 largest_number = pow(2,binary_dim)声明二进制数字维度,8,二进制能表达最大整数2^8=256,largest_number。 binary = np.unpackbits( np.array([range(largest_number)],dtype=np.uint8).T,axis=1) for i in range(largest_number): int2binary[i] = binary[i] 预先把整数到二进制转换词典存起来。 alpha = 0.1 input_dim = 2 hidden_dim = 16 output_dim = 1 设置参数,alpha是学习速度,input_dim是输入层向量维度,输入a、b两个数,是2,hidden_dim是隐藏层向量维度,隐藏层神经元个数,output_dim是输出层向量维度,输出一个c,是1维。从输入层到隐藏层权重矩阵是216维,从隐藏层到输出层权重矩阵是161维,隐藏层到隐藏层权重矩阵是16*16维: synapse_0 = 2*np.random.random((input_dim,hidden_dim)) - 1 synapse_1 = 2*np.random.random((hidden_dim,output_dim)) - 1 synapse_h = 2*np.random.random((hidden_dim,hidden_dim)) - 1 2x-1,np.random.random生成从0到1之间随机浮点数,2x-1使其取值范围在[-1, 1]。 synapse_0_update = np.zeros_like(synapse_0) synapse_1_update = np.zeros_like(synapse_1) synapse_h_update = np.zeros_like(synapse_h) 声明三个矩阵更新,Delta。 for j in range(10000): 进行10000次迭代。 a_int = np.random.randint(largest_number/2) a = int2binary[a_int] b_int = np.random.randint(largest_number/2) b = int2binary[b_int] c_int = a_int + b_int c = int2binary[c_int] 随机生成样本,包含二进制a、b、c,c=a+b,a_int、b_int、c_int分别是a、b、c对应整数格式。 d = np.zeros_like(c) d存模型对c预测值。 overallError = 0 全局误差,观察模型效果。layer_2_deltas = list()存储第二层(输出层)残差,输出层残差计算公式推导公式http://deeplearning.stanford.edu/wiki/index.php/%E5%8F%8D%E5%90%91%E4%BC%A0%E5%AF%BC%E7%AE%97%E6%B3%95 。 layer_1_values = list() layer_1_values.append(np.zeros(hidden_dim)) 存储第一层(隐藏层)输出值,赋0值作为上一个时间值。 for position in range(binary_dim): 遍历二进制每一位。 X = np.array([[a[binary_dim - position - 1],b[binary_dim - position - 1]]]) y = np.array([[c[binary_dim - position - 1]]]).T X和y分别是样本输入和输出二进制值第position位,X对于每个样本有两个值,分别是a和b对应第position位。把样本拆成每个二进制位用于训练,二进制加法存在进位标记正好适合利用LSTM长短期记忆训练,每个样本8个二进制位是一个时间序列。 layer_1 = sigmoid(np.dot(X,synapse_0) + np.dot(layer_1_values[-1],synapse_h)) 公式Ct = sigma(W0·Xt + Wh·Ct-1) layer_2 = sigmoid(np.dot(layer_1,synapse_1)) 这里使用的公式是C2 = sigma(W1·C1), layer_2_error = y - layer_2 计算预测值和真实值误差。 layer_2_deltas.append((layer_2_error)*sigmoid_output_to_derivative(layer_2)) 反向传导,计算delta,添加到数组layer_2_deltas overallError += np.abs(layer_2_error[0]) 计算累加总误差,用于展示和观察。 d[binary_dim - position - 1] = np.round(layer_2[0][0]) 存储预测position位输出值。 layer_1_values.append(copy.deepcopy(layer_1)) 存储中间过程生成隐藏层值。 future_layer_1_delta = np.zeros(hidden_dim) 存储下一个时间周期隐藏层历史记忆值,先赋一个空值。 for position in range(binary_dim): 遍历二进制每一位。 X = np.array([[a[position],b[position]]]) 取出X值,从大位开始更新,反向传导按时序逆着一级一级更新。 layer_1 = layer_1_values[-position-1] 取出位对应隐藏层输出。 prev_layer_1 = layer_1_values[-position-2] 取出位对应隐藏层上一时序输出。 layer_2_delta = layer_2_deltas[-position-1] 取出位对应输出层delta。 layer_1_delta = (future_layer_1_delta.dot(synapse_h.T) + layer_2_delta.dot(synapse_1.T)) * sigmoid_output_to_derivative(layer_1) 神经网络反向传导公式,加上隐藏层?值。 synapse_1_update += np.atleast_2d(layer_1).T.dot(layer_2_delta) 累加权重矩阵更新,对权重(权重矩阵)偏导等于本层输出与下一层delta点乘。 synapse_h_update += np.atleast_2d(prev_layer_1).T.dot(layer_1_delta) 前一时序隐藏层权重矩阵更新,前一时序隐藏层输出与本时序delta点乘。 synapse_0_update += X.T.dot(layer_1_delta) 输入层权重矩阵更新。 future_layer_1_delta = layer_1_delta 记录本时序隐藏层delta。 synapse_0 += synapse_0_update * alpha synapse_1 += synapse_1_update * alpha synapse_h += synapse_h_update * alpha 权重矩阵更新。 synapse_0_update *= 0 synapse_1_update *= 0 synapse_h_update *= 0 更新变量归零。 if(j % 1000 == 0): print "Error:" + str(overallError) print "Pred:" + str(d) print "True:" + str(c) out = 0 for index,x in enumerate(reversed(d)): out += x*pow(2,index) print str(a_int) + " + " + str(b_int) + " = " + str(out) print "------------" 每训练1000个样本输出总误差信息,运行时看收敛过程。LSTM最简单实现,没有考虑偏置变量,只有两个神经元。 完整LSTM python实现。完全参照论文great intro paper实现,代码来源https://github.com/nicodjimenez/lstm ,作者解释http://nicodjimenez.github.io/2014/08/08/lstm.html ,具体过程参考http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 图。 import random import numpy as np import math def sigmoid(x): return 1. / (1 + np.exp(-x)) 声明sigmoid函数。 def rand_arr(a, b, *args): np.random.seed(0) return np.random.rand(*args) * (b - a) + a 生成随机矩阵,取值范围[a,b),shape用args指定。 class LstmParam: def __init__(self, mem_cell_ct, x_dim): self.mem_cell_ct = mem_cell_ct self.x_dim = x_dim concat_len = x_dim + mem_cell_ct # weight matrices self.wg = rand_arr(-0.1, 0.1, mem_cell_ct, concat_len) self.wi = rand_arr(-0.1, 0.1, mem_cell_ct, concat_len) self.wf = rand_arr(-0.1, 0.1, mem_cell_ct, concat_len) self.wo = rand_arr(-0.1, 0.1, mem_cell_ct, concat_len) # bias terms self.bg = rand_arr(-0.1, 0.1, mem_cell_ct) self.bi = rand_arr(-0.1, 0.1, mem_cell_ct) self.bf = rand_arr(-0.1, 0.1, mem_cell_ct) self.bo = rand_arr(-0.1, 0.1, mem_cell_ct) # diffs (derivative of loss function w.r.t. all parameters) self.wg_diff = np.zeros((mem_cell_ct, concat_len)) self.wi_diff = np.zeros((mem_cell_ct, concat_len)) self.wf_diff = np.zeros((mem_cell_ct, concat_len)) self.wo_diff = np.zeros((mem_cell_ct, concat_len)) self.bg_diff = np.zeros(mem_cell_ct) self.bi_diff = np.zeros(mem_cell_ct) self.bf_diff = np.zeros(mem_cell_ct) self.bo_diff = np.zeros(mem_cell_ct) LstmParam类传递参数,mem_cell_ct是lstm神经元数目,x_dim是输入数据维度,concat_len是mem_cell_ct与x_dim长度和,wg是输入节点权重矩阵,wi是输入门权重矩阵,wf是忘记门权重矩阵,wo是输出门权重矩阵,bg、bi、bf、bo分别是输入节点、输入门、忘记门、输出门偏置,wg_diff、wi_diff、wf_diff、wo_diff分别是输入节点、输入门、忘记门、输出门权重损失,bg_diff、bi_diff、bf_diff、bo_diff分别是输入节点、输入门、忘记门、输出门偏置损失,初始化按照矩阵维度初始化,损失矩阵归零。 def apply_diff(self, lr = 1): self.wg -= lr * self.wg_diff self.wi -= lr * self.wi_diff self.wf -= lr * self.wf_diff self.wo -= lr * self.wo_diff self.bg -= lr * self.bg_diff self.bi -= lr * self.bi_diff self.bf -= lr * self.bf_diff self.bo -= lr * self.bo_diff # reset diffs to zero self.wg_diff = np.zeros_like(self.wg) self.wi_diff = np.zeros_like(self.wi) self.wf_diff = np.zeros_like(self.wf) self.wo_diff = np.zeros_like(self.wo) self.bg_diff = np.zeros_like(self.bg) self.bi_diff = np.zeros_like(self.bi) self.bf_diff = np.zeros_like(self.bf) self.bo_diff = np.zeros_like(self.bo) 定义权重更新过程,先减损失,再把损失矩阵归零。 class LstmState: def __init__(self, mem_cell_ct, x_dim): self.g = np.zeros(mem_cell_ct) self.i = np.zeros(mem_cell_ct) self.f = np.zeros(mem_cell_ct) self.o = np.zeros(mem_cell_ct) self.s = np.zeros(mem_cell_ct) self.h = np.zeros(mem_cell_ct) self.bottom_diff_h = np.zeros_like(self.h) self.bottom_diff_s = np.zeros_like(self.s) self.bottom_diff_x = np.zeros(x_dim) LstmState存储LSTM神经元状态,包括g、i、f、o、s、h,s是内部状态矩阵(记忆),h是隐藏层神经元输出矩阵。 class LstmNode: def __init__(self, lstm_param, lstm_state): # store reference to parameters and to activations self.state = lstm_state self.param = lstm_param # non-recurrent input to node self.x = None # non-recurrent input concatenated with recurrent input self.xc = None LstmNode对应样本输入,x是输入样本x,xc是用hstack把x和递归输入节点拼接矩阵(hstack是横拼矩阵,vstack是纵拼矩阵)。 def bottom_data_is(self, x, s_prev = None, h_prev = None): # if this is the first lstm node in the network if s_prev == None: s_prev = np.zeros_like(self.state.s) if h_prev == None: h_prev = np.zeros_like(self.state.h) # save data for use in backprop self.s_prev = s_prev self.h_prev = h_prev # concatenate x(t) and h(t-1) xc = np.hstack((x, h_prev)) self.state.g = np.tanh(np.dot(self.param.wg, xc) + self.param.bg) self.state.i = sigmoid(np.dot(self.param.wi, xc) + self.param.bi) self.state.f = sigmoid(np.dot(self.param.wf, xc) + self.param.bf) self.state.o = sigmoid(np.dot(self.param.wo, xc) + self.param.bo) self.state.s = self.state.g * self.state.i + s_prev * self.state.f self.state.h = self.state.s * self.state.o self.x = x self.xc = xc bottom和top是两个方向,输入样本从底部输入,反向传导从顶部向底部传导,bottom_data_is是输入样本过程,把x和先前输入拼接成矩阵,用公式wx+b分别计算g、i、f、o值,激活函数tanh和sigmoid。每个时序神经网络有四个神经网络层(激活函数),最左边忘记门,直接生效到记忆C,第二个输入门,依赖输入样本数据,按照一定“比例”影响记忆C,“比例”通过第三个层(tanh)实现,取值范围是[-1,1]可以正向影响也可以负向影响,最后一个输出门,每一时序产生输出既依赖输入样本x和上一时序输出,还依赖记忆C,设计模仿生物神经元记忆功能。 def top_diff_is(self, top_diff_h, top_diff_s): # notice that top_diff_s is carried along the constant error carousel ds = self.state.o * top_diff_h + top_diff_s do = self.state.s * top_diff_h di = self.state.g * ds dg = self.state.i * ds df = self.s_prev * ds # diffs w.r.t. vector inside sigma / tanh function di_input = (1. - self.state.i) * self.state.i * di df_input = (1. - self.state.f) * self.state.f * df do_input = (1. - self.state.o) * self.state.o * do dg_input = (1. - self.state.g ** 2) * dg # diffs w.r.t. inputs self.param.wi_diff += np.outer(di_input, self.xc) self.param.wf_diff += np.outer(df_input, self.xc) self.param.wo_diff += np.outer(do_input, self.xc) self.param.wg_diff += np.outer(dg_input, self.xc) self.param.bi_diff += di_input self.param.bf_diff += df_input self.param.bo_diff += do_input self.param.bg_diff += dg_input # compute bottom diff dxc = np.zeros_like(self.xc) dxc += np.dot(self.param.wi.T, di_input) dxc += np.dot(self.param.wf.T, df_input) dxc += np.dot(self.param.wo.T, do_input) dxc += np.dot(self.param.wg.T, dg_input) # save bottom diffs self.state.bottom_diff_s = ds * self.state.f self.state.bottom_diff_x = dxc[:self.param.x_dim] self.state.bottom_diff_h = dxc[self.param.x_dim:] 反向传导,整个训练过程核心。假设在t时刻lstm输出预测值h(t),实际输出值是y(t),之间差别是损失,假设损失函数为l(t) = f(h(t), y(t)) = ||h(t) - y(t)||^2,欧式距离,整体损失函数是L(t) = ∑l(t),t从1到T,T表示整个事件序列最大长度。最终目标是用梯度下降法让L(t)最小化,找到一个最优权重w使得L(t)最小,当w发生微小变化L(t)不再变化,达到局部最优,即L对w偏导梯度为0。dL/dw表示当w发生单位变化L变化多少,dh(t)/dw表示当w发生单位变化h(t)变化多少,dL/dh(t)表示当h(t)发生单位变化时L变化多少,(dL/dh(t)) * (dh(t)/dw)表示第t时序第i个记忆单元w发生单位变化L变化多少,把所有由1到M的i和所有由1到T的t累加是整体dL/dw。 第i个记忆单元,h(t)发生单位变化,整个从1到T时序所有局部损失l的累加和,是dL/dh(t),h(t)只影响从t到T时序局部损失l。 假设L(t)表示从t到T损失和,L(t) = ∑l(s)。 h(t)对w导数。 L(t) = l(t) + L(t+1),dL(t)/dh(t) = dl(t)/dh(t) + dL(t+1)/dh(t),用下一时序导数得出当前时序导数,规律推导,计算T时刻导数往前推,在T时刻,dL(T)/dh(T) = dl(T)/dh(T)。 class LstmNetwork(): def __init__(self, lstm_param): self.lstm_param = lstm_param self.lstm_node_list = [] # input sequence self.x_list = [] def y_list_is(self, y_list, loss_layer): """ Updates diffs by setting target sequence with corresponding loss layer. Will *NOT* update parameters. To update parameters, call self.lstm_param.apply_diff() """ assert len(y_list) == len(self.x_list) idx = len(self.x_list) - 1 # first node only gets diffs from label ... loss = loss_layer.loss(self.lstm_node_list[idx].state.h, y_list[idx]) diff_h = loss_layer.bottom_diff(self.lstm_node_list[idx].state.h, y_list[idx]) # here s is not affecting loss due to h(t+1), hence we set equal to zero diff_s = np.zeros(self.lstm_param.mem_cell_ct) self.lstm_node_list[idx].top_diff_is(diff_h, diff_s) idx -= 1 ### ... following nodes also get diffs from next nodes, hence we add diffs to diff_h ### we also propagate error along constant error carousel using diff_s while idx >= 0: loss += loss_layer.loss(self.lstm_node_list[idx].state.h, y_list[idx]) diff_h = loss_layer.bottom_diff(self.lstm_node_list[idx].state.h, y_list[idx]) diff_h += self.lstm_node_list[idx + 1].state.bottom_diff_h diff_s = self.lstm_node_list[idx + 1].state.bottom_diff_s self.lstm_node_list[idx].top_diff_is(diff_h, diff_s) idx -= 1 return loss diff_h(预测结果误差发生单位变化损失L多少,dL(t)/dh(t)数值计算),由idx从T往前遍历到1,计算loss_layer.bottom_diff和下一个时序bottom_diff_h和作为diff_h(第一次遍历即T不加bottom_diff_h)。loss_layer.bottom_diff: def bottom_diff(self, pred, label): diff = np.zeros_like(pred) diff[0] = 2 * (pred[0] - label) return diff l(t) = f(h(t), y(t)) = ||h(t) - y(t)||^2导数l'(t) = 2 * (h(t) - y(t))。当s(t)发生变化,L(t)变化来源s(t)影响h(t)和h(t+1),影响L(t)。h(t+1)不会影响l(t)。左边式子(dL(t)/dh(t)) * (dh(t)/ds(t)),由t+1到t来逐级反推dL(t)/ds(t)。神经元self.state.h = self.state.s self.state.o,h(t) = s(t) o(t),dh(t)/ds(t) = o(t),dL(t)/dh(t)是top_diff_h。 top_diff_is,Bottom means input to the layer, top means output of the layer. Caffe also uses this terminology. bottom表示神经网络层输入,top表示神经网络层输出,和caffe概念一致。def top_diff_is(self, top_diff_h, top_diff_s):top_diff_h表示当前t时序dL(t)/dh(t), top_diff_s表示t+1时序记忆单元dL(t)/ds(t)。 ds = self.state.o * top_diff_h + top_diff_s do = self.state.s * top_diff_h di = self.state.g * ds dg = self.state.i * ds df = self.s_prev * ds 前缀d表达误差L对某一项导数(directive)。ds是在根据公式dL(t)/ds(t)计算当前t时序dL(t)/ds(t)。do是计算dL(t)/do(t),h(t) = s(t) o(t),dh(t)/do(t) = s(t),dL(t)/do(t) = (dL(t)/dh(t)) (dh(t)/do(t)) = top_diff_h * s(t)。di是计算dL(t)/di(t)。s(t) = f(t) s(t-1) + i(t) g(t)。dL(t)/di(t) = (dL(t)/ds(t)) (ds(t)/di(t)) = ds g(t)。dg是计算dL(t)/dg(t),dL(t)/dg(t) = (dL(t)/ds(t)) (ds(t)/dg(t)) = ds i(t)。df是计算dL(t)/df(t),dL(t)/df(t) = (dL(t)/ds(t)) (ds(t)/df(t)) = ds s(t-1)。 di_input = (1. - self.state.i) * self.state.i * di df_input = (1. - self.state.f) * self.state.f * df do_input = (1. - self.state.o) * self.state.o * do dg_input = (1. - self.state.g ** 2) * dg sigmoid函数导数,tanh函数导数。di_input,(1. - self.state.i) * self.state.i,sigmoid导数,当i神经元输入发生单位变化时输出值有多大变化,再乘di表示当i神经元输入发生单位变化时误差L(t)发生多大变化,dL(t)/d i_input(t)。 self.param.wi_diff += np.outer(di_input, self.xc) self.param.wf_diff += np.outer(df_input, self.xc) self.param.wo_diff += np.outer(do_input, self.xc) self.param.wg_diff += np.outer(dg_input, self.xc) self.param.bi_diff += di_input self.param.bf_diff += df_input self.param.bo_diff += do_input self.param.bg_diff += dg_input w_diff是权重矩阵误差,b_diff是偏置误差,用于更新。 dxc = np.zeros_like(self.xc) dxc += np.dot(self.param.wi.T, di_input) dxc += np.dot(self.param.wf.T, df_input) dxc += np.dot(self.param.wo.T, do_input) dxc += np.dot(self.param.wg.T, dg_input) 累加输入xdiff,x在四处起作用,四处diff加和后作xdiff。 self.state.bottom_diff_s = ds * self.state.f self.state.bottom_diff_x = dxc[:self.param.x_dim] self.state.bottom_diff_h = dxc[self.param.x_dim:] bottom_diff_s是在t-1时序上s变化和t时序上s变化时f倍关系。dxc是x和h横向合并矩阵,分别取两部分diff信息bottom_diff_x和bottom_diff_h。 def x_list_clear(self): self.x_list = [] def x_list_add(self, x): self.x_list.append(x) if len(self.x_list) > len(self.lstm_node_list): # need to add new lstm node, create new state mem lstm_state = LstmState(self.lstm_param.mem_cell_ct, self.lstm_param.x_dim) self.lstm_node_list.append(LstmNode(self.lstm_param, lstm_state)) # get index of most recent x input idx = len(self.x_list) - 1 if idx == 0: # no recurrent inputs yet self.lstm_node_list[idx].bottom_data_is(x) else: s_prev = self.lstm_node_list[idx - 1].state.s h_prev = self.lstm_node_list[idx - 1].state.h self.lstm_node_list[idx].bottom_data_is(x, s_prev, h_prev) 添加训练样本,输入x数据。 def example_0(): # learns to repeat simple sequence from random inputs np.random.seed(0) # parameters for input data dimension and lstm cell count mem_cell_ct = 100 x_dim = 50 concat_len = x_dim + mem_cell_ct lstm_param = LstmParam(mem_cell_ct, x_dim) lstm_net = LstmNetwork(lstm_param) y_list = [-0.5,0.2,0.1, -0.5] input_val_arr = [np.random.random(x_dim) for _ in y_list] for cur_iter in range(100): print "cur iter: ", cur_iter for ind in range(len(y_list)): lstm_net.x_list_add(input_val_arr[ind]) print "y_pred[%d] : %f" % (ind, lstm_net.lstm_node_list[ind].state.h[0]) loss = lstm_net.y_list_is(y_list, ToyLossLayer) print "loss: ", loss lstm_param.apply_diff(lr=0.1) lstm_net.x_list_clear() 初始化LstmParam,指定记忆存储单元数为100,指定输入样本x维度是50。初始化LstmNetwork训练模型,生成4组各50个随机数,分别以[-0.5,0.2,0.1, -0.5]作为y值训练,每次喂50个随机数和一个y值,迭代100次。lstm输入一串连续质数预估下一个质数。小测试,生成100以内质数,循环拿出50个质数序列作x,第51个质数作y,拿出10个样本参与训练1w次,均方误差由0.17973最终达到了1.05172e-06,几乎完全正确: import numpy as np import sys from lstm import LstmParam, LstmNetwork class ToyLossLayer: """ Computes square loss with first element of hidden layer array. """ @classmethod def loss(self, pred, label): return (pred[0] - label) ** 2 @classmethod def bottom_diff(self, pred, label): diff = np.zeros_like(pred) diff[0] = 2 * (pred[0] - label) return diff class Primes: def __init__(self): self.primes = list() for i in range(2, 100): is_prime = True for j in range(2, i-1): if i % j == 0: is_prime = False if is_prime: self.primes.append(i) self.primes_count = len(self.primes) def get_sample(self, x_dim, y_dim, index): result = np.zeros((x_dim+y_dim)) for i in range(index, index + x_dim + y_dim): result[i-index] = self.primes[i%self.primes_count]/100.0 return result def example_0(): mem_cell_ct = 100 x_dim = 50 concat_len = x_dim + mem_cell_ct lstm_param = LstmParam(mem_cell_ct, x_dim) lstm_net = LstmNetwork(lstm_param) primes = Primes() x_list = [] y_list = [] for i in range(0, 10): sample = primes.get_sample(x_dim, 1, i) x = sample[0:x_dim] y = sample[x_dim:x_dim+1].tolist()[0] x_list.append(x) y_list.append(y) for cur_iter in range(10000): if cur_iter % 1000 == 0: print "y_list=", y_list for ind in range(len(y_list)): lstm_net.x_list_add(x_list[ind]) if cur_iter % 1000 == 0: print "y_pred[%d] : %f" % (ind, lstm_net.lstm_node_list[ind].state.h[0]) loss = lstm_net.y_list_is(y_list, ToyLossLayer) if cur_iter % 1000 == 0: print "loss: ", loss lstm_param.apply_diff(lr=0.01) lstm_net.x_list_clear() if __name__ == "__main__": example_0() 质数列表全都除以100,这个代码训练数据必须是小于1数值。 torch是深度学习框架。1)tensorflow,谷歌主推,时下最火,小型试验和大型计算都可以,基于python,缺点是上手相对较难,速度一般;2)torch,facebook主推,用于小型试验,开源应用较多,基于lua,上手较快,网上文档较全,缺点是lua语言相对冷门;3)mxnet,Amazon主推,主要用于大型计算,基于python和R,缺点是网上开源项目较少;4)caffe,facebook主推,用于大型计算,基于c++、python,缺点是开发不是很方便;5)theano,速度一般,基于python,评价很好。 torch github上lstm实现项目比较多。 在mac上安装torch。https://github.com/torch/torch7/wiki/Cheatsheet#installing-and-running-torch 。 git clone https://github.com/torch/distro.git ~/torch --recursive cd ~/torch; bash install-deps; ./install.sh qt安装不成功问题,自己单独安装。 brew install cartr/qt4/qt 安装后需要手工加到~/.bash_profile中。 . ~/torch/install/bin/torch-activate source ~/.bash_profile后执行th使用torch。安装itorch,安装依赖 brew install zeromq brew install openssl luarocks install luacrypto OPENSSL_DIR=/usr/local/opt/openssl/ git clone https://github.com/facebook/iTorch.git cd iTorch luarocks make 用卷积神经网络实现图像识别。创建pattern_recognition.lua: require 'nn' require 'paths' if (not paths.filep("cifar10torchsmall.zip")) then os.execute('wget -c https://s3.amazonaws.com/torch7/data/cifar10torchsmall.zip') os.execute('unzip cifar10torchsmall.zip') end trainset = torch.load('cifar10-train.t7') testset = torch.load('cifar10-test.t7') classes = {'airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'} setmetatable(trainset, {__index = function(t, i) return {t.data[i], t.label[i]} end} ); trainset.data = trainset.data:double() -- convert the data from a ByteTensor to a DoubleTensor. function trainset:size() return self.data:size(1) end mean = {} -- store the mean, to normalize the test set in the future stdv = {} -- store the standard-deviation for the future for i=1,3 do -- over each image channel mean[i] = trainset.data[{ {}, {i}, {}, {} }]:mean() -- mean estimation print('Channel ' .. i .. ', Mean: ' .. mean[i]) trainset.data[{ {}, {i}, {}, {} }]:add(-mean[i]) -- mean subtraction stdv[i] = trainset.data[{ {}, {i}, {}, {} }]:std() -- std estimation print('Channel ' .. i .. ', Standard Deviation: ' .. stdv[i]) trainset.data[{ {}, {i}, {}, {} }]:div(stdv[i]) -- std scaling end net = nn.Sequential() net:add(nn.SpatialConvolution(3, 6, 5, 5)) -- 3 input image channels, 6 output channels, 5x5 convolution kernel net:add(nn.ReLU()) -- non-linearity net:add(nn.SpatialMaxPooling(2,2,2,2)) -- A max-pooling operation that looks at 2x2 windows and finds the max. net:add(nn.SpatialConvolution(6, 16, 5, 5)) net:add(nn.ReLU()) -- non-linearity net:add(nn.SpatialMaxPooling(2,2,2,2)) net:add(nn.View(16*5*5)) -- reshapes from a 3D tensor of 16x5x5 into 1D tensor of 16*5*5 net:add(nn.Linear(16*5*5, 120)) -- fully connected layer (matrix multiplication between input and weights) net:add(nn.ReLU()) -- non-linearity net:add(nn.Linear(120, 84)) net:add(nn.ReLU()) -- non-linearity net:add(nn.Linear(84, 10)) -- 10 is the number of outputs of the network (in this case, 10 digits) net:add(nn.LogSoftMax()) -- converts the output to a log-probability. Useful for classification problems criterion = nn.ClassNLLCriterion() trainer = nn.StochasticGradient(net, criterion) trainer.learningRate = 0.001 trainer.maxIteration = 5 trainer:train(trainset) testset.data = testset.data:double() -- convert from Byte tensor to Double tensor for i=1,3 do -- over each image channel testset.data[{ {}, {i}, {}, {} }]:add(-mean[i]) -- mean subtraction testset.data[{ {}, {i}, {}, {} }]:div(stdv[i]) -- std scaling end predicted = net:forward(testset.data[100]) print(classes[testset.label[100]]) print(predicted:exp()) for i=1,predicted:size(1) do print(classes[i], predicted[i]) end correct = 0 for i=1,10000 do local groundtruth = testset.label[i] local prediction = net:forward(testset.data[i]) local confidences, indices = torch.sort(prediction, true) -- true means sort in descending order if groundtruth == indices[1] then correct = correct + 1 end end print(correct, 100*correct/10000 .. ' % ') class_performance = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0} for i=1,10000 do local groundtruth = testset.label[i] local prediction = net:forward(testset.data[i]) local confidences, indices = torch.sort(prediction, true) -- true means sort in descending order if groundtruth == indices[1] then class_performance[groundtruth] = class_performance[groundtruth] + 1 end end for i=1,#classes do print(classes[i], 100*class_performance[i]/1000 .. ' %') end 执行th pattern_recognition.lua。 首先下载cifar10torchsmall.zip样本,有50000张训练用图片,10000张测试用图片,分别都标注,包括airplane、automobile等10种分类,对trainset绑定__index和size方法,兼容nn.Sequential使用,绑定函数看lua教程:http://tylerneylon.com/a/learn-lua/ ,trainset数据正规化,数据转成均值为1方差为1的double类型张量。初始化卷积神经网络模型,包括两层卷积、两层池化、一个全连接以及一个softmax层,进行训练,学习率为0.001,迭代5次,模型训练好后对测试机第100号图片做预测,打印出整体正确率以及每种分类准确率。https://github.com/soumith/cvpr2015/blob/master/Deep%20Learning%20with%20Torch.ipynb 。 torch可以方便支持gpu计算,需要对代码做修改。 比较流行的seq2seq基本都用lstm组成编码器解码器模型实现,开源实现大都基于one-hot embedding(没有词向量表达信息量大)。word2vec词向量 seq2seq模型,只有一个lstm单元机器人。 下载《甄环传》小说原文。上网随便百度“甄环传 txt”,下载下来,把文件转码成utf-8编码,把windows回车符都替换成n,以便后续处理。 对甄环传切词。切词工具word_segment.py到github下载,地址在https://github.com/warmheartli/ChatBotCourse/blob/master/word_segment.py 。 python ./word_segment.py zhenhuanzhuan.txt zhenhuanzhuan.segment 生成词向量。用word2vec,word2vec源码 https://github.com/warmheartli/ChatBotCourse/tree/master/word2vec 。make编译即可执行。 ./word2vec -train ./zhenhuanzhuan.segment -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15 生成一个vectors.bin文件,基于甄环传原文生成的词向量文件。 训练代码。 # -*- coding: utf-8 -*- import sys import math import tflearn import chardet import numpy as np import struct seq = [] max_w = 50 float_size = 4 word_vector_dict = {} def load_vectors(input): """从vectors.bin加载词向量,返回一个word_vector_dict的词典,key是词,value是200维的向量 """ print "begin load vectors" input_file = open(input, "rb") # 获取词表数目及向量维度 words_and_size = input_file.readline() words_and_size = words_and_size.strip() words = long(words_and_size.split(' ')[0]) size = long(words_and_size.split(' ')[1]) print "words =", words print "size =", size for b in range(0, words): a = 0 word = '' # 读取一个词 while True: c = input_file.read(1) word = word + c if False == c or c == ' ': break if a < max_w and c != 'n': a = a + 1 word = word.strip() vector = [] for index in range(0, size): m = input_file.read(float_size) (weight,) = struct.unpack('f', m) vector.append(weight) # 将词及其对应的向量存到dict中 word_vector_dict[word.decode('utf-8')] = vector input_file.close() print "load vectors finish" def init_seq(): """读取切好词的文本文件,加载全部词序列 """ file_object = open('zhenhuanzhuan.segment', 'r') vocab_dict = {} while True: line = file_object.readline() if line: for word in line.decode('utf-8').split(' '): if word_vector_dict.has_key(word): seq.append(word_vector_dict[word]) else: break file_object.close() def vector_sqrtlen(vector): len = 0 for item in vector: len += item * item len = math.sqrt(len) return len def vector_cosine(v1, v2): if len(v1) != len(v2): sys.exit(1) sqrtlen1 = vector_sqrtlen(v1) sqrtlen2 = vector_sqrtlen(v2) value = 0 for item1, item2 in zip(v1, v2): value += item1 * item2 return value / (sqrtlen1*sqrtlen2) def vector2word(vector): max_cos = -10000 match_word = '' for word in word_vector_dict: v = word_vector_dict[word] cosine = vector_cosine(vector, v) if cosine > max_cos: max_cos = cosine match_word = word return (match_word, max_cos) def main(): load_vectors("./vectors.bin") init_seq() xlist = [] ylist = [] test_X = None #for i in range(len(seq)-100): for i in range(10): sequence = seq[i:i+20] xlist.append(sequence) ylist.append(seq[i+20]) if test_X is None: test_X = np.array(sequence) (match_word, max_cos) = vector2word(seq[i+20]) print "right answer=", match_word, max_cos X = np.array(xlist) Y = np.array(ylist) net = tflearn.input_data([None, 20, 200]) net = tflearn.lstm(net, 200) net = tflearn.fully_connected(net, 200, activation='linear') net = tflearn.regression(net, optimizer='sgd', learning_rate=0.1, loss='mean_square') model = tflearn.DNN(net) model.fit(X, Y, n_epoch=500, batch_size=10,snapshot_epoch=False,show_metric=True) model.save("model") predict = model.predict([test_X]) #print predict #for v in test_X: # print vector2word(v) (match_word, max_cos) = vector2word(predict[0]) print "predict=", match_word, max_cos main() load_vectors从vectors.bin加载词向量,init_seq加载甄环传切词文本并存到一个序列里,vector2word求距离某向量最近词,模型只有一个lstm单元。经过500个epoch训练,均方损失降到0.33673,以0.941794432002余弦相似度预测出下一个字。强大gpu,调整参数,整篇文章都训练,修改代码predict部分,不断输出下一个字,自动吐出甄环体。基于tflearn实现,tflearn官方文档examples实现seq2seq直接调用tensorflow中的tensorflow/python/ops/seq2seq.py,基于one-hot embedding方法,一定没有词向量效果好。 参考资料: 《Python 自然语言处理》http://www.shareditor.com/blogshow?blogId=116http://www.shareditor.com/blogshow?blogId=117http://www.shareditor.com/blogshow?blogId=118 欢迎推荐上海机器学习工作机会,我的微信:qingxingfengzi
影视剧字幕聊天语料库特点,把影视剧说话内容一句一句以回车换行罗列三千多万条中国话,相邻第二句很可能是第一句最好回答。一个问句有很多种回答,可以根据相关程度以及历史聊天记录所有回答排序,找到最优,是一个搜索排序过程。 lucene+ik。lucene开源免费搜索引擎库,java语言开发。ik IKAnalyzer,开源中文切词工具。语料库切词建索引,文本搜索做文本相关性检索,把下一句取出作答案候选集,答案排序,问题分析。 建索引。eclipse创建maven工程,maven自动生成pom.xml文件,配置包依赖信息,dependencies标签中添加依赖: <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-core</artifactId> <version>4.10.4</version> </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-queryparser</artifactId> <version>4.10.4</version> </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-analyzers-common</artifactId> <version>4.10.4</version> </dependency> <dependency> <groupId>io.netty</groupId> <artifactId>netty-all</artifactId> <version>5.0.0.Alpha2</version> </dependency> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>1.1.41</version> </dependency> project标签增加配置,依赖jar包自动拷贝lib目录: <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-dependency-plugin</artifactId> <executions> <execution> <id>copy-dependencies</id> <phase>prepare-package</phase> <goals> <goal>copy-dependencies</goal> </goals> <configuration> <outputDirectory>${project.build.directory}/lib</outputDirectory> <overWriteReleases>false</overWriteReleases> <overWriteSnapshots>false</overWriteSnapshots> <overWriteIfNewer>true</overWriteIfNewer> </configuration> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-jar-plugin</artifactId> <configuration> <archive> <manifest> <addClasspath>true</addClasspath> <classpathPrefix>lib/</classpathPrefix> <mainClass>theMainClass</mainClass> </manifest> </archive> </configuration> </plugin> </plugins> </build> https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/ik-analyzer/IK%20Analyzer%202012FF_hf1_source.rar 下载ik源代码把src/org目录拷到chatbotv1工程src/main/java下,刷新maven工程。 com.shareditor.chatbotv1包下maven自动生成App.java,改成Indexer.java: Analyzer analyzer = new IKAnalyzer(true); IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_4_9, analyzer); iwc.setOpenMode(OpenMode.CREATE); iwc.setUseCompoundFile(true); IndexWriter indexWriter = new IndexWriter(FSDirectory.open(new File(indexPath)), iwc); BufferedReader br = new BufferedReader(new InputStreamReader( new FileInputStream(corpusPath), "UTF-8")); String line = ""; String last = ""; long lineNum = 0; while ((line = br.readLine()) != null) { line = line.trim(); if (0 == line.length()) { continue; } if (!last.equals("")) { Document doc = new Document(); doc.add(new TextField("question", last, Store.YES)); doc.add(new StoredField("answer", line)); indexWriter.addDocument(doc); } last = line; lineNum++; if (lineNum % 100000 == 0) { System.out.println("add doc " + lineNum); } } br.close(); indexWriter.forceMerge(1); indexWriter.close(); 编译拷贝src/main/resources所有文件到target目录,target目录执行 java -cp $CLASSPATH:./lib/:./chatbotv1-0.0.1-SNAPSHOT.jar com.shareditor.chatbotv1.Indexer ../../subtitle/raw_subtitles/subtitle.corpus ./index 生成索引目录index通过lukeall-4.9.0.jar查看。 检索服务。netty创建http服务server,代码在https://github.com/warmheartli/ChatBotCourse的chatbotv1目录: Analyzer analyzer = new IKAnalyzer(true); QueryParser qp = new QueryParser(Version.LUCENE_4_9, "question", analyzer); if (topDocs.totalHits == 0) { qp.setDefaultOperator(Operator.AND); query = qp.parse(q); System.out.println(query.toString()); indexSearcher.search(query, collector); topDocs = collector.topDocs(); } if (topDocs.totalHits == 0) { qp.setDefaultOperator(Operator.OR); query = qp.parse(q); System.out.println(query.toString()); indexSearcher.search(query, collector); topDocs = collector.topDocs(); } ret.put("total", topDocs.totalHits); ret.put("q", q); JSONArray result = new JSONArray(); for (ScoreDoc d : topDocs.scoreDocs) { Document doc = indexSearcher.doc(d.doc); String question = doc.get("question"); String answer = doc.get("answer"); JSONObject item = new JSONObject(); item.put("question", question); item.put("answer", answer); item.put("score", d.score); item.put("doc", d.doc); result.add(item); } ret.put("result", result); 查询索引,query词做切词拼lucene query,检索索引question字段,匹配返回answer字段值作候选集,挑出候选集一条作答案。server通过http访问,如http://127.0.0.1:8765/?q=hello 。中文需转urlcode发送,java端读取按urlcode解析,server启动方法: java -cp $CLASSPATH:./lib/:./chatbotv1-0.0.1-SNAPSHOT.jar com.shareditor.chatbotv1.Searcher 聊天界面。一个展示聊天内容框框,选择ckeditor,支持html格式内容展示,一个输入框和发送按钮,html代码: <div class="col-sm-4 col-xs-10"> <div class="row"> <textarea id="chatarea"> <div style='color: blue; text-align: left; padding: 5px;'>机器人: 喂,大哥您好,您终于肯跟我聊天了,来侃侃呗,我来者不拒!</div> <div style='color: blue; text-align: left; padding: 5px;'>机器人: 啥?你问我怎么这么聪明会聊天?因为我刚刚吃了一堆影视剧字幕!</div> </textarea> </div> <br /> <div class="row"> <div class="input-group"> <input type="text" id="input" class="form-control" autofocus="autofocus" onkeydown="submitByEnter()" /> <span class="input-group-btn"> <button class="btn btn-default" type="button" onclick="submit()">发送</button> </span> </div> </div> </div> <script type="text/javascript"> CKEDITOR.replace('chatarea', { readOnly: true, toolbar: ['Source'], height: 500, removePlugins: 'elementspath', resize_enabled: false, allowedContent: true }); </script> 调用聊天server,要一个发送请求获取结果控制器: public function queryAction(Request $request) { $q = $request->get('input'); $opts = array( 'http'=>array( 'method'=>"GET", 'timeout'=>60, ) ); $context = stream_context_create($opts); $clientIp = $request->getClientIp(); $response = file_get_contents('http://127.0.0.1:8765/?q=' . urlencode($q) . '&clientIp=' . $clientIp, false, $context); $res = json_decode($response, true); $total = $res['total']; $result = ''; if ($total > 0) { $result = $res['result'][0]['answer']; } return new Response($result); } 控制器路由配置: chatbot_query: path: /chatbot/query defaults: { _controller: AppBundle:ChatBot:query } 聊天server响应时间比较长,不导致web界面卡住,执行submit时异步发请求和收结果: var xmlHttp; function submit() { if (window.ActiveXObject) { xmlHttp = new ActiveXObject("Microsoft.XMLHTTP"); } else if (window.XMLHttpRequest) { xmlHttp = new XMLHttpRequest(); } var input = $("#input").val().trim(); if (input == '') { jQuery('#input').val(''); return; } addText(input, false); jQuery('#input').val(''); var datastr = "input=" + input; datastr = encodeURI(datastr); var url = "/chatbot/query"; xmlHttp.open("POST", url, true); xmlHttp.onreadystatechange = callback; xmlHttp.setRequestHeader("Content-type", "application/x-www-form-urlencoded"); xmlHttp.send(datastr); } function callback() { if (xmlHttp.readyState == 4 && xmlHttp.status == 200) { var responseText = xmlHttp.responseText; addText(responseText, true); } } addText往ckeditor添加一段文本: function addText(text, is_response) { var oldText = CKEDITOR.instances.chatarea.getData(); var prefix = ''; if (is_response) { prefix = "<div style='color: blue; text-align: left; padding: 5px;'>机器人: " } else { prefix = "<div style='color: darkgreen; text-align: right; padding: 5px;'>我: " } CKEDITOR.instances.chatarea.setData(oldText + "" + prefix + text + "</div>"); } 代码:https://github.com/warmheartli/ChatBotCoursehttps://github.com/warmheartli/shareditor.com 效果演示:http://www.shareditor.com/chatbot/ 导流。统计网站流量情况。cnzz统计看最近半个月受访页面流量情况,用户访问集中页面。增加图库动态按钮。吸引用户点击,在每个页面右下角放置动态小图标,页面滚动它不动,用户点了直接跳到想要引流的页面。搜客服漂浮代码。创建js文件,lrtk.js : $(function() { var tophtml="<a href=\"http://www.shareditor.com/chatbot/\" target=\"_blank\"><div id=\"izl_rmenu\" class=\"izl-rmenu\"><div class=\"btn btn-phone\"></div><div class=\"btn btn-top\"></div></div></a>"; $("#top").html(tophtml); $("#izl_rmenu").each(function() { $(this).find(".btn-phone").mouseenter(function() { $(this).find(".phone").fadeIn("fast"); }); $(this).find(".btn-phone").mouseleave(function() { $(this).find(".phone").fadeOut("fast"); }); $(this).find(".btn-top").click(function() { $("html, body").animate({ "scroll-top":0 },"fast"); }); }); var lastRmenuStatus=false; $(window).scroll(function() { var _top=$(window).scrollTop(); if(_top>=0) { $("#izl_rmenu").data("expanded",true); } else { $("#izl_rmenu").data("expanded",false); } if($("#izl_rmenu").data("expanded")!=lastRmenuStatus) { lastRmenuStatus=$("#izl_rmenu").data("expanded"); if(lastRmenuStatus) { $("#izl_rmenu .btn-top").slideDown(); } else { $("#izl_rmenu .btn-top").slideUp(); } } }); }); 上半部分定义id=top的div标签内容。一个id为izl_rmenu的div,css格式定义在另一个文件lrtk.css里: .izl-rmenu{position:fixed;left:85%;bottom:10px;padding-bottom:73px;z-index:999;} .izl-rmenu .btn{width:72px;height:73px;margin-bottom:1px;cursor:pointer;position:relative;} .izl-rmenu .btn-top{background:url(http://www.shareditor.com/uploads/media/default/0001/01/thumb_416_default_big.png) 0px 0px no-repeat;background-size: 70px 70px;display:none;} 下半部分当页面滚动时div展开。 在所有页面公共代码部分增加 <div id="top"></div> 庞大语料库运用,LSTM-RNN训练,中文语料转成算法识别向量形式,最强大word embedding工具word2vec。 word2vec输入切词文本文件,影视剧字幕语料库回车换行分隔完整句子,所以我们先对其做切词,word_segment.py文件: # coding:utf-8 import sys import importlib importlib.reload(sys) import jieba from jieba import analyse def segment(input, output): input_file = open(input, "r") output_file = open(output, "w") while True: line = input_file.readline() if line: line = line.strip() seg_list = jieba.cut(line) segments = "" for str in seg_list: segments = segments + " " + str segments = segments + "\n" output_file.write(segments) else: break input_file.close() output_file.close() if __name__ == '__main__': if 3 != len(sys.argv): print("Usage: ", sys.argv[0], "input output") sys.exit(-1) segment(sys.argv[1], sys.argv[2]); 使用: python word_segment.py subtitle/raw_subtitles/subtitle.corpus segment_result word2vec生成词向量。word2vec可从https://github.com/warmheartli/ChatBotCourse/tree/master/word2vec获取,make编译生成二进制文件。执行: ./word2vec -train ../segment_result -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15 生成vectors.bin词向量,二进制格式,word2vec自带distance工具来验证: ./distance vectors.bin 词向量二进制文件格式加载。word2vec生成词向量二进制格式:词数目(空格)向量维度。加载词向量二进制文件python脚本: # coding:utf-8 import sys import struct import math import numpy as np reload(sys) sys.setdefaultencoding( "utf-8" ) max_w = 50 float_size = 4 def load_vectors(input): print "begin load vectors" input_file = open(input, "rb") # 获取词表数目及向量维度 words_and_size = input_file.readline() words_and_size = words_and_size.strip() words = long(words_and_size.split(' ')[0]) size = long(words_and_size.split(' ')[1]) print "words =", words print "size =", size word_vector = {} for b in range(0, words): a = 0 word = '' # 读取一个词 while True: c = input_file.read(1) word = word + c if False == c or c == ' ': break if a < max_w and c != '\n': a = a + 1 word = word.strip() # 读取词向量 vector = np.empty([200]) for index in range(0, size): m = input_file.read(float_size) (weight,) = struct.unpack('f', m) vector[index] = weight # 将词及其对应的向量存到dict中 word_vector[word.decode('utf-8')] = vector input_file.close() print "load vectors finish" return word_vector if __name__ == '__main__': if 2 != len(sys.argv): print "Usage: ", sys.argv[0], "vectors.bin" sys.exit(-1) d = load_vectors(sys.argv[1]) print d[u'真的'] 运行方式如下: python word_vectors_loader.py vectors.bin 参考资料: 《Python 自然语言处理》 http://www.shareditor.com/blogshow?blogId=113 http://www.shareditor.com/blogshow?blogId=114 http://www.shareditor.com/blogshow?blogId=115 欢迎推荐上海机器学习工作机会
递归神经网络可存储记忆神经网络,LSTM是其中一种,在NLP领域应用效果不错。 递归神经网络(RNN),时间递归神经网络(recurrent neural network),结构递归神经网络(recursive neural network)。时间递归神经网络神经元间连接构成有向图,结构递归神经网络利用相似神经网络结构递归构造更复杂深度网络。两者训练属同一算法变体。 时间递归神经网络。传统神经网络FNN(Feed-Forward Neural Networks),前向反馈神经网络。RNN引入定向循环,神经元为节点组成有向环,可表达前后关联关系。隐藏层节点间构成全连接,一个隐藏层节点输出可作另一个隐藏层节点或自己的输入。U、V、W是变换概率矩阵,x是输入,o是输出。RNN关键是隐藏层,隐藏层捕捉序列信息,记忆能力。RNN中U、V、W参数共享,每一步都在做相同事情,输入不同,降低参数个数和计算量。RNN在NLP应用较多,语言模型在已知已出现词情况下预测下一个词概率,是时序模型,下一个词出现取决于前几个词,对应RNN隐藏层间内部连接。 RNN的训练方法。用BP误差反向传播算法更新训练参数。从输入到输出经过步骤不确定,利用时序方式做前向计算,假设x表示输入值,s表示输入x经过U矩阵变换后值,h表示隐藏层激活值,o表示输出层值, f表示隐藏层激活函数,g表示输出层激活函数。当t=0时,输入为x0, 隐藏层为h0。当t=1时,输入为x1, s1 = Ux1+Wh0, h1 = f(s1), o1 = g(Vh1)。当t=2时,s2 = Ux2+Wh1, h2 = f(s2), o2 = g(Vh2)。st = Uxt + Wh(t-1), ht = f(st), ot = g(Vht)。h=f(现有的输入+过去记忆总结),对RNN记忆能力全然体现。UVW变换概率矩阵,x输入,s xU矩阵变换后值,f隐藏层激活函数,h隐藏层激活值,g输出层激活函数,o输出。时间、输入、变换(输入、前隐藏)、隐藏(变换)、输出(隐藏)。输出(隐藏(变换(时间、输入、前隐藏)))。反向修正参数,每一步输出o和实际o值误差,用误差反向推导,链式求导求每层梯度,更新参数。 LSTM(Long Short Tem Momery networks)。RNN存在长序列依赖(Long-Term Dependencies)问题。下一个词出现概率和非常久远之前词有关,考虑到计算量,限制依赖长度。http://colah.github.io/posts/2015-08-Understanding-LSTMs 。传统RNN示意图,只包含一个隐藏层,tanh为激发函数,“记忆”体现在t滑动窗口,有多少个t就有多少记忆。 LSTM设计,神经网络层(权重系数和激活函数,σ表示sigmoid激活函数,tanh表示tanh激活函数),矩阵运算(矩阵乘或矩阵加)。历史信息传递和记忆,调大小阀门(乘以一个0到1之间系数),第一个sigmoid层计算输出0到1之间系数,作用到×门,这个操作表达上一阶段传递过来的记忆保留多少,忘掉多少。忘掉记忆多少取决上一隐藏层输出h{t-1}和本层的输入x{t}。上一层输出h{t-1}和本层的输入x{t}得出新信息,存到记忆。计算输出值Ct部分tanh神经元和计算比例系数sigmoid神经元(sigmoid取值范围是[0,1]作比例系数,tanh取值范围[-1,1]作一个输出值)。隐藏层输出h计算,考虑当前全部信息(上一时序隐藏层输出、本层输入x和当前整体记忆信息),本单元状态部分C通过tanh激活并做一个过滤(上一时序输出值和当前输入值通过sigmoid激活系数)。一句话词是不同时序输入x,在某一时间t出现词A概率可LSTM计算,词A出现概率取决前面出现过词,取决前面多少个词不确定,LSTM存储记忆信息C,得出较接近概率。 聊天机器人是范问答系统。 语料库获取。范问答系统,一般从互联网收集语料信息,比如百度、谷歌,构建问答对组成语料库。语料库分成多训练集、开发集、测试集。问答系统训练在一堆答案里找一个正确答案模型。训练过程不把所有答案都放到一个向量空间,做分组,在语料库里采集样本,收集每一个问题对应500个答案集合,500个里面有正向样本,随机选些负向样本,突出正向样本作用。 基于CNN系统设计,sparse interaction(稀疏交互),parameter sharing(参数共享),equivalent respresentation(等价表示),适合自动问答系统答案选择模型训练。 通用训练方法。训练时获取问题词向量Vq(词向量可用google word2vec训练,和一个正向答案词向量Va+,和一个负向答案词向量Va-, 比较问题和两个答案相似度,两个相似度差值大于一个阈值m更新模型参数,在候选池里选答案,小于m不更新模型。参数更新,梯度下降、链式求导。测试数据,计算问题和候选答案cos距离,相似度最大是正确答案预测。 神经网络结构设计。HL hide layer隐藏层,激活函数z = tanh(Wx+B),CNN 卷积层,P 池化层,池化步长 1,T tanh层,P+T输出是向量表示,最终输出两个向量cos相似度。HL或CNN连起来表示共享相同权重。CNN输出维数取决做多少卷积特征。论文《Applying Deep Learning To Answer Selection- A Study And An Open Task》。 深度学习运用到聊天机器人中,1. 神经网络结构选择、组合、优化。2. 自然语言处理,机器识别词向量。3. 相似或匹配关系考虑相似度计算,典型方法 cos距离。4. 文本序列全局信息用CNN或LSTM。5. 精度不高可加层。6. 计算量过大,参数共享和池化。 聊天机器人学习,需要海量聊天语料库。美剧字幕。外文电影或电视剧字幕文件是天然聊天语料,对话比较多美剧最佳。字幕库网站www.zimuku.net。 自动抓取字幕。抓取器代码(https://github.com/warmheartli/ChatBotCourse)。在subtitle下创建目录result,scrapy.Request方法调用时增加传参 dont_filter=True: # coding:utf-8 import sys import importlib importlib.reload(sys) import scrapy from subtitle_crawler.items import SubtitleCrawlerItem class SubTitleSpider(scrapy.Spider): name = "subtitle" allowed_domains = ["zimuku.net"] start_urls = [ "http://www.zimuku.net/search?q=&t=onlyst&ad=1&p=20", "http://www.zimuku.net/search?q=&t=onlyst&ad=1&p=21", "http://www.zimuku.net/search?q=&t=onlyst&ad=1&p=22", ] def parse(self, response): hrefs = response.selector.xpath('//div[contains(@class, "persub")]/h1/a/@href').extract() for href in hrefs: url = response.urljoin(href) request = scrapy.Request(url, callback=self.parse_detail, dont_filter=True) yield request def parse_detail(self, response): url = response.selector.xpath('//li[contains(@class, "dlsub")]/div/a/@href').extract()[0] print("processing: ", url) request = scrapy.Request(url, callback=self.parse_file, dont_filter=True) yield request def parse_file(self, response): body = response.body item = SubtitleCrawlerItem() item['url'] = response.url item['body'] = body return item # -*- coding: utf-8 -*- class SubtitleCrawlerPipeline(object): def process_item(self, item, spider): url = item['url'] file_name = url.replace('/','_').replace(':','_')+'.rar' fp = open('result/'+file_name, 'wb+') fp.write(item['body']) fp.close() return item ls result/|head -1 , ls result/|wc -l , du -hs result/ 。 字幕文件解压,linux直接执行unzip file.zip。linux解压rar文件,http://www.rarlab.com/download.htm 。wget http://www.rarlab.com/rar/rarlinux-x64-5.4.0.tar.gz 。tar zxvf rarlinux-x64-5.4.0.tar.gz./rar/unrar 。解压命令,unrar x file.rar 。linux解压7z文件,http://downloads.sourceforge.net/project/p7zip 下载源文件,解压执行make编译 bin/7za可用,用法 bin/7za x file.7z。 程序和脚本在https://github.com/warmheartli/ChatBotCourse 。第一步:爬取影视剧字幕。第二步:压缩格式分类。文件多无法ls、文件名带特殊字符、文件名重名误覆盖、扩展名千奇百怪,python脚本mv_zip.py: import glob import os import fnmatch import shutil import sys def iterfindfiles(path, fnexp): for root, dirs, files in os.walk(path): for filename in fnmatch.filter(files, fnexp): yield os.path.join(root, filename) i=0 for filename in iterfindfiles(r"./input/", "*.ZIP"): i=i+1 newfilename = "zip/" + str(i) + "_" + os.path.basename(filename) print(filename + " <===> " + newfilename) shutil.move(filename, newfilename) #sys.exit(-1) 扩展名根据压缩文件修改.rar、.RAR、.zip、.ZIP。第三步:解压。根据操作系统下载不同解压工具,建议unrar和unzip,脚本来实现批量解压: i=0; for file in `ls`; do mkdir output/${i}; echo "unzip $file -d output/${i}";unzip -P abc $file -d output/${i} > /dev/null; ((i++)); done i=0; for file in `ls`; do mkdir output/${i}; echo "${i} unrar x $file output/${i}";unrar x $file output/${i} > /dev/null; ((i++)); done 第四步:srt、ass、ssa字幕文件分类整理。字幕文件类型srt、lrc、ass、ssa、sup、idx、str、vtt。第五步:清理目录。自动清理空目录脚本clear_empty_dir.py : import glob import os import fnmatch import shutil import sys def iterfindfiles(path, fnexp): for root, dirs, files in os.walk(path): if 0 == len(files) and len(dirs) == 0: print(root) os.rmdir(root) iterfindfiles(r"./input/", "*.srt") 第六步:清理非字幕文件。批量删除脚本del_file.py : import glob import os import fnmatch import shutil import sys def iterfindfiles(path, fnexp): for root, dirs, files in os.walk(path): for filename in fnmatch.filter(files, fnexp): yield os.path.join(root, filename) for suffix in ("*.mp4", "*.txt", "*.JPG", "*.htm", "*.doc", "*.docx", "*.nfo", "*.sub", "*.idx"): for filename in iterfindfiles(r"./input/", suffix): print(filename) os.remove(filename) 第七步:多层解压缩。第八步:舍弃剩余少量文件。无扩展名、特殊扩展名、少量压缩文件,总体不超过50M。第九步:编码识别与转码。utf-8、utf-16、gbk、unicode、iso8859,统一utf-8,get_charset_and_conv.py : import chardet import sys import os if __name__ == '__main__': if len(sys.argv) == 2: for root, dirs, files in os.walk(sys.argv[1]): for file in files: file_path = root + "/" + file f = open(file_path,'r') data = f.read() f.close() encoding = chardet.detect(data)["encoding"] if encoding not in ("UTF-8-SIG", "UTF-16LE", "utf-8", "ascii"): try: gb_content = data.decode("gb18030") gb_content.encode('utf-8') f = open(file_path, 'w') f.write(gb_content.encode('utf-8')) f.close() except: print("except:", file_path) 第十步:筛选中文。extract_sentence_srt.py : # coding:utf-8 import chardet import os import re cn=ur"([u4e00-u9fa5]+)" pattern_cn = re.compile(cn) jp1=ur"([u3040-u309F]+)" pattern_jp1 = re.compile(jp1) jp2=ur"([u30A0-u30FF]+)" pattern_jp2 = re.compile(jp2) for root, dirs, files in os.walk("./srt"): file_count = len(files) if file_count > 0: for index, file in enumerate(files): f = open(root + "/" + file, "r") content = f.read() f.close() encoding = chardet.detect(content)["encoding"] try: for sentence in content.decode(encoding).split('n'): if len(sentence) > 0: match_cn = pattern_cn.findall(sentence) match_jp1 = pattern_jp1.findall(sentence) match_jp2 = pattern_jp2.findall(sentence) sentence = sentence.strip() if len(match_cn)>0 and len(match_jp1)==0 and len(match_jp2) == 0 and len(sentence)>1 and len(sentence.split(' ')) < 10: print(sentence.encode('utf-8')) except: continue 第十一步:字幕中句子提取。 # coding:utf-8 import chardet import os import re cn=ur"([u4e00-u9fa5]+)" pattern_cn = re.compile(cn) jp1=ur"([u3040-u309F]+)" pattern_jp1 = re.compile(jp1) jp2=ur"([u30A0-u30FF]+)" pattern_jp2 = re.compile(jp2) for root, dirs, files in os.walk("./ssa"): file_count = len(files) if file_count > 0: for index, file in enumerate(files): f = open(root + "/" + file, "r") content = f.read() f.close() encoding = chardet.detect(content)["encoding"] try: for line in content.decode(encoding).split('n'): if line.find('Dialogue') == 0 and len(line) < 500: fields = line.split(',') sentence = fields[len(fields)-1] tag_fields = sentence.split('}') if len(tag_fields) > 1: sentence = tag_fields[len(tag_fields)-1] match_cn = pattern_cn.findall(sentence) match_jp1 = pattern_jp1.findall(sentence) match_jp2 = pattern_jp2.findall(sentence) sentence = sentence.strip() if len(match_cn)>0 and len(match_jp1)==0 and len(match_jp2) == 0 and len(sentence)>1 and len(sentence.split(' ')) < 10: sentence = sentence.replace('N', '') print(sentence.encode('utf-8')) except: continue 第十二步:内容过滤。过滤特殊unicode字符、关键词、去除字幕样式标签、html标签、连续特殊字符、转义字符、剧集信息: # coding:utf-8 import sys import re import chardet if __name__ == '__main__': #illegal=ur"([u2000-u2010]+)" illegal=ur"([u0000-u2010]+)" pattern_illegals = [re.compile(ur"([u2000-u2010]+)"), re.compile(ur"([u0090-u0099]+)")] filters = ["字幕", "时间轴:", "校对:", "翻译:", "后期:", "监制:"] filters.append("时间轴:") filters.append("校对:") filters.append("翻译:") filters.append("后期:") filters.append("监制:") filters.append("禁止用作任何商业盈利行为") filters.append("http") htmltagregex = re.compile(r'<[^>]+>',re.S) brace_regex = re.compile(r'{.*}',re.S) slash_regex = re.compile(r'\w',re.S) repeat_regex = re.compile(r'[-=]{10}',re.S) f = open("./corpus/all.out", "r") count=0 while True: line = f.readline() if line: line = line.strip() # 编码识别,不是utf-8就过滤 gb_content = '' try: gb_content = line.decode("utf-8") except Exception as e: sys.stderr.write("decode error: ", line) continue # 中文识别,不是中文就过滤 need_continue = False for pattern_illegal in pattern_illegals: match_illegal = pattern_illegal.findall(gb_content) if len(match_illegal) > 0: sys.stderr.write("match_illegal error: %sn" % line) need_continue = True break if need_continue: continue # 关键词过滤 need_continue = False for filter in filters: try: line.index(filter) sys.stderr.write("filter keyword of %s %sn" % (filter, line)) need_continue = True break except: pass if need_continue: continue # 去掉剧集信息 if re.match('.*第.*季.*', line): sys.stderr.write("filter copora %sn" % line) continue if re.match('.*第.*集.*', line): sys.stderr.write("filter copora %sn" % line) continue if re.match('.*第.*帧.*', line): sys.stderr.write("filter copora %sn" % line) continue # 去html标签 line = htmltagregex.sub('',line) # 去花括号修饰 line = brace_regex.sub('', line) # 去转义 line = slash_regex.sub('', line) # 去重复 new_line = repeat_regex.sub('', line) if len(new_line) != len(line): continue # 去特殊字符 line = line.replace('-', '').strip() if len(line) > 0: sys.stdout.write("%sn" % line) count+=1 else: break f.close() pass 参考资料: 《Python 自然语言处理》 http://www.shareditor.com/blogshow?blogId=103 http://www.shareditor.com/blogshow?blogId=104 http://www.shareditor.com/blogshow?blogId=105 http://www.shareditor.com/blogshow?blogId=112
人工神经网络,借鉴生物神经网络工作原理数学模型。 由n个输入特征得出与输入特征几乎相同的n个结果,训练隐藏层得到意想不到信息。信息检索领域,模型训练合理排序模型,输入特征,文档质量、文档点击历史、文档前链数目、文档锚文本信息,为找特征隐藏信息,隐藏层神经元数目设置少于输入特征数目,经大量样本训练能还原原始特征模型,相当用少于输入特征数目信息还原全部特征,压缩,可发现某些特征之间存在隐含相关性,或者有某种特殊关系。让隐藏层神经元数目多余输入特征数目,训练模型可展示特征之间某种细节关联。输出输入一致,自编码算法。 人工神经网络模型,多层神经元结构建立,每一层抽象一种思维过程,经多层思考,得出结论。神经网络每一层有每一层专做事情,每一层神经元添加特殊约束条件。多层提取特定特征做机器学习是深度学习。 卷积,在一定范围内做平移并求平均值。卷积积分公式,对τ积分,对固定x,找x附近所有变量,求两个函数乘积,并求和。神经网络里面,每个神经元计算输出卷积公式,神经网络每一层输出一种更高级特征。自然语言,较近上下文词语之间存在一定相关性,标点、特殊词等分隔使、传统自然语言处理脱离词与词之间关联,丢失部分重要信息,利用卷积神经网络可以做多元(n-gram)计算,不损失自然语言临近词相关性信息。 自动问答系统深度学习应用RNN,利用时序建模。 卷积神经网络(Convolutional Neural Network,CNN),二维离散卷积运算和人工神经网络结合深度神经网络。自动提取特征。 手写数字识别。http://yann.lecun.com/exdb/mnist/手写数据集,文件是二进制像素单位保存几万张图片文件,https://github.com/warmheartli/ChatBotCourse。 多层卷积网络,第一层一个卷积和一个max pooling,卷积运算“视野”5×5像素范围,卷积使用1步长、0边距模板(保证输入输出同一个大小),1个输入通道(图片灰度,单色),32个输出通道(32个特征)。每张图片28×28像素,第一次卷积输出28×28大小。max pooling采用2×2大小模板,池化后输出尺寸14×14,一共有32个通道,一张图片输出是14×14×32=6272像素。第二层一个卷积和一个max pooling,输入通道32个(对应第一层32个特征),输出通道64个(输出64个特征),输入每张大小14×14,卷积层输出14×14,经过max pooling,输出大小7×7,输出像素7×7×64=3136。第三层一个密集连接层,一个有1024个神经元全连接层,第二层输出7×7×64个值作1024个神经元输入。神经元激活函数为ReLu函数,平滑版Softplus g(x)=log(1+e^x))。最终输出层,第三层1024个输出为输入,设计一个softmax层,输出10个概率值。 # coding:utf-8 import sys import importlib importlib.reload(sys) from tensorflow.examples.tutorials.mnist import input_data import tensorflow as tf flags = tf.app.flags FLAGS = flags.FLAGS flags.DEFINE_string('data_dir', './', 'Directory for storing data') mnist = input_data.read_data_sets(FLAGS.data_dir, one_hot=True) # 初始化生成随机的权重(变量),避免神经元输出恒为0 def weight_variable(shape): # 以正态分布生成随机值 initial = tf.truncated_normal(shape, stddev=0.1) return tf.Variable(initial) # 初始化生成随机的偏置项(常量),避免神经元输出恒为0 def bias_variable(shape): initial = tf.constant(0.1, shape=shape) return tf.Variable(initial) # 卷积采用1步长,0边距,保证输入输出大小相同 def conv2d(x, W): return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME') # 池化采用2×2模板 def max_pool_2x2(x): return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') # 28*28=784 x = tf.placeholder(tf.float32, [None, 784]) # 输出类别共10个:0-9 y_ = tf.placeholder("float", [None,10]) # 第一层卷积权重,视野是5*5,输入通道1个,输出通道32个 W_conv1 = weight_variable([5, 5, 1, 32]) # 第一层卷积偏置项有32个 b_conv1 = bias_variable([32]) # 把x变成4d向量,第二维和第三维是图像尺寸,第四维是颜色通道数1 x_image = tf.reshape(x, [-1,28,28,1]) h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1) h_pool1 = max_pool_2x2(h_conv1) # 第二层卷积权重,视野是5*5,输入通道32个,输出通道64个 W_conv2 = weight_variable([5, 5, 32, 64]) # 第二层卷积偏置项有64个 b_conv2 = bias_variable([64]) h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2) h_pool2 = max_pool_2x2(h_conv2) # 第二层池化后尺寸编程7*7,第三层是全连接,输入是64个通道,输出是1024个神经元 W_fc1 = weight_variable([7 * 7 * 64, 1024]) # 第三层全连接偏置项有1024个 b_fc1 = bias_variable([1024]) h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64]) h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1) # 按float做dropout,以减少过拟合 keep_prob = tf.placeholder("float") h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob) # 最后的softmax层生成10种分类 W_fc2 = weight_variable([1024, 10]) b_fc2 = bias_variable([10]) y_conv=tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2) cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv)) # Adam优化器来做梯度最速下降 train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy) correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1)) accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float")) sess = tf.InteractiveSession() sess.run(tf.global_variables_initializer()) for i in range(20000): batch = mnist.train.next_batch(50) if i%100 == 0: train_accuracy = accuracy.eval(feed_dict={ x:batch[0], y_: batch[1], keep_prob: 1.0}) print("step %d, training accuracy %g"%(i, train_accuracy)) train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5}) print("test accuracy %g"%accuracy.eval(feed_dict={ x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0})) 词向量。自然语言需要数学化才能被计算机认识计算。为每个词分配一个编号,不能表示词与词关系。每一个词对应一个向量,词义相近词,词向量距离越近(欧氏距离、夹角余弦)。词向量,维度一般较低,一般是50维或100维,可避免维度灾难,更容易深度学习。 语言模型表达已知前n-1个词前提下,预测第n个词的概率。词向量训练,无监督学习,没有标注数据,给n篇文章,可训练出词向量。基于三层神经网络构建n-gram语言模型。最下面w是词,上面C(w)是词向量,词向量一层是神经网络输入层(第一层),输入层是一个(n-1)×m矩阵,n-1是词向量数目,m是词向量维度。第二层(隐藏层)是普通神经网络,H为权重,tanh为激活函数。第三层(输出层)有|V|个节点,|V|是词表大小,输出U为权重,softmax作激活函数实现归一化,最终输出某个词概率。增加一个从输入层到输出层直连边(线性变换),可提升模型效果,变换矩阵设为W。假设C(w)是输入x,y计算公式是y = b + Wx + Utanh(d+Hx)。模型训练变量C、H、U、W。梯度下降法训练得出C是生成词向量所用矩阵,C(w)是所需词向量。 词向量应用。找同义词。案例google word2vec工具,训练好词向量,指定一个词,返回cos距离最相近词并排序。词性标注和语义角色标注任务。词向量作神经网络输入层,通过前馈网络和卷积网络完成。句法分析和情感分析任务。词向量作递归神经网络输入。命名实体识别和短语识别。词向量作扩展特征使用。词向量 C(king)-C(queue)≈C(man)-C(woman),减法是向量逐维相减,C(king)-C(man)+C(woman)最相近向量是C(queue),语义空间线性关系。 词向量是深度学习应用NLP根基,word2vec是使用最广泛最简单有效词向量训练工具。 一个记忆单元识别一个事物,叫localist representation。几个记忆单元分别识别基础信息,通过这几个记忆单元输出,表示所有事物,叫distributed representation,词向量。localist representation 稀疏表达,one hot vector,每一类型用向量一维来表示。distributed representation 分布式表达,增加表达只需要增加一个或很少特征维度。 word embedding,词嵌入,范畴论,morphism(态射),态射表示两个数学结构中保持结构过程抽象,一个域和另一个域之间关系。范畴论中嵌入(态射)保持结构,word embedding表示“降维”嵌入,通过降维避免维度灾难,降低计算复杂度,更易于深度学习应用。 word2vec本质,通过distributed representation表达方式表示词,通过降维word embedding减少计算量。 word2vec训练神经概率语言模型。word2vec CBOW和Skip-gram模型。CBOW模型。Continuous Bag-of-Words Model,已知当前词上下文预测当前词。CBOW模型神经网络结构,输入层,词w上下文2c个词的词向量。投影层,输入层2c个向量做求和累加。输出层,霍夫曼树,叶子节点是语料出现过词,权重是出现次数。神经网络模型首尾相接改成求和累加,减少维度。去掉隐藏层,减少计算量。输出层softmax归一化运算改成霍夫曼树。 基于霍夫曼树Hierarchical Softmax技术。基于训练语料得到每一个可能w概率。霍夫曼树,非根节点θ表示待训练参数向量,当投射层产出新向量x,逻辑回归公式 σ(xTθ) = 1/(1+e^(-xTθ)),可得每一层被分到左节点(1)还是右节点(0)概率p(d|x,θ) = 1-σ(xTθ)和p(d|x,θ) = σ(xTθ)。以对数似然函数为优化目标,假设两个求和符号部分记作L(w, j),θ更新公式,x梯度公式,x多个v累加,word2vec中v更新方法。Skip-gram模型,Continuous Skip-gram Model,已知当前词情况预测上下文。Skip-gram模型神经网络结构。输入层,w词向量v(w)。投影层,v(w)。输出层,霍夫曼树。θ和v(w)更新公式,符号名从x改v(w)。 word2vec,下载源码,https://github.com/warmheartli/ChatBotCourse/tree/master/word2vec),执行make编译(mac系统代码所有#include 替换成#include )。编译生成word2vec、word2phrase、word-analogy、distance、compute-accuracy二进制文件。训练,语料,已切好词(空格分隔)文本。执行 ./word2vec -train train.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -thread 12 -binary 1 。生成vectors.bin文件,训练好词向量二进制文件,求近义词了,执行 ./distance vectors.bin 。 参考资料:《Python 自然语言处理》http://www.shareditor.com/blogshow?blogId=92http://www.shareditor.com/blogshow?blogId=97http://www.shareditor.com/blogshow?blogId=99http://www.shareditor.com/blogshow?blogId=100 欢迎推荐上海机器学习工作机会,我的微信:qingxingfengzi
中文分词把文本切分成词语,还可以反过来,把该拼一起的词再拼到一起,找到命名实体。 概率图模型条件随机场适用观测值条件下决定随机变量有有限个取值情况。给定观察序列X,某个特定标记序列Y概率,指数函数 exp(∑λt+∑μs)。符合最大熵原理。基于条件随机场命名实体识别方法属于有监督学习方法,利用已标注大规模语料库训练。 命名实体的放射性。命名实体的前后词。 特征模板,当前位置前后n个位置字/词/字母/数字/标点作为特征,基于已经标注好语料,词性、词形已知。特征模板选择和具体识别实体类别有关。 命名实体,人名(政治家、艺人等)、地名(城市、州、国家、建筑等)、组织机构名、时间、数字、专有名词(电影名、书名、项目名、电话号码等)。命名性指称、名词性指称和代词性指称。 词形上下文训练模型,给定词形上下文语境中产生实体概率。词性上下文训练模型,给定词性上下文语境中产生实体概率。给定实体词形串作为实体概率。给定实体词性串作为实体概率。 词性,名、动、形、数、量、代、副、介、连、助、叹、拟声。自然语言处理词性,区别词、方位词、成语、习用语、机构团体、时间词,多达100多种。汉语词性标注最大困难“兼类”,一个词在不同语境中有不同词性,很难从形式上识别。 词性标注过程。标注,根据规则或统计方法做词性标注。校验,一致性检查和自动校对方法修正。 统计模型词性标注方法。大量已标注语料库训练,选择合适训练用数学模型,概率图隐马尔科夫模型(HMM)适合词性标注基于观察序列标注情形。 隐马尔可夫模型参数初始化。模型参数初始化,在利用语料库前用最小成本和最接近最优解目标设定初值。HMM,基于条件概率生成式模型,模型参数生成概率,假设每个词生成概率是所有可能词性个数倒数,计算最简单最有可能接近最优解生成概率。每个词所有可能词性,已有词表标记,词表生成方法简单,已标注语料库,很好统计。生成概率初值设置0。 规则词性标注方法。既定搭配关系上下文语境规则,判断实际语境按照规则标注词性。适合既有规则,对兼词词性识别效果好,不适合网络新词层出不穷、网络用语新规则。机器学习自动提取规则,初始标注器标注结果和人工标注结果差距,生成修正标注转换规则,错误驱动学习方法。经过人工校总结大量有用信息补充调整规则库。 统计方法、规则方法相结合词性标注方法。规则排歧,统计标注,最后校对,得到正确标注结果。首选统计方法标注,同时计算计算置信度或错误率,判断结果是否可疑,在可疑情况下采用规则方法歧义消解,达到最佳效果。 词性标注校验。校验确定正确性,修正结果。检查词性标注一致性。一致性,所有标注结果,相同语境同一个词标注相同。兼类词,被标记不同词性。非兼类词,人工校验或其他原因导致标记不同词性。词数目多,词性多,一致性指标无法计算公式求得,基于聚类和分类方法,根据欧式距离定义一致性指标,设定阈值,保证一致性在阈值范围内。词性标注自动校对。不需要人参与,直接找出错误标注修正,适用一个词词性标注通篇全错,数据挖掘和规则学习方法判断相对准确。大规模训练语料生成词性校对决策表,找通篇全错词性标注自动修正。 句法分析树生成。把一句话按照句法逻辑组织成一棵树。 句法分析分句法结构分析和依存关系分析。句法结构分析是短语结构分析,提取出句子名词短语、动词短语等。分基于规则的分析方法和基于统计分析方法。基于规则方法存在很多局限性。基于统计方法,基于概率上下文无关文法(PCFG),终结符集合、非终结符集合、规则集。 先展示简单例子,感受计算过程,再叙述理论。 终结符集合,表示有哪些字可作句法分析树叶子节点。非终结符集合,表示树非页子节点,连接多个节点表达关系节点,句法规则符号。规则集,句法规则符号,模型训练概率值左部相同的概率和一定是1。 一句话句法结构树可能有多种,只选择概率最大作句子最佳结构。 设W={ω1ω2ω3……}表示一个句子,其中ω表示一个词(word),利用动态规划算法计算非终结符A推导出W中子串ωiωi+1ωi+2……ωj的概率,假设概率为αij(A),递归公式,αij(A)=P(A->ωi),αij(A)=∑∑P(A->BC)αik(B)α(k+1)j(C)。 句法规则提取方法与PCFG的概率参数估计。大量的树库,训练数据。树库中句法规则提取生成结构形式,进行合并、归纳等处理,得到终结符集合∑、非终结符集合N、规则集R。概率参数计算方法,给定参数一个随机初始值,采用EM迭代算法,不断训练数据,计算每条规则使用次数作为最大似然计算得到概率估值,不断迭代更新概率,最终得出概率符合最大似然估计精确值。 参考资料: 《Python 自然语言处理》 http://www.shareditor.com/blogshow?blogId=82http://www.shareditor.com/blogshow?blogId=86http://www.shareditor.com/blogshow?blogId=87 欢迎推荐上海机器学习工作机会,我的微信:qingxingfengzi
关键词提取。pynlpir库实现关键词提取。 # coding:utf-8 import sys import importlib importlib.reload(sys) import pynlpir pynlpir.open() s = '怎么才能把电脑里的垃圾文件删除' key_words = pynlpir.get_key_words(s, weighted=True) for key_word in key_words: print(key_word[0], 't', key_word[1]) pynlpir.close() 百度接口:https://www.baidu.com/s?wd=机器学习 数据挖掘 信息检索 安装scrapy pip install scrapy。创建scrapy工程 scrapy startproject baidu_search。做抓取器,创建baidu_search/baidu_search/spiders/baidu_search.py文件。 # coding:utf-8 import sys import importlib importlib.reload(sys) import scrapy class BaiduSearchSpider(scrapy.Spider): name = "baidu_search" allowed_domains = ["baidu.com"] start_urls = [ "https://www.baidu.com/s?wd=电脑 垃圾 文件 删除" ] def parse(self, response): filename = "result.html" with open(filename, 'wb') as f: f.write(response.body) 修改settings.py文件,ROBOTSTXT_OBEY = False,USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36' ,DOWNLOAD_TIMEOUT = 5 , 进入baidu_search/baidu_search/目录,scrapy crawl baidu_search 。生成result.html,正确抓取网页。 语料提取。搜索结果只是索引。真正内容需进入链接。分析抓取结果,链接嵌在class=c-container Div h3 a标签 href属性。url添加到抓取队列抓取。提取正文,去掉标签,保存摘要。提取url时,提取标题和摘要,scrapy.Request meta传递到处理函数parse_url,抓取完成后能接到这两个值,提取content。完整数据:url、title、abstract、content。 # coding:utf-8 import sys import importlib importlib.reload(sys) import scrapy from scrapy.utils.markup import remove_tags class BaiduSearchSpider(scrapy.Spider): name = "baidu_search" allowed_domains = ["baidu.com"] start_urls = [ "https://www.baidu.com/s?wd=电脑 垃圾 文件 删除" ] def parse(self, response): # filename = "result.html" # with open(filename, 'wb') as f: # f.write(response.body) hrefs = response.selector.xpath('//div[contains(@class, "c-container")]/h3/a/@href').extract() # for href in hrefs: # print(href) # yield scrapy.Request(href, callback=self.parse_url) containers = response.selector.xpath('//div[contains(@class, "c-container")]') for container in containers: href = container.xpath('h3/a/@href').extract()[0] title = remove_tags(container.xpath('h3/a').extract()[0]) c_abstract = container.xpath('div/div/div[contains(@class, "c-abstract")]').extract() abstract = "" if len(c_abstract) > 0: abstract = remove_tags(c_abstract[0]) request = scrapy.Request(href, callback=self.parse_url) request.meta['title'] = title request.meta['abstract'] = abstract yield request def parse_url(self, response): print(len(response.body)) print("url:", response.url) print("title:", response.meta['title']) print("abstract:", response.meta['abstract']) content = remove_tags(response.selector.xpath('//body').extract()[0]) print("content_len:", len(content)) 参考资料: 《Python 自然语言处理》 http://www.shareditor.com/blogshow/?blogId=43 http://www.shareditor.com/blogshow?blogId=76 欢迎推荐上海机器学习工作机会,我的微信:qingxingfengzi
聊天机器人,提问、检索、回答。 提问,查询关键词生成、答案类型确定、句法和语义分析。查询关键词生成,提问提取关键词,中心词关联扩展词。答案类型确定,确定提问类型。句法和语义分析,问题深层含义剖析。检索,搜索,根据查询关键词信息检索,返回句子或段落。答案抽取,分析和推理检索句子或段落,抽取提问一致实体,根据概率最大对候选答案排序。 海量文本知识表示,网络文本资源获取、机器学习方法、大规模语义计算和推理、知识表示体系、知识库构建。问句解析,中文分词、词性标注、实体标注、概念类别标注、句法分析、语义分析、逻辑结构标注、指代消解、关联关系标注、问句分类、答案类别确定。答案生成过滤,候选答案抽取、关系推演、吻哈程度判断、噪声过滤。 聊天机器人技术类型。基于检索技术,信息检索,简单易实现,无法从句法关系和语义关系给出答案,无法推理问题。基于模式匹配技术,把问题往梳理好的模式匹配,推理简单,模式涵盖不全。基于自然语言理解技术,把浅层分析加句法分析、语义分析。基于统计翻译模型技术,把问句疑问词留出来,和候选答案资源匹配。 问句解析。哈工大LTP(语言技术平台)、博森科技、jieba分词、中科院张华平博士NLPIR汉语分词系统。 NLPIR,http://pynlpir.readthedocs.io/en/latest/。安装 pip install pynlpir 。下载授权文件 https://github.com/NLPIR-team/NLPIR/blob/master/License/license%20for%20a%20month/NLPIR-ICTCLAS分词系统授权/NLPIR.user,替换pynlpir/Data目录的已过期文件。 # coding:utf-8 import sys import importlib importlib.reload(sys) import pynlpir pynlpir.open() # s = '聊天机器人到底该怎么做呢?' s = '海洋是如何形成的' # 分词 分析功能全打开 不使用英文 segments = pynlpir.segment(s, pos_names='all', pos_english=False) for segment in segments: print(segment[0], 't', segment[1]) # 关键词提取 key_words = pynlpir.get_key_words(s, weighted=True) for key_word in key_words: print(key_word[0], 't', key_word[1]) pynlpir.close() segment 切词,返回tuple(token, pos),token切词,pos 语言属性。调用segment方法,指定pos_names参数'all' 、'child' 、'parent',默认parent 表示获取词性最顶级词性。child 表示获取词性最具体信息。all 表示获取词性相关所有词性信息,从顶级词性到该词性路径。 词性分类表。nlpir 源代码 /pynlpir/pos_map.py,全部词性分类及其子类别: POS_MAP = { 'n': ('名词', 'noun', { 'nr': ('人名', 'personal name', { 'nr1': ('汉语姓氏', 'Chinese surname'), 'nr2': ('汉语名字', 'Chinese given name'), 'nrj': ('日语人名', 'Japanese personal name'), 'nrf': ('音译人名', 'transcribed personal name') }), 'ns': ('地名', 'toponym', { 'nsf': ('音译地名', 'transcribed toponym'), }), 'nt': ('机构团体名', 'organization/group name'), 'nz': ('其它专名', 'other proper noun'), 'nl': ('名词性惯用语', 'noun phrase'), 'ng': ('名词性语素', 'noun morpheme'), }), 't': ('时间词', 'time word', { 'tg': ('时间词性语素', 'time morpheme'), }), 's': ('处所词', 'locative word'), 'f': ('方位词', 'noun of locality'), 'v': ('动词', 'verb', { 'vd': ('副动词', 'auxiliary verb'), 'vn': ('名动词', 'noun-verb'), 'vshi': ('动词"是"', 'verb 是'), 'vyou': ('动词"有"', 'verb 有'), 'vf': ('趋向动词', 'directional verb'), 'vx': ('行事动词', 'performative verb'), 'vi': ('不及物动词', 'intransitive verb'), 'vl': ('动词性惯用语', 'verb phrase'), 'vg': ('动词性语素', 'verb morpheme'), }), 'a': ('形容词', 'adjective', { 'ad': ('副形词', 'auxiliary adjective'), 'an': ('名形词', 'noun-adjective'), 'ag': ('形容词性语素', 'adjective morpheme'), 'al': ('形容词性惯用语', 'adjective phrase'), }), 'b': ('区别词', 'distinguishing word', { 'bl': ('区别词性惯用语', 'distinguishing phrase'), }), 'z': ('状态词', 'status word'), 'r': ('代词', 'pronoun', { 'rr': ('人称代词', 'personal pronoun'), 'rz': ('指示代词', 'demonstrative pronoun', { 'rzt': ('时间指示代词', 'temporal demonstrative pronoun'), 'rzs': ('处所指示代词', 'locative demonstrative pronoun'), 'rzv': ('谓词性指示代词', 'predicate demonstrative pronoun'), }), 'ry': ('疑问代词', 'interrogative pronoun', { 'ryt': ('时间疑问代词', 'temporal interrogative pronoun'), 'rys': ('处所疑问代词', 'locative interrogative pronoun'), 'ryv': ('谓词性疑问代词', 'predicate interrogative pronoun'), }), 'rg': ('代词性语素', 'pronoun morpheme'), }), 'm': ('数词', 'numeral', { 'mq': ('数量词', 'numeral-plus-classifier compound'), }), 'q': ('量词', 'classifier', { 'qv': ('动量词', 'verbal classifier'), 'qt': ('时量词', 'temporal classifier'), }), 'd': ('副词', 'adverb'), 'p': ('介词', 'preposition', { 'pba': ('介词“把”', 'preposition 把'), 'pbei': ('介词“被”', 'preposition 被'), }), 'c': ('连词', 'conjunction', { 'cc': ('并列连词', 'coordinating conjunction'), }), 'u': ('助词', 'particle', { 'uzhe': ('着', 'particle 着'), 'ule': ('了/喽', 'particle 了/喽'), 'uguo': ('过', 'particle 过'), 'ude1': ('的/底', 'particle 的/底'), 'ude2': ('地', 'particle 地'), 'ude3': ('得', 'particle 得'), 'usuo': ('所', 'particle 所'), 'udeng': ('等/等等/云云', 'particle 等/等等/云云'), 'uyy': ('一样/一般/似的/般', 'particle 一样/一般/似的/般'), 'udh': ('的话', 'particle 的话'), 'uls': ('来讲/来说/而言/说来', 'particle 来讲/来说/而言/说来'), 'uzhi': ('之', 'particle 之'), 'ulian': ('连', 'particle 连'), }), 'e': ('叹词', 'interjection'), 'y': ('语气词', 'modal particle'), 'o': ('拟声词', 'onomatopoeia'), 'h': ('前缀', 'prefix'), 'k': ('后缀', 'suffix'), 'x': ('字符串', 'string', { 'xe': ('Email字符串', 'email address'), 'xs': ('微博会话分隔符', 'hashtag'), 'xm': ('表情符合', 'emoticon'), 'xu': ('网址URL', 'URL'), 'xx': ('非语素字', 'non-morpheme character'), }), 'w': ('标点符号', 'punctuation mark', { 'wkz': ('左括号', 'left parenthesis/bracket'), 'wky': ('右括号', 'right parenthesis/bracket'), 'wyz': ('左引号', 'left quotation mark'), 'wyy': ('右引号', 'right quotation mark'), 'wj': ('句号', 'period'), 'ww': ('问号', 'question mark'), 'wt': ('叹号', 'exclamation mark'), 'wd': ('逗号', 'comma'), 'wf': ('分号', 'semicolon'), 'wn': ('顿号', 'enumeration comma'), 'wm': ('冒号', 'colon'), 'ws': ('省略号', 'ellipsis'), 'wp': ('破折号', 'dash'), 'wb': ('百分号千分号', 'percent/per mille sign'), 'wh': ('单位符号', 'unit of measure sign'), }), } 参考资料: 《Python 自然语言处理》 http://www.shareditor.com/blogshow?blogId=73 http://www.shareditor.com/blogshow?blogId=74 欢迎推荐上海机器学习工作机会,我的微信:qingxingfengzi
分块,根据句子的词和词性,按照规则组织合分块,分块代表实体。常见实体,组织、人员、地点、日期、时间。名词短语分块(NP-chunking),通过词性标记、规则识别,通过机器学习方法识别。介词短语(PP)、动词短语(VP)、句子(S)。 分块标记,IOB标记,I(inside,内部)、O(outside,外部)、B(begin,开始)。树结构存储分块。多级分块,多重分块方法。级联分块。 关系抽取,找出实体间关系。实体识别认知事物,关系识别掌握真相。三元组(X,a,Y),X、Y实体,a表达关系字符串。通过正则识别。from nltk.corpus import conll2000,print(conll2000.chunked_sents('train.txt')[99]) 。 文法,潜在无限句子集合紧凑特性。形式化模型,覆盖所有结构句子。符合多种文法句子有歧义。只能用特征方法处理。 文法特征结构,单词最后字母、词性标签、文法类别、正字拼写、指示物、关系、施事角色、受事角色。文法特征是键值对,特征结构存储形式是字典。句法协议、属性、约束、术语。import nltk,fs1 = nltk.FeatStruct(TENSE='past', NUM='sg') ,fs2 = nltk.FeatStruct(POS='N', AGR=fs1) 。nltk产生式文法描述 /nltk_data/grammars/book_grammars 。sql0.fcfg,查找国家城市sql语句文法: % start S S[SEM=(?np + WHERE + ?vp)] -> NP[SEM=?np] VP[SEM=?vp] VP[SEM=(?v + ?pp)] -> IV[SEM=?v] PP[SEM=?pp] VP[SEM=(?v + ?ap)] -> IV[SEM=?v] AP[SEM=?ap] NP[SEM=(?det + ?n)] -> Det[SEM=?det] N[SEM=?n] PP[SEM=(?p + ?np)] -> P[SEM=?p] NP[SEM=?np] AP[SEM=?pp] -> A[SEM=?a] PP[SEM=?pp] NP[SEM='Country="greece"'] -> 'Greece' NP[SEM='Country="china"'] -> 'China' Det[SEM='SELECT'] -> 'Which' | 'What' N[SEM='City FROM city_table'] -> 'cities' IV[SEM=''] -> 'are' A[SEM=''] -> 'located' P[SEM=''] -> 'in' 加载文法描述 import nltk from nltk import load_parser cp = load_parser('grammars/book_grammars/sql0.fcfg') query = 'What cities are located in China' tokens = query.split() for tree in cp.parse(tokens): print(tree) 参考资料: 《Python 自然语言处理》 http://www.shareditor.com/blogshow?blogId=70 http://www.shareditor.com/blogshow?blogId=71 欢迎推荐上海机器学习工作机会,我的微信:qingxingfengzi
英文词干提取器,import nltk,porter = nltk.PorterStemmer(),porter.stem('lying') 。 词性标注器,pos_tag处理词序列,根据句子动态判断,import nltk,text = nltk.word_tokenize("And now for something completely different”),nltk.pos_tag(text) 。CC 连接词,RB 副词,IN 介词,NN 名次,JJ 形容词。 标注自定义词性标注语料库,tagged_token = nltk.tag.str2tuple('fly/NN') 。字符串转成二元组。布朗语料库标注 nltk.corpus.brown.tagged_words() 。 nltk中文语料库,nltk.download()。下载 Corpora sinica_treebank,台湾中国研究院。 # coding:utf-8 import sys import importlib importlib.reload(sys) import nltk for word in nltk.corpus.sinica_treebank.tagged_words(): print(word[0], word[1]) jieba切词,https://github.com/fxsjy/jieba,自定义语料中文切词,自动词性标注。 词性自动标注。默认标注器 DefaultTagger,标注为频率最高词性。 # coding:utf-8 import sys import importlib importlib.reload(sys) import nltk default_tagger = nltk.DefaultTagger('NN') raw = '我 好 想 你' tokens = nltk.word_tokenize(raw) tags = default_tagger.tag(tokens) print(tags) 正则表达式标注器,RegexpTagge,满足特定正则表达式词性。 # coding:utf-8 import sys import importlib importlib.reload(sys) import nltk pattern = [(r'.*们$','PRO')] tagger = nltk.RegexpTagger(pattern) print(tagger.tag(nltk.word_tokenize('我们 一起 去 你们 和 他们 去过 的 地方'))) 查询标注器,多个最频繁词和词性,查找语料库,匹配标注,剩余词用默认标注器(回退)。 一元标注,已标注语料库训练,模型标注新语料。 # coding:utf-8 import sys import importlib importlib.reload(sys) import nltk tagged_sents = [[(u'我', u'PRO'), (u'小兔', u'NN')]] unigram_tagger = nltk.UnigramTagger(tagged_sents) sents = [[u'我', u'你', u'小兔']] # brown_tagged_sents = nltk.corpus.brown.tagged_sents(categories='news') # unigram_tagger = nltk.UnigramTagger(brown_tagged_sents) # sents = nltk.corpus.brown.sents(categories='news') tags = unigram_tagger.tag(sents[0]) print(tags) 二元标注、多元标注,一元标注 UnigramTagger 只考虑当前词,不考虑上下文。二元标注器 BigramTagger 考虑前面词。三元标注 TrigramTagger。 组合标注器,提高精度和覆盖率,多种标注器组合。 标注器存储,训练好持久化,存储硬盘。加载。 # coding:utf-8 import sys import importlib importlib.reload(sys) import nltk train_sents = [[(u'我', u'PRO'), (u'小兔', u'NN')]] t0 = nltk.DefaultTagger('NN') t1 = nltk.UnigramTagger(train_sents, backoff=t0) t2 = nltk.BigramTagger(train_sents, backoff=t1) sents = [[u'我', u'你', u'小兔']] tags = t2.tag(sents[0]) print(tags) from pickle import dump print(t2) output = open('t2.pkl', 'wb') dump(t2, output, -1) output.close() from pickle import load input = open('t2.pkl', 'rb') tagger = load(input) input.close() print(tagger) 机器学习,训练模型,已知数据统计学习;使用模型,统计学习模型计算未知数据。有监督,训练样本数据有确定判断,断定新数据。无监督,训练样本数据没有判断,自发生成结论。最难是选算法。 贝叶斯,概率论,随机事件条件概率。公式:P(B|A)=P(A|B)P(B)/P(A)。已知P(A|B)、P(A)、P(B),计算P(B|A)。贝叶斯分类器: # coding:utf-8 import sys import importlib importlib.reload(sys) import nltk my_train_set = [ ({'feature1':u'a'},'1'), ({'feature1':u'a'},'2'), ({'feature1':u'a'},'3'), ({'feature1':u'a'},'3'), ({'feature1':u'b'},'2'), ({'feature1':u'b'},'2'), ({'feature1':u'b'},'2'), ({'feature1':u'b'},'2'), ({'feature1':u'b'},'2'), ({'feature1':u'b'},'2'), ] classifier = nltk.NaiveBayesClassifier.train(my_train_set) print(classifier.classify({'feature1':u'a'})) print(classifier.classify({'feature1':u'b'})) 分类,最重要知道哪些特征最能反映分类特点,特征选取。文档分类,最能代表分类词。特征提取,找到最优信息量特征: # coding:utf-8 import sys import importlib importlib.reload(sys) import nltk from nltk.corpus import movie_reviews import random documents =[(list(movie_reviews.words(fileid)),category)for category in movie_reviews.categories()for fileid in movie_reviews.fileids(category)] random.shuffle(documents) all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words()) word_features = [word for (word, freq) in all_words.most_common(2000)] def document_features(document): document_words = set(document) features = {} for word in word_features: features['contains(%s)' % word] = (word in document_words) return features featuresets = [(document_features(d), c) for (d,c) in documents] # classifier = nltk.NaiveBayesClassifier.train(featuresets) # classifier.classify(document_features(d)) train_set, test_set = featuresets[100:], featuresets[:100] classifier = nltk.NaiveBayesClassifier.train(train_set) print(nltk.classify.accuracy(classifier, test_set)) classifier.show_most_informative_features(5) 词性标注,上下文语境文本分类。句子分割,标点符号分类,选取单独句子标识符合并链表、数据特征。识别对话行为,问候、问题、回答、断言、说明。识别文字蕴含,句子能否得出另一句子结论,真假标签。 参考资料:http://www.shareditor.com/blogshow?blogId=67http://www.shareditor.com/blogshow?blogId=69https://www.jianshu.com/p/6e5ace051c1e《Python 自然语言处理》 欢迎推荐上海机器学习工作机会,我的微信:qingxingfengzi
聊天机器人知识主要是自然语言处理。包括语言分析和理解、语言生成、机器学习、人机对话、信息检索、信息传输与信息存储、文本分类、自动文摘、数学方法、语言资源、系统评测。 NLTK库安装,pip install nltk 。执行python。下载书籍,import nltk,nltk.download(),选择book,点Download。下载完,加载书籍,from nltk.book import 。输入text书籍节点,输出书籍标题。搜索文本,text1.concordance("former”) 。搜索相关词,text1.similar("ship") 。查看词在文章的位置,text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"]) ,可以按Ctr+Z退出。继续尝试其他函数需要重新执行python,重新加载书籍。词统计,总字数 len(text1),文本所有词集合 set(text1),文本总词数 len(set(text4)),单词出现总次数 text4.count("is") ,统计文章词频从大到小排序到列表 FreqDist(text1),统计词频输出累计图 fdist1 = FreqDist(text1);fdist1.plot(50, cumulative=True),只出现一次的词 fdist1.hapaxes(),频繁双联词 text4.collocations() 。 自然语言处理关键点,词意理解、自动生成语言,机器翻译、人机对话(图灵测试,5分钟内回答提出问题的30%)。基于规则,完全从语法句法出发,照语言规则分析、理解。基于统计,收集大量语料数据,统计学习理解语言,得益于硬件(GPU)、大数据、深度学习的发展。 NLTK语料库,Gutenberg,nltk.corpus.gutenberg.fileids()。Gutenberg语料库文件标识符,import nltk,nltk.corpus.gutenberg.fileids()。Gutenberg语料库阅读器 nltk.corpus.gutenberg。输出文章原始内容 nltk.corpus.gutenberg.raw('chesterton-brown.txt') 。输出文章单词列表 nltk.corpus.gutenberg.words('chesterton-brown.txt') 。输出文章句子列表 nltk.corpus.gutenberg.sents('chesterton-brown.txt') 。网络文本语料库,网络和聊天文本,from nltk.corpus import webtext 。布朗语料库,按照文本分类好500个不同来源文本,from nltk.corpus import brown 。路透社语料库,1万多个新闻文档,from nltk.corpus import reuters 。就职演说语料库,55个总统的演说,from nltk.corpus import inaugural 。 语料库组织结构,散养式(孤立多篇文章)、分类式(按照类别组织,但没有交集)、交叉式(文章属多个类)、渐变式(语法随时间发生变化)。 语料库通用接口,文件 fileids(),分类 categories(),原始内容 raw(),词汇 words(),句子 sents(),指定文件磁盘位置 abspath(),文件流 open()。 加载自定义语料库,from nltk.corpus import PlaintextCorpusReader ,corpus_root = '/Users/libinggen/Documents/workspace/Python/robot/txt' ,wordlists = PlaintextCorpusReader(corpus_root, '.*') ,wordlists.fileids() 。 格式转换GBK2UTF8,iconv -f GBK -t UTF-8 安娜·卡列尼娜.txt > 安娜·卡列尼娜utf8.txt 。 条件分布,在一定条件下事件概率颁上。条件频率分布,指定条件下事件频率分布。 输出布朗语料库每个类别条件每个词概率: # coding:utf-8 import sys import importlib importlib.reload(sys) import nltk from nltk.corpus import brown # 链表推导式,genre是brown语料库里的所有类别列表,word是这个类别中的词汇列表 # (genre, word)就是类别加词汇对 genre_word = [(genre, word) for genre in brown.categories() for word in brown.words(categories=genre) ] # 创建条件频率分布 cfd = nltk.ConditionalFreqDist(genre_word) # 指定条件和样本作图 # cfd.tabulate(conditions=['news','adventure'], samples=[u'stock', u'sunbonnet', u'Elevated', u'narcotic', u'four', u'woods', u'railing', u'Until', u'aggression', u'marching', u'looking', u'eligible', u'electricity', u'$25-a-plate', u'consulate', u'Casey', u'all-county', u'Belgians', u'Western', u'1959-60', u'Duhagon', u'sinking', u'1,119', u'co-operation', u'Famed', u'regional', u'Charitable', u'appropriation', u'yellow', u'uncertain', u'Heights', u'bringing', u'prize', u'Loen', u'Publique', u'wooden', u'Loeb', u'963', u'specialties', u'Sands', u'succession', u'Paul', u'Phyfe']) cfd.plot(conditions=['news','adventure'], samples=[u'stock', u'sunbonnet', u'Elevated', u'narcotic', u'four', u'woods', u'railing', u'Until', u'aggression', u'marching', u'looking', u'eligible', u'electricity', u'$25-a-plate', u'consulate', u'Casey', u'all-county', u'Belgians', u'Western', u'1959-60', u'Duhagon', u'sinking', u'1,119', u'co-operation', u'Famed', u'regional', u'Charitable', u'appropriation', u'yellow', u'uncertain', u'Heights', u'bringing', u'prize', u'Loen', u'Publique', u'wooden', u'Loeb', u'963', u'specialties', u'Sands', u'succession', u'Paul', u'Phyfe']) 利用条件频率分布,按照最大条件概率生成双连词,生成随机文本: # coding:utf-8 import sys import importlib importlib.reload(sys) import nltk # 循环10次,从cfdist中取当前单词最大概率的连词,并打印出来 def generate_model(cfdist, word, num=10): for i in range(num): print(word), word = cfdist[word].max() # 加载语料库 text = nltk.corpus.genesis.words('english-kjv.txt') # 生成双连词 bigrams = nltk.bigrams(text) # 生成条件频率分布 cfd = nltk.ConditionalFreqDist(bigrams) # 以the开头,生成随机串 generate_model(cfd, 'the') 词典资源,词或短语集合:词汇列表语料库,所有英文单词,识别语法错误 nltk.corpus.words.words 。停用词语料库,识别最频繁出现没有意义词 nltk.corpus.stopwords.words 。发音词典,输出英文单词发音 nltk.corpus.cmudict.dict 。比较词表,多种语言核心200多个词对照,语言翻译基础 nltk.corpus.swadesh 。同义词集,面向语义英语词典,同义词集网络 WordNet 。 参考资料: http://www.shareditor.com/blogshow/?blogId=63 http://www.shareditor.com/blogshow?blogId=64 http://www.shareditor.com/blogshow?blogId=65 欢迎推荐上海机器学习工作机会,我的微信:qingxingfengzi
概率和信息论。 概率论,表示不确定性声明数学框架。提供量化不确定性方法,提供导出新不确定性声明(statement)公理。人工智能领域,概率法则,AI系统推理,设计算法计算概率论导出表达式。概率和统计理论分析AI系统行为。概率论提出不确定声明,在不确定性存在情况下推理。信息论量化概率分布不确定性总量。Jaynes(2003)。机器学习经常处理不确定量,有时处理随机(非确定性)量。20世纪80年代,研究人员对概率论量化不确定性提出信服论据。Pearl(1998)。 不确定性来源。被建模系统内存的随机性。不完全观测,确定系统不能观测到所有驱动系统行为变量,也呈随机性。不完全建模,模型舍弃观测信息,导致预测不确定性。简单而不确定规则比复杂而确定规则更实用,即使真正规则是确定的并且建模型系统足够精确容纳复杂规则。 概率论分析事件发生频率。事件可以重复。结果发生概率p,反复无限次,有p比例会导致某个结果。概率表示信任度(degree of belief)。直接与事件发生的频率相联系,频率派概率(frequentist probability)。涉及到确定性水平,贝叶斯概率(Bayesian probability)。不确定性常识推理,列出若干条期望性质,满足唯一方法是贝叶斯概率和频率概率等同。Ramsey(1926)。概率,处理不确定性逻辑扩展。逻辑提供形式化规则,给定命题真假,判断另一些命题真假。概率论提供形式化规则,给定命题似然,计算其他命题为真似然。 随机变量(random variable)。 随机取不同值变量。无格式字体(plain typeface)小写字母表示随机变量,手写体小写字母表示随机变量取值。随机变量对可能状态描述。伴随概率分布批定每个状态可能性。随机变量可以离散或连续。离散随机变量有限或可数无限多状态。可能没有数值。连续随机变量伴随实数值。 概率分布(probability distribution)。 随机变量或一簇随机变量每个状态可能性大小。描述概率分布方式取决随机变量离散还是连续。 离散型变量和概率质量函数。离散弄变量概率分布用概率质量函数(probability mass function,PMF)描述。大写字母P表示概率质量函数。每个随机变量有一个不同概率质量函数,根据随机变量推断所用PMF。概率质量函数将随机变量每个状态映射到随机变量取该状态概率。x=x概率用P(x)表示,概率1表示x=x确定,概率0表示x=x不可能发生。明确写出随机变量名称,P(x=x)。定义随机变量,用~符号说明遵循分布,x~P(x)。概率质量同时作用多个随机变量。多个变量概率分布为联合概率分布(joint probability distribution)。P(x=x,y=y)表示x=x和y=y同时发生概率。简写P(x,y)。函数P是随机变量x的PMF,P定义域必须是x所有可能状态集合。FORALL(x) ELEMENT(X),0<=P(x)<=1。不可能发生事件概率为0,不存在概率更低状态。确保一定发生事件概率为1,不存在概率更高状态。SUM(x ELEMENT(X),P(x))=1。归一化(normalized)。 离散型随机变量x有k个不同状态,x均匀分布(uniform distribution),每个状态均等可能。PMF,P(x=x i)=1/k。所有i成立。k是一个正整数,1/k是正的。SUM(i, P(x=x i))=SUM(i, 1/k)=k/k=1。分布满足归一化条件。连续型变量和概率密度函数。连续型随机变量,概率密度函数(probability density function,PDF)描述概率分布。函数p是概率密度函数。p定义域是x所有可能状态集合。FORALL(x) ELEMENT(X),P(x)>=0,不要求p(x)<=1。INTEGRAL(p(x)dx)=1 。概率密度函数p(x)给出落在面积为DELTA(x)无限小区域内概率为p(x)DELTA(x)。概率密度函数求积分,获得点集真实概率质量。x落在集合S中的概率,p(x)对集合求积分得到。单变量,x落在区间[a,b]概率是INTEGRAL([a,b],p(x)dx) 。 实数区间均匀分布。函数u(x;a,b),a和b 是区间端点,满足b>a。符号";"表示以什么为参数。x作函数自变量,a和b作定义函数参数。确保区间外没有概率,所有x NOTELEMENT([a,b]),令u(x;a,b)=0。在[a,b]内,u(x;a,b)=1/(b-a)。任何一点都非负。积分为1。x~U(a,b)表示x在[a,b]上均匀分布。 边缘概率。 定义在子集上的概率分布为边缘概率分布(marginal probability distribution)。离散型随机变量x和y,知道P(x,y),求和法则(sum rule)计算FORALL(x) ELEMENT(X),P(x=x)=SUM(y,P(x=x,y=y)) 。边缘概率名称来源手算边缘概率计算过程。P(x,y)每个值被写在每行表示不同x值、每列表示不同y值网格中,对网络中每行求和,求和结果P(x)写在每行右边纸边缘处。连续型变量,用积分替代求和,p(x)=INTEGRAL(p(x,y)dy。 条件概率。 某个事件上在给定其他事件发生时出现概率。给定x=x,y=y发生条件概率记P(y=y|x=x)。P(y=y|x=x)=P(y=y,x=x)/P(x=x)。条件概率只在P(x=x)>0有定义。不能计算给定在永远不会发生事件上上的条件概率。不要把条件概率和计算当采用某个动作后会发生什么相混淆。 条件概率链式法则。 任何多维随机变量联合概率分布,都可以分解成只有一个变量的条件概率相乘形式。P(x (1) ,…,x (n) )=P(x (1) )PRODUCT(i=2,n,P(x (i) |x (i) ,…,x (i-1) ))。概率链式法则(chain rule)或乘法法则(product rule)。从条件概率定义得到,使用两次定义得到,P(a,b,c)=P(a|b,c)P(b,c)。P(b,c)=P(b|c)P(c)。P(a,b,c)=P(a|b,c)P(b|c)P(c)。 独立性和条件独立性。 两个随机变量x和y,概率分布表示成两个因子乘积形式,一个因子只包含x,另一个因子只包含y,两个随机变量相互独立(independent)。FORALL(x) ELEMENT(x),y ELEMENT(y),z ELEMENT(z),p(x=x,y=y)=p(x=x)p(y=y)。x和y的条件概率分布对于z的每一个值都写成乘积形式,随机变量x和y在给定随机变量z时条件独立(conditionally independent)。FORALL(x) ELEMENT(x),y ELEMENT(y),z ELEMENT(z),p(x=x,y=y|z=z)=p(x=x|z=z)p(y=y|z=z)。简化形式表示独立笥和条件独立性,x UPTACK(y)表示x和y相互独立,x UPTACK(y)|z表示x和y在给定z时条件独立。 期望、方差和协方差。 函数f(x)关于某分布P(x)的期望(expectation)或期望值(expected value),当x由P产生,f作用于x,f(x)的平均值。对于离散型随机变量,求和得到,E x~P [f(x)]=SUM(x,P(x)f(x))。连续型随机变量,求积分得到,E x~p [f(x)]=INTEGRAL(p(x)f(x)dx) 。概率分布在上下文指明,只写出期望作用随机变量名称简化,Ex[f(x)]。期望作用随机变量明确,不写脚标,E[f(x)]。默认,假设E[.]表示对方括号内所有随机变量值求平均。没有歧义时,可以省略方括号。期望线性,E x [af(x)+bg(x)]=aEx[f(x)]+bE x [g(x)]。a和b不依赖x。 方差(variance)衡量,x依据概率分布采样时,随机变量x函数值差异。Var(f(x))=E[(f(x)-E[f(x)]) 2 ]。方差很小时,f(x)值形成簇比较接近期望值。方差的平方根为标准差(standard deviation)。 协方差(covariance),给出两个变量线性相关性强度及变量尺度。Cov(f(x),g(y))=E[(f(x)-E[f(x)])(g(y)-E[g(y)])]。协方差绝对值很大,变量值变化很大,距离各自的均值很远。协方差为正,两个变量倾向于同时取得相对较大值。协方差为负,一个变量倾向于取较大值,另一个变量倾向于取较小值。其他衡量指标,相关系数(correlation),每个变量贡献归一化,只衡量变量相关性,不受各个变量尺度大小影响。 协方差和相关性有联系,是不同概念。联系。两个变量互相独立,协方差为零。两个变量协义差不为零,一定相关。独立性和协方差性质完全不同。两个变量协方差为零,一定没有相互依赖,但具有零协方差可能。从区间[-1,1]均匀分布采样一个实数x,对一个随机变量s采样。s以1/2概率值为1,否则为-1。令y-sx生成一个随机变量y。x和y不相互独立,x完全决定y尺度.Cov(x,y)=0。 随机向量x ELEMENT(R n )协方差矩阵(convariance matrix)是n*n矩阵,满足,Cov(x) i,j =Cov(x i ,x j )。协方差矩阵对角元是方差,Cov(x i ,x i )=Var(x i )。 参考资料: 《深度学习》 欢迎推荐上海机器学习工作机会,我的微信:qingxingfengzi 我有一个微信群,欢迎一起学深度学习。
Moore-Penrose伪逆(pseudoinverse)。 非方矩阵,逆矩阵没有定义。矩阵A的左逆B求解线性方程Ax=y。两边左乘左逆B,x=By。可能无法设计唯一映射将A映射到B。矩阵A行数大于列数,方程无解。矩阵A行数小于列数,矩阵有多个解。 矩阵A的伪逆A + =lim a->0 (A T A+aI) -1 A T。计算伪逆公式,A + =VD + U T。矩阵U、D、V是矩阵A奇异值分解得到矩阵。对角矩阵D伪逆D + 是非零元素取倒数后再转置。矩阵A列数多于行数,伪逆求解线性方程是可能解法。x=A + y是方程所有可行解中欧几里得范数||x|| 2 最小。矩阵A行数多于列数,没有解。伪逆得到x使得Ax和y的欧几里得距离||Ax-y|| 2 最小。 迹运算。 返回矩阵对角元素和,Tr(A)=Sum i A i,i 。通过矩阵乘法和迹运算符号清楚表示矩阵运算。描述矩阵Frobenius范数,||A|| F =SQRT(Tr(AA T ))。迹运算在转置运算下不变,Tr(A)=Tr(A T )。多个矩阵相乘方阵迹,矩阵最后一个挪到最前面相乘迹相同。需考虑挪动后矩阵乘积定义良好,Tr(ABC)=Tr(CAB)=Tr(BCA),Tr(PRODUCT(n,i=1,F (i) ))=Tr(F (n) PRODUCT(n-1,i=1,F (i) ))。循环置换后矩阵乘积矩阵形状变了,迹运算结果依然不变。矩阵A ELEMENT(R mn ),矩阵B ELEMENT(R nm ),得到 Tr(AB)=Tr(BA)。AB ELEMENT(R mm ),BA ELEMENT(R nn )。标量在迹运算后仍是自己,a=Tr(a)。 行列式。 det(A),方阵A映射到实数函数。行列式等于矩阵特征值的乘积。行列式绝对值衡量矩阵参与矩阵乘法后空间扩大或缩小多少。行列式是0,空间沿着某一维完全收缩,失去所有体积。行列式是1,转换保持空间体积不变。 主成分分析(principal components analysis,PCA)。 简单机器学习算法,基础线性代数知识推导。R n 空间有m个点{x (1) ,…,x (m) },有损压缩,用更少内存,损失精度存储。希望损失精度尽可能少。低维表示,每个点x (i) ELEMENT(R n ),一个对应编码向量c (i) ,按比例放大D :,i ,保持结果不变。为问题有唯一解,限制D所有列向量有单位范数。计算解码器最优编码困难。PCA限制D列向量彼此正交(除非l=n,严格意义D不是正交矩阵)。 想法变算法。明确每一个输入x得到一个最优编码c * 。 最小化原始输入向量x和重构向量g(c )间距离。范数衡量距离。PCA算法,用L 2 范数,c =argmin c ||x-g(c)|| 2 。用平方L 2 范数替代L 2 范数。相同值c上取得最小值。L 2 范数非负。平方运算在非负值上单调递增。c =argmin c ||x-g(c)|| 2 2 。最小化函数简化,(x-g(c)) T (x-g(c))。L 2 范数定义,=x T x-x T g(c)-g(c) T x +g(c) T g(c)。分配律,=x T x-2x T g(c)+g(c) T g(c)。标量g(c) T x转置等于自己。第一项x T x 不依赖c,忽略,优化目标,c =argmin c -2x T g(c)+g(c) T g(c)。代入g(c),c * =argmin c -2x T Dc+c T D T Dc=argmin c -2x T Dc+c T I l c。矩阵D正交性和单位范数约束,=argmin c -2x T Dc+c T c。 向量微积分求解最优化,NABLA(c, (-2x T Dc+c T c))=0,-2D T x+2c=0,c=D T x。算法高效。最优编码x只需要一个矩阵-向量乘法操作。编码向量,编码函数,f(x)=D T x。矩阵乘法,定义PCA重构操作,r(x)=g(f(x))=DD T x。挑选编码矩阵D。相同矩阵D对所有点解码,不能孤立看待每个点。最小化所有维数和所有点上的误差矩阵Frobenius范数。D =argmin D SQRT(SUM(i,j,(x (i) j -r(x (i)) j )) 2 )subject to D T D=Il。推导寻求D 算法,l=1,D是单一向量d。简化D为d,问题简化。d =argmin d SUM(i,||x (i) -dd T x (i) || 2 2 )subject to ||d|| 2 =1。最美观方式。标量d T x (i) 放在向量d右边。标量放在左边写法更传统。d =argmin d SUM(i,||x (i) -d T x (i) d|| 2 2 )subject to ||d|| 2 =1。标量转置和自身相等。d * =argmin d SUM(i,||x (i) -x (i) T dd|| 2 2 )subject to ||d|| 2 =1。重排写法。 单一矩阵重述问题。更紧凑符号。表示各点向量堆叠成矩阵。记X ELEMENT(R mn )。X i,: =x (i) T 。重新表述,d =argmin d ||X-Xdd T || 2 F subject to d T d=1。不考虑约束,Frobenius范数简化。argmin d ||X-Xdd T || 2 F 。=argmin d Tr((X-Xdd T ) T (X-Xdd T ))。=argmin d Tr(X T X-X T Xdd T -dd T X T X+dd T X T Xdd T )。=argmin d Tr(X T X)-Tr(X T Xdd T )-Tr(dd T X T X)+Tr(dd T X T Xdd T )。=argmin d -Tr(X T Xdd T )-Tr(dd T X T X)+Tr(dd T X T Xdd T )。与d无关项不影响argmin,=argmin d -2Tr(X T Xdd T )+Tr(dd T X T Xdd T )。循环改变迹运算相乘矩阵顺序不影响结果,=argmin d -2Tr(X T Xdd T )+Tr(X T X T Xdd T dd T )。考虑约束条件。argmin d -2Tr(X T Xdd T )+Tr(X T X T Xdd T dd T )subject to d T d=1。=argmin d -2Tr(X T Xdd T )+Tr(X T X T Xdd T )subject to d T d=1。=argmin d -Tr(X T X T Xdd T )subject to d T d=1。=argmax d Tr(X T X T Xdd T )subject to d T d=1。=argmax d Tr(d T X T X T Xd)subject to d T d=1。优化问题,特征分解求解。最优d是X T X最大特征值对应特征向量。 以上推导特定于l=1情况,仅得到第一个主成分。得到主成分的基时,矩阵D由前l个最大特征值对应特征向量组成。归纳法证明。 参考资料: 《深度学习》 欢迎推荐上海机器学习工作机会,我的微信:qingxingfengzi 我有一个微信群,欢迎一起学深度学习。
线性相关、生成子空间。 逆矩阵A⁽-1⁾存在,Ax=b 每个向量b恰好存在一个解。方程组,向量b某些值,可能不存在解,或者存在无限多个解。x、y是方程组的解,z=αx+(1-α),α取任意实数。 A列向量看作从原点(origin,元素都是零的向量)出发的不同方向,确定有多少种方法到达向量b。向量x每个元素表示沿着方向走多远。xi表示沿第i个向量方向走多远。Ax=sumixiA:,i。线性组合(linear combination)。一组向量线性组合,每个向量乘以对应标量系数的和。sumiciv⁽i⁾。一组向量的生成子空间(span)是原始向量线性组合后能抵达的点的集合。确定Ax=b是否有解,相当于确定向量b是否在A列向量的生成子空间中。A的列空间(column space)或A的值域(range)。方程Ax=b对任意向量b∈ℝ⁽m⁾都存在解,要求A列空间构成整个ℝ⁽m⁾。ℝ⁽m⁾点不在A列空间,对应b使方程没有解。矩阵A列空间是整个ℝ⁽m⁾的要求,A至少有m列,n>=m。否则,A列空间维数小于m。 列向量冗余为线性相关(linear dependence)。一组向量任意一个向量都不能表示成其他向量的线性组合,线性无关(linearly independent)。某个向量是一组向量中某些向量的线性组合,这个向量加入这组向量不会增加这组向量的生成子空间。一个矩阵列空间涵盖整个ℝ⁽m⁾,矩阵必须包含一组m个线性无关的向量。是Ax=b 对每个向量b取值都有解充分必要条件。向量集只有m个线性无关列向量,不是至少m个。不存在一个m维向量集合有多于m个彼此线性不相关列向量,一个有多于m个列向量矩阵有可能有不止一个大小为m的线性无关向量集。 矩阵可逆,要保证Ax=b 对每个b值至多有一个解。要确保矩阵至多有m个列向量。矩阵必须是一个方阵(square),m=n,且所有列向量线性无关。一个列向量线性相关方阵为奇异的(singular)。矩阵不是方阵或是奇异方阵,方程可能有解,但不能用矩阵逆求解。逆矩阵右乘AA⁽-1⁾=I。左逆、右逆相等。 范数(norm)。 衡量向量大小。L⁽p⁾:||x||p=(sumi|xi|⁽p⁾)⁽1/p⁾。p∈ℝ,p>=1。范数(L⁽p⁾范数),向量映射到非负值函数。向量x范数衡量从原点到点x距离。范数满足性质,f(x)=0=>x=0,f(x+y)<=f(x)+f(y)三解不等式(triangel inequality),∀α∈ℝ f(αx)=|α|f(x)。 p=2,L⁽2⁾范数称欧几里得范数(Euclidean norm)。表示从原点出发到向量x确定点的欧几里得距离。简化||x||,略去下标2。平方L⁽2⁾ 范数衡量向量大小,通过点积x⫟x计算。平方L⁽2⁾范数在数学、计算上比L⁽2⁾范数更方便。平方L⁽2⁾范数对x中每个元素的导数只取决对应元素。L⁽2⁾范数对每个元素的导数和整个向量相关。平方L⁽2⁾范数,在原点附近增长缓慢。 L⁽1⁾范数,在各个位置余率相同,保持简单数学形式。||x||1=sumi|xi|。机器学习问题中零和非零差异重要,用L⁽1⁾范数。当x中某个元素从0增加∊,对应L⁽1⁾范数也增加∊。向量缩放α倍不会改变该向量非零元素数目。L⁽1⁾范数常作为表示非零元素数目替代函数。 L⁽∞⁾范数,最大范数(max norm)。表示向量具有最大幅值元素绝对值,||x||₍∞₎=maxi|xi|。 Frobenius范数(Frobenius norm),衡量矩阵大小。||A||F=sqrt(sumi,jA⁽2⁾₍i,j₎)。 两个向量点积用范数表示,x⫟y=||x||2||y||2cosθ,θ表示x、y间夹角。 特殊类型矩阵、向量。 对角矩阵(diagonal matrix),只在主对角线上有非零元素,其他位置都是零。对角矩阵,当且仅当对于所有i != j,Di,j=0。单位矩阵,对角元素全部是1。 diag(v)表示对角元素由向量v中元素给定一个对角方阵。对角矩阵乘法计算高效。计算乘法diag(v)x,x中每个元素xi放大vi倍。diag(v)x=v⊙x。计算对角方阵的逆矩阵很高效。对角方阵的逆矩阵存在,当且仅当对角元素都是非零值,diag(v)⁽-1⁾=diag([1/v1,…,1/vn]⫟)。根据任意矩阵导出通用机器学习算法。通过将矩阵限制为对象矩阵,得到计算代价较低(简单扼要)算法。 并非所有对角矩阵都是方阵。长方形矩阵也有可能是对角矩阵。非方阵的对象矩阵没有逆矩阵,但有高效计算乘法。长方形对角矩阵D,乘法Dx涉及x每个元素缩放。D是瘦长型矩阵,缩放后末尾添加零。D是胖宽型矩阵,缩放后去掉最后元素。 对称(symmetric)矩阵,转置和自己相等矩阵。A=A⫟。不依赖参数顺序双参数函数生成元素,对称矩阵常出现。A是矩离度量矩阵,Ai,j表示点i到点j距离,Ai,j=Aj,i。距离函数对称。 单位向量(unit vector),具有单位范数(unit norm)向量。||x||2=1。 x⫟y=0,向量x和向量y互相正交(orthogonal)。两个向量都有非零范数,两个向量间夹角90°。ℝⁿ至多有n个范数非零向量互相正交。向量不但互相正交,且范数为1,标准正交(orthonorma)。 正交矩阵(orthogonal matrix),行向量和列向量是分别标准正交方阵。 A⫟A=AA⫟=I,A⁽-1⁾=A⫟。正交矩阵求逆计算代价小。正交矩阵行向量不仅正交,还标准正交。行向量或列向量互相正交但不标准正交矩阵,没有对应专有术语。 参考资料:《深度学习》
线性代数,面向连续数学,非离散数学。《The Matrix Cookbook》,Petersen and Pedersen,2006。Shilov(1977)。 标量、向量、矩阵、张量。 标量(scalar)。一个标量,一个单独的数。其他大部分对象是多个数的数组。斜体表示标量。小写变量名称。明确标量数类型。实数标量,令s∊ℝ表示一条线斜率。自然数标量,令n∊ℕ表示元素数目。 向量(vector)。一个向量,一列数。有序排列。次序索引,确定每个单独的数。粗体小写变量名称。向量元素带脚标斜体表示。注明存储在向量中元素类型。如果每个元素都属于R,向量有n个元素,向量属于实数集R的n次笛卡儿乘积构成集合,记ℝⁿ。明确表示向量元素,元素排列成一个方括号包围纵列。向量看作空间中点。每个元素是不同坐标轴上的坐标。索引向量元素,定义包含元素索引集合,集合写在脚标处。用符号-表示集合补集索引。 矩阵(matrix)。一个二维数组。每个元素由两个索引确定。粗体大写变量名称。如果实数矩阵高度为m,宽度为n,A∊ℝ⁽m*n⁾。表示矩阵元素,不加粗斜体形式名称,索引逗号间隔。A1,1表示A左上元素,Am,n表示A右下元素。“:”表示水平坐标,表示垂直坐标i中所有元素。Ai,:表示A中垂直坐标i上一横排元素,A的第i行(row)。右下元素。A:,i表示A的第i列(column)。明确表示矩阵元素,方括号括起数组。矩阵值表达式索引,表达式后接下标,f(A)i,j表示函数f作用在A上输出矩阵第i行第j列元素。 张量(tensor)。超过两维的数组。一个数组中元素分布在若干维坐标规则网络中。A表示张量“A”。张量A中坐标(i,j,k)元素记Ai,j,k。 转置(transpose)。矩阵转置,以对角线为轴镜像。左上角到右下角对角线为主对角线(main diagonal)。A的转置表为A⫟。(A⫟)i,j=Aj,i。向量可作一列矩阵。向量转置,一行矩阵。向量元素作行矩阵写在文本行,用转置操作变标准列向量来定义一个向量,x=[x1,x2,x3]⫟。标量可看作一元矩阵。标量转置等于本身,a=a⫟。矩阵形状一样,可相加。对应位置元素相加。C=A+B,Ci,j=Ai,j+Bi,j。标量和矩阵相乘或相加,与矩阵每个元素相乘或相加,D=aB+C,Di,j=aBi,j+c。 深度学习,矩阵和向量相加,产生另一矩阵,C=A+b,Ci,j=Ai,j+bj。向量b和矩阵A每一行相加。无须在加法操作前定义一个将向量b复制到第一行而生成的矩阵。隐式复制向量b到很多位置方式,称广播(broadcasting)。 矩阵、向量相乘。 两个矩阵A、B矩阵乘积(matrix product)是第三个矩阵C。矩阵A列数必须和矩阵B行数相等。如果矩阵A的形状mn,矩阵B的形状是np,矩阵C的形状是mp。两个或多个矩阵并列放置书写矩阵乘法。C=AB。Ci,j=Sumk(Ai,kBk,j)。列乘行。两个矩阵对应元素乘积,元素对应乘积(element-wise product),Hadamard 乘积(Hadamard product),记A⊙B。两个相同维数向量x、y点积(dot product),矩阵乘积x⫟y。矩阵乘积C=AB计算Ci,j步骤看作A第i行和B的第j列间点积。矩阵乘积服务分配律(A(B+C)=AB+AC)、结合律(A(BC)=(AB)C)。不满足交换律(AB=BA)。两个向量点积满足交换律x⫟y=y⫟x。矩阵乘积转置 (AB)⫟=B⫟A⫟。两个向量点积结果是标量,标量转置是自身,x⫟y=(x⫟y)⫟=y⫟x。Ax=b,A∊ℝ⁽mn⁾是已知矩阵,b∊ℝ⁽m⁾是已知向量,x∊ℝⁿ是求解未知向量。向量x每个元素xi都未知。矩阵A第一行和b中对应元素构成一个约束。 单位矩阵、逆矩阵。 矩阵逆(matrix inversion)。单位矩阵(identity matrix),任意向量和单位矩阵相乘,都不会改变,保持n维向量不变的单位矩阵记In。In∊ℝ⁽n*n⁾。∀x∊ℝⁿ,Inx=x。单位矩阵结构简单,所有沿对角线元素都是1,其他位置所有元素都是0。矩阵A的矩阵逆记A⁽-1⁾,A⁽-1⁾A=In。求解式Ax=b,A⁽-1⁾Ax=A⁽-1⁾b,Inx=A⁽-1⁾b,x=A⁽-1⁾b。当逆矩阵A⁽-1⁾存在,能找到闭解形式。相同逆矩阵可用于多次求解不同向量b方程。逆矩阵A⁽-1⁾在数字计算机上只能表现出有限精度,有效用向量bt算法得到更精确x,逆矩阵A⁽-1⁾主要作理论工具。 参考资料:《深度学习》 欢迎推荐上海机器学习工作机会,我的微信:qingxingfengzi我有一个微信群,欢迎一起学深度学习。
TensorFlow对Android、iOS、树莓派都提供移动端支持。 移动端应用原理。移动端、嵌入式设备应用深度学习方式,一模型运行在云端服务器,向服务器发送请求,接收服务器响应;二在本地运行模型,PC训练模型,放到移动端预测。向服务端请求数据可行性差,移动端资源稀缺。本地运行实时性更好。加速计算,内存空间和速度优化。精简模型,节省内存空间,加快计算速度。加快框架执行速度,优化模型复杂度和每步计算速度。精简模型,用更低权得精度,量化(quantization)、权重剪枝(weight pruning,剪小权重连接,把所有权值连接低于阈值的从网络移除)。加速框架执行,优化矩阵通用乘法(GEMM)运算,影响卷积层(先数据im2col运行,再GEMM运算)和全连接层。im2col,索引图像块重排列为矩阵列。先将大矩阵重叠划分多个子矩阵,每个子矩阵序列化成向量,得到另一个矩阵。 量化(quantitative)。《How to Quantize Neural Networks with TensorFlow》https://www.tensorflow.org/performance/quantization 。离散化。用比32位浮点数更少空间存储、运行模型,TensorFlow量化实现屏蔽存储、运行细节。神经网络预测,浮点影响速度,量化加快速度,保持较高精度。减小模型文件大小。存储模型用8位整数,加载模型运算转换回32位浮点数。降低预测过程计算资源。神经网络噪声健壮笥强,量化精度损失不会危害整体准确度。训练,反向传播需要计算梯度,不能用低精度格式直接训练。PC训练浮点数模型,转8位,移动端用8位模型预测。量化示例。GoogleNet模型转8位模型例子。下载训练好GoogleNet模型,http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz 。 bazel build tensorflow/tools/quantization:quantization_graph bazel-bin/tensorflow/tools/quantization/quantization_graph \ --input=/tmp/classify_image_graph_def.pb \ --output_node_names="softmax" --output=/tmp/quantized_graph.pb \ --mode=eightbit 生成量化后模型大小只有原来的1/4。执行: bazel build tensorflow/examples/label_image:label_image bazel-bin/tensorflow/examples/label_image/label_image \ --image=/tmp/cropped_panda.jpg \ --graph=/tmp/quantized_graph.pb \ --labels=/tmp/imagenet_synset_to_human_label_map.txt \ --input_width=299 \ --input_height=299 \ --input_mean=128 \ --input_std=128 \ --input_layer="Mul:0" \ --output_layer="softmax:0" 量化过程实现。预测操作转换成等价8位版本操作实现。原始Relu操作,输入、输出浮点数。量化Relu操作,根据输入浮点数计算最大值、最小值,进入量化(Quantize)操作输入数据转换8位。保证输出层输入数据准确性,需要反量化(Dequantize)操作,权重转回32位精度,保证预测准确性。整个模型前向传播用8位整数支行,最后一层加反量化层,8位转回32位输出层输入。每个量化操作后执行反量化操作。 量化数据表示。浮点数转8位表示,是压缩问题。权重、经过激活函数处理上层输出,是分布在一个范围内的值。量化过程,找出最大值、最小值,将浮点数线性分布,做线性扩展。 优化矩阵乘法运算。谷歌开源小型独立低精度通用矩阵乘法(General Matrix to Matrix Multiplication,GEMM)库 gemmlowp。https://github.com/google/gemmlowp 。 iOS系统实践。 环境准备。操作系统Mac OS X,集成开发工具Xcode 7.3以上版本。编译TensorFlow核心静态库。tensorflow/contrib/makefiles/download_depencies.sh 。依赖库下载到tensorflow/contrib/makefile/downloads目录。eigen #C++开源矩阵计算工具。gemmlowp #小型独立低精度通用矩阵乘法(GEMM)库。googletest #谷歌开源C++测试框架。protobuf #谷歌开源数据交换格式协议。re2 #谷歌开源正则表达式库。 编译演示程度,运行。tensorflow/contrib/makefile/build_all_iso.sh。编译生成静态库,tensorflow/contrib/makefile/gen/lib:ios_ARM64、ios_ARMV7、ios_ARMV7S、ios_I386、ios_X86_64、libtensorflow-core.a。Xcode模拟器或iOS设备运行APP预测示例。TensorFlow iOS示例。https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/ios/ 。3个目录。benchmark目录是预测基准示例。simple目录是图片预测示例。camera目录是视频流实时预测示例。下载Inception V1模型,能识别1000类图片,https://storage.googleapis.com/download.tensorflow.org/models/inception5h.zip 。解压模型,复制到benchmark、simple、camera的data目录。运行目录下xcodeproj文件。选择iPhone 7 Plus模拟器,点击运行标志,编译完成点击Run Model按钮。预测结果见Xcode 控制台。 自定义模型编译、运行。https://github.com/tensorflow/tensorflow/blob/15b1cf025da5c6ac2bcf4d4878ee222fca3aec4a/tensorflow/docs_src/tutorials/image_retraining.md 。下载花卉数据 http://download.tensorflow.org/example_images/flower_photos.tgz 。郁金香(tulips)、玫瑰(roses)、浦公英(dandelion)、向日葵(sunflowers)、雏菊(daisy)5种花卉文件目录,各800张图片。训练原始模型。下载预训练Inception V3模型 http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz 。 python tensorflow/examples/image_retraining/retrain.py \ --bottlenectk_dir=/tmp/bottlenecks/ \ --how_many_training_steps 10 \ --model_dir=/tmp/inception \ --output_graph=/tmp/retrained_graph.pb \ --output_labels=/tmp/retrained_labels.txt \ --image_dir /tmp/flower_photos 训练完成,/tmp目录有模型文件retrained_graph.pb、标签文件上retrained_labels.txt。“瓶颈”(bottlenecks)文件,描述实际分类最终输出层前一层(倒数第二层)。倒数第二层训练很好,瓶颈值是有意义紧凑图像摘要,包含足够信息使分类选择。第一次训练,retrain.py文件代码先分析所有图片,计算每张图片瓶颈值存储下来。每张图片被使用多次,不必重复计算。 编译iOS支持模型。https://petewarden.com/2016/09/27/tensorflow-for-mobile-poets/ 。原始模型到iOS模型,先去掉iOS系统不支持操作,优化模型,再将模型量化,权重变8位常数,缩小模型,最后模型内存映射。去掉iOS系统不支持操作,优化模型。iOS版本TensorFlow仅支持预测阶段常见没有大外部依赖关系操作。支持操作列表:https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/makefile/tf_op_files.txt 。DecodeJpeg不支持,JPEG格式图片解码,依赖libjpeg。从摄像头实时识别花卉种类,直接处理相机图像缓冲区,不存JPEG文件再解码。预训练模型Inception V3 从图片数据集训练,包含DecodeJpeg操作。输入数据直接提供(feed)Decode后Mul操作,绕过Decode操作。优化加速预测,显式批处理规范化(explicit batch normalization)操作合并到卷积权重,减少计算次数。 bazel build tensorflow/python/tools:optimize_for_inference bazel-bin/tensorflow/python/tools/optimize_for_inference \ --input=/tmp/retrained_graph.pb \ --output=/tmp/optimized_graph.pb \ --input_names=Mul \ --output_names=final_result \ label_image命令预测: bazel-bin/tensorflow/examples/label_image/label_image \ --output_layer=final_result \ --labels=/tmp/output_labels.txt \ --image=/tmp/flower_photos/daisy/5547758_eea9edfd54_n.jpg --graph=/tmp/output_graph.pb \ --input_layer=Mul \ --input_mean=128 \ --input_std=128 \ 量化模型。苹果系统在.ipa包分发应用程度,所有应用程度资源都用zip压缩。模型权重从浮点数转整数(范围0~255),损失准确度,小于1%。 bazel build tensorflow/tools/quantization:quantization_graph bazel-bin/tensorflow/tools/quantization/quantization_graph \ --input=/tmp/optimized_graph.pb \ --output=/tmp/rounded_graph.pb \ --output_node_names=final_result \ --mode=weights_rounded 内存映射 memory mapping。物理内存映射到进程地址空间内,应用程序直接用输入/输出地址空间,提高读写效率。模型全部一次性加载到内存缓冲区,会对iOS RAM施加过大压力,操作系统会杀死内存占用过多程序。模型权值缓冲区只读,可映射到内存。重新排列模型,权重分部分逐块从主GraphDef加载到内存。 bazel build tensorflow/contrib/util:convert_graphdef_memmapped_format bazel-bin/tensorflow/contrib/util/convert_graphdef_memmapped_format \ --in_graph=/tmp/rounded_graph.pb \ --out_graph=/tmp/mmapped_graph.pb 生成iOS工程文件运行。视频流实进预测演示程序例子。https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/ios/camera 。模型文件、标记文件复制到data目录。修改CameraExampleViewController.mm,更改加载模型文件名称、输入图片尺寸、操作节点名字、缩放像素大小。 #import <AssertMacros.h> #import <AssetsLibrary/AssetsLibrary.h> #import <CoreImage/CoreImage.h> #import <ImageIO/ImageIO.h> #import "CameraExampleViewController.h" #include <sys/time.h> #include "tensorflow_utils.h" // If you have your own model, modify this to the file name, and make sure // you've added the file to your app resources too. static NSString* model_file_name = @"tensorflow_inception_graph"; static NSString* model_file_type = @"pb"; // This controls whether we'll be loading a plain GraphDef proto, or a // file created by the convert_graphdef_memmapped_format utility that wraps a // GraphDef and parameter file that can be mapped into memory from file to // reduce overall memory usage. const bool model_uses_memory_mapping = false; // If you have your own model, point this to the labels file. static NSString* labels_file_name = @"imagenet_comp_graph_label_strings"; static NSString* labels_file_type = @"txt"; // These dimensions need to match those the model was trained with. // 以下尺寸需要和模型训练时相匹配 const int wanted_input_width =299;// 224; const int wanted_input_height = 299;//224; const int wanted_input_channels = 3; const float input_mean = 128.0f;//117.0f; const float input_std = 128.0f;//1.0f; const std::string input_layer_name = "Mul";//"input"; const std::string output_layer_name = "final_result";//"softmax1"; static void *AVCaptureStillImageIsCapturingStillImageContext = &AVCaptureStillImageIsCapturingStillImageContext; @interface CameraExampleViewController (InternalMethods) - (void)setupAVCapture; - (void)teardownAVCapture; @end @implementation CameraExampleViewController - (void)setupAVCapture { NSError *error = nil; session = [AVCaptureSession new]; if ([[UIDevice currentDevice] userInterfaceIdiom] == UIUserInterfaceIdiomPhone) [session setSessionPreset:AVCaptureSessionPreset640x480]; else [session setSessionPreset:AVCaptureSessionPresetPhoto]; AVCaptureDevice *device = [AVCaptureDevice defaultDeviceWithMediaType:AVMediaTypeVideo]; AVCaptureDeviceInput *deviceInput = [AVCaptureDeviceInput deviceInputWithDevice:device error:&error]; assert(error == nil); isUsingFrontFacingCamera = NO; if ([session canAddInput:deviceInput]) [session addInput:deviceInput]; stillImageOutput = [AVCaptureStillImageOutput new]; [stillImageOutput addObserver:self forKeyPath:@"capturingStillImage" options:NSKeyValueObservingOptionNew context:(void *)(AVCaptureStillImageIsCapturingStillImageContext)]; if ([session canAddOutput:stillImageOutput]) [session addOutput:stillImageOutput]; videoDataOutput = [AVCaptureVideoDataOutput new]; NSDictionary *rgbOutputSettings = [NSDictionary dictionaryWithObject:[NSNumber numberWithInt:kCMPixelFormat_32BGRA] forKey:(id)kCVPixelBufferPixelFormatTypeKey]; [videoDataOutput setVideoSettings:rgbOutputSettings]; [videoDataOutput setAlwaysDiscardsLateVideoFrames:YES]; videoDataOutputQueue = dispatch_queue_create("VideoDataOutputQueue", DISPATCH_QUEUE_SERIAL); [videoDataOutput setSampleBufferDelegate:self queue:videoDataOutputQueue]; if ([session canAddOutput:videoDataOutput]) [session addOutput:videoDataOutput]; [[videoDataOutput connectionWithMediaType:AVMediaTypeVideo] setEnabled:YES]; previewLayer = [[AVCaptureVideoPreviewLayer alloc] initWithSession:session]; [previewLayer setBackgroundColor:[[UIColor blackColor] CGColor]]; [previewLayer setVideoGravity:AVLayerVideoGravityResizeAspect]; CALayer *rootLayer = [previewView layer]; [rootLayer setMasksToBounds:YES]; [previewLayer setFrame:[rootLayer bounds]]; [rootLayer addSublayer:previewLayer]; [session startRunning]; if (error) { NSString *title = [NSString stringWithFormat:@"Failed with error %d", (int)[error code]]; UIAlertController *alertController = [UIAlertController alertControllerWithTitle:title message:[error localizedDescription] preferredStyle:UIAlertControllerStyleAlert]; UIAlertAction *dismiss = [UIAlertAction actionWithTitle:@"Dismiss" style:UIAlertActionStyleDefault handler:nil]; [alertController addAction:dismiss]; [self presentViewController:alertController animated:YES completion:nil]; [self teardownAVCapture]; } } - (void)teardownAVCapture { [stillImageOutput removeObserver:self forKeyPath:@"isCapturingStillImage"]; [previewLayer removeFromSuperlayer]; } - (void)observeValueForKeyPath:(NSString *)keyPath ofObject:(id)object change:(NSDictionary *)change context:(void *)context { if (context == AVCaptureStillImageIsCapturingStillImageContext) { BOOL isCapturingStillImage = [[change objectForKey:NSKeyValueChangeNewKey] boolValue]; if (isCapturingStillImage) { // do flash bulb like animation flashView = [[UIView alloc] initWithFrame:[previewView frame]]; [flashView setBackgroundColor:[UIColor whiteColor]]; [flashView setAlpha:0.f]; [[[self view] window] addSubview:flashView]; [UIView animateWithDuration:.4f animations:^{ [flashView setAlpha:1.f]; }]; } else { [UIView animateWithDuration:.4f animations:^{ [flashView setAlpha:0.f]; } completion:^(BOOL finished) { [flashView removeFromSuperview]; flashView = nil; }]; } } } - (AVCaptureVideoOrientation)avOrientationForDeviceOrientation: (UIDeviceOrientation)deviceOrientation { AVCaptureVideoOrientation result = (AVCaptureVideoOrientation)(deviceOrientation); if (deviceOrientation == UIDeviceOrientationLandscapeLeft) result = AVCaptureVideoOrientationLandscapeRight; else if (deviceOrientation == UIDeviceOrientationLandscapeRight) result = AVCaptureVideoOrientationLandscapeLeft; return result; } - (IBAction)takePicture:(id)sender { if ([session isRunning]) { [session stopRunning]; [sender setTitle:@"Continue" forState:UIControlStateNormal]; flashView = [[UIView alloc] initWithFrame:[previewView frame]]; [flashView setBackgroundColor:[UIColor whiteColor]]; [flashView setAlpha:0.f]; [[[self view] window] addSubview:flashView]; [UIView animateWithDuration:.2f animations:^{ [flashView setAlpha:1.f]; } completion:^(BOOL finished) { [UIView animateWithDuration:.2f animations:^{ [flashView setAlpha:0.f]; } completion:^(BOOL finished) { [flashView removeFromSuperview]; flashView = nil; }]; }]; } else { [session startRunning]; [sender setTitle:@"Freeze Frame" forState:UIControlStateNormal]; } } + (CGRect)videoPreviewBoxForGravity:(NSString *)gravity frameSize:(CGSize)frameSize apertureSize:(CGSize)apertureSize { CGFloat apertureRatio = apertureSize.height / apertureSize.width; CGFloat viewRatio = frameSize.width / frameSize.height; CGSize size = CGSizeZero; if ([gravity isEqualToString:AVLayerVideoGravityResizeAspectFill]) { if (viewRatio > apertureRatio) { size.width = frameSize.width; size.height = apertureSize.width * (frameSize.width / apertureSize.height); } else { size.width = apertureSize.height * (frameSize.height / apertureSize.width); size.height = frameSize.height; } } else if ([gravity isEqualToString:AVLayerVideoGravityResizeAspect]) { if (viewRatio > apertureRatio) { size.width = apertureSize.height * (frameSize.height / apertureSize.width); size.height = frameSize.height; } else { size.width = frameSize.width; size.height = apertureSize.width * (frameSize.width / apertureSize.height); } } else if ([gravity isEqualToString:AVLayerVideoGravityResize]) { size.width = frameSize.width; size.height = frameSize.height; } CGRect videoBox; videoBox.size = size; if (size.width < frameSize.width) videoBox.origin.x = (frameSize.width - size.width) / 2; else videoBox.origin.x = (size.width - frameSize.width) / 2; if (size.height < frameSize.height) videoBox.origin.y = (frameSize.height - size.height) / 2; else videoBox.origin.y = (size.height - frameSize.height) / 2; return videoBox; } - (void)captureOutput:(AVCaptureOutput *)captureOutput didOutputSampleBuffer:(CMSampleBufferRef)sampleBuffer fromConnection:(AVCaptureConnection *)connection { CVPixelBufferRef pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer); CFRetain(pixelBuffer); [self runCNNOnFrame:pixelBuffer]; CFRelease(pixelBuffer); } - (void)runCNNOnFrame:(CVPixelBufferRef)pixelBuffer { assert(pixelBuffer != NULL); OSType sourcePixelFormat = CVPixelBufferGetPixelFormatType(pixelBuffer); int doReverseChannels; if (kCVPixelFormatType_32ARGB == sourcePixelFormat) { doReverseChannels = 1; } else if (kCVPixelFormatType_32BGRA == sourcePixelFormat) { doReverseChannels = 0; } else { assert(false); // Unknown source format } const int sourceRowBytes = (int)CVPixelBufferGetBytesPerRow(pixelBuffer); const int image_width = (int)CVPixelBufferGetWidth(pixelBuffer); const int fullHeight = (int)CVPixelBufferGetHeight(pixelBuffer); CVPixelBufferLockFlags unlockFlags = kNilOptions; CVPixelBufferLockBaseAddress(pixelBuffer, unlockFlags); unsigned char *sourceBaseAddr = (unsigned char *)(CVPixelBufferGetBaseAddress(pixelBuffer)); int image_height; unsigned char *sourceStartAddr; if (fullHeight <= image_width) { image_height = fullHeight; sourceStartAddr = sourceBaseAddr; } else { image_height = image_width; const int marginY = ((fullHeight - image_width) / 2); sourceStartAddr = (sourceBaseAddr + (marginY * sourceRowBytes)); } const int image_channels = 4; assert(image_channels >= wanted_input_channels); tensorflow::Tensor image_tensor( tensorflow::DT_FLOAT, tensorflow::TensorShape( {1, wanted_input_height, wanted_input_width, wanted_input_channels})); auto image_tensor_mapped = image_tensor.tensor<float, 4>(); tensorflow::uint8 *in = sourceStartAddr; float *out = image_tensor_mapped.data(); for (int y = 0; y < wanted_input_height; ++y) { float *out_row = out + (y * wanted_input_width * wanted_input_channels); for (int x = 0; x < wanted_input_width; ++x) { const int in_x = (y * image_width) / wanted_input_width; const int in_y = (x * image_height) / wanted_input_height; tensorflow::uint8 *in_pixel = in + (in_y * image_width * image_channels) + (in_x * image_channels); float *out_pixel = out_row + (x * wanted_input_channels); for (int c = 0; c < wanted_input_channels; ++c) { out_pixel[c] = (in_pixel[c] - input_mean) / input_std; } } } CVPixelBufferUnlockBaseAddress(pixelBuffer, unlockFlags); if (tf_session.get()) { std::vector<tensorflow::Tensor> outputs; tensorflow::Status run_status = tf_session->Run( {{input_layer_name, image_tensor}}, {output_layer_name}, {}, &outputs); if (!run_status.ok()) { LOG(ERROR) << "Running model failed:" << run_status; } else { tensorflow::Tensor *output = &outputs[0]; auto predictions = output->flat<float>(); NSMutableDictionary *newValues = [NSMutableDictionary dictionary]; for (int index = 0; index < predictions.size(); index += 1) { const float predictionValue = predictions(index); if (predictionValue > 0.05f) { std::string label = labels[index % predictions.size()]; NSString *labelObject = [NSString stringWithUTF8String:label.c_str()]; NSNumber *valueObject = [NSNumber numberWithFloat:predictionValue]; [newValues setObject:valueObject forKey:labelObject]; } } dispatch_async(dispatch_get_main_queue(), ^(void) { [self setPredictionValues:newValues]; }); } } CVPixelBufferUnlockBaseAddress(pixelBuffer, 0); } - (void)dealloc { [self teardownAVCapture]; } // use front/back camera - (IBAction)switchCameras:(id)sender { AVCaptureDevicePosition desiredPosition; if (isUsingFrontFacingCamera) desiredPosition = AVCaptureDevicePositionBack; else desiredPosition = AVCaptureDevicePositionFront; for (AVCaptureDevice *d in [AVCaptureDevice devicesWithMediaType:AVMediaTypeVideo]) { if ([d position] == desiredPosition) { [[previewLayer session] beginConfiguration]; AVCaptureDeviceInput *input = [AVCaptureDeviceInput deviceInputWithDevice:d error:nil]; for (AVCaptureInput *oldInput in [[previewLayer session] inputs]) { [[previewLayer session] removeInput:oldInput]; } [[previewLayer session] addInput:input]; [[previewLayer session] commitConfiguration]; break; } } isUsingFrontFacingCamera = !isUsingFrontFacingCamera; } - (void)didReceiveMemoryWarning { [super didReceiveMemoryWarning]; } - (void)viewDidLoad { [super viewDidLoad]; square = [UIImage imageNamed:@"squarePNG"]; synth = [[AVSpeechSynthesizer alloc] init]; labelLayers = [[NSMutableArray alloc] init]; oldPredictionValues = [[NSMutableDictionary alloc] init]; tensorflow::Status load_status; if (model_uses_memory_mapping) { load_status = LoadMemoryMappedModel( model_file_name, model_file_type, &tf_session, &tf_memmapped_env); } else { load_status = LoadModel(model_file_name, model_file_type, &tf_session); } if (!load_status.ok()) { LOG(FATAL) << "Couldn't load model: " << load_status; } tensorflow::Status labels_status = LoadLabels(labels_file_name, labels_file_type, &labels); if (!labels_status.ok()) { LOG(FATAL) << "Couldn't load labels: " << labels_status; } [self setupAVCapture]; } - (void)viewDidUnload { [super viewDidUnload]; } - (void)viewWillAppear:(BOOL)animated { [super viewWillAppear:animated]; } - (void)viewDidAppear:(BOOL)animated { [super viewDidAppear:animated]; } - (void)viewWillDisappear:(BOOL)animated { [super viewWillDisappear:animated]; } - (void)viewDidDisappear:(BOOL)animated { [super viewDidDisappear:animated]; } - (BOOL)shouldAutorotateToInterfaceOrientation: (UIInterfaceOrientation)interfaceOrientation { return (interfaceOrientation == UIInterfaceOrientationPortrait); } - (BOOL)prefersStatusBarHidden { return YES; } - (void)setPredictionValues:(NSDictionary *)newValues { const float decayValue = 0.75f; const float updateValue = 0.25f; const float minimumThreshold = 0.01f; NSMutableDictionary *decayedPredictionValues = [[NSMutableDictionary alloc] init]; for (NSString *label in oldPredictionValues) { NSNumber *oldPredictionValueObject = [oldPredictionValues objectForKey:label]; const float oldPredictionValue = [oldPredictionValueObject floatValue]; const float decayedPredictionValue = (oldPredictionValue * decayValue); if (decayedPredictionValue > minimumThreshold) { NSNumber *decayedPredictionValueObject = [NSNumber numberWithFloat:decayedPredictionValue]; [decayedPredictionValues setObject:decayedPredictionValueObject forKey:label]; } } oldPredictionValues = decayedPredictionValues; for (NSString *label in newValues) { NSNumber *newPredictionValueObject = [newValues objectForKey:label]; NSNumber *oldPredictionValueObject = [oldPredictionValues objectForKey:label]; if (!oldPredictionValueObject) { oldPredictionValueObject = [NSNumber numberWithFloat:0.0f]; } const float newPredictionValue = [newPredictionValueObject floatValue]; const float oldPredictionValue = [oldPredictionValueObject floatValue]; const float updatedPredictionValue = (oldPredictionValue + (newPredictionValue * updateValue)); NSNumber *updatedPredictionValueObject = [NSNumber numberWithFloat:updatedPredictionValue]; [oldPredictionValues setObject:updatedPredictionValueObject forKey:label]; } NSArray *candidateLabels = [NSMutableArray array]; for (NSString *label in oldPredictionValues) { NSNumber *oldPredictionValueObject = [oldPredictionValues objectForKey:label]; const float oldPredictionValue = [oldPredictionValueObject floatValue]; if (oldPredictionValue > 0.05f) { NSDictionary *entry = @{ @"label" : label, @"value" : oldPredictionValueObject }; candidateLabels = [candidateLabels arrayByAddingObject:entry]; } } NSSortDescriptor *sort = [NSSortDescriptor sortDescriptorWithKey:@"value" ascending:NO]; NSArray *sortedLabels = [candidateLabels sortedArrayUsingDescriptors:[NSArray arrayWithObject:sort]]; const float leftMargin = 10.0f; const float topMargin = 10.0f; const float valueWidth = 48.0f; const float valueHeight = 26.0f; const float labelWidth = 246.0f; const float labelHeight = 26.0f; const float labelMarginX = 5.0f; const float labelMarginY = 5.0f; [self removeAllLabelLayers]; int labelCount = 0; for (NSDictionary *entry in sortedLabels) { NSString *label = [entry objectForKey:@"label"]; NSNumber *valueObject = [entry objectForKey:@"value"]; const float value = [valueObject floatValue]; const float originY = (topMargin + ((labelHeight + labelMarginY) * labelCount)); const int valuePercentage = (int)roundf(value * 100.0f); const float valueOriginX = leftMargin; NSString *valueText = [NSString stringWithFormat:@"%d%%", valuePercentage]; [self addLabelLayerWithText:valueText originX:valueOriginX originY:originY width:valueWidth height:valueHeight alignment:kCAAlignmentRight]; const float labelOriginX = (leftMargin + valueWidth + labelMarginX); [self addLabelLayerWithText:[label capitalizedString] originX:labelOriginX originY:originY width:labelWidth height:labelHeight alignment:kCAAlignmentLeft]; if ((labelCount == 0) && (value > 0.5f)) { [self speak:[label capitalizedString]]; } labelCount += 1; if (labelCount > 4) { break; } } } - (void)removeAllLabelLayers { for (CATextLayer *layer in labelLayers) { [layer removeFromSuperlayer]; } [labelLayers removeAllObjects]; } - (void)addLabelLayerWithText:(NSString *)text originX:(float)originX originY:(float)originY width:(float)width height:(float)height alignment:(NSString *)alignment { CFTypeRef font = (CFTypeRef) @"Menlo-Regular"; const float fontSize = 20.0f; const float marginSizeX = 5.0f; const float marginSizeY = 2.0f; const CGRect backgroundBounds = CGRectMake(originX, originY, width, height); const CGRect textBounds = CGRectMake((originX + marginSizeX), (originY + marginSizeY), (width - (marginSizeX * 2)), (height - (marginSizeY * 2))); CATextLayer *background = [CATextLayer layer]; [background setBackgroundColor:[UIColor blackColor].CGColor]; [background setOpacity:0.5f]; [background setFrame:backgroundBounds]; background.cornerRadius = 5.0f; [[self.view layer] addSublayer:background]; [labelLayers addObject:background]; CATextLayer *layer = [CATextLayer layer]; [layer setForegroundColor:[UIColor whiteColor].CGColor]; [layer setFrame:textBounds]; [layer setAlignmentMode:alignment]; [layer setWrapped:YES]; [layer setFont:font]; [layer setFontSize:fontSize]; layer.contentsScale = [[UIScreen mainScreen] scale]; [layer setString:text]; [[self.view layer] addSublayer:layer]; [labelLayers addObject:layer]; } - (void)setPredictionText:(NSString *)text withDuration:(float)duration { if (duration > 0.0) { CABasicAnimation *colorAnimation = [CABasicAnimation animationWithKeyPath:@"foregroundColor"]; colorAnimation.duration = duration; colorAnimation.fillMode = kCAFillModeForwards; colorAnimation.removedOnCompletion = NO; colorAnimation.fromValue = (id)[UIColor darkGrayColor].CGColor; colorAnimation.toValue = (id)[UIColor whiteColor].CGColor; colorAnimation.timingFunction = [CAMediaTimingFunction functionWithName:kCAMediaTimingFunctionLinear]; [self.predictionTextLayer addAnimation:colorAnimation forKey:@"colorAnimation"]; } else { self.predictionTextLayer.foregroundColor = [UIColor whiteColor].CGColor; } [self.predictionTextLayer removeFromSuperlayer]; [[self.view layer] addSublayer:self.predictionTextLayer]; [self.predictionTextLayer setString:text]; } - (void)speak:(NSString *)words { if ([synth isSpeaking]) { return; } AVSpeechUtterance *utterance = [AVSpeechUtterance speechUtteranceWithString:words]; utterance.voice = [AVSpeechSynthesisVoice voiceWithLanguage:@"en-US"]; utterance.rate = 0.75 * AVSpeechUtteranceDefaultSpeechRate; [synth speakUtterance:utterance]; } @end 连上iPhone手机,双击tensorflow/contrib/ios_examples/camera/camera_example.xcodeproj编译运行。手机安装好APP,打开APP,找到玫瑰花识别。训练迭代次数10000次后,识别率99%以上。模拟器打包,生成打包工程文件位于/Users/libinggen/Library/Developer/Xcode/DeriveData/camera_example-dhfdsdfesfmrwtfb1fpfkfjsdfhdskf/Build/Products/Debug-iphoneos。打开CameraExample.app,有可执行文件CameraExample、资源文件模型文件mmapped_graph.pb、标记文件retrained_labels.txt。 Android系统实践。 环境准备。MacBook Pro。Oracle官网下载JDK1.8版本。http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html 。jdk-8u111-macosx-x64.dmg。双击安装。设置Java环境变量: JAVA_HOME='/usr/libexec/java_home' export JAVA_HOME 搭建Android SDK环境。Android官网下载Android SDK,https://developer.android.com 。25.0.2版本。android-sdk_r25.0.2-macosx.zip。解压到~/Library/Android/sdk目录。build-tools、extras、patcher、platform-tools #各版本SDK 根据API Level划分SDK版本、platforms、sources、system-images、temp #临时文件夹 在SDK更新安装时用到、tools #各版本通用SDK工具 有adb、aapt、aidl、dx文件。搭建Android NDK环境。Android官网下载Android NDK Mac OS X版本,https://developer.android.com/ndk/downloads/index.html 。android-ndk-r13b-darwin-x86_64.zip文件。解压,CHANGELOG.md、build、ndk-build、ndk-depends、ndk-gdb、ndk-stack、ndk-which、platforms、prebuilt、python-packages、shader-tools、simpleperf、source.properties、sources、toolchains。搭建Bazel。brew安装bazel: brew install bazel 更新bazel: brew upgrade bazel 编译演示程序运行。修改tensorflow-1.1.0根目录WORKSPACE文件。android_sdk_repository、android_ndk_repository配置改为用户自己安装目录、版本。 android_sdk_repository( name = "androidsdk", api_level = 25, build_tools_version = "25.0.2", # Replace with path to Android SDK on your system path = "~/Library/Android/sdk" ) android_ndk_repository( name = "androidndk", api_level = 23, path = "~/Downloads/android-ndk-r13b" ) 在根目录用bazel构建: bazel build // tensorflow/examples/android:tensorflow_demo 编译成功,默认在tensorflow-1.1.0/bazel-bin/tensorflow/examples/android目录生成TensorFlow演示程序。运行。生成apk文件传输到手机,手机摄像头看效果。Android 6.0.1。开启“开发者模式”。手机用数据线与计算机相连,进入SDK所在目录,进入platform-tools文件夹,找到adb命令,执行: ./adb install tensorflow-0.12/bazel-bin/tensorflow/examples/android/tensorflow_demo.apk tensorflow_demo.apk自动安装到手机。打开TF Detec App。App 调起手机摄像头,摄像头返回数据流实时监测。 自定义模型编译运行。训练原始模型、编译Android系统支持模型、生成Android apk文件运行。训练原始模型、编译Android系统支持模型。用项目根目录tensorflow/python/tools/optimize_for_inference.py、tensorflow/tools/quantization/quantize_graph.py、tensorflow/contrib/util/convert_graphdef_memmapped_format.cc对模型优化。将第一步生成原始模型文件retrained_graph.pb、标记文件retrained_labels.txt放在tensorflow/examples/android/assets目录。修改tensorflow/examples/android/src/org/tensorflow/demo/TensorFlowImageClassifier.java要加载模型文件名称,输入图片尺寸、操作节点名字、缩放像素大小。 package org.tensorflow.demo; import android.content.res.AssetManager; import android.graphics.Bitmap; import android.os.Trace; import android.util.Log; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.util.ArrayList; import java.util.Comparator; import java.util.List; import java.util.PriorityQueue; import java.util.Vector; import org.tensorflow.Operation; import org.tensorflow.contrib.android.TensorFlowInferenceInterface; /** A classifier specialized to label images using TensorFlow. */ public class TensorFlowImageClassifier implements Classifier { private static final String TAG = "TensorFlowImageClassifier"; // Only return this many results with at least this confidence. private static final int MAX_RESULTS = 3; private static final float THRESHOLD = 0.1f; // Config values. private String inputName; private String outputName; private int inputSize; private int imageMean; private float imageStd; // Pre-allocated buffers. private Vector<String> labels = new Vector<String>(); private int[] intValues; private float[] floatValues; private float[] outputs; private String[] outputNames; private boolean logStats = false; private TensorFlowInferenceInterface inferenceInterface; private TensorFlowImageClassifier() {} /** * Initializes a native TensorFlow session for classifying images. * * @param assetManager The asset manager to be used to load assets. * @param modelFilename The filepath of the model GraphDef protocol buffer. * @param labelFilename The filepath of label file for classes. * @param inputSize The input size. A square image of inputSize x inputSize is assumed. * @param imageMean The assumed mean of the image values. * @param imageStd The assumed std of the image values. * @param inputName The label of the image input node. * @param outputName The label of the output node. * @throws IOException */ public static Classifier create( AssetManager assetManager, String modelFilename, String labelFilename, int inputSize, int imageMean, float imageStd, String inputName, String outputName) { TensorFlowImageClassifier c = new TensorFlowImageClassifier(); c.inputName = inputName; c.outputName = outputName; // Read the label names into memory. // TODO(andrewharp): make this handle non-assets. String actualFilename = labelFilename.split("file:///android_asset/")[1]; Log.i(TAG, "Reading labels from: " + actualFilename); BufferedReader br = null; try { br = new BufferedReader(new InputStreamReader(assetManager.open(actualFilename))); String line; while ((line = br.readLine()) != null) { c.labels.add(line); } br.close(); } catch (IOException e) { throw new RuntimeException("Problem reading label file!" , e); } c.inferenceInterface = new TensorFlowInferenceInterface(assetManager, modelFilename); // The shape of the output is [N, NUM_CLASSES], where N is the batch size. final Operation operation = c.inferenceInterface.graphOperation(outputName); final int numClasses = (int) operation.output(0).shape().size(1); Log.i(TAG, "Read " + c.labels.size() + " labels, output layer size is " + numClasses); // Ideally, inputSize could have been retrieved from the shape of the input operation. Alas, // the placeholder node for input in the graphdef typically used does not specify a shape, so it // must be passed in as a parameter. c.inputSize = inputSize; c.imageMean = imageMean; c.imageStd = imageStd; // Pre-allocate buffers. c.outputNames = new String[] {outputName}; c.intValues = new int[inputSize * inputSize]; c.floatValues = new float[inputSize * inputSize * 3]; c.outputs = new float[numClasses]; return c; } @Override public List<Recognition> recognizeImage(final Bitmap bitmap) { // Log this method so that it can be analyzed with systrace. Trace.beginSection("recognizeImage"); Trace.beginSection("preprocessBitmap"); // Preprocess the image data from 0-255 int to normalized float based // on the provided parameters. bitmap.getPixels(intValues, 0, bitmap.getWidth(), 0, 0, bitmap.getWidth(), bitmap.getHeight()); for (int i = 0; i < intValues.length; ++i) { final int val = intValues[i]; floatValues[i * 3 + 0] = (((val >> 16) & 0xFF) - imageMean) / imageStd; floatValues[i * 3 + 1] = (((val >> 8) & 0xFF) - imageMean) / imageStd; floatValues[i * 3 + 2] = ((val & 0xFF) - imageMean) / imageStd; } Trace.endSection(); // Copy the input data into TensorFlow. Trace.beginSection("feed"); inferenceInterface.feed(inputName, floatValues, 1, inputSize, inputSize, 3); Trace.endSection(); // Run the inference call. Trace.beginSection("run"); inferenceInterface.run(outputNames, logStats); Trace.endSection(); // Copy the output Tensor back into the output array. Trace.beginSection("fetch"); inferenceInterface.fetch(outputName, outputs); Trace.endSection(); // Find the best classifications. PriorityQueue<Recognition> pq = new PriorityQueue<Recognition>( 3, new Comparator<Recognition>() { @Override public int compare(Recognition lhs, Recognition rhs) { // Intentionally reversed to put high confidence at the head of the queue. return Float.compare(rhs.getConfidence(), lhs.getConfidence()); } }); for (int i = 0; i < outputs.length; ++i) { if (outputs[i] > THRESHOLD) { pq.add( new Recognition( "" + i, labels.size() > i ? labels.get(i) : "unknown", outputs[i], null)); } } final ArrayList<Recognition> recognitions = new ArrayList<Recognition>(); int recognitionsSize = Math.min(pq.size(), MAX_RESULTS); for (int i = 0; i < recognitionsSize; ++i) { recognitions.add(pq.poll()); } Trace.endSection(); // "recognizeImage" return recognitions; } @Override public void enableStatLogging(boolean logStats) { this.logStats = logStats; } @Override public String getStatString() { return inferenceInterface.getStatString(); } @Override public void close() { inferenceInterface.close(); } } 重新编译apk,连接Android手机,安装apk: bazel buld //tensorflow/examples/android:tensorflow_demo adb install -r -g bazel-bin/tensorflow/examples/android/tensorflow_demo.apk 树莓派实践。 Tensorflow可以在树莓派(Raspberry Pi)运行。树莓派,只有信用卡大小微型电脑,系统基于Linux,有音频、视频功能。应用,输入1万张自己的面部图片,在树莓派训练人脸识别模型,教会它认识你,你进入家门后,帮你开灯、播放音乐各种功能。树莓派编译方法和直接在Linux环境上用相似。 参考资料:《TensorFlow技术解析与实战》 欢迎推荐上海机器学习工作机会,我的微信:qingxingfengzi
Hadoop生态大数据系统分为Yam、 HDFS、MapReduce计算框架。TensorFlow分布式相当于MapReduce计算框架,Kubernetes相当于Yam调度系统。TensorFlowOnSpark,利用远程直接内存访问(Remote Direct Memory Access,RDMA)解决存储功能和调度,实现深度学习和大数据融合。TensorFlowOnSpark(TFoS),雅虎开源项目。https://github.com/yahoo/TensorFlowOnSpark 。支持ApacheSpark集群分布式TensorFlow训练、预测。TensorFlowOnSpark提供桥接程序,每个Spark Executor启动一个对应TensorFlow进程,通过远程进程通信(RPC)交互。 TensorFlowOnSpark架构。TensorFlow训练程序用Spark集群运行,管理Spark集群步骤:预留,在Executor执行每个TensorFlow进程保留一个端口,启动数据消息监听器。启动,在Executor启动TensorFlow主函数。数据获取,TensorFlow Readers和QueueRunners机制直接读取HDFS数据文件,Spark不访问数据;Feeding,SparkRDD 数据发送TensorFlow节点,数据通过feed_dict机制传入TensorFlow计算图。关闭,关闭Executor TensorFlow计算节点、参数服务节点。Spark Driver->Spark Executor->参数服务器->TensorFlow Core->gRPC、RDMA->HDFS数据集。http://yahoohadoop.tumblr.com/post/157196317141/open-sourcing-tensorflowonspark-distributed-deep 。 TensorFlowOnSpark MNIST。https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_standalone 。Standalone模式Spark集群,一台计算机。安装 Spark、Hadoop。部署Java 1.8.0 JDK。下载Spark2.1.0版 http://spark.apache.org/downloads.html 。下载Hadoop2.7.3版 http://hadoop.apache.org/#Download+Hadoop 。0.12.1版本支持较好。修改配置文件,设置环境变量,启动Hadoop:$HADOOP_HOME/sbin/start-all.sh。检出TensorFlowOnSpark源代码: git clone --recurse-submodules https://github.com/yahoo/TensorFlowOnSpark.git cd TensorFlowOnSpark git submodule init git submodule update --force git submodule foreach --recursive git clean -dfx 源代码打包,提交任务使用: cd TensorflowOnSpark/src zip -r ../tfspark.zip * 设置TensorFlowOnSpark根目录环境变量: cd TensorFlowOnSpark export TFoS_HOME=$(pwd) 启动Spark主节点(master): $(SPARK_HOME)/sbin/start-master.sh 配置两个工作节点(worker)实例,master-spark-URL连接主节点: export MASTER=spark://$(hostname):7077 export SPARK_WORKER_INSTANCES=2 export CORES_PER_WORKER=1 export TOTAL_CORES=$(($(CORES_PER_WORKER)*$(SPARK_WORKER_INSTANCES))) $(SPARK_HOME)/sbin/start-slave.sh -c $CORES_PER_WORKER -m 3G $(MASTER) 提交任务,MNIST zip文件转换为HDFS RDD 数据集: $(SPARK_HOME)/bin/spark-submit \ --master $(MASTER) --conf spark.ui.port=4048 --verbose \ $(TFoS_HOME)/examples/mnist/mnist_data_setup.py \ --output examples/mnist/csv \ --format csv 查看处理过的数据集: hadoop fs -ls hdfs://localhost:9000/user/libinggen/examples/mnist/csv 查看保存图片、标记向量: hadoop fs -ls hdfs://localhost:9000/user/libinggen/examples/mnist/csv/train/labels 把训练集、测试集分别保存RDD数据。https://github.com/yahoo/TensorFlowOnSpark/blob/master/examples/mnist/mnist_data_setup.py 。 from __future__ import absolute_import from __future__ import division from __future__ import print_function import numpy import tensorflow as tf from array import array from tensorflow.contrib.learn.python.learn.datasets import mnist def toTFExample(image, label): """Serializes an image/label as a TFExample byte string""" example = tf.train.Example( features = tf.train.Features( feature = { 'label': tf.train.Feature(int64_list=tf.train.Int64List(value=label.astype("int64"))), 'image': tf.train.Feature(int64_list=tf.train.Int64List(value=image.astype("int64"))) } ) ) return example.SerializeToString() def fromTFExample(bytestr): """Deserializes a TFExample from a byte string""" example = tf.train.Example() example.ParseFromString(bytestr) return example def toCSV(vec): """Converts a vector/array into a CSV string""" return ','.join([str(i) for i in vec]) def fromCSV(s): """Converts a CSV string to a vector/array""" return [float(x) for x in s.split(',') if len(s) > 0] def writeMNIST(sc, input_images, input_labels, output, format, num_partitions): """Writes MNIST image/label vectors into parallelized files on HDFS""" # load MNIST gzip into memory # MNIST图像、标记向量写入HDFS with open(input_images, 'rb') as f: images = numpy.array(mnist.extract_images(f)) with open(input_labels, 'rb') as f: if format == "csv2": labels = numpy.array(mnist.extract_labels(f, one_hot=False)) else: labels = numpy.array(mnist.extract_labels(f, one_hot=True)) shape = images.shape print("images.shape: {0}".format(shape)) # 60000 x 28 x 28 print("labels.shape: {0}".format(labels.shape)) # 60000 x 10 # create RDDs of vectors imageRDD = sc.parallelize(images.reshape(shape[0], shape[1] * shape[2]), num_partitions) labelRDD = sc.parallelize(labels, num_partitions) output_images = output + "/images" output_labels = output + "/labels" # save RDDs as specific format # RDDs保存特定格式 if format == "pickle": imageRDD.saveAsPickleFile(output_images) labelRDD.saveAsPickleFile(output_labels) elif format == "csv": imageRDD.map(toCSV).saveAsTextFile(output_images) labelRDD.map(toCSV).saveAsTextFile(output_labels) elif format == "csv2": imageRDD.map(toCSV).zip(labelRDD).map(lambda x: str(x[1]) + "|" + x[0]).saveAsTextFile(output) else: # format == "tfr": tfRDD = imageRDD.zip(labelRDD).map(lambda x: (bytearray(toTFExample(x[0], x[1])), None)) # requires: --jars tensorflow-hadoop-1.0-SNAPSHOT.jar tfRDD.saveAsNewAPIHadoopFile(output, "org.tensorflow.hadoop.io.TFRecordFileOutputFormat", keyClass="org.apache.hadoop.io.BytesWritable", valueClass="org.apache.hadoop.io.NullWritable") # Note: this creates TFRecord files w/o requiring a custom Input/Output format # else: # format == "tfr": # def writeTFRecords(index, iter): # output_path = "{0}/part-{1:05d}".format(output, index) # writer = tf.python_io.TFRecordWriter(output_path) # for example in iter: # writer.write(example) # return [output_path] # tfRDD = imageRDD.zip(labelRDD).map(lambda x: toTFExample(x[0], x[1])) # tfRDD.mapPartitionsWithIndex(writeTFRecords).collect() def readMNIST(sc, output, format): """Reads/verifies previously created output""" output_images = output + "/images" output_labels = output + "/labels" imageRDD = None labelRDD = None if format == "pickle": imageRDD = sc.pickleFile(output_images) labelRDD = sc.pickleFile(output_labels) elif format == "csv": imageRDD = sc.textFile(output_images).map(fromCSV) labelRDD = sc.textFile(output_labels).map(fromCSV) else: # format.startswith("tf"): # requires: --jars tensorflow-hadoop-1.0-SNAPSHOT.jar tfRDD = sc.newAPIHadoopFile(output, "org.tensorflow.hadoop.io.TFRecordFileInputFormat", keyClass="org.apache.hadoop.io.BytesWritable", valueClass="org.apache.hadoop.io.NullWritable") imageRDD = tfRDD.map(lambda x: fromTFExample(str(x[0]))) num_images = imageRDD.count() num_labels = labelRDD.count() if labelRDD is not None else num_images samples = imageRDD.take(10) print("num_images: ", num_images) print("num_labels: ", num_labels) print("samples: ", samples) if __name__ == "__main__": import argparse from pyspark.context import SparkContext from pyspark.conf import SparkConf parser = argparse.ArgumentParser() parser.add_argument("-f", "--format", help="output format", choices=["csv","csv2","pickle","tf","tfr"], default="csv") parser.add_argument("-n", "--num-partitions", help="Number of output partitions", type=int, default=10) parser.add_argument("-o", "--output", help="HDFS directory to save examples in parallelized format", default="mnist_data") parser.add_argument("-r", "--read", help="read previously saved examples", action="store_true") parser.add_argument("-v", "--verify", help="verify saved examples after writing", action="store_true") args = parser.parse_args() print("args:",args) sc = SparkContext(conf=SparkConf().setAppName("mnist_parallelize")) if not args.read: # Note: these files are inside the mnist.zip file writeMNIST(sc, "mnist/train-images-idx3-ubyte.gz", "mnist/train-labels-idx1-ubyte.gz", args.output + "/train", args.format, args.num_partitions) writeMNIST(sc, "mnist/t10k-images-idx3-ubyte.gz", "mnist/t10k-labels-idx1-ubyte.gz", args.output + "/test", args.format, args.num_partitions) if args.read or args.verify: readMNIST(sc, args.output + "/train", args.format) 提交训练任务,开始训练,在HDFS生成mnist_model,命令: ${SPARK_HOME}/bin/spark-submit \ --master ${MASTER} \ --py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \ --conf spark.cores.max=${TOTAL_CORES} \ --conf spark.task.cpus=${CORES_PER_WORKER} \ --conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \ ${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \ --cluster_size ${SPARK_WORKER_INSTANCES} \ --images examples/mnist/csv/train/images \ --labels examples/mnist/csv/train/labels \ --format csv \ --mode train \ --model mnist_model mnist_dist.py 构建TensorFlow 分布式任务,定义分布式任务主函数,启动TensorFlow主函数map_fun,数据获取方式Feeding。获取TensorFlow集群和服务器实例: cluster, server = TFNode.start_cluster_server(ctx, 1, args.rdma) TFNode调用tfspark.zip TFNode.py文件。 mnist_spark.py文件是训练主程序,TensorFlowOnSpark部署步骤: from __future__ import absolute_import from __future__ import division from __future__ import print_function from pyspark.context import SparkContext from pyspark.conf import SparkConf import argparse import os import numpy import sys import tensorflow as tf import threading import time from datetime import datetime from tensorflowonspark import TFCluster import mnist_dist sc = SparkContext(conf=SparkConf().setAppName("mnist_spark")) executors = sc._conf.get("spark.executor.instances") num_executors = int(executors) if executors is not None else 1 num_ps = 1 parser = argparse.ArgumentParser() parser.add_argument("-b", "--batch_size", help="number of records per batch", type=int, default=100) parser.add_argument("-e", "--epochs", help="number of epochs", type=int, default=1) parser.add_argument("-f", "--format", help="example format: (csv|pickle|tfr)", choices=["csv","pickle","tfr"], default="csv") parser.add_argument("-i", "--images", help="HDFS path to MNIST images in parallelized format") parser.add_argument("-l", "--labels", help="HDFS path to MNIST labels in parallelized format") parser.add_argument("-m", "--model", help="HDFS path to save/load model during train/inference", default="mnist_model") parser.add_argument("-n", "--cluster_size", help="number of nodes in the cluster", type=int, default=num_executors) parser.add_argument("-o", "--output", help="HDFS path to save test/inference output", default="predictions") parser.add_argument("-r", "--readers", help="number of reader/enqueue threads", type=int, default=1) parser.add_argument("-s", "--steps", help="maximum number of steps", type=int, default=1000) parser.add_argument("-tb", "--tensorboard", help="launch tensorboard process", action="store_true") parser.add_argument("-X", "--mode", help="train|inference", default="train") parser.add_argument("-c", "--rdma", help="use rdma connection", default=False) args = parser.parse_args() print("args:",args) print("{0} ===== Start".format(datetime.now().isoformat())) if args.format == "tfr": images = sc.newAPIHadoopFile(args.images, "org.tensorflow.hadoop.io.TFRecordFileInputFormat", keyClass="org.apache.hadoop.io.BytesWritable", valueClass="org.apache.hadoop.io.NullWritable") def toNumpy(bytestr): example = tf.train.Example() example.ParseFromString(bytestr) features = example.features.feature image = numpy.array(features['image'].int64_list.value) label = numpy.array(features['label'].int64_list.value) return (image, label) dataRDD = images.map(lambda x: toNumpy(str(x[0]))) else: if args.format == "csv": images = sc.textFile(args.images).map(lambda ln: [int(x) for x in ln.split(',')]) labels = sc.textFile(args.labels).map(lambda ln: [float(x) for x in ln.split(',')]) else: # args.format == "pickle": images = sc.pickleFile(args.images) labels = sc.pickleFile(args.labels) print("zipping images and labels") dataRDD = images.zip(labels) #1.为在Executor执行每个TensorFlow进程保留一个端口 cluster = TFCluster.run(sc, mnist_dist.map_fun, args, args.cluster_size, num_ps, args.tensorboard, TFCluster.InputMode.SPARK) #2.启动Tensorflow主函数 cluster.start(mnist_dist.map_fun, args) if args.mode == "train": #3.训练 cluster.train(dataRDD, args.epochs) else: #3.预测 labelRDD = cluster.inference(dataRDD) labelRDD.saveAsTextFile(args.output) #4.关闭Executor TensorFlow计算节点、参数服务节点 cluster.shutdown() print("{0} ===== Stop".format(datetime.now().isoformat())) 预测命令: ${SPARK_HOME}/bin/spark-submit \ --master ${MASTER} \ --py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \ --conf spark.cores.max=${TOTAL_CORES} \ --conf spark.task.cpus=${CORES_PER_WORKER} \ --conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \ ${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \ --cluster_size ${SPARK_WORKER_INSTANCES} \ --images examples/mnist/csv/test/images \ --labels examples/mnist/csv/test/labels \ --mode inference \ --format csv \ --model mnist_model \ --output predictions 还可以Amazon EC2运行及在Hadoop集群采用YARN模式运行。 参考资料:《TensorFlow技术解析与实战》 欢迎推荐上海机器学习工作机会,我的微信:qingxingfengzi
TensorFlow Debugger(tfdbg),TensorFlow专用调试器。用断点、计算机图形化展现实时数据流,可视化运行TensorFlow图形内部结构、状态。有助训练推理调试模型错误。https://www.tensorflow.org/programmers_guide/debugger 。 常见错误类型,非数字(nan)、无限值(inf)。tfdbg命令行界面(command line interface,CLI)。 Debugger示例。错误运行MNIST训练,通过TensorFlow Debugger找到出错地方,改正。https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/debug/examples/debug_mnist.py 。 先直接执行。 python -m tensorflow.python.debug.examples.debug_mnist 准确率第一次训练上千,后面保持较低水平。TensorFlow Debugger,在每次调用run()前后基于终端用户界面(UI),控制执行、检查图内部状态。 from tensorflow.python import debug as tf_debug sess = tr.debug.LocalCLIDebugWrapperSession(sess) sess.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan) 张量值注册过滤器has_inf_on_nan,判断图中间张量是否有nan、inf值。开启调试模式(debug)。 python -m tensorflow.python.debug.examples.debug_mnist -debug python debug_mnist.py --debug=True 运行开始UI(run-start UI),在tfdbg>后输入交互式命令,run()进入运行结束后UI(run-end UI)。连续运行10次 tfdbg>run -t 10 找出图形第一个nan或inf值 tfdbg> run -f has_inf_or_nan 第一行灰底字表示tfdbg在调用run()后立即停止,生成指定过滤器has_inf_or_nan中间张量。第4次调用run(),36个中间张量包含inf或nan值。首次出现在cross_entropy/Log:0。单击图中cross_entropy/Log:0,单击下划线node_info菜单项,看节点输入张量,是否有0值。 tfdbg>pt softmax/Softmax:0 用ni命令-t标志追溯 ni -t cross_entropy/Log 问题代码 diff = -(y_ * tf.log(y)) 修改,对tf.log输入值裁剪 diff = y_ * tf.log(tf.clip_by_value(y, 1e-8, 1.0)) from __future__ import absolute_import from __future__ import division from __future__ import print_function import argparse import sys import tensorflow as tf from tensorflow.examples.tutorials.mnist import input_data from tensorflow.python import debug as tf_debug IMAGE_SIZE = 28 HIDDEN_SIZE = 500 NUM_LABELS = 10 RAND_SEED = 42 def main(_): # Import data mnist = input_data.read_data_sets(FLAGS.data_dir, one_hot=True, fake_data=FLAGS.fake_data) def feed_dict(train): if train or FLAGS.fake_data: xs, ys = mnist.train.next_batch(FLAGS.train_batch_size, fake_data=FLAGS.fake_data) else: xs, ys = mnist.test.images, mnist.test.labels return {x: xs, y_: ys} sess = tf.InteractiveSession() # Create the MNIST neural network graph. # Input placeholders. with tf.name_scope("input"): x = tf.placeholder( tf.float32, [None, IMAGE_SIZE * IMAGE_SIZE], name="x-input") y_ = tf.placeholder(tf.float32, [None, NUM_LABELS], name="y-input") def weight_variable(shape): """Create a weight variable with appropriate initialization.""" initial = tf.truncated_normal(shape, stddev=0.1, seed=RAND_SEED) return tf.Variable(initial) def bias_variable(shape): """Create a bias variable with appropriate initialization.""" initial = tf.constant(0.1, shape=shape) return tf.Variable(initial) def nn_layer(input_tensor, input_dim, output_dim, layer_name, act=tf.nn.relu): """Reusable code for making a simple neural net layer.""" # Adding a name scope ensures logical grouping of the layers in the graph. with tf.name_scope(layer_name): # This Variable will hold the state of the weights for the layer with tf.name_scope("weights"): weights = weight_variable([input_dim, output_dim]) with tf.name_scope("biases"): biases = bias_variable([output_dim]) with tf.name_scope("Wx_plus_b"): preactivate = tf.matmul(input_tensor, weights) + biases activations = act(preactivate) return activations hidden = nn_layer(x, IMAGE_SIZE**2, HIDDEN_SIZE, "hidden") logits = nn_layer(hidden, HIDDEN_SIZE, NUM_LABELS, "output", tf.identity) y = tf.nn.softmax(logits) with tf.name_scope("cross_entropy"): # The following line is the culprit of the bad numerical values that appear # during training of this graph. Log of zero gives inf, which is first seen # in the intermediate tensor "cross_entropy/Log:0" during the 4th run() # call. A multiplication of the inf values with zeros leads to nans, # which is first in "cross_entropy/mul:0". # # You can use the built-in, numerically-stable implementation to fix this # issue: # diff = tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=logits) diff = -(y_ * tf.log(y)) with tf.name_scope("total"): cross_entropy = tf.reduce_mean(diff) with tf.name_scope("train"): train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize( cross_entropy) with tf.name_scope("accuracy"): with tf.name_scope("correct_prediction"): correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1)) with tf.name_scope("accuracy"): accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) sess.run(tf.global_variables_initializer()) if FLAGS.debug: sess = tf_debug.LocalCLIDebugWrapperSession(sess, ui_type=FLAGS.ui_type) # Add this point, sess is a debug wrapper around the actual Session if # FLAGS.debug is true. In that case, calling run() will launch the CLI. for i in range(FLAGS.max_steps): acc = sess.run(accuracy, feed_dict=feed_dict(False)) print("Accuracy at step %d: %s" % (i, acc)) sess.run(train_step, feed_dict=feed_dict(True)) if __name__ == "__main__": parser = argparse.ArgumentParser() parser.register("type", "bool", lambda v: v.lower() == "true") parser.add_argument( "--max_steps", type=int, default=10, help="Number of steps to run trainer.") parser.add_argument( "--train_batch_size", type=int, default=100, help="Batch size used during training.") parser.add_argument( "--learning_rate", type=float, default=0.025, help="Initial learning rate.") parser.add_argument( "--data_dir", type=str, default="/tmp/mnist_data", help="Directory for storing data") parser.add_argument( "--ui_type", type=str, default="curses", help="Command-line user interface type (curses | readline)") parser.add_argument( "--fake_data", type="bool", nargs="?", const=True, default=False, help="Use fake MNIST data for unit testing") parser.add_argument( "--debug", type="bool", nargs="?", const=True, default=False, help="Use debugger to track down bad values during training") FLAGS, unparsed = parser.parse_known_args() tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) 远程调试。tfdbg offline_analyzer。设置本地、远程机器能访问共享目录。debug_utils.watch_graph函数设置运行时参数选项。运行session.run(),中间张量、运行时图像转储到共享目录。本地终端用tfdbg offline_analyzer加载、检查共享目录数据。 python -m tensorflow.python.debug.cli.offline_analyzer --dump_dir=/home/somebody/tfdbg_dumps_1 源码 from tensorflow.python.debug.lib import debug_utils # 构建图,生成session对象,省略 run_options = tf.RunOptions() debug_utils.watch_graph( run_options, sess.graph, # 共享目录位置 # 多个客户端执行run,应用多个不同共享目录 debug_urls=["file://home/somebody/tfdbg_dumps_1"]) session.run(fetches, feed_dict=feeds, options=run_options) 或用会话包装器函数DumpingDebugWrapperSession在共享目录产生训练累积文件。 from tensorflow.python.debug.lib import debug_utils sess = tf_debug.DumpingDebugWrapperSession(sess, "/home/somebody/tfdbg_dumps_1", watch_fn=my_watch_fn) 参考资料:《TensorFlow技术解析与实战》 欢迎推荐上海机器学习工作机会,我的微信:qingxingfengzi
XLA(Accelerated Linear Algebra),线性代数领域专用编译器(demain-specific compiler),优化TensorFlow计算。即时(just-in-time,JIT)编译或提前(ahead-of-time,AOT)编译实现XLA,有助于硬件加速。XLA还在试验阶段。https://www.tensorflow.org/versions/master/experimental/xla/ 。 XLA优势。线性代数领域专用编译器,优化TensorFlow计算的执行速度(编译子图减少生命周期较短操作执行时间,融合管道化操作减少内存占用)、内存使用(分析、规划内存使用需求,消除许多中间结果缓存)、自定义操作依赖(提高自动化融合底层操作low-level op性能,达到手动融合自定义操作custom op效果)、移动端内存占用(提前AOT编译子图减少TensorFlow执行时间,共享头文件对被其他程序直接链接)、可移植性方面(为新硬件开发新后端,TensorFlow不需要更改很多代码用在新硬件设备上)。 XLA工作原理。LLVM编译器框架系统,C++编写,优化任意编程语言缩写程序编译时间(compile time)、链接时间(link time)、运行时间(run time)、空闲时间(idle time)。前端解析、验证、论断输入代码错误,解析代码转换LLVM中间表示(intermdediate representation,IR)。IR分析、优化改进代码,发送到代码生成器,产生本地机器代码。三相设计LLVM实现。最重要,LLVM IR。编译器IR表示代码。C->Clang C/C++/ObjC前端、Fortran->llvm-gcc前端、Haskell->GHC前端 LLVM IR-> LLVM 优化器 ->LLVM IR LLVM X86后端->X86、LLVM PowerPC后端->PowerPC、LLVM ARM后端->ARM。http://www.aosabook.org/en/llvm.html 。XLA输入语言HLO IR,XLA HLO定义图形,编译成各种体系结构机器指令。编译过程。XLA HLO->目标无关优化分析->XLA HLO->XLA后端->目标相关优化分析->目标特定代码生成。XLA首先进行目标无关优化分析(公共子表达式消除common subexpression elimination CSE,目标无关操作融合,运行时内存缓冲区分析)。XLA将HLO计算发送到后端。后端执行进一步HLO级目标不相关优化分析。XLA GPU后端执行对GPU编程模型有益操作融合,确定计算划分成流。生成目标特定代码。XLA CPU、GPU后端用LLVM中间表示、优化、代码生成。后端用LLVM IR表示XLA HLO计算。XLA 支持x86-64、NVIDIA GPU JIT编译,x86-64、ARM AOT编译。AOT更适合移动、嵌入式深度学习应用。 JIT编译方式。XLA编译、运行TensorFlow计算图一部分。XLA 将多个操作(内核)融合到少量编译内核,融合操作符减少存储器带宽提高性能。XLA 运行TensorFlow计算方法。一,打开CPU、GPU设备JIT编译。二,操作符放在XLA_CPU、XLA_GPU设备。打开JIT编译。在会话打开。把所有可能操作符编程成XLA计算。 config = tf.ConfigProto() config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1 sess = tf.Session(config=config) 为一个或多个操作符手动打开JIT编译。属性_XlaCompile = true标记编译操作符。 jit_scope = tf.contrib.compiler.jit.experimental_jit_scope x = tf.placeholder(np.float32) with jit_scope(): y = tf.add(x, x) 操作符放在XLA设备。有效设备XLA_CPU、XLA_GPU: with tf.device("/job:localhost/replica:0/task:0/device:XLA_GPU:0"): output = tf.add(input1, input2) JIT编译MNIST实现。https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/mnist/mnist_softmax_xla.py 。不使用XLA运行。 python mnist_softmax_xla.py --xla=false 运行完成生成时间线文件timeline.ctf.json,用Chrome跟踪事件分析器 chrome://tracing,打开时间线文件,呈现时间线。左侧列出GPU,可以看操作符时间消耗情况。用XLA训练模型。 TF_XLA_FLAGS=--xla_generate_hlo_graph=.* python mnist_softmax_xla.py XLA框架处于试验阶段,AOT主要应用场景内存较小嵌入式设备、手机、树莓派。 from __future__ import absolute_import from __future__ import division from __future__ import print_function import argparse import sys import tensorflow as tf from tensorflow.examples.tutorials.mnist import input_data from tensorflow.python.client import timeline FLAGS = None def main(_): # Import data mnist = input_data.read_data_sets(FLAGS.data_dir, one_hot=True) # Create the model x = tf.placeholder(tf.float32, [None, 784]) w = tf.Variable(tf.zeros([784, 10])) b = tf.Variable(tf.zeros([10])) y = tf.matmul(x, w) + b # Define loss and optimizer y_ = tf.placeholder(tf.float32, [None, 10]) # The raw formulation of cross-entropy, # # tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(tf.nn.softmax(y)), # reduction_indices=[1])) # # can be numerically unstable. # # So here we use tf.nn.softmax_cross_entropy_with_logits on the raw # outputs of 'y', and then average across the batch. cross_entropy = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y)) train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy) config = tf.ConfigProto() jit_level = 0 if FLAGS.xla: # Turns on XLA JIT compilation. # 开启XLA JIT编译 jit_level = tf.OptimizerOptions.ON_1 config.graph_options.optimizer_options.global_jit_level = jit_level run_metadata = tf.RunMetadata() sess = tf.Session(config=config) tf.global_variables_initializer().run(session=sess) # Train # 训练 train_loops = 1000 for i in range(train_loops): batch_xs, batch_ys = mnist.train.next_batch(100) # Create a timeline for the last loop and export to json to view with # chrome://tracing/. # 在最后一次循环创建时间线文件,用chrome://tracing/打开分析 if i == train_loops - 1: sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys}, options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE), run_metadata=run_metadata) trace = timeline.Timeline(step_stats=run_metadata.step_stats) with open('timeline.ctf.json', 'w') as trace_file: trace_file.write(trace.generate_chrome_trace_format()) else: sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys}) # Test trained model correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1)) accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels})) sess.close() if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument( '--data_dir', type=str, default='/tmp/tensorflow/mnist/input_data', help='Directory for storing input data') parser.add_argument( '--xla', type=bool, default=True, help='Turn xla via JIT on') FLAGS, unparsed = parser.parse_known_args() tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) 参考资料:《TensorFlow技术解析与实战》 欢迎推荐上海机器学习工作机会,我的微信:qingxingfengzi
分布式TensorFlow由高性能gRPC库底层技术支持。Martin Abadi、Ashish Agarwal、Paul Barham论文《TensorFlow:Large-Scale Machine Learning on Heterogeneous Distributed Systems》。 分布式原理。分布式集群 由多个服务器进程、客户端进程组成。部署方式,单机多卡、分布式(多机多卡)。多机多卡TensorFlow分布式。 单机多卡,单台服务器多块GPU。训练过程:在单机单GPU训练,数据一个批次(batch)一个批次训练。单机多GPU,一次处理多个批次数据,每个GPU处理一个批次数据计算。变量参数保存在CPU,数据由CPU分发给多个GPU,GPU计算每个批次更新梯度。CPU收集完多个GPU更新梯度,计算平均梯度,更新参数。继续计算更新梯度。处理速度取决最慢GPU速度。 分布式,训练在多个工作节点(worker)。工作节点,实现计算单元。计算服务器单卡,指服务器。计算服务器多卡,多个GPU划分多个工作节点。数据量大,超过一台机器处理能力,须用分布式。 分布式TensorFlow底层通信,gRPC(google remote procedure call)。gRPC,谷歌开源高性能、跨语言RPC框架。RPC协议,远程过程调用协议,网络从远程计算机程度请求服务。 分布式部署方式。分布式运行,多个计算单元(工作节点),后端服务器部署单工作节点、多工作节点。 单工作节点部署。每台服务器运行一个工作节点,服务器多个GPU,一个工作节点可以访问多块GPU卡。代码tf.device()指定运行操作设备。优势,单机多GPU间通信,效率高。劣势,手动代码指定设备。 多工作节点部署。一台服务器运行多个工作节点。 设置CUDA_VISIBLE_DEVICES环境变量,限制各个工作节点只可见一个GPU,启动进程添加环境变量。用tf.device()指定特定GPU。多工作节点部署优势,代码简单,提高GPU使用率。劣势,工作节点通信,需部署多个工作节点。https://github.com/tobegit3hub/tensorflow_examples/tree/master/distributed_tensorflow 。 CUDA_VISIBLE_DEVICES='' python ./distributed_supervisor.py --ps_hosts=127.0.0.1:2222,127.0.0.1:2223 --worker_hosts=127.0.0.1:2224,127.0.0.1:2225 --job_name=ps --task_index=0 CUDA_VISIBLE_DEVICES='' python ./distributed_supervisor.py --ps_hosts=127.0.0.1:2222,127.0.0.1:2223 --worker_hosts=127.0.0.1:2224,127.0.0.1:2225 --job_name=ps --task_index=1 CUDA_VISIBLE_DEVICES='0' python ./distributed_supervisor.py --ps_hosts=127.0.0.1:2222,127.0.0.1:2223 --worker_hosts=127.0.0.1:2224,127.0.0.1:2225 --job_name=worker --task_index=0 CUDA_VISIBLE_DEVICES='1' python ./distributed_supervisor.py --ps_hosts=127.0.0.1:2222,127.0.0.1:2223 --worker_hosts=127.0.0.1:2224,127.0.0.1:2225 --job_name=worker --task_index=1 分布式架构。https://www.tensorflow.org/extend/architecture 。客户端(client)、服务端(server),服务端包括主节点(master)、工作节点(worker)组成。 客户端、主节点、工作节点关系。TensorFlow,客户端会话联系主节点,实际工作由工作节点实现,每个工作节点占一台设备(TensorFlow具体计算硬件抽象,CPU或GPU)。单机模式,客户端、主节点、工作节点在同一台服务器。分布模式,可不同服务器。客户端->主节点->工作节点/job:worker/task:0->/job:ps/task:0。客户端。建立TensorFlow计算图,建立与集群交互会话层。代码包含Session()。一个客户端可同时与多个服务端相连,一具服务端也可与多个客户端相连。服务端。运行tf.train.Server实例进程,TensroFlow执行任务集群(cluster)一部分。有主节点服务(Master service)和工作节点服务(Worker service)。运行中,一个主节点进程和数个工作节点进程,主节点进程和工作接点进程通过接口通信。单机多卡和分布式结构相同,只需要更改通信接口实现切换。主节点服务。实现tensorflow::Session接口。通过RPC服务程序连接工作节点,与工作节点服务进程工作任务通信。TensorFlow服务端,task_index为0作业(job)。工作节点服务。实现worker_service.proto接口,本地设备计算部分图。TensorFlow服务端,所有工作节点包含工作节点服务逻辑。每个工作节点负责管理一个或多个设备。工作节点可以是本地不同端口不同进程,或多台服务多个进程。运行TensorFlow分布式执行任务集,一个或多个作业(job)。每个作业,一个或多个相同目的任务(task)。每个任务,一个工作进程执行。作业是任务集合,集群是作业集合。分布式机器学习框架,作业分参数作业(parameter job)和工作节点作业(worker job)。参数作业运行服务器为参数服务器(parameter server,PS),管理参数存储、更新。工作节点作业,管理无状态主要从事计算任务。模型越大,参数越多,模型参数更新超过一台机器性能,需要把参数分开到不同机器存储更新。参数服务,多台机器组成集群,类似分布式存储架构,涉及数据同步、一致性,参数存储为键值对(key-value)。分布式键值内存数据库,加参数更新操作。李沐《Parameter Server for Distributed Machine Learning》http://www.cs.cmu.edu/~muli/file/ps.pdf 。参数存储更新在参数作业进行,模型计算在工作节点作业进行。TensorFlow分布式实现作业间数据传输,参数作业到工作节点作业前向传播,工作节点作业到参数作业反向传播。任务。特定TensorFlow服务器独立进程,在作业中拥有对应序号。一个任务对应一个工作节点。集群->作业->任务->工作节点。 客户端、主节点、工作节点交互过程。单机多卡交互,客户端->会话运行->主节点->执行子图->工作节点->GPU0、GPU1。分布式交互,客户端->会话运行->主节点进程->执行子图1->工作节点进程1->GPU0、GPU1。《TensorFlow:Large-Scale Machine Learning on Heterogeneous distributed Systems》https://arxiv.org/abs/1603.04467v1 。 分布式模式。 数据并行。https://www.tensorflow.org/tutorials/deep_cnn 。CPU负责梯度平均、参数更新,不同GPU训练模型副本(model replica)。基于训练样例子集训练,模型有独立性。步骤:不同GPU分别定义模型网络结构。单个GPU从数据管道读取不同数据块,前向传播,计算损失,计算当前变量梯度。所有GPU输出梯度数据转移到CPU,梯度求平均操作,模型变量更新。重复,直到模型变量收敛。数据并行,提高SGD效率。SGD mini-batch样本,切成多份,模型复制多份,在多个模型上同时计算。多个模型计算速度不一致,CPU更新变量有同步、异步两个方案。 同步更新、异步更新。分布式随机梯度下降法,模型参数分布式存储在不同参数服务上,工作节点并行训练数据,和参数服务器通信获取模型参数。同步随机梯度下降法(Sync-SGD,同步更新、同步训练),训练时,每个节点上工作任务读入共享参数,执行并行梯度计算,同步需要等待所有工作节点把局部梯度处好,将所有共享参数合并、累加,再一次性更新到模型参数,下一批次,所有工作节点用模型更新后参数训练。优势,每个训练批次考虑所有工作节点训练情部,损失下降稳定。劣势,性能瓶颈在最慢工作节点。异楹设备,工作节点性能不同,劣势明显。异步随机梯度下降法(Async-SGD,异步更新、异步训练),每个工作节点任务独立计算局部梯度,异步更新到模型参数,不需执行协调、等待操作。优势,性能不存在瓶颈。劣势,每个工作节点计算梯度值发磅回参数服务器有参数更新冲突,影响算法收剑速度,损失下降过程抖动较大。同步更新、异步更新实现区别于更新参数服务器参数策略。数据量小,各节点计算能力较均衡,用同步模型。数据量大,各机器计算性能参差不齐,用异步模式。带备份的Sync-SGD(Sync-SDG with backup)。Jianmin Chen、Xinghao Pan、Rajat Monga、Aamy Bengio、Rafal Jozefowicz论文《Revisiting Distributed Synchronous SGD》https://arxiv.org/abs/1604.00981 。增加工作节点,解决部分工作节点计算慢问题。工作节点总数n+n*5%,n为集群工作节点数。异步更新设定接受到n个工作节点参数直接更新参数服务器模型参数,进入下一批次模型训练。计算较慢节点训练参数直接丢弃。同步更新、异步更新有图内模式(in-graph pattern)和图间模式(between-graph pattern),独立于图内(in-graph)、图间(between-graph)概念。图内复制(in-grasph replication),所有操作(operation)在同一个图中,用一个客户端来生成图,把所有操作分配到集群所有参数服务器和工作节点上。国内复制和单机多卡类似,扩展到多机多卡,数据分发还是在客户端一个节点上。优势,计算节点只需要调用join()函数等待任务,客户端随时提交数据就可以训练。劣势,训练数据分发在一个节点上,要分发给不同工作节点,严重影响并发训练速度。图间复制(between-graph replication),每一个工作节点创建一个图,训练参数保存在参数服务器,数据不分发,各个工作节点独立计算,计算完成把要更新参数告诉参数服务器,参数服务器更新参数。优势,不需要数据分发,各个工作节点都创建图和读取数据训练。劣势,工作节点既是图创建者又是计算任务执行者,某个工作节点宕机影响集群工作。大数据相关深度学习推荐使用图间模式。 模型并行。切分模型,模型不同部分执行在不同设备上,一个批次样本可以在不同设备同时执行。TensorFlow尽量让相邻计算在同一台设备上完成节省网络开销。Martin Abadi、Ashish Agarwal、Paul Barham论文《TensorFlow:Large-Scale Machine Learning on Heterogeneous Distributed Systems》https://arxiv.org/abs/1603.04467v1 。 模型并行、数据并行,TensorFlow中,计算可以分离,参数可以分离。可以在每个设备上分配计算节点,让对应参数也在该设备上,计算参数放一起。 分布式API。https://www.tensorflow.org/deploy/distributed 。创建集群,每个任务(task)启动一个服务(工作节点服务或主节点服务)。任务可以分布不同机器,可以同一台机器启动多个任务,用不同GPU运行。每个任务完成工作:创建一个tf.train.ClusterSpec,对集群所有任务进行描述,描述内容对所有任务相同。创建一个tf.train.Server,创建一个服务,运行相应作业计算任务。TensorFlow分布式开发API。tf.train.ClusterSpec({"ps":ps_hosts,"worker":worke_hosts})。创建TensorFlow集群描述信息,ps、worker为作业名称,ps_phsts、worker_hosts为作业任务所在节点地址信息。tf.train.ClusterSpec传入参数,作业和任务间关系映射,映射关系任务通过IP地址、端口号表示。 结构 tf.train.ClusterSpec({"local":["localhost:2222","localhost:2223"]}) 可用任务 /job:local/task:0、/job:local/task:1。 结构 tf.train.ClusterSpec({"worker":["worker0.example.com:2222","worker1.example.com:2222","worker2.example.com:2222"],"ps":["ps0.example.com:2222","ps1.example.com:2222"]}) 可用任务 /job:worker/task:0、 /job:worker/task:1、 /job:worker/task:2、 /job:ps/task:0、 /job:ps/task:1 tf.train.Server(cluster,job_name,task_index)。创建服务(主节点服务或工作节点服务),运行作业计算任务,运行任务在task_index指定机器启动。 #任务0 cluster = tr.train.ClusterSpec({"local":["localhost:2222","localhost:2223"]}) server = tr.train.Server(cluster,job_name="local",task_index=0) #任务1 cluster = tr.train.ClusterSpec({"local":["localhost:2222","localhost:2223"]}) server = tr.train.Server(cluster,job_name="local",task_index=1)。 自动化管理节点、监控节点工具。集群管理工具Kubernetes。tf.device(device_name_or_function)。设定指定设备执行张量运算,批定代码运行CPU、GPU。 #指定在task0所在机器执行Tensor操作运算 with tf.device("/job:ps/task:0"): weights_1 = tf.Variable(…) biases_1 = tf.Variable(…) 分布式训练代码框架。创建TensorFlow服务器集群,在该集群分布式计算数据流图。https://github.com/tensorflow/tensorflow/blob/master/tensorflow/docs_src/deploy/distributed.md 。 import argparse import sys import tensorflow as tf FLAGS = None def main(_): # 第1步:命令行参数解析,获取集群信息ps_hosts、worker_hosts # 当前节点角色信息job_name、task_index ps_hosts = FLAGS.ps_hosts.split(",") worker_hosts = FLAGS.worker_hosts.split(",") # 第2步:创建当前任务节点服务器 # Create a cluster from the parameter server and worker hosts. cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts}) # Create and start a server for the local task. server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index) # 第3步:如果当前节点是参数服务器,调用server.join()无休止等待;如果是工作节点,执行第4步 if FLAGS.job_name == "ps": server.join() # 第4步:构建要训练模型,构建计算图 elif FLAGS.job_name == "worker": # Assigns ops to the local worker by default. with tf.device(tf.train.replica_device_setter( worker_device="/job:worker/task:%d" % FLAGS.task_index, cluster=cluster)): # Build model... loss = ... global_step = tf.contrib.framework.get_or_create_global_step() train_op = tf.train.AdagradOptimizer(0.01).minimize( loss, global_step=global_step) # The StopAtStepHook handles stopping after running given steps. # 第5步管理模型训练过程 hooks=[tf.train.StopAtStepHook(last_step=1000000)] # The MonitoredTrainingSession takes care of session initialization, # restoring from a checkpoint, saving to a checkpoint, and closing when done # or an error occurs. with tf.train.MonitoredTrainingSession(master=server.target, is_chief=(FLAGS.task_index == 0), checkpoint_dir="/tmp/train_logs", hooks=hooks) as mon_sess: while not mon_sess.should_stop(): # Run a training step asynchronously. # See `tf.train.SyncReplicasOptimizer` for additional details on how to # perform *synchronous* training. # mon_sess.run handles AbortedError in case of preempted PS. # 训练模型 mon_sess.run(train_op) if __name__ == "__main__": parser = argparse.ArgumentParser() parser.register("type", "bool", lambda v: v.lower() == "true") # Flags for defining the tf.train.ClusterSpec parser.add_argument( "--ps_hosts", type=str, default="", help="Comma-separated list of hostname:port pairs" ) parser.add_argument( "--worker_hosts", type=str, default="", help="Comma-separated list of hostname:port pairs" ) parser.add_argument( "--job_name", type=str, default="", help="One of 'ps', 'worker'" ) # Flags for defining the tf.train.Server parser.add_argument( "--task_index", type=int, default=0, help="Index of task within the job" ) FLAGS, unparsed = parser.parse_known_args() tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) 分布式最佳实践。https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dist_test/python/mnist_replica.py 。MNIST数据集分布式训练。开设3个端口作分布式工作节点部署,2222端口参数服务器,2223端口工作节点0,2224端口工作节点1。参数服务器执行参数更新任务,工作节点0、工作节点1执行图模型训练计算任务。参数服务器/job:ps/task:0 cocalhost:2222,工作节点/job:worker/task:0 cocalhost:2223,工作节点/job:worker/task:1 cocalhost:2224。运行代码。 python mnist_replica.py --job_name="ps" --task_index=0 python mnist_replica.py --job_name="worker" --task_index=0 python mnist_replica.py --job_name="worker" --task_index=1 from __future__ import absolute_import from __future__ import division from __future__ import print_function import math import sys import tempfile import time import tensorflow as tf from tensorflow.examples.tutorials.mnist import input_data # 定义常量,用于创建数据流图 flags = tf.app.flags flags.DEFINE_string("data_dir", "/tmp/mnist-data", "Directory for storing mnist data") # 只下载数据,不做其他操作 flags.DEFINE_boolean("download_only", False, "Only perform downloading of data; Do not proceed to " "session preparation, model definition or training") # task_index从0开始。0代表用来初始化变量的第一个任务 flags.DEFINE_integer("task_index", None, "Worker task index, should be >= 0. task_index=0 is " "the master worker task the performs the variable " "initialization ") # 每台机器GPU个数,机器没有GPU为0 flags.DEFINE_integer("num_gpus", 1, "Total number of gpus for each machine." "If you don't use GPU, please set it to '0'") # 同步训练模型下,设置收集工作节点数量。默认工作节点总数 flags.DEFINE_integer("replicas_to_aggregate", None, "Number of replicas to aggregate before parameter update" "is applied (For sync_replicas mode only; default: " "num_workers)") flags.DEFINE_integer("hidden_units", 100, "Number of units in the hidden layer of the NN") # 训练次数 flags.DEFINE_integer("train_steps", 200, "Number of (global) training steps to perform") flags.DEFINE_integer("batch_size", 100, "Training batch size") flags.DEFINE_float("learning_rate", 0.01, "Learning rate") # 使用同步训练、异步训练 flags.DEFINE_boolean("sync_replicas", False, "Use the sync_replicas (synchronized replicas) mode, " "wherein the parameter updates from workers are aggregated " "before applied to avoid stale gradients") # 如果服务器已经存在,采用gRPC协议通信;如果不存在,采用进程间通信 flags.DEFINE_boolean( "existing_servers", False, "Whether servers already exists. If True, " "will use the worker hosts via their GRPC URLs (one client process " "per worker host). Otherwise, will create an in-process TensorFlow " "server.") # 参数服务器主机 flags.DEFINE_string("ps_hosts","localhost:2222", "Comma-separated list of hostname:port pairs") # 工作节点主机 flags.DEFINE_string("worker_hosts", "localhost:2223,localhost:2224", "Comma-separated list of hostname:port pairs") # 本作业是工作节点还是参数服务器 flags.DEFINE_string("job_name", None,"job name: worker or ps") FLAGS = flags.FLAGS IMAGE_PIXELS = 28 def main(unused_argv): mnist = input_data.read_data_sets(FLAGS.data_dir, one_hot=True) if FLAGS.download_only: sys.exit(0) if FLAGS.job_name is None or FLAGS.job_name == "": raise ValueError("Must specify an explicit `job_name`") if FLAGS.task_index is None or FLAGS.task_index =="": raise ValueError("Must specify an explicit `task_index`") print("job name = %s" % FLAGS.job_name) print("task index = %d" % FLAGS.task_index) #Construct the cluster and start the server # 读取集群描述信息 ps_spec = FLAGS.ps_hosts.split(",") worker_spec = FLAGS.worker_hosts.split(",") # Get the number of workers. num_workers = len(worker_spec) # 创建TensorFlow集群描述对象 cluster = tf.train.ClusterSpec({ "ps": ps_spec, "worker": worker_spec}) # 为本地执行任务创建TensorFlow Server对象。 if not FLAGS.existing_servers: # Not using existing servers. Create an in-process server. # 创建本地Sever对象,从tf.train.Server这个定义开始,每个节点开始不同 # 根据执行的命令的参数(作业名字)不同,决定这个任务是哪个任务 # 如果作业名字是ps,进程就加入这里,作为参数更新的服务,等待其他工作节点给它提交参数更新的数据 # 如果作业名字是worker,就执行后面的计算任务 server = tf.train.Server( cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index) # 如果是参数服务器,直接启动即可。这里,进程就会阻塞在这里 # 下面的tf.train.replica_device_setter代码会将参数批定给ps_server保管 if FLAGS.job_name == "ps": server.join() # 处理工作节点 # 找出worker的主节点,即task_index为0的点 is_chief = (FLAGS.task_index == 0) # 如果使用gpu if FLAGS.num_gpus > 0: # Avoid gpu allocation conflict: now allocate task_num -> #gpu # for each worker in the corresponding machine gpu = (FLAGS.task_index % FLAGS.num_gpus) # 分配worker到指定gpu上运行 worker_device = "/job:worker/task:%d/gpu:%d" % (FLAGS.task_index, gpu) # 如果使用cpu elif FLAGS.num_gpus == 0: # Just allocate the CPU to worker server # 把cpu分配给worker cpu = 0 worker_device = "/job:worker/task:%d/cpu:%d" % (FLAGS.task_index, cpu) # The device setter will automatically place Variables ops on separate # parameter servers (ps). The non-Variable ops will be placed on the workers. # The ps use CPU and workers use corresponding GPU # 用tf.train.replica_device_setter将涉及变量操作分配到参数服务器上,使用CPU。将涉及非变量操作分配到工作节点上,使用上一步worker_device值。 # 在这个with语句之下定义的参数,会自动分配到参数服务器上去定义。如果有多个参数服务器,就轮流循环分配 with tf.device( tf.train.replica_device_setter( worker_device=worker_device, ps_device="/job:ps/cpu:0", cluster=cluster)): # 定义全局步长,默认值为0 global_step = tf.Variable(0, name="global_step", trainable=False) # Variables of the hidden layer # 定义隐藏层参数变量,这里是全连接神经网络隐藏层 hid_w = tf.Variable( tf.truncated_normal( [IMAGE_PIXELS * IMAGE_PIXELS, FLAGS.hidden_units], stddev=1.0 / IMAGE_PIXELS), name="hid_w") hid_b = tf.Variable(tf.zeros([FLAGS.hidden_units]), name="hid_b") # Variables of the softmax layer # 定义Softmax 回归层参数变量 sm_w = tf.Variable( tf.truncated_normal( [FLAGS.hidden_units, 10], stddev=1.0 / math.sqrt(FLAGS.hidden_units)), name="sm_w") sm_b = tf.Variable(tf.zeros([10]), name="sm_b") # Ops: located on the worker specified with FLAGS.task_index # 定义模型输入数据变量 x = tf.placeholder(tf.float32, [None, IMAGE_PIXELS * IMAGE_PIXELS]) y_ = tf.placeholder(tf.float32, [None, 10]) # 构建隐藏层 hid_lin = tf.nn.xw_plus_b(x, hid_w, hid_b) hid = tf.nn.relu(hid_lin) # 构建损失函数和优化器 y = tf.nn.softmax(tf.nn.xw_plus_b(hid, sm_w, sm_b)) cross_entropy = -tf.reduce_sum(y_ * tf.log(tf.clip_by_value(y, 1e-10, 1.0))) # 异步训练模式:自己计算完成梯度就去更新参数,不同副本之间不会去协调进度 opt = tf.train.AdamOptimizer(FLAGS.learning_rate) # 同步训练模式 if FLAGS.sync_replicas: if FLAGS.replicas_to_aggregate is None: replicas_to_aggregate = num_workers else: replicas_to_aggregate = FLAGS.replicas_to_aggregate # 使用SyncReplicasOptimizer作优化器,并且是在图间复制情况下 # 在图内复制情况下将所有梯度平均 opt = tf.train.SyncReplicasOptimizer( opt, replicas_to_aggregate=replicas_to_aggregate, total_num_replicas=num_workers, name="mnist_sync_replicas") train_step = opt.minimize(cross_entropy, global_step=global_step) if FLAGS.sync_replicas: local_init_op = opt.local_step_init_op if is_chief: # 所有进行计算工作节点里一个主工作节点(chief) # 主节点负责初始化参数、模型保存、概要保存 local_init_op = opt.chief_init_op ready_for_local_init_op = opt.ready_for_local_init_op # Initial token and chief queue runners required by the sync_replicas mode # 同步训练模式所需初始令牌、主队列 chief_queue_runner = opt.get_chief_queue_runner() sync_init_op = opt.get_init_tokens_op() init_op = tf.global_variables_initializer() train_dir = tempfile.mkdtemp() if FLAGS.sync_replicas: # 创建一个监管程序,用于统计训练模型过程中的信息 # lodger 是保存和加载模型路径 # 启动就会去这个logdir目录看是否有检查点文件,有的话就自动加载 # 没有就用init_op指定初始化参数 # 主工作节点(chief)负责模型参数初始化工作 # 过程中,其他工作节点等待主节眯完成初始化工作,初始化完成后,一起开始训练数据 # global_step值是所有计算节点共享的 # 在执行损失函数最小值时自动加1,通过global_step知道所有计算节点一共计算多少步 sv = tf.train.Supervisor( is_chief=is_chief, logdir=train_dir, init_op=init_op, local_init_op=local_init_op, ready_for_local_init_op=ready_for_local_init_op, recovery_wait_secs=1, global_step=global_step) else: sv = tf.train.Supervisor( is_chief=is_chief, logdir=train_dir, init_op=init_op, recovery_wait_secs=1, global_step=global_step) # 创建会话,设置属性allow_soft_placement为True # 所有操作默认使用被指定设置,如GPU # 如果该操作函数没有GPU实现,自动使用CPU设备 sess_config = tf.ConfigProto( allow_soft_placement=True, log_device_placement=False, device_filters=["/job:ps", "/job:worker/task:%d" % FLAGS.task_index]) # The chief worker (task_index==0) session will prepare the session, # while the remaining workers will wait for the preparation to complete. # 主工作节点(chief),task_index为0节点初始化会话 # 其余工作节点等待会话被初始化后进行计算 if is_chief: print("Worker %d: Initializing session..." % FLAGS.task_index) else: print("Worker %d: Waiting for session to be initialized..." % FLAGS.task_index) if FLAGS.existing_servers: server_grpc_url = "grpc://" + worker_spec[FLAGS.task_index] print("Using existing server at: %s" % server_grpc_url) # 创建TensorFlow会话对象,用于执行TensorFlow图计算 # prepare_or_wait_for_session需要参数初始化完成且主节点准备好后,才开始训练 sess = sv.prepare_or_wait_for_session(server_grpc_url, config=sess_config) else: sess = sv.prepare_or_wait_for_session(server.target, config=sess_config) print("Worker %d: Session initialization complete." % FLAGS.task_index) if FLAGS.sync_replicas and is_chief: # Chief worker will start the chief queue runner and call the init op. sess.run(sync_init_op) sv.start_queue_runners(sess, [chief_queue_runner]) # Perform training # 执行分布式模型训练 time_begin = time.time() print("Training begins @ %f" % time_begin) local_step = 0 while True: # Training feed # 读入MNIST训练数据,默认每批次100张图片 batch_xs, batch_ys = mnist.train.next_batch(FLAGS.batch_size) train_feed = {x: batch_xs, y_: batch_ys} _, step = sess.run([train_step, global_step], feed_dict=train_feed) local_step += 1 now = time.time() print("%f: Worker %d: training step %d done (global step: %d)" % (now, FLAGS.task_index, local_step, step)) if step >= FLAGS.train_steps: break time_end = time.time() print("Training ends @ %f" % time_end) training_time = time_end - time_begin print("Training elapsed time: %f s" % training_time) # Validation feed # 读入MNIST验证数据,计算验证的交叉熵 val_feed = {x: mnist.validation.images, y_: mnist.validation.labels} val_xent = sess.run(cross_entropy, feed_dict=val_feed) print("After %d training step(s), validation cross entropy = %g" % (FLAGS.train_steps, val_xent)) if __name__ == "__main__": tf.app.run() 参考资料:《TensorFlow技术解析与实战》 欢迎推荐上海机器学习工作机会,我的微信:qingxingfengzi
斯坦福大学人工智能实验室李飞飞教授,实现人工智能3要素:语法(syntax)、语义(semantics)、推理(inference)。语言、视觉。通过语法(语言语法解析、视觉三维结构解析)和语义(语言语义、视觉特体动作含义)作模型输入训练数据,实现推理能力,训练学习能力应用到工作,从新数据推断结论。《The Syntax,Semantics and Inference Mechanism in Natureal Language》 http://www.aaai.org/Papers/Symposia/Fall/1996/FS-96-04/FS96-04-010.pdf 。 看图说话模型。输入一张图片,根据图像像给出描述图像内容自然语言,讲故事。翻译图像信息和文本信息。https://github.com/tensorflow/models/tree/master/research/im2txt 。 原理。编码器-解码器框架,图像编码成固定中间矢量,解码成自然语言描述。编码器Inception V3图像识别模型,解码器LSTM网络。{s0,s1,…,sn-1}字幕词,{wes0,wes1,…,wesn-1}对应词嵌入向量,LSTM输出{p1,p2,…,pn}句子下一词生成概率分布,{logp1(s1),logp2(s2),…,logpn(sn)}正确词每个步骤对数似然,总和取负数是模型最小化目标。 最佳实践。微软Microsoft COCO Caption数据集 http://mscoco.org/ 。Miscrosoft Common Objects in Context(COCO)数据集。超过30万张图片,200万个标记实体。对原COCO数据集33万张图片,用亚马逊Mechanical Turk服务,人工为每张图片生成至少5句标注,标注语句超过150万句。2014版本、2015版本。2014版本82783张图片,验证集40504张图片,测试集40775张图片。TensorFlow-Slim图像分类库 https://github.com/tensorflow/models/tree/master/research/inception/inception/slim 。 构建模型。show_and_tell_model.py。 from __future__ import absolute_import from __future__ import division from __future__ import print_function import tensorflow as tf from im2txt.ops import image_embedding from im2txt.ops import image_processing from im2txt.ops import inputs as input_ops class ShowAndTellModel(object): """Image-to-text implementation based on http://arxiv.org/abs/1411.4555. "Show and Tell: A Neural Image Caption Generator" Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan """ def __init__(self, config, mode, train_inception=False): """Basic setup. Args: config: Object containing configuration parameters. mode: "train", "eval" or "inference". train_inception: Whether the inception submodel variables are trainable. """ assert mode in ["train", "eval", "inference"] self.config = config self.mode = mode self.train_inception = train_inception # Reader for the input data. self.reader = tf.TFRecordReader() # To match the "Show and Tell" paper we initialize all variables with a # random uniform initializer. self.initializer = tf.random_uniform_initializer( minval=-self.config.initializer_scale, maxval=self.config.initializer_scale) # A float32 Tensor with shape [batch_size, height, width, channels]. self.images = None # An int32 Tensor with shape [batch_size, padded_length]. self.input_seqs = None # An int32 Tensor with shape [batch_size, padded_length]. self.target_seqs = None # An int32 0/1 Tensor with shape [batch_size, padded_length]. self.input_mask = None # A float32 Tensor with shape [batch_size, embedding_size]. self.image_embeddings = None # A float32 Tensor with shape [batch_size, padded_length, embedding_size]. self.seq_embeddings = None # A float32 scalar Tensor; the total loss for the trainer to optimize. self.total_loss = None # A float32 Tensor with shape [batch_size * padded_length]. self.target_cross_entropy_losses = None # A float32 Tensor with shape [batch_size * padded_length]. self.target_cross_entropy_loss_weights = None # Collection of variables from the inception submodel. self.inception_variables = [] # Function to restore the inception submodel from checkpoint. self.init_fn = None # Global step Tensor. self.global_step = None def is_training(self): """Returns true if the model is built for training mode.""" return self.mode == "train" def process_image(self, encoded_image, thread_id=0): """Decodes and processes an image string. Args: encoded_image: A scalar string Tensor; the encoded image. thread_id: Preprocessing thread id used to select the ordering of color distortions. Returns: A float32 Tensor of shape [height, width, 3]; the processed image. """ return image_processing.process_image(encoded_image, is_training=self.is_training(), height=self.config.image_height, width=self.config.image_width, thread_id=thread_id, image_format=self.config.image_format) def build_inputs(self): """Input prefetching, preprocessing and batching. Outputs: self.images self.input_seqs self.target_seqs (training and eval only) self.input_mask (training and eval only) """ if self.mode == "inference": # In inference mode, images and inputs are fed via placeholders. image_feed = tf.placeholder(dtype=tf.string, shape=[], name="image_feed") input_feed = tf.placeholder(dtype=tf.int64, shape=[None], # batch_size name="input_feed") # Process image and insert batch dimensions. images = tf.expand_dims(self.process_image(image_feed), 0) input_seqs = tf.expand_dims(input_feed, 1) # No target sequences or input mask in inference mode. target_seqs = None input_mask = None else: # Prefetch serialized SequenceExample protos. input_queue = input_ops.prefetch_input_data( self.reader, self.config.input_file_pattern, is_training=self.is_training(), batch_size=self.config.batch_size, values_per_shard=self.config.values_per_input_shard, input_queue_capacity_factor=self.config.input_queue_capacity_factor, num_reader_threads=self.config.num_input_reader_threads) # Image processing and random distortion. Split across multiple threads # with each thread applying a slightly different distortion. assert self.config.num_preprocess_threads % 2 == 0 images_and_captions = [] for thread_id in range(self.config.num_preprocess_threads): serialized_sequence_example = input_queue.dequeue() encoded_image, caption = input_ops.parse_sequence_example( serialized_sequence_example, image_feature=self.config.image_feature_name, caption_feature=self.config.caption_feature_name) image = self.process_image(encoded_image, thread_id=thread_id) images_and_captions.append([image, caption]) # Batch inputs. queue_capacity = (2 * self.config.num_preprocess_threads * self.config.batch_size) images, input_seqs, target_seqs, input_mask = ( input_ops.batch_with_dynamic_pad(images_and_captions, batch_size=self.config.batch_size, queue_capacity=queue_capacity)) self.images = images self.input_seqs = input_seqs self.target_seqs = target_seqs self.input_mask = input_mask def build_image_embeddings(self): """Builds the image model subgraph and generates image embeddings. Inputs: self.images Outputs: self.image_embeddings """ inception_output = image_embedding.inception_v3( self.images, trainable=self.train_inception, is_training=self.is_training()) self.inception_variables = tf.get_collection( tf.GraphKeys.GLOBAL_VARIABLES, scope="InceptionV3") # Map inception output into embedding space. with tf.variable_scope("image_embedding") as scope: image_embeddings = tf.contrib.layers.fully_connected( inputs=inception_output, num_outputs=self.config.embedding_size, activation_fn=None, weights_initializer=self.initializer, biases_initializer=None, scope=scope) # Save the embedding size in the graph. tf.constant(self.config.embedding_size, name="embedding_size") self.image_embeddings = image_embeddings def build_seq_embeddings(self): """Builds the input sequence embeddings. Inputs: self.input_seqs Outputs: self.seq_embeddings """ with tf.variable_scope("seq_embedding"), tf.device("/cpu:0"): embedding_map = tf.get_variable( name="map", shape=[self.config.vocab_size, self.config.embedding_size], initializer=self.initializer) seq_embeddings = tf.nn.embedding_lookup(embedding_map, self.input_seqs) self.seq_embeddings = seq_embeddings def build_model(self): """Builds the model. Inputs: self.image_embeddings self.seq_embeddings self.target_seqs (training and eval only) self.input_mask (training and eval only) Outputs: self.total_loss (training and eval only) self.target_cross_entropy_losses (training and eval only) self.target_cross_entropy_loss_weights (training and eval only) """ # This LSTM cell has biases and outputs tanh(new_c) * sigmoid(o), but the # modified LSTM in the "Show and Tell" paper has no biases and outputs # new_c * sigmoid(o). lstm_cell = tf.contrib.rnn.BasicLSTMCell( num_units=self.config.num_lstm_units, state_is_tuple=True) if self.mode == "train": lstm_cell = tf.contrib.rnn.DropoutWrapper( lstm_cell, input_keep_prob=self.config.lstm_dropout_keep_prob, output_keep_prob=self.config.lstm_dropout_keep_prob) with tf.variable_scope("lstm", initializer=self.initializer) as lstm_scope: # Feed the image embeddings to set the initial LSTM state. zero_state = lstm_cell.zero_state( batch_size=self.image_embeddings.get_shape()[0], dtype=tf.float32) _, initial_state = lstm_cell(self.image_embeddings, zero_state) # Allow the LSTM variables to be reused. lstm_scope.reuse_variables() if self.mode == "inference": # In inference mode, use concatenated states for convenient feeding and # fetching. tf.concat(axis=1, values=initial_state, name="initial_state") # Placeholder for feeding a batch of concatenated states. state_feed = tf.placeholder(dtype=tf.float32, shape=[None, sum(lstm_cell.state_size)], name="state_feed") state_tuple = tf.split(value=state_feed, num_or_size_splits=2, axis=1) # Run a single LSTM step. lstm_outputs, state_tuple = lstm_cell( inputs=tf.squeeze(self.seq_embeddings, axis=[1]), state=state_tuple) # Concatentate the resulting state. tf.concat(axis=1, values=state_tuple, name="state") else: # Run the batch of sequence embeddings through the LSTM. sequence_length = tf.reduce_sum(self.input_mask, 1) lstm_outputs, _ = tf.nn.dynamic_rnn(cell=lstm_cell, inputs=self.seq_embeddings, sequence_length=sequence_length, initial_state=initial_state, dtype=tf.float32, scope=lstm_scope) # Stack batches vertically. lstm_outputs = tf.reshape(lstm_outputs, [-1, lstm_cell.output_size]) with tf.variable_scope("logits") as logits_scope: logits = tf.contrib.layers.fully_connected( inputs=lstm_outputs, num_outputs=self.config.vocab_size, activation_fn=None, weights_initializer=self.initializer, scope=logits_scope) if self.mode == "inference": tf.nn.softmax(logits, name="softmax") else: targets = tf.reshape(self.target_seqs, [-1]) weights = tf.to_float(tf.reshape(self.input_mask, [-1])) # Compute losses. losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=targets, logits=logits) batch_loss = tf.div(tf.reduce_sum(tf.multiply(losses, weights)), tf.reduce_sum(weights), name="batch_loss") tf.losses.add_loss(batch_loss) total_loss = tf.losses.get_total_loss() # Add summaries. tf.summary.scalar("losses/batch_loss", batch_loss) tf.summary.scalar("losses/total_loss", total_loss) for var in tf.trainable_variables(): tf.summary.histogram("parameters/" + var.op.name, var) self.total_loss = total_loss self.target_cross_entropy_losses = losses # Used in evaluation. self.target_cross_entropy_loss_weights = weights # Used in evaluation. def setup_inception_initializer(self): """Sets up the function to restore inception variables from checkpoint.""" if self.mode != "inference": # Restore inception variables only. saver = tf.train.Saver(self.inception_variables) def restore_fn(sess): tf.logging.info("Restoring Inception variables from checkpoint file %s", self.config.inception_checkpoint_file) saver.restore(sess, self.config.inception_checkpoint_file) self.init_fn = restore_fn def setup_global_step(self): """Sets up the global step Tensor.""" global_step = tf.Variable( initial_value=0, name="global_step", trainable=False, collections=[tf.GraphKeys.GLOBAL_STEP, tf.GraphKeys.GLOBAL_VARIABLES]) self.global_step = global_step def build(self): """Creates all ops for training and evaluation.""" # 构建模型 self.build_inputs() # 构建输入数据 self.build_image_embeddings() # 采用Inception V3构建图像模型,输出图片嵌入向量 self.build_seq_embeddings() # 构建输入序列embeddings self.build_model() # CNN、LSTM串联,构建完整模型 self.setup_inception_initializer() # 载入Inception V3预训练模型 self.setup_global_step() # 记录全局迭代次数 训练模型。train.py。 from __future__ import absolute_import from __future__ import division from __future__ import print_function import tensorflow as tf from im2txt import configuration from im2txt import show_and_tell_model FLAGS = tf.app.flags.FLAGS tf.flags.DEFINE_string("input_file_pattern", "", "File pattern of sharded TFRecord input files.") tf.flags.DEFINE_string("inception_checkpoint_file", "", "Path to a pretrained inception_v3 model.") tf.flags.DEFINE_string("train_dir", "", "Directory for saving and loading model checkpoints.") tf.flags.DEFINE_boolean("train_inception", False, "Whether to train inception submodel variables.") tf.flags.DEFINE_integer("number_of_steps", 1000000, "Number of training steps.") tf.flags.DEFINE_integer("log_every_n_steps", 1, "Frequency at which loss and global step are logged.") tf.logging.set_verbosity(tf.logging.INFO) def main(unused_argv): assert FLAGS.input_file_pattern, "--input_file_pattern is required" assert FLAGS.train_dir, "--train_dir is required" model_config = configuration.ModelConfig() model_config.input_file_pattern = FLAGS.input_file_pattern model_config.inception_checkpoint_file = FLAGS.inception_checkpoint_file training_config = configuration.TrainingConfig() # Create training directory. # 创建训练结果存储路径 train_dir = FLAGS.train_dir if not tf.gfile.IsDirectory(train_dir): tf.logging.info("Creating training directory: %s", train_dir) tf.gfile.MakeDirs(train_dir) # Build the TensorFlow graph. # 建立TensorFlow数据流图 g = tf.Graph() with g.as_default(): # Build the model. # 构建模型 model = show_and_tell_model.ShowAndTellModel( model_config, mode="train", train_inception=FLAGS.train_inception) model.build() # Set up the learning rate. # 定义学习率 learning_rate_decay_fn = None if FLAGS.train_inception: learning_rate = tf.constant(training_config.train_inception_learning_rate) else: learning_rate = tf.constant(training_config.initial_learning_rate) if training_config.learning_rate_decay_factor > 0: num_batches_per_epoch = (training_config.num_examples_per_epoch / model_config.batch_size) decay_steps = int(num_batches_per_epoch * training_config.num_epochs_per_decay) def _learning_rate_decay_fn(learning_rate, global_step): return tf.train.exponential_decay( learning_rate, global_step, decay_steps=decay_steps, decay_rate=training_config.learning_rate_decay_factor, staircase=True) learning_rate_decay_fn = _learning_rate_decay_fn # Set up the training ops. # 定义训练操作 train_op = tf.contrib.layers.optimize_loss( loss=model.total_loss, global_step=model.global_step, learning_rate=learning_rate, optimizer=training_config.optimizer, clip_gradients=training_config.clip_gradients, learning_rate_decay_fn=learning_rate_decay_fn) # Set up the Saver for saving and restoring model checkpoints. saver = tf.train.Saver(max_to_keep=training_config.max_checkpoints_to_keep) # Run training. # 训练 tf.contrib.slim.learning.train( train_op, train_dir, log_every_n_steps=FLAGS.log_every_n_steps, graph=g, global_step=model.global_step, number_of_steps=FLAGS.number_of_steps, init_fn=model.init_fn, saver=saver) if __name__ == "__main__": tf.app.run() 预测生成模型。run_inference.py。 from __future__ import absolute_import from __future__ import division from __future__ import print_function import math import os import tensorflow as tf from im2txt import configuration from im2txt import inference_wrapper from im2txt.inference_utils import caption_generator from im2txt.inference_utils import vocabulary FLAGS = tf.flags.FLAGS tf.flags.DEFINE_string("checkpoint_path", "", "Model checkpoint file or directory containing a " "model checkpoint file.") tf.flags.DEFINE_string("vocab_file", "", "Text file containing the vocabulary.") tf.flags.DEFINE_string("input_files", "", "File pattern or comma-separated list of file patterns " "of image files.") tf.logging.set_verbosity(tf.logging.INFO) def main(_): # Build the inference graph. g = tf.Graph() with g.as_default(): model = inference_wrapper.InferenceWrapper() restore_fn = model.build_graph_from_config(configuration.ModelConfig(), FLAGS.checkpoint_path) g.finalize() # Create the vocabulary. vocab = vocabulary.Vocabulary(FLAGS.vocab_file) filenames = [] for file_pattern in FLAGS.input_files.split(","): filenames.extend(tf.gfile.Glob(file_pattern)) tf.logging.info("Running caption generation on %d files matching %s", len(filenames), FLAGS.input_files) with tf.Session(graph=g) as sess: # Load the model from checkpoint. restore_fn(sess) # Prepare the caption generator. Here we are implicitly using the default # beam search parameters. See caption_generator.py for a description of the # available beam search parameters. generator = caption_generator.CaptionGenerator(model, vocab) for filename in filenames: with tf.gfile.GFile(filename, "r") as f: image = f.read() captions = generator.beam_search(sess, image) print("Captions for image %s:" % os.path.basename(filename)) for i, caption in enumerate(captions): # Ignore begin and end words. sentence = [vocab.id_to_word(w) for w in caption.sentence[1:-1]] sentence = " ".join(sentence) print(" %d) %s (p=%f)" % (i, sentence, math.exp(caption.logprob))) if __name__ == "__main__": tf.app.run() 参考资料:《TensorFlow技术解析与实战》 欢迎推荐上海机器学习工作机会,我的微信:qingxingfengzi
自然语言处理,语音处理、文本处理。语音识别(speech recognition),让计算机能够“听懂”人类语音,语音的文字信息“提取”。 日本富国生命保险公司花170万美元安装人工智能系统,客户语言转换文本,分析词正面或负面。智能客服是人工能智能公司研究重点。循环神经网络(recurrent neural network,RNN)模型。 模型选择。每一个矩形是一个向量,箭头表示函数。最下面一行输入向量,最上面一行输出向量,中间一行RNN状态。一对一,没用RNN,如Vanilla模型,固定大小输入到固定大小输出(图像分类)。一对多,序列输出,图片描述,输入一张图片输出一段文字序列,CNN、RNN结合,图像、语言结合。多对一,序列输入,情感分析,输入一段文字,分类积极、消极情感,如淘宝商品评论分类,用LSTM。多对多,异步序列输入、序列输出,机器翻译,如RNN读取英文语句,以法语形式输出。多对多,同步序列输入、序列输出,视频分类,视频每帧打标记。中间RNN状态部分固定,可多次使用,不需对序列长度预先约束。Andrej Karpathy《The Unreasonable Effectiveness of Recurrent Neural Networks》。http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 。自然语言处理,语音合成(文字生成语音)、语单识别、声纹识别(声纹鉴权)、文本处理(分词、情感分析、文本挖掘)。 英文数字语音识别。https://github.com/pannous/tensorflow-speech-recognition/blob/master/speech2text-tflearn.py 。20行Python代码创建超简单语音识别器。LSTM循环神经网络,TFLearn训练英文数字口语数据集。spoken numbers pcm数据集 http://pannous.net/spoken_numbers.tar 。多人阅读0~9数字英文音频,分男女声,一段音频(wav文件)只有一个数字对应英文声音。标识方法{数字}_人名_xxx。 定义输入数据,预处理数据。语音处理成矩阵形式。梅尔频率倒谱系数(Mel frequency cepstral coefficents, MFCC)特征向量。语音分帧、取对数、逆矩阵,生成MFCC代表语音特征。 定义网络模型。LSTM模型。 训练模型,并存储模型。 预测模型。任意输入一个语音文件,预测。 语音识别,可用在智能输入法、会议快速录入、语音控制系统、智能家居领域。 #!/usr/bin/env python #!/usr/local/bin/python # -*- coding: utf-8 -*- from __future__ import division, print_function, absolute_import import tflearn import speech_data learning_rate = 0.0001 training_iters = 300000 # steps 迭代次数 batch_size = 64 width = 20 # mfcc features MFCC特征 height = 80 # (max) length of utterance 最大发音长度 classes = 10 # digits 数字类别 batch = word_batch = speech_data.mfcc_batch_generator(batch_size) # 生成每一批MFCC语音 X, Y = next(batch) # train, test, _ = ,X trainX, trainY = X, Y testX, testY = X, Y #overfit for now # Data preprocessing # Sequence padding # trainX = pad_sequences(trainX, maxlen=100, value=0.) # testX = pad_sequences(testX, maxlen=100, value=0.) # # Converting labels to binary vectors # trainY = to_categorical(trainY, nb_classes=2) # testY = to_categorical(testY, nb_classes=2) # Network building # LSTM模型 net = tflearn.input_data([None, width, height]) # net = tflearn.embedding(net, input_dim=10000, output_dim=128) net = tflearn.lstm(net, 128, dropout=0.8) net = tflearn.fully_connected(net, classes, activation='softmax') net = tflearn.regression(net, optimizer='adam', learning_rate=learning_rate, loss='categorical_crossentropy') # Training model = tflearn.DNN(net, tensorboard_verbose=0) model.load("tflearn.lstm.model") while 1: #training_iters model.fit(trainX, trainY, n_epoch=100, validation_set=(testX, testY), show_metric=True, batch_size=batch_size) _y=model.predict(X) model.save("tflearn.lstm.model") print (_y) print (y) 智能聊天机器人。未来方向“自然语言人机交互”。苹果Siri、微软Cortana和小冰、Google Now、百度度秘、亚马逊蓝牙音箱Amazon Echo内置语音助手Alexa、Facebook 语音助手M。通过和用户“语音机器人”对话,引导用户到对应服务。今后智能硬件、智能家居嵌入式应用。智能聊天机器人3代技术。第一代特征工程,大量逻辑判断。第二代检索库,给定问题、聊天,从检索库找到与已有答案最匹配答案。第三代深度学习,seq2seq+Attention模型,大量训练,根据输入生成输出。 seq2seq+Attention模型原理、构建方法。翻译模型,把一个序列翻译成另一个序列。两个RNNLM,一个作编码器,一个解码器,组成RNN编码器-解码器。文本处理领域,常用编码器-解码器(encoder-decoder)框架。输入->编码器->语义编码C->解码器->输出。适合处理上下文(context)生成一个目标(target)通用处理模型。一个句子对,输入给定句子X,通过编码器-解码器框架生成目标句子Y。X、Y可以不同语言,机器翻译。X、Y是对话问句答句,聊天机器人。X、Y可以是图片和对应描述,看图说话。X由x1、x2等单词序列组成,Y由y1、y2等单词序列组成。编码器编码输入X,生成中间语义编码C,解码器解码中间语义编码C,每个i时刻结合已生成y1、y2……yi-1历史信息生成Yi。生成句子每个词采用中间语义编码相同 C。短句子贴切,长句子不合语义。实际实现聊天系统,编码器和解码器采用RNN模型、LSTM模型。句子长度超过30,LSTM模型效果急剧下降,引入Attention模型,长句子提升系统效果。Attention机制,人在做一件事情,专注做这件事,忽略周围其他事。源句子中对生成句子重要关键词权重提高,产生更准确应答。增加Attention模型编码器-解码器模型框架:输入->编码器->语义编码C1、C2、C3->解码器->输出Y1、Y2、Y3。中间语义编码Ci不断变化,产生更准确Yi。 最佳实践。https://github.com/suriyadeepan/easy_seq2seq ,依赖TensorFlow 0.12.1环境。康奈尔大学 Corpus数据集(Cornell Movie Dialogs Corpus) http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html 。600 部电影对白。 处理聊天数据。 先把数据集整理成“问”、“答”文件,生成.enc(问句)、.dec(答句)文件。test.dec #测试集答句,test.enc #测试集问句,train.dec #训练集答句,train.enc #训练集问句。创建词汇表,问句、答句转换成对应id形式。词汇表文件2万个词汇。vocab20000.dec #答句词汇表,vocab20000.enc #问句词汇表。_GO、_EOS、_UNK、_PAD seq2seq模型特殊标记,填充标记对话。_GO标记对话开始。_EOS标记对话结束。_UNK标记未出现词汇表字符,替换稀有词汇。_PAD填充序列,保证批次序列长度相同。转换成ids文件,test.enc.ids20000、train.dec.ids20000、train.enc.ids20000。问句、答句转换ids文件,每行是一个问句或答句,每行每个id代表问句或答句对应位置词。 采用编码器-解码器框架训练。 定义训练参数。seq2seq.ini。 [strings] # Mode : train, test, serve 模式 mode = train train_enc = data/train.enc train_dec = data/train.dec test_enc = data/test.enc test_dec = data/test.dec # folder where checkpoints, vocabulary, temporary data will be stored # 模型文件和词汇表存储路径 working_directory = working_dir/ [ints] # vocabulary size # 词汇表大小 # 20,000 is a reasonable size enc_vocab_size = 20000 dec_vocab_size = 20000 # number of LSTM layers : 1/2/3 # LSTM层数 num_layers = 3 # typical options : 128, 256, 512, 1024 每层大小,可取值 layer_size = 256 # dataset size limit; typically none : no limit max_train_data_size = 0 batch_size = 64 # steps per checkpoint # 每多少次迭代存储一次模型 # Note : At a checkpoint, models parameters are saved, model is evaluated # and results are printed steps_per_checkpoint = 300 [floats] learning_rate = 0.5 # 学习速率 learning_rate_decay_factor = 0.99 # 学习速率下降系数 max_gradient_norm = 5.0 定义网络模型 seq2seq。seq2seq_model.py。TensorFlow 0.12。定义seq2seq+Attention模型类,3个函数。《Grammar as a Foreign Language》 http://arxiv.org/abs/1412.7499 。初始化模型函数(__init__)、训练模型函数(step)、获取下一批次训练数据函数(get_batch)。 from __future__ import absolute_import from __future__ import division from __future__ import print_function import random import numpy as np from six.moves import xrange # pylint: disable=redefined-builtin import tensorflow as tf from tensorflow.models.rnn.translate import data_utils class Seq2SeqModel(object): def __init__(self, source_vocab_size, target_vocab_size, buckets, size, num_layers, max_gradient_norm, batch_size, learning_rate, learning_rate_decay_factor, use_lstm=False, num_samples=512, forward_only=False): """ 构建模型 Args: 参数 source_vocab_size: size of the source vocabulary. 问句词汇表大小 target_vocab_size: size of the target vocabulary.答句词汇表大小 buckets: a list of pairs (I, O), where I specifies maximum input length that will be processed in that bucket, and O specifies maximum output length. Training instances that have inputs longer than I or outputs longer than O will be pushed to the next bucket and padded accordingly. We assume that the list is sorted, e.g., [(2, 4), (8, 16)]. 其中I指定最大输入长度,O指定最大输出长度 size: number of units in each layer of the model.每层神经元数量 num_layers: number of layers in the model.模型层数 max_gradient_norm: gradients will be clipped to maximally this norm.梯度被削减到最大规范 batch_size: the size of the batches used during training; the model construction is independent of batch_size, so it can be changed after initialization if this is convenient, e.g., for decoding.批次大小。训练、预测批次大小,可不同 learning_rate: learning rate to start with.学习速率 learning_rate_decay_factor: decay learning rate by this much when needed.调整学习速率 use_lstm: if true, we use LSTM cells instead of GRU cells.使用LSTM 单元代替GRU单元 num_samples: number of samples for sampled softmax.使用softmax样本数 forward_only: if set, we do not construct the backward pass in the model.是否仅构建前向传播 """ self.source_vocab_size = source_vocab_size self.target_vocab_size = target_vocab_size self.buckets = buckets self.batch_size = batch_size self.learning_rate = tf.Variable(float(learning_rate), trainable=False) self.learning_rate_decay_op = self.learning_rate.assign( self.learning_rate * learning_rate_decay_factor) self.global_step = tf.Variable(0, trainable=False) # If we use sampled softmax, we need an output projection. output_projection = None softmax_loss_function = None # Sampled softmax only makes sense if we sample less than vocabulary size. # 如果样本量比词汇表量小,用抽样softmax if num_samples > 0 and num_samples < self.target_vocab_size: w = tf.get_variable("proj_w", [size, self.target_vocab_size]) w_t = tf.transpose(w) b = tf.get_variable("proj_b", [self.target_vocab_size]) output_projection = (w, b) def sampled_loss(inputs, labels): labels = tf.reshape(labels, [-1, 1]) return tf.nn.sampled_softmax_loss(w_t, b, inputs, labels, num_samples, self.target_vocab_size) softmax_loss_function = sampled_loss # Create the internal multi-layer cell for our RNN. # 构建RNN single_cell = tf.nn.rnn_cell.GRUCell(size) if use_lstm: single_cell = tf.nn.rnn_cell.BasicLSTMCell(size) cell = single_cell cell = tf.nn.rnn_cell.DropoutWrapper(cell, output_keep_prob=0.5) if num_layers > 1: cell = tf.nn.rnn_cell.MultiRNNCell([single_cell] * num_layers) # The seq2seq function: we use embedding for the input and attention. # Attention模型 def seq2seq_f(encoder_inputs, decoder_inputs, do_decode): return tf.nn.seq2seq.embedding_attention_seq2seq( encoder_inputs, decoder_inputs, cell, num_encoder_symbols=source_vocab_size, num_decoder_symbols=target_vocab_size, embedding_size=size, output_projection=output_projection, feed_previous=do_decode) # Feeds for inputs. # 给模型填充数据 self.encoder_inputs = [] self.decoder_inputs = [] self.target_weights = [] for i in xrange(buckets[-1][0]): # Last bucket is the biggest one. self.encoder_inputs.append(tf.placeholder(tf.int32, shape=[None], name="encoder{0}".format(i))) for i in xrange(buckets[-1][1] + 1): self.decoder_inputs.append(tf.placeholder(tf.int32, shape=[None], name="decoder{0}".format(i))) self.target_weights.append(tf.placeholder(tf.float32, shape=[None], name="weight{0}".format(i))) # Our targets are decoder inputs shifted by one. # targets值是解码器偏移1位 targets = [self.decoder_inputs[i + 1] for i in xrange(len(self.decoder_inputs) - 1)] # Training outputs and losses. # 训练模型输出 if forward_only: self.outputs, self.losses = tf.nn.seq2seq.model_with_buckets( self.encoder_inputs, self.decoder_inputs, targets, self.target_weights, buckets, lambda x, y: seq2seq_f(x, y, True), softmax_loss_function=softmax_loss_function) # If we use output projection, we need to project outputs for decoding. if output_projection is not None: for b in xrange(len(buckets)): self.outputs[b] = [ tf.matmul(output, output_projection[0]) + output_projection[1] for output in self.outputs[b] ] else: self.outputs, self.losses = tf.nn.seq2seq.model_with_buckets( self.encoder_inputs, self.decoder_inputs, targets, self.target_weights, buckets, lambda x, y: seq2seq_f(x, y, False), softmax_loss_function=softmax_loss_function) # Gradients and SGD update operation for training the model. # 训练模型,更新梯度 params = tf.trainable_variables() if not forward_only: self.gradient_norms = [] self.updates = [] opt = tf.train.AdamOptimizer() for b in xrange(len(buckets)): gradients = tf.gradients(self.losses[b], params) clipped_gradients, norm = tf.clip_by_global_norm(gradients, max_gradient_norm) self.gradient_norms.append(norm) self.updates.append(opt.apply_gradients( zip(clipped_gradients, params), global_step=self.global_step)) self.saver = tf.train.Saver(tf.global_variables()) def step(self, session, encoder_inputs, decoder_inputs, target_weights, bucket_id, forward_only): """Run a step of the model feeding the given inputs. 定义运行模型的每一步 Args: session: tensorflow session to use. encoder_inputs: list of numpy int vectors to feed as encoder inputs.问句向量序列 decoder_inputs: list of numpy int vectors to feed as decoder inputs.答句向量序列 target_weights: list of numpy float vectors to feed as target weights. bucket_id: which bucket of the model to use.输入bucket_id forward_only: whether to do the backward step or only forward.是否只做前向传播 Returns: A triple consisting of gradient norm (or None if we did not do backward), average perplexity, and the outputs. Raises: ValueError: if length of encoder_inputs, decoder_inputs, or target_weights disagrees with bucket size for the specified bucket_id. """ # Check if the sizes match. encoder_size, decoder_size = self.buckets[bucket_id] if len(encoder_inputs) != encoder_size: raise ValueError("Encoder length must be equal to the one in bucket," " %d != %d." % (len(encoder_inputs), encoder_size)) if len(decoder_inputs) != decoder_size: raise ValueError("Decoder length must be equal to the one in bucket," " %d != %d." % (len(decoder_inputs), decoder_size)) if len(target_weights) != decoder_size: raise ValueError("Weights length must be equal to the one in bucket," " %d != %d." % (len(target_weights), decoder_size)) # Input feed: encoder inputs, decoder inputs, target_weights, as provided. # 输入填充 input_feed = {} for l in xrange(encoder_size): input_feed[self.encoder_inputs[l].name] = encoder_inputs[l] for l in xrange(decoder_size): input_feed[self.decoder_inputs[l].name] = decoder_inputs[l] input_feed[self.target_weights[l].name] = target_weights[l] # Since our targets are decoder inputs shifted by one, we need one more. last_target = self.decoder_inputs[decoder_size].name input_feed[last_target] = np.zeros([self.batch_size], dtype=np.int32) # Output feed: depends on whether we do a backward step or not. # 输出填充:与是否有后向传播有关 if not forward_only: output_feed = [self.updates[bucket_id], # Update Op that does SGD. self.gradient_norms[bucket_id], # Gradient norm. self.losses[bucket_id]] # Loss for this batch. else: output_feed = [self.losses[bucket_id]] # Loss for this batch. for l in xrange(decoder_size): # Output logits. output_feed.append(self.outputs[bucket_id][l]) outputs = session.run(output_feed, input_feed) if not forward_only: return outputs[1], outputs[2], None # Gradient norm, loss, no outputs.有后向传播输出,梯度、损失值、None else: return None, outputs[0], outputs[1:] # No gradient norm, loss, outputs.仅有前向传播输出,None,损失值,None def get_batch(self, data, bucket_id): """ 从指定桶获取一个批次随机数据,在训练每步(step)使用 Args:参数 data: a tuple of size len(self.buckets) in which each element contains lists of pairs of input and output data that we use to create a batch.长度为(self.buckets)元组,每个元素包含创建批次输入、输出数据对列表 bucket_id: integer, which bucket to get the batch for.整数,从哪个bucket获取批次 Returns:返回 The triple (encoder_inputs, decoder_inputs, target_weights) for the constructed batch that has the proper format to call step(...) later.一个包含三项元组(encoder_inputs, decoder_inputs, target_weights) """ encoder_size, decoder_size = self.buckets[bucket_id] encoder_inputs, decoder_inputs = [], [] # Get a random batch of encoder and decoder inputs from data, # pad them if needed, reverse encoder inputs and add GO to decoder. for _ in xrange(self.batch_size): encoder_input, decoder_input = random.choice(data[bucket_id]) # Encoder inputs are padded and then reversed. encoder_pad = [data_utils.PAD_ID] * (encoder_size - len(encoder_input)) encoder_inputs.append(list(reversed(encoder_input + encoder_pad))) # Decoder inputs get an extra "GO" symbol, and are padded then. decoder_pad_size = decoder_size - len(decoder_input) - 1 decoder_inputs.append([data_utils.GO_ID] + decoder_input + [data_utils.PAD_ID] * decoder_pad_size) # Now we create batch-major vectors from the data selected above. batch_encoder_inputs, batch_decoder_inputs, batch_weights = [], [], [] # Batch encoder inputs are just re-indexed encoder_inputs. for length_idx in xrange(encoder_size): batch_encoder_inputs.append( np.array([encoder_inputs[batch_idx][length_idx] for batch_idx in xrange(self.batch_size)], dtype=np.int32)) # Batch decoder inputs are re-indexed decoder_inputs, we create weights. for length_idx in xrange(decoder_size): batch_decoder_inputs.append( np.array([decoder_inputs[batch_idx][length_idx] for batch_idx in xrange(self.batch_size)], dtype=np.int32)) # Create target_weights to be 0 for targets that are padding. batch_weight = np.ones(self.batch_size, dtype=np.float32) for batch_idx in xrange(self.batch_size): # We set weight to 0 if the corresponding target is a PAD symbol. # The corresponding target is decoder_input shifted by 1 forward. if length_idx < decoder_size - 1: target = decoder_inputs[batch_idx][length_idx + 1] if length_idx == decoder_size - 1 or target == data_utils.PAD_ID: batch_weight[batch_idx] = 0.0 batch_weights.append(batch_weight) return batch_encoder_inputs, batch_decoder_inputs, batch_weights 训练模型。修改seq2seq.ini文件mode值“train”,execute.py训练。 验证模型。修改seq2seq.ini文件mode值“test”,execute.py测试。 from __future__ import absolute_import from __future__ import division from __future__ import print_function import math import os import random import sys import time import numpy as np from six.moves import xrange # pylint: disable=redefined-builtin import tensorflow as tf import data_utils import seq2seq_model try: from ConfigParser import SafeConfigParser except: from configparser import SafeConfigParser # In Python 3, ConfigParser has been renamed to configparser for PEP 8 compliance. gConfig = {} def get_config(config_file='seq2seq.ini'): parser = SafeConfigParser() parser.read(config_file) # get the ints, floats and strings _conf_ints = [ (key, int(value)) for key,value in parser.items('ints') ] _conf_floats = [ (key, float(value)) for key,value in parser.items('floats') ] _conf_strings = [ (key, str(value)) for key,value in parser.items('strings') ] return dict(_conf_ints + _conf_floats + _conf_strings) # We use a number of buckets and pad to the closest one for efficiency. # See seq2seq_model.Seq2SeqModel for details of how they work. _buckets = [(5, 10), (10, 15), (20, 25), (40, 50)] def read_data(source_path, target_path, max_size=None): """Read data from source and target files and put into buckets. Args: source_path: path to the files with token-ids for the source language. target_path: path to the file with token-ids for the target language; it must be aligned with the source file: n-th line contains the desired output for n-th line from the source_path. max_size: maximum number of lines to read, all other will be ignored; if 0 or None, data files will be read completely (no limit). Returns: data_set: a list of length len(_buckets); data_set[n] contains a list of (source, target) pairs read from the provided data files that fit into the n-th bucket, i.e., such that len(source) < _buckets[n][0] and len(target) < _buckets[n][1]; source and target are lists of token-ids. """ data_set = [[] for _ in _buckets] with tf.gfile.GFile(source_path, mode="r") as source_file: with tf.gfile.GFile(target_path, mode="r") as target_file: source, target = source_file.readline(), target_file.readline() counter = 0 while source and target and (not max_size or counter < max_size): counter += 1 if counter % 100000 == 0: print(" reading data line %d" % counter) sys.stdout.flush() source_ids = [int(x) for x in source.split()] target_ids = [int(x) for x in target.split()] target_ids.append(data_utils.EOS_ID) for bucket_id, (source_size, target_size) in enumerate(_buckets): if len(source_ids) < source_size and len(target_ids) < target_size: data_set[bucket_id].append([source_ids, target_ids]) break source, target = source_file.readline(), target_file.readline() return data_set def create_model(session, forward_only): """Create model and initialize or load parameters""" model = seq2seq_model.Seq2SeqModel( gConfig['enc_vocab_size'], gConfig['dec_vocab_size'], _buckets, gConfig['layer_size'], gConfig['num_layers'], gConfig['max_gradient_norm'], gConfig['batch_size'], gConfig['learning_rate'], gConfig['learning_rate_decay_factor'], forward_only=forward_only) if 'pretrained_model' in gConfig: model.saver.restore(session,gConfig['pretrained_model']) return model ckpt = tf.train.get_checkpoint_state(gConfig['working_directory']) if ckpt and ckpt.model_checkpoint_path: print("Reading model parameters from %s" % ckpt.model_checkpoint_path) model.saver.restore(session, ckpt.model_checkpoint_path) else: print("Created model with fresh parameters.") session.run(tf.global_variables_initializer()) return model def train(): # prepare dataset # 准备数据集 print("Preparing data in %s" % gConfig['working_directory']) enc_train, dec_train, enc_dev, dec_dev, _, _ = data_utils.prepare_custom_data(gConfig['working_directory'],gConfig['train_enc'],gConfig['train_dec'],gConfig['test_enc'],gConfig['test_dec'],gConfig['enc_vocab_size'],gConfig['dec_vocab_size']) # setup config to use BFC allocator config = tf.ConfigProto() config.gpu_options.allocator_type = 'BFC' with tf.Session(config=config) as sess: # Create model. # 构建模型 print("Creating %d layers of %d units." % (gConfig['num_layers'], gConfig['layer_size'])) model = create_model(sess, False) # Read data into buckets and compute their sizes. # 把数据读入桶(bucket)中,计算桶大小 print ("Reading development and training data (limit: %d)." % gConfig['max_train_data_size']) dev_set = read_data(enc_dev, dec_dev) train_set = read_data(enc_train, dec_train, gConfig['max_train_data_size']) train_bucket_sizes = [len(train_set[b]) for b in xrange(len(_buckets))] train_total_size = float(sum(train_bucket_sizes)) # A bucket scale is a list of increasing numbers from 0 to 1 that we'll use # to select a bucket. Length of [scale[i], scale[i+1]] is proportional to # the size if i-th training bucket, as used later. train_buckets_scale = [sum(train_bucket_sizes[:i + 1]) / train_total_size for i in xrange(len(train_bucket_sizes))] # This is the training loop. # 开始训练循环 step_time, loss = 0.0, 0.0 current_step = 0 previous_losses = [] while True: # Choose a bucket according to data distribution. We pick a random number # in [0, 1] and use the corresponding interval in train_buckets_scale. # 随机生成一个0-1数,在生成bucket_id中使用 random_number_01 = np.random.random_sample() bucket_id = min([i for i in xrange(len(train_buckets_scale)) if train_buckets_scale[i] > random_number_01]) # Get a batch and make a step. # 获取一个批次数据,进行一步训练 start_time = time.time() encoder_inputs, decoder_inputs, target_weights = model.get_batch( train_set, bucket_id) _, step_loss, _ = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket_id, False) step_time += (time.time() - start_time) / gConfig['steps_per_checkpoint'] loss += step_loss / gConfig['steps_per_checkpoint'] current_step += 1 # Once in a while, we save checkpoint, print statistics, and run evals. # 保存检查点文件,打印统计数据 if current_step % gConfig['steps_per_checkpoint'] == 0: # Print statistics for the previous epoch. perplexity = math.exp(loss) if loss < 300 else float('inf') print ("global step %d learning rate %.4f step-time %.2f perplexity " "%.2f" % (model.global_step.eval(), model.learning_rate.eval(), step_time, perplexity)) # Decrease learning rate if no improvement was seen over last 3 times. # 如果损失值在最近3次内没有再降低,减小学习率 if len(previous_losses) > 2 and loss > max(previous_losses[-3:]): sess.run(model.learning_rate_decay_op) previous_losses.append(loss) # Save checkpoint and zero timer and loss. # 保存检查点文件,计数器、损失值归零 checkpoint_path = os.path.join(gConfig['working_directory'], "seq2seq.ckpt") model.saver.save(sess, checkpoint_path, global_step=model.global_step) step_time, loss = 0.0, 0.0 # Run evals on development set and print their perplexity. for bucket_id in xrange(len(_buckets)): if len(dev_set[bucket_id]) == 0: print(" eval: empty bucket %d" % (bucket_id)) continue encoder_inputs, decoder_inputs, target_weights = model.get_batch( dev_set, bucket_id) _, eval_loss, _ = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket_id, True) eval_ppx = math.exp(eval_loss) if eval_loss < 300 else float('inf') print(" eval: bucket %d perplexity %.2f" % (bucket_id, eval_ppx)) sys.stdout.flush() def decode(): with tf.Session() as sess: # Create model and load parameters. # 建立模型,定义超参数batch_size model = create_model(sess, True) model.batch_size = 1 # We decode one sentence at a time.一次只解码一个句子 # Load vocabularies. # 加载词汇表文件 enc_vocab_path = os.path.join(gConfig['working_directory'],"vocab%d.enc" % gConfig['enc_vocab_size']) dec_vocab_path = os.path.join(gConfig['working_directory'],"vocab%d.dec" % gConfig['dec_vocab_size']) enc_vocab, _ = data_utils.initialize_vocabulary(enc_vocab_path) _, rev_dec_vocab = data_utils.initialize_vocabulary(dec_vocab_path) # Decode from standard input. # 对标准输入句子解码 sys.stdout.write("> ") sys.stdout.flush() sentence = sys.stdin.readline() while sentence: # Get token-ids for the input sentence. # 得到输入句子的token-ids token_ids = data_utils.sentence_to_token_ids(tf.compat.as_bytes(sentence), enc_vocab) # Which bucket does it belong to? # 计算token_ids属于哪个桶(bucket) bucket_id = min([b for b in xrange(len(_buckets)) if _buckets[b][0] > len(token_ids)]) # Get a 1-element batch to feed the sentence to the model. # 句子送入模型 encoder_inputs, decoder_inputs, target_weights = model.get_batch( {bucket_id: [(token_ids, [])]}, bucket_id) # Get output logits for the sentence. _, _, output_logits = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket_id, True) # This is a greedy decoder - outputs are just argmaxes of output_logits. # 贪心解码器,输出output_logits argmaxes outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits] # If there is an EOS symbol in outputs, cut them at that point. if data_utils.EOS_ID in outputs: outputs = outputs[:outputs.index(data_utils.EOS_ID)] # Print out French sentence corresponding to outputs. # 打印与输出句子对应法语句子 print(" ".join([tf.compat.as_str(rev_dec_vocab[output]) for output in outputs])) print("> ", end="") sys.stdout.flush() sentence = sys.stdin.readline() def self_test(): """Test the translation model.""" with tf.Session() as sess: print("Self-test for neural translation model.") # Create model with vocabularies of 10, 2 small buckets, 2 layers of 32. model = seq2seq_model.Seq2SeqModel(10, 10, [(3, 3), (6, 6)], 32, 2, 5.0, 32, 0.3, 0.99, num_samples=8) sess.run(tf.initialize_all_variables()) # Fake data set for both the (3, 3) and (6, 6) bucket. data_set = ([([1, 1], [2, 2]), ([3, 3], [4]), ([5], [6])], [([1, 1, 1, 1, 1], [2, 2, 2, 2, 2]), ([3, 3, 3], [5, 6])]) for _ in xrange(5): # Train the fake model for 5 steps. bucket_id = random.choice([0, 1]) encoder_inputs, decoder_inputs, target_weights = model.get_batch( data_set, bucket_id) model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket_id, False) def init_session(sess, conf='seq2seq.ini'): global gConfig gConfig = get_config(conf) # Create model and load parameters. model = create_model(sess, True) model.batch_size = 1 # We decode one sentence at a time. # Load vocabularies. enc_vocab_path = os.path.join(gConfig['working_directory'],"vocab%d.enc" % gConfig['enc_vocab_size']) dec_vocab_path = os.path.join(gConfig['working_directory'],"vocab%d.dec" % gConfig['dec_vocab_size']) enc_vocab, _ = data_utils.initialize_vocabulary(enc_vocab_path) _, rev_dec_vocab = data_utils.initialize_vocabulary(dec_vocab_path) return sess, model, enc_vocab, rev_dec_vocab def decode_line(sess, model, enc_vocab, rev_dec_vocab, sentence): # Get token-ids for the input sentence. token_ids = data_utils.sentence_to_token_ids(tf.compat.as_bytes(sentence), enc_vocab) # Which bucket does it belong to? bucket_id = min([b for b in xrange(len(_buckets)) if _buckets[b][0] > len(token_ids)]) # Get a 1-element batch to feed the sentence to the model. encoder_inputs, decoder_inputs, target_weights = model.get_batch({bucket_id: [(token_ids, [])]}, bucket_id) # Get output logits for the sentence. _, _, output_logits = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket_id, True) # This is a greedy decoder - outputs are just argmaxes of output_logits. outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits] # If there is an EOS symbol in outputs, cut them at that point. if data_utils.EOS_ID in outputs: outputs = outputs[:outputs.index(data_utils.EOS_ID)] return " ".join([tf.compat.as_str(rev_dec_vocab[output]) for output in outputs]) if __name__ == '__main__': if len(sys.argv) - 1: gConfig = get_config(sys.argv[1]) else: # get configuration from seq2seq.ini gConfig = get_config() print('\n>> Mode : %s\n' %(gConfig['mode'])) if gConfig['mode'] == 'train': # start training train() elif gConfig['mode'] == 'test': # interactive decode decode() else: # wrong way to execute "serve" # Use : >> python ui/app.py # uses seq2seq_serve.ini as conf file print('Serve Usage : >> python ui/app.py') print('# uses seq2seq_serve.ini as conf file') 基于文字智能机器人,结合语音识别,产生直接对话机器人。系统架构:人->语音识别(ASR)->自然语言理解(NLU)->对话管理->自然语言生成(NLG)->语音合成(TTS)->人。《中国人工智能学会通讯》2016年第6卷第1期。 图灵机器人公司,提高对话和语义准确度,提升中文语境智能程度。竹间智能科技,研究记忆、自学习情感机器人,机器人真正理解多模式多渠道信息,高度拟人化回应,最理想自然语言交流模式交流。腾讯公司,社交对话数据。微信,最庞大自然语言交流语料库,利用庞大真实数据,结合小程序成为所有服务入口。 参考资料:《TensorFlow技术解析与实战》 欢迎推荐上海机器学习工作机会 人工智能工作机会分割线----------------------------------------- 杭州阿里 新零售淘宝基础架构平台:移动AI高级专家