# 记忆网络（Memory network）

• Me: Can you find me some restaurants?
• Assistance: I find a few places within 0.25 mile. The first one is Caffé Opera on Lincoln Street. The second one is …
• Me: Can you make a reservation at the first restaurant?
• Assistance: Ok. Let’s make a reservation for the Sushi Tom restaurant on the First Street.

The description on the memory networks (MemNN) is based on Memory networks, Jason Weston etc.

1. Joe went to the kitchen.
2. Fred went to the kitchen.
3. Joe picked up the milk.
4. Joe traveled to the office.
5. Joe left the milk.
6. Joe went to the bathroom.

## 存信息于记忆

Memory slot $$m_i$$ Sentence
1 Joe went to the kitchen.
2 Fred went to the kitchen.
3 Joe picked up the milk.
4 Joe traveled to the office.
5 Joe left the milk.
6 Joe went to the bathroom.

$$o_1 = \mathop{\arg\min}_{i=1,...,N}s_0(q, m_i)$$

$$o_2 = \mathop{\arg\max}_{i=1,...,N}s_0([q, m_{01}], m_i)$$

$$o = [q, m_{01}, m_{02}] = ["where is the milk now"," Joe left the milk."," Joe travelled to the office."]$$

$$r = \mathop{\arg\max}_{w \in W}s_r([q, m_{01}, m_{02}], w)$$

## 编码输入（Encoding the input）

Vocaulary ... is Joe left milk now office the to travelled where ...
where is the milk now ... 1 0 0 1 1 0 1 0 0 1 ...

"where is the milk now"=(...,1,0,0,1,1,0,1,0,0,1,...)

$$q$$: Where is the milk now?
$$m_{01}$$: Joe left the milk.
$$m_{02}$$: Joe travelled to the office.

... Joe_1 milk_1 ... Joe_2 milk_2 ... Joe_3 milk_3 ...
0 1 1 1 1 0

## 计算打分函数（Compute the scoring function）

$$s_0(x, y)n = \Phi_x(x)^TU_0^TU_0\Phi_y(y)$$

$$s_r(x, y)n = \Phi_x(x)^TU_r^TU_r\Phi_y(y)$$

## 边缘损失函数（Margin loss function）

$$\sum\limits_{\overline{f}\not=m_{01}}\max(0, \gamma - s_0(x, m_{01}) + s_0(x, \overline{f})) +$$

$$\sum\limits_{\overline{f}\not=m_{02}}\max(0, \gamma - s_0(\left[x, m_{01}\right], m_{02}) + s_0(\left[x, m_{01}\right], \overline{f^\prime})) +$$

$$\sum\limits_{\overline{r}\not=r}\max(0, \gamma - s_0(\left[x, m_{01}, m_{02}\right], r) + s_0(\left[x, m_{01}, m_{02}\right], \overline{r}))$$

# 端到端记忆网络（End to End memory network, MemN2N）

The description, as well as the diagrams, on the end to end memory networks (MemN2N) are based on End-To-End Memory Networks, Sainbayar Sukhbaatar etc..

$$u = embedding_B(q)$$

$$m_i = embedding_A(x_i)$$

$$p_i = softmax(u^Tm_i)$$

$$c_i = emedding_C(x_i)$$

$$o = \sum\limits_{i}p_ic_i$$

$$\hat a = softmax(W(o + u))$$

## 多层（Multiple layer）

$$u^{k + 1} = u^k + o^k$$

## 语言模型（Language model）

Memory slot $$m_i$$ word
1 We
2 hold
3 these
4 truths
5 to
6 be
7 ...

1. 没有问句，我们试着找下个词而不是问句的答案。因此，我们无需词嵌入矩阵B，只需以常量0.1填充$$u$$。
2. 我们使用多层，但是每层词嵌入矩阵$$A$$相同，而词嵌入矩阵$$B$$不同。
3. 词嵌入增加时间项来记录记忆中词的次序（Section 4.1）。
4. 加上$$o$$前，使$$u$$乘以一个线性向量。
5. 为了帮助训练，对每层的一半单元运用ReLU（Section 5）。

def build_memory(self):
self.A = tf.Variable(tf.random_normal([self.nwords, self.edim], stddev=self.init_std)) # Embedding A for sentences
self.C = tf.Variable(tf.random_normal([self.nwords, self.edim], stddev=self.init_std)) # Embedding C for sentences
self.H = tf.Variable(tf.random_normal([self.edim, self.edim], stddev=self.init_std))   # Multiple it with u before adding to o

# Sec 4.1: Temporal Encoding to capture the time order of the sentences.
self.T_A = tf.Variable(tf.random_normal([self.mem_size, self.edim], stddev=self.init_std))
self.T_C = tf.Variable(tf.random_normal([self.mem_size, self.edim], stddev=self.init_std))

# Sec 2: We are using layer-wise (RNN-like) which the embeddings for each layers are sharing the parameters.
# (N, 100, 150) m_i = sum A_ij * x_ij + T_A_i
m_a = tf.nn.embedding_lookup(self.A, self.sentences)
m_t = tf.nn.embedding_lookup(self.T_A, self.T)

# (N, 100, 150) c_i = sum C_ij * x_ij + T_C_i
c_a = tf.nn.embedding_lookup(self.C, self.sentences)
c_t = tf.nn.embedding_lookup(self.T_C, self.T)

# For each layer
for h in range(self.nhop):
u = tf.reshape(self.u_s[-1], [-1, 1, self.edim])
scores = tf.reshape(scores, [-1, self.mem_size])

P = tf.nn.softmax(scores)     # (N, 100)
P = tf.reshape(P, [-1, 1, self.mem_size])

o = tf.matmul(P, c)
o = tf.reshape(o, [-1, self.edim])

# Section 2: We are using layer-wise (RNN-like), so we multiple u with H.
uh = tf.matmul(self.u_s[-1], self.H)

# Section 5:  To aid training, we apply ReLU operations to half of the units in each layer.
F = tf.slice(next_u, [0, 0], [self.batch_size, self.lindim])
G = tf.slice(next_u, [0, self.lindim], [self.batch_size, self.edim-self.lindim])
K = tf.nn.relu(G)
self.u_s.append(tf.concat(axis=1, values=[F, K]))

self.W = tf.Variable(tf.random_normal([self.edim, self.nwords], stddev=self.init_std))
z = tf.matmul(self.u_s[-1], self.W)

self.loss = tf.nn.softmax_cross_entropy_with_logits(logits=z, labels=self.target)

self.lr = tf.Variable(self.current_lr)

params = [self.A, self.C, self.H, self.T_A, self.T_C, self.W]

with tf.control_dependencies([inc]):
self.optim = self.opt.apply_gradients(clipped_grads_and_vars)

def train(self, data):
n_batch = int(math.ceil(len(data) / self.batch_size))
cost = 0

u = np.ndarray([self.batch_size, self.edim], dtype=np.float32)      # (N, 150) Will fill with 0.1
T = np.ndarray([self.batch_size, self.mem_size], dtype=np.int32)    # (N, 100) Will fill with 0..99
target = np.zeros([self.batch_size, self.nwords])                   # one-hot-encoded
sentences = np.ndarray([self.batch_size, self.mem_size])

u.fill(self.init_u)   # (N, 150) Fill with 0.1 since we do not need query in the language model.
for t in range(self.mem_size):   # (N, 100) 100 memory cell with 0 to 99 time sequence.
T[:,t].fill(t)

for idx in range(n_batch):
target.fill(0)      # (128, 10,000)
for b in range(self.batch_size):
# We random pick a word in our data and use that as the word we need to predict using the language model.
m = random.randrange(self.mem_size, len(data))
target[b][data[m]] = 1                       # Set the one hot vector for the target word to 1

# (N, 100). Say we pick word 1000, we then fill the memory using words 1000-150 ... 999
# We fill Xi (sentence) with 1 single word according to the word order in data.
sentences[b] = data[m - self.mem_size:m]

_, loss, self.step = self.sess.run([self.optim,
self.loss,
self.global_step],
feed_dict={
self.u: u,
self.T: T,
self.target: target,
self.sentences: sentences})
cost += np.sum(loss)

return cost/n_batch/self.batch_size

class MemN2N(object):
def __init__(self, config, sess):
self.nwords = config.nwords         # 10,000
self.init_u = config.init_u         # 0.1 (We don't need a query in language model. So set u to be 0.1
self.init_std = config.init_std     # 0.05
self.batch_size = config.batch_size # 128
self.nepoch = config.nepoch         # 100
self.nhop = config.nhop             # 6
self.edim = config.edim             # 150
self.mem_size = config.mem_size     # 100
self.lindim = config.lindim         # 75

self.show = config.show
self.is_test = config.is_test
self.checkpoint_dir = config.checkpoint_dir

if not os.path.isdir(self.checkpoint_dir):

# (?, 150) Unlike Q&A, the language model do not need a query (or care what is its value).
# So we bypass q and fill u directly with 0.1 later.
self.u = tf.placeholder(tf.float32, [None, self.edim], name="u")

# (?, 100) Sec. 4.1, we add temporal encoding to capture the time sequence of the memory Xi.
self.T = tf.placeholder(tf.int32, [None, self.mem_size], name="T")

# (N, 10000) The answer word we want. (Next word for the language model)
self.target = tf.placeholder(tf.float32, [self.batch_size, self.nwords], name="target")

# (N, 100) The memory Xi. For each sentence here, it contains 1 single word only.
self.sentences = tf.placeholder(tf.int32, [self.batch_size, self.mem_size], name="sentences")

# Store the value of u at each layer
self.u_s = []
self.u_s.append(self.u)

# 动态记忆网络（Dynamic memory network）

## Question module

$$q_t = GRU(v_t, q_{t-1})$$

## Episodic memory module

$$h_i^t = g_i^tGRU(s_i, h_{i-1}^t) + (1 - g_i^t)h_{i-1}^t$$

+ 订阅