# Recommended System

### 相似度的衡量

Pearson Correlation相似度：<x,y>指的就是內积操作、皮尔逊相似度的好处就在于对于级数不敏感，级数相差过大带来的影响不大。

### 推荐算法

#### 协同过滤的推荐

NeiNeighborhood-based Algorithm是一种基于近邻的推荐，协同过滤有两种，一种是基于用户的协同过滤，一种是基于物品的协同过滤。

##### user-based CF

def cos_sim(x, y):
numerator = x * y.T
denominator = np.sqrt(x * x.T)
return (numerator / denominator)[0, 0]


def similarity(data):
m = np.shape(data)[0]
w = np.mat(np.zeros((m, m)))
for i in range(m):
for j in range(i, m):
if j != i:
w[i, j] = cos_sim(data[i, ], data[j, ])
w[j, i] = w[i, j]
else:
w[i, j] = 0
return w


def top_k(predict, k):
top_recom = []
len_result = len(predict)
if k >= len_result:
top_recom = predict
else:
for i in range(k):
top_recom.append(predict[i])


def user_based_recommend(data, w, user):
m, n = np.shape(data)
interaction = data[user, ]
not_inter = []
for i in range(n):
if interaction[0, i] == 0:
not_inter.append(i)
predict = {}
for x in not_inter:
item = np.copy(data[:, x])
for i in range(m):
if item[i, 0] != 0:
if x not in predict:
predict[x] = w[user, i] * item[i, 0]
else:
predict[x] = predict[x] + w[user, i] * item[i, 0]
return sorted(predict.items(), key=lambda  d:d[1], reverse=True)


##### item_based CF

def item_based_recommend(data, w, user):
m, n = np.shape(data)
interation = data[:, user]
not_inter = []
for i in range(n):
if interation[0, i] == 0:
not_inter.append(i)

predict = {}
for x in not_inter:
item = np.copy(interation)
for j in range(m):
if item[0, j] != 0:
if x not in predict:
predict[x] = w[x, j] * item[0, j]
else:
predict[x] = predict[x] + w[x, j] * item[0, j]
return sorted(predict.items(), key=lambda d:d[1], reverse=True)


### 基于隐语义的推荐

，N输入用户的数量，d就是网络的维度，M就是输出的结果，M就是物品的数量，因为最后的输出就是用户对商品的评分。

，k代表的就是分解的度，分解的越多有可能会过拟合，所以这里也是可以加正则化的。

### KL损失函数

#### 代码实现

def train(V, r, maxCycles, e):
m, n = np.shape(V)
W = np.mat(np.random.random((m, r)))
H = np.mat(np.mat(np.random.random((r, n))))

for step in range(maxCycles):
V_pre = W * H
E = V - V_pre
err = 0.0
for i in range(m):
for j in range(n):
err += E[i, j] * E[i, j]
if err < e:
break
if step % 1000 == 0:
print("\Interator: ", step, " Loss: ", err)
a = W.T * V
b = W.T * W * H
for i in range(r):
for j in range(n):
if b[i, j] != 0:
H[i, j] = H[i, j] * a[i, j] / b[i, j]
c = V * H.T
d = W * H * H.T
for i in range(m):
for j in range(r):
if d[i, j] != 0:
W[i, j] = W[i, j] * c[i, j] / d[i, j]
return W, H



### PageRank Algorithm

，每一个网页转向其他网页的概率都是一样的：

。根据这个矩阵，我们是可以计算出用户访问每一个网页的概率：

#### 矩阵T的条件

①T要是随机矩阵：所有元素大于0，而且要求列相加要是等于1的。
②T是不可约的：所谓的不可约就是强连通性，即图中任何一个节点可以到达其他任何一个节点。比如有某一个节点是不能到达任何一个节点的，在那个节点所在的列都是0，或者是只有回到自己，也就是自环路。这两种就是在后面提到的终止点和陷阱了。
③T要是非周期的：也就是符合马尔科夫链的周期性变化。

#### Summary

##### 对于PageRank的公式化

，然而，图中的B是不止一条的，B到A有两条路径，那么就是二分之一了，所以
，不断往下迭代即可。所以一个网页的迭代公式：
N表示网页总数，in(i)表示的是指向网页i的网页集合，out(j)表示的就是网页j指向的网页集合。后面代码实现就是按照这个。

#### 代码实现

from pygraph.classes.digraph import digraph
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

class pageRank(object):

def __init__(self, dg):
self.alpha = 0.85
self.maxCycles = 200
self.min_delta = 0.0001
self.graph = dg

def page_rank(self):
#没有出链的点先加上和所有点的边
for node in self.graph.nodes():
if len(self.graph.neighbors(node)) == 0:
for node2 in self.graph.nodes():
nodes = self.graph.nodes()
graphs_size = len(nodes)

if graphs_size == 0:
return 'nodes set is empty!'

page_rank = dict.fromkeys(nodes, 1.0/graphs_size)
runAway = (1.0 - self.alpha) / graphs_size
flag = False
for i in range(self.maxCycles):
change = 0
for node in nodes:
rank = 0
for incident_page in self.graph.incidents(node):
rank += self.alpha * (page_rank[incident_page] / len(self.graph.neighbors(incident_page)))
rank += runAway
change += abs(page_rank[node] - rank)
page_rank[node] = rank

print("NO.%s iteration" % (i + 1))
print(page_rank)

if change < self.min_delta:
flag = True
break
return page_rank


if __name__ == '__main__':
dg = digraph()

pr = pageRank(dg)
page_ranks = pr.page_rank()

print("The final page rank is\n", page_ranks)


### PersonalRank Algorithm

PersonalRank Algorithm是PageRank的变形，用于计算商品节点相对于用户节点U的重要程度。假设用户为，则从节点开始在用户——商品二部图中游走，游走到容易一个节点的时候，和PageRank算法一样，会按照一定的概率选择停止或者继续游走。直到每一个节点访问的概率不再变化位置。

，而且满足这两个子集交集为空。回到PersonalRank算法，在这个算法中不用区分用户和商品，上述节点的感兴趣程度就变成了对用户

①初始化：
②开始在图上游走，每次选择PR值不为0的节点开始，沿着边往前的概率为，停在当前点的概率就是了。
③首先U1开始，从U1到D2,D3,D4的概率就是，则此时D1,D2和D4的PR就是，U1的PR值就变成了
④此时PR值不为0的节点为，则此时从这三点出发，继续上述的过程直到收敛为止。所以更新公式为：如果是在当前节点那么就要加上一个停留的概率了。

#### 代码实现

def generate_dict(dataTmp):
m, n = np.shape(dataTmp)
data_dict = {}
for i in range(m):
tmp_dict = {}
for j in range(n):
if dataTmp[i, j] != 0:
tmp_dict["D_" + str(j)] = dataTmp[i, j]
data_dict["U_" + str(i)] = tmp_dict
for j in range(n):
tmp_dict = {}
for i in range(m):
if dataTmp[i, j] != 0:
tmp_dict["U_" + str(i)] = dataTmp[i, j]
data_dict["D_" + str(j)] = tmp_dict
return data_dict



def PersonalRank(data_dict, alpha, user, maxCycles):
rank = {}
for x in data_dict.keys():
rank[x] = 0
rank[user] = 1
step = 0
while step < maxCycles:
tmp = {}
for x in data_dict.keys():
tmp[x] = 0
for i, ri in data_dict.items():
for j in ri.keys():
if j not in tmp:
tmp[j] = 0
tmp[j] += alpha * rank[i] / (1.0 * len(ri))
if j == user:
tmp[j] += (1-alpha)
check = []
for k in tmp.keys():
check.append(tmp[k] - rank[k])
if sum(check) <= 0.0001:
break
rank = tmp
if step % 20 == 0:
print('NO: ', step)
step += 1
return rank



#### 最后附上GitHub代码：https://github.com/GreenArrow2017/MachineLearning/tree/master/MachineLearning/Recmmended%20System

checking build system type... ./config.guess: unable to guess system type/you must specify one
checking build system type... ./config.guess: unable to guess system type/you must specify one
110 0
‘gperf‘ is missing on your system.
‘gperf‘ is missing on your system.
164 0
|
Oracle 关系型数据库 Unix
'gperf' is missing on your system.
'gperf' is missing on your system.
89 0

1357 0
|
JavaScript 前端开发 Shell
|
JavaScript 前端开发 Shell