开发者社区> 吞吞吐吐的> 正文

【笔记4】用pandas实现条目数据格式的推荐算法 (基于用户的协同)

简介:
+关注继续查看
'''
基于用户的协同推荐

条目数据
'''

import pandas as pd
from io import StringIO
import json

#数据类型一:条目(用户、商品、打分)(避免巨型稀疏矩阵)
csv_txt = '''"Angelica","Blues Traveler",3.5
"Angelica","Broken Bells",2.0
"Angelica","Norah Jones",4.5
"Angelica","Phoenix",5.0
"Angelica","Slightly Stoopid",1.5
"Angelica","The Strokes",2.5
"Angelica","Vampire Weekend",2.0
"Bill","Blues Traveler",2.0
"Bill","Broken Bells",3.5
"Bill","Deadmau5",4.0
"Bill","Phoenix",2.0
"Bill","Slightly Stoopid",3.5
"Bill","Vampire Weekend",3.0
"Chan","Blues Traveler",5.0
"Chan","Broken Bells",1.0
"Chan","Deadmau5",1.0
"Chan","Norah Jones",3.0
"Chan","Phoenix",5,
"Chan","Slightly Stoopid",1.0
"Dan","Blues Traveler",3.0
"Dan","Broken Bells",4.0
"Dan","Deadmau5",4.5
"Dan","Phoenix",3.0
"Dan","Slightly Stoopid",4.5
"Dan","The Strokes",4.0
"Dan","Vampire Weekend",2.0
"Hailey","Broken Bells",4.0
"Hailey","Deadmau5",1.0
"Hailey","Norah Jones",4.0
"Hailey","The Strokes",4.0
"Hailey","Vampire Weekend",1.0
"Jordyn","Broken Bells",4.5
"Jordyn","Deadmau5",4.0
"Jordyn","Norah Jones",5.0
"Jordyn","Phoenix",5.0
"Jordyn","Slightly Stoopid",4.5
"Jordyn","The Strokes",4.0
"Jordyn","Vampire Weekend",4.0
"Sam","Blues Traveler",5.0
"Sam","Broken Bells",2.0
"Sam","Norah Jones",3.0
"Sam","Phoenix",5.0
"Sam","Slightly Stoopid",4.0
"Sam","The Strokes",5.0
"Veronica","Blues Traveler",3.0
"Veronica","Norah Jones",5.0
"Veronica","Phoenix",4.0
"Veronica","Slightly Stoopid",2.5
"Veronica","The Strokes",3.0'''

#数据类型二:json数据(用户、商品、打分)
json_txt = '''{"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0,
                      "Norah Jones": 4.5, "Phoenix": 5.0,
                      "Slightly Stoopid": 1.5,
                      "The Strokes": 2.5, "Vampire Weekend": 2.0},
         
         "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5,
                 "Deadmau5": 4.0, "Phoenix": 2.0,
                 "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},
         
         "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0,
                  "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5,
                  "Slightly Stoopid": 1.0},
         
         "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0,
                 "Deadmau5": 4.5, "Phoenix": 3.0,
                 "Slightly Stoopid": 4.5, "The Strokes": 4.0,
                 "Vampire Weekend": 2.0},
         
         "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0,
                    "Norah Jones": 4.0, "The Strokes": 4.0,
                    "Vampire Weekend": 1.0},
         
         "Jordyn":  {"Broken Bells": 4.5, "Deadmau5": 4.0,
                     "Norah Jones": 5.0, "Phoenix": 5.0,
                     "Slightly Stoopid": 4.5, "The Strokes": 4.0,
                     "Vampire Weekend": 4.0},
         
         "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0,
                 "Norah Jones": 3.0, "Phoenix": 5.0,
                 "Slightly Stoopid": 4.0, "The Strokes": 5.0},
         
         "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0,
                      "Phoenix": 4.0, "Slightly Stoopid": 2.5,
                      "The Strokes": 3.0}
}'''


df = None

#方式一:加载csv数据
def load_csv_txt():
    global df
    df = pd.read_csv(StringIO(csv_txt), header=None, names=['user','goods','rate'])

#方式二:加载json数据(把json读成条目)
def load_json_txt():
    global df
    #由json数据得到字典
    users = json.loads(json_txt)
    
    #遍历字典,得到条目
    csv_txt_ = ''
    for user in users:
        for goods in users[user]:
            csv_txt_ += '{},{},{}\n'.format(user, goods, users[user][goods])
    
    df = pd.read_csv(StringIO(csv_txt_), header=None, names=['user','goods','rate'])


print('测试:读取数据')
#load_csv_txt()
load_json_txt()



def build_xy(user_name1, user_name2):
    df1 = df.ix[df['user'] == user_name1, ['goods','rate']]
    df2 = df.ix[df['user'] == user_name2, ['goods','rate']]
    
    df3 = pd.merge(df1, df2, on='goods', how='inner') #只保留两人都有评分的商品的评分
    
    return df3['rate_x'], df3['rate_y'] #merge之后默认的列名:rate_x,rate_y
    



#曼哈顿距离
def manhattan(user_name1, user_name2):
    x, y = build_xy(user_name1, user_name2)
    return sum(abs(x - y))
    
#欧几里德距离
def euclidean(user_name1, user_name2):
    x, y = build_xy(user_name1, user_name2)
    return sum((x - y)**2)**0.5
    
#闵可夫斯基距离
def minkowski(user_name1, user_name2, r):
    x, y = build_xy(user_name1, user_name2)
    return sum(abs(x - y)**r)**(1/r)
    
#皮尔逊相关系数
def pearson(user_name1, user_name2):
    x, y = build_xy(user_name1, user_name2)
    mean1, mean2 = x.mean(), y.mean()
    #分母
    denominator = (sum((x-mean1)**2)*sum((y-mean2)**2))**0.5
    return [sum((x-mean1)*(y-mean2))/denominator, 0][denominator == 0]

#余弦相似度(数据的稀疏性问题,在文本挖掘中应用得较多)
def cosine(user_name1, user_name2):
    x, y = build_xy(user_name1, user_name2)
    #分母
    denominator = (sum(x*x)*sum(y*y))**0.5
    return [sum(x*y)/denominator, 0][denominator == 0]
    
metric_funcs = {
    'manhattan': manhattan,
    'euclidean': euclidean,
    'minkowski': minkowski,
    'pearson': pearson,
    'cosine': cosine
}


print('\n测试:计算Angelica与Bill的曼哈顿距离')
print(manhattan('Angelica','Bill'))


#计算最近的邻居(返回:pd.Series)
def computeNearestNeighbor(user_name, metric='pearson', k=3, r=2):
    '''
    metric: 度量函数
    k:      返回k个邻居
    r:      闵可夫斯基距离专用
    
    返回:pd.Series,其中index是邻居名称,values是距离
    '''
    array = df[df['user'] != user_name]['user'].unique()
    if metric in ['manhattan', 'euclidean']:
        return pd.Series(array, index=array.tolist()).apply(metric_funcs[metric], args=(user_name,)).nsmallest(k)
    elif metric in ['minkowski']:
        return pd.Series(array, index=array.tolist()).apply(metric_funcs[metric], args=(user_name, r,)).nsmallest(k)
    elif metric in ['pearson', 'cosine']:
        return pd.Series(array, index=array.tolist()).apply(metric_funcs[metric], args=(user_name,)).nlargest(k)
    
    
print('\n测试:计算Hailey的最近邻居')
print(computeNearestNeighbor('Hailey'))


#向给定用户推荐(返回:pd.DataFrame)
def recommend(user_name):
    """返回推荐结果列表"""
    # 找到距离最近的用户名
    nearest_username = computeNearestNeighbor(user_name).index[0]
    
    # 找出这位用户评价过、但自己未曾评价的乐队
    df1 = df.ix[df['user'] == user_name, ['goods', 'rate']]
    df2 = df.ix[df['user'] == nearest_username, ['goods', 'rate']]
    
    df3 = pd.merge(df1, df2, on='goods', how='outer')
    
    return df3.ix[(df3['rate_x'].isnull()) & (df3['rate_y'].notnull()), ['goods', 'rate_y']].sort_values(by='rate_y')

    
print('\n测试:为Hailey做推荐')
print(recommend('Hailey'))


#向给定用户推荐(返回:pd.Series)
def recommend2(user_name, metric='pearson', k=3, n=5, r=2):
    '''
    metric: 度量函数
    k:      根据k个最近邻居,协同推荐
    r:      闵可夫斯基距离专用
    n:      推荐的商品数目
    
    返回:pd.Series,其中index是商品名称,values是加权评分
    '''
    # 找到距离最近的k个邻居
    nearest_neighbors = computeNearestNeighbor(user_name, metric='pearson', k=k, r=r)
    
    # 计算权值
    if metric in ['manhattan', 'euclidean', 'minkowski']: # 距离越小,越类似
        nearest_neighbors = 1 / nearest_neighbors # 所以,取倒数(或者别的减函数,如:y=2**-x)
    elif metric in ['pearson', 'cosine']:                 # 距离越大,越类似
        pass
        
    nearest_neighbors = nearest_neighbors / nearest_neighbors.sum() #已经变为权值
    
    # 逐个邻居找出其评价过、但自己未曾评价的乐队(或商品)的评分,并乘以权值
    neighbors_rate_with_weight = []
    for neighbor_name in nearest_neighbors.index:
        # 每个结果:pd.Series,其中index是商品名称,values是评分(已乘权值)
        df1 = df.ix[df['user'] == user_name, ['goods', 'rate']]
        df2 = df.ix[df['user'] == neighbor_name, ['goods', 'rate']]
        
        df3 = pd.merge(df1, df2, on='goods', how='outer')
        
        df4 =  df3.ix[(df3['rate_x'].isnull()) & (df3['rate_y'].notnull()), ['goods', 'rate_y']]
        
        #注意这中间有一个转化为pd.Series的操作!
        neighbors_rate_with_weight.append(pd.Series(df4['rate_y'].tolist(), index=df4['goods']) * nearest_neighbors[neighbor_name])

    # 把邻居们的加权评分拼接成pd.DataFrame,按列累加,取最大的前n个商品的评分
    return pd.concat(neighbors_rate_with_weight, axis=1).sum(axis=1, skipna=True).nlargest(n) # 黑科技!
    
print('\n测试:为Hailey做推荐')
print(recommend2('Hailey', metric='manhattan', k=3, n=5))

print('\n测试:为Hailey做推荐')
print(recommend2('Hailey', metric='euclidean', k=3, n=5, r=2))

print('\n测试:为Hailey做推荐')
print(recommend2('Hailey', metric='pearson', k=1, n=5))
本文转自罗兵博客园博客,原文链接:http://www.cnblogs.com/hhh5460/p/6121899.html,如需转载请自行联系原作者

版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。

相关文章
TikTok「魔力」推荐算法坐收一亿用户,不含算法的TikTok一文不值
TikTok的收购再添疑云,有个重要原因就是核心算法无法出售了。美国的竞购方在探讨自己攻关的可能性,但专家经过深入分析后表示,无法复制出TikTok的算法魔力。
53 0
ML之K-means:基于(完整的)手写数字图片识别数据集利用K-means算法实现图片聚类
ML之K-means:基于(完整的)手写数字图片识别数据集利用K-means算法实现图片聚类
145 0
js基本搜索算法实现与170万条数据下的性能测试
今天让我们来继续聊一聊js算法,通过接下来的讲解,我们可以了解到搜索算法的基本实现以及各种实现方法的性能,进而发现for循环,forEach,While的性能差异,我们还会了解到如何通过web worker做算法分片,极大的提高算法的性能。 同时我还会简单介绍一下经典的二分算法,哈希表查找算法,但这些不是本章的重点,之后我会推出相应的文章详细介绍这些高级算法,感兴趣的朋友可以关注我的专栏,或一起探讨。
16 0
《数据结构和算法基础(Java语言实现)》学习笔记
《数据结构和算法基础(Java语言实现)》一书由北京大学出版社出版,已经于近日上市。拿到了样书,第一时间希望与读者朋友们分享下这本书里面的内容。
46 0
阿里云服务器如何登录?阿里云服务器的三种登录方法
购买阿里云ECS云服务器后如何登录?场景不同,大概有三种登录方式:
8742 0
ML之CatboostC:基于titanic泰坦尼克数据集利用catboost算法实现二分类
ML之CatboostC:基于titanic泰坦尼克数据集利用catboost算法实现二分类
82 0
阿里云服务器怎么设置密码?怎么停机?怎么重启服务器?
如果在创建实例时没有设置密码,或者密码丢失,您可以在控制台上重新设置实例的登录密码。本文仅描述如何在 ECS 管理控制台上修改实例登录密码。
19395 0
ML之H-Clusters:基于H-Clusters算法利用电影数据集实现对top 100电影进行文档分类
ML之H-Clusters:基于H-Clusters算法利用电影数据集实现对top 100电影进行文档分类
54 0
阿里云服务器端口号设置
阿里云服务器初级使用者可能面临的问题之一. 使用tomcat或者其他服务器软件设置端口号后,比如 一些不是默认的, mysql的 3306, mssql的1433,有时候打不开网页, 原因是没有在ecs安全组去设置这个端口号. 解决: 点击ecs下网络和安全下的安全组 在弹出的安全组中,如果没有就新建安全组,然后点击配置规则 最后如上图点击添加...或快速创建.   have fun!  将编程看作是一门艺术,而不单单是个技术。
17859 0
4852
文章
0
问答
文章排行榜
最热
最新
相关电子书
更多
OceanBase 入门到实战教程
立即下载
阿里云图数据库GDB,加速开启“图智”未来.ppt
立即下载
实时数仓Hologres技术实战一本通2.0版(下)
立即下载