我们的目标是把Netflix数据集的格式转换成:用户id、物品id、评分、时间戳格式。在开始转换之前,先下载Netflix数据集:netflix-prize-data。点击“Download”,下载文件archive.zip并解压。
我们只选用combined_data的4个文件此时文件目录如下:
接下来开始转换:
1、先导入需要用到的包
from datetime import datetime import pandas as pd import numpy as np
2、读入combined_data_1-4的数据
df1 = pd.read_csv('./combined_data_1.txt', header = None, names = ['user_id', 'rating', 'timestamp'], usecols=[0,1,2]) # 读入combined_data_1 # df2 = pd.read_csv('./combined_data_2.txt', header = None, names = ['user_id', 'rating', 'timestamp'], usecols=[0,1,2]) # 读入combined_data_2 # df3 = pd.read_csv('./combined_data_3.txt', header = None, names = ['user_id', 'rating', 'timestamp'], usecols=[0,1,2]) # 读入combined_data_3 # df4 = pd.read_csv('./combined_data_4.txt', header = None, names = ['user_id', 'rating', 'timestamp'], usecols=[0,1,2]) # 读入combined_data_4 df1['rating'] = df1['rating'].astype(float) # df2['rating'] = df2['rating'].astype(float) # df3['rating'] = df3['rating'].astype(float) # df4['rating'] = df4['rating'].astype(float) print('Dataset 1 shape: {}'.format(df1.shape)) # print('Dataset 2 shape: {}'.format(df2.shape)) # print('Dataset 3 shape: {}'.format(df3.shape)) # print('Dataset 4 shape: {}'.format(df4.shape)) print('-Dataset examples-') print(df1.iloc[::5000000, :])
输出:
Dataset 1 shape: (24058263, 3) -Dataset examples- user_id rating timestamp 0 1: NaN NaN 5000000 2560324 4.0 2005-12-06 10000000 2271935 2.0 2005-04-11 15000000 1921803 2.0 2005-01-31 20000000 1933327 3.0 2004-11-10
3、合成4个数据集并重构索引
df = df1 # df.append(df2) # df.append(df3) # df.append(df4) df.index = np.arange(0, len(df)) print('Full dataset shape: {}'.format(df.shape)) print('-Dataset examples-') print(df.iloc[::5000000, :])
输出:
Full dataset shape: (24058263, 3) -Dataset examples- user_id rating timestamp 0 1: NaN NaN 5000000 2560324 4.0 2005-12-06 10000000 2271935 2.0 2005-04-11 15000000 1921803 2.0 2005-01-31 20000000 1933327 3.0 2004-11-10
4、数据清洗,去除rating为0的数据行
df_nan = pd.DataFrame(pd.isnull(df.rating)) df_nan = df_nan[df_nan['rating'] == True] df_nan = df_nan.reset_index() item_np = [] item_id = 1 for i,j in zip(df_nan['index'][1:],df_nan['index'][:-1]): # 使用numpy temp = np.full((1,i-j-1), item_id) item_np = np.append(item_np, temp) item_id += 1 # 考虑最后一条记录和其长度 # 使用numpy last_record = np.full((1,len(df) - df_nan.iloc[-1, 0] - 1),item_id) item_np = np.append(item_np, last_record) print('Item numpy: {}'.format(item_np)) print('Length: {}'.format(len(item_np)))
输出:
Item numpy: [1.000e+00 1.000e+00 1.000e+00 ... 4.499e+03 4.499e+03 4.499e+03] Length: 24053764
5、将物品(电影)id加入dataframe
def time2stamp(cmnttime): # 时间转时间戳函数 cmnttime = datetime.strptime(cmnttime,'%Y-%m-%d') stamp = int(datetime.timestamp(cmnttime)) return stamp df = df[pd.notnull(df['rating'])].copy() df['item_id'] = item_np.astype(int) df['user_id'] = df['user_id'].astype(int) df = df.loc[:,['user_id', 'item_id', 'rating', 'timestamp']] # 交换两列位置 df['timestamp'] = df['timestamp'].astype(str).apply(time2stamp) # 时间转成时间戳 print('-Dataset examples-') print(df.iloc[::5000000, :])
输出:
-Dataset examples- user_id item_id rating timestamp 1 1488844 1 3.0 1125936000 5000996 501954 996 2.0 1093449600 10001962 404654 1962 5.0 1125244800 15002876 886608 2876 2.0 1127059200 20003825 1193835 3825 2.0 1060704000
6、保存dataframe
# df.sort_values(by=["user_id", "timestamp"], ascending=[True, True]) # 先按用户id排序,然后按时间戳排序 df.to_csv('./ratings.dat', sep=',', index=0, header=0)
完成