Spark机器学习2·准备数据(pyspark)

简介: ![](http://img3.douban.com/lpic/s28277325.jpg) [Spark机器学习](http://book.douban.com/subject/26593179/) ### 准备环境 #### anaconda ``` nano ~/.


Spark机器学习

准备环境

anaconda

nano ~/.zshrc
export PATH=$PATH:/anaconda/bin
source ~/.zshrc
echo $HOME
echo $PATH

ipython

conda update conda && conda update ipython ipython-notebook ipython-qtconsole
conda install scipy

PYTHONPATH

export SPARK_HOME=/Users/erichan/garden/spark-1.5.1-bin-hadoop2.6
export PYTHONPATH=${SPARK_HOME}/python/:${SPARK_HOME}/python/lib/py4j-0.8.2.1-src.zip

运行环境

cd $SPARK_HOME

IPYTHON=1 IPYTHON_OPTS="--pylab" ./bin/pyspark

数据

1. 获取原始数据

PATH = "/Users/erichan/sourcecode/book/Spark机器学习"
user_data = sc.textFile("%s/ml-100k/u.user" % PATH)
user_fields = user_data.map(lambda line: line.split("|"))
movie_data = sc.textFile("%s/ml-100k/u.item" % PATH)
movie_fields = movie_data.map(lambda lines: lines.split("|"))
rating_data_raw = sc.textFile("%s/ml-100k/u.data" % PATH)
rating_data = rating_data_raw.map(lambda line: line.split("\t"))
num_movies = movie_data.count()
print num_movies

1682

user_data.first()

u'1|24|M|technician|85711'

movie_data.first()

u'1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0'

rating_data_raw.first()

u'196t242t3t881250949'

2. 探索数据

2.1. 按列统计
num_users = user_fields.map(lambda fields: fields[0]).count()
num_genders = user_fields.map(lambda fields: fields[2]).distinct().count()
num_occupations = user_fields.map(lambda fields: fields[3]).distinct().count()
num_zipcodes = user_fields.map(lambda fields: fields[4]).distinct().count()

ratings = rating_data.map(lambda fields: int(fields[2]))
num_ratings = ratings.count()
max_rating = ratings.reduce(lambda x, y: max(x, y))
min_rating = ratings.reduce(lambda x, y: min(x, y))
mean_rating = ratings.reduce(lambda x, y: x + y) / float(num_ratings)
median_rating = np.median(ratings.collect())
ratings_per_user = num_ratings / num_users
ratings_per_movie = num_ratings / num_movies
print "Users: %d, genders: %d, occupations: %d, ZIP codes: %d" % (num_users, num_genders, num_occupations, num_zipcodes)

Users: 943, genders: 2, occupations: 21, ZIP codes: 795

print "Min rating: %d" % min_rating

Min rating: 1

print "Max rating: %d" % max_rating

Max rating: 5

print "Average rating: %2.2f" % mean_rating

Average rating: 3.53

print "Median rating: %d" % median_rating

Median rating: 4

print "Average # of ratings per user: %2.2f" % ratings_per_user

Average # of ratings per user: 106.00

print "Average # of ratings per movie: %2.2f" % ratings_per_movie

Average # of ratings per movie: 59.00

ratings.stats()

(count: 100000, mean: 3.52986, stdev: 1.12566797076, max: 5, min: 1)

2.2. 使用matplotlib的hist函数绘制直方图
ages = user_fields.map(lambda x: int(x[1])).collect()
hist(ages, bins=20, color='lightblue', normed=True)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16, 10)

3_2

count_by_rating = ratings.countByValue()
x_axis = np.array(count_by_rating.keys())
y_axis = np.array([float(c) for c in count_by_rating.values()])
# we normalize the y-axis here to percentages
y_axis_normed = y_axis / y_axis.sum()

pos = np.arange(len(x_axis))
width = 1.0

ax = plt.axes()
ax.set_xticks(pos + (width / 2))
ax.set_xticklabels(x_axis)

plt.bar(pos, y_axis_normed, width, color='lightblue')
plt.xticks(rotation=30)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16, 10)

3_5

count_by_occupation = user_fields.map(lambda fields: (fields[3], 1)).reduceByKey(lambda x, y: x + y).collect()
x_axis1 = np.array([c[0] for c in count_by_occupation])
y_axis1 = np.array([c[1] for c in count_by_occupation])
x_axis = x_axis1[np.argsort(y_axis1)]
y_axis = y_axis1[np.argsort(y_axis1)]

pos = np.arange(len(x_axis))
width = 1.0

ax = plt.axes()
ax.set_xticks(pos + (width / 2))
ax.set_xticklabels(x_axis)

plt.bar(pos, y_axis, width, color='lightblue')
plt.xticks(rotation=30)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16, 10)

3_3

2.3. 使用countByValue函数统计
count_by_occupation2 = user_fields.map(lambda fields: fields[3]).countByValue()
print "Map-reduce approach:"
print dict(count_by_occupation2)

{u'administrator': 79, u'retired': 14, u'lawyer': 12, u'healthcare': 16, u'marketing': 26, u'executive': 32, u'scientist': 31, u'student': 196, u'technician': 27, u'librarian': 51, u'programmer': 66, u'salesman': 12, u'homemaker': 7, u'engineer': 67, u'none': 9, u'doctor': 7, u'writer': 45, u'entertainment': 18, u'other': 105, u'educator': 95, u'artist': 28}

print ""
print "countByValue approach:"
print dict(count_by_occupation)

{u'administrator': 79, u'writer': 45, u'retired': 14, u'lawyer': 12, u'doctor': 7, u'marketing': 26, u'executive': 32, u'none': 9, u'entertainment': 18, u'healthcare': 16, u'scientist': 31, u'student': 196, u'educator': 95, u'technician': 27, u'librarian': 51, u'programmer': 66, u'artist': 28, u'salesman': 12, u'other': 105, u'homemaker': 7, u'engineer': 67}

2.4. 使用filter转换
def convert_year(x):
    try:
        return int(x[-4:])
    except:
        return 1900

years = movie_fields.map(lambda fields: fields[2]).map(lambda x: convert_year(x))
years_filtered = years.filter(lambda x: x != 1900)
movie_ages = years_filtered.map(lambda yr: 1998-yr).countByValue()
values = movie_ages.values()
bins = movie_ages.keys()
hist(values, bins=bins, color='lightblue', normed=True)

(array([ 0. , 0.07575758, 0.09090909, 0.09090909, 0.18181818,

     0.18181818,  0.04545455,  0.07575758,  0.07575758,  0.03030303,
     0.        ,  0.01515152,  0.01515152,  0.03030303,  0.        ,
     0.03030303,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ,  0.        ,  0.01515152,  0.        ,  0.01515152,
     0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ,  0.        ,  0.01515152,  0.        ,  0.        ,
     0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.01515152,  0.        ,  0.        ,  0.        ,  0.        ]),

array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,

    17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
    34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
    51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
    68, 72, 76]),

)

fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16,10)

3_4

2.5. 使用groupByKey分组
# to compute the distribution of ratings per user, we first group the ratings by user id
user_ratings_grouped = rating_data.map(lambda fields: (int(fields[0]), int(fields[2]))).groupByKey()
# then, for each key (user id), we find the size of the set of ratings, which gives us the # ratings for that user
user_ratings_byuser = user_ratings_grouped.map(lambda (k, v): (k, len(v)))
user_ratings_byuser.take(5)

[(2, 62), (4, 24), (6, 211), (8, 59), (10, 184)]

user_ratings_byuser_local = user_ratings_byuser.map(lambda (k, v): v).collect()
hist(user_ratings_byuser_local, bins=200, color='lightblue', normed=True)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16,10)

3_6

3. 处理转换

3.1. 填充缺失
years_pre_processed = movie_fields.map(lambda fields: fields[2]).map(lambda x: convert_year(x)).filter(lambda yr: yr != 1900).collect()
years_pre_processed_array = np.array(years_pre_processed)
# first we compute the mean and median year of release, without the 'bad' data point
mean_year = np.mean(years_pre_processed_array[years_pre_processed_array!=1900])
median_year = np.median(years_pre_processed_array[years_pre_processed_array!=1900])
idx_bad_data = np.where(years_pre_processed_array==1900)[0]
years_pre_processed_array[idx_bad_data] = median_year
print "Mean year of release: %d" % mean_year

Mean year of release: 1989

print "Median year of release: %d" % median_year

Median year of release: 1995

print "Index of '1900' after assigning median: %s" % np.where(years_pre_processed_array == 1900)[0]

Index of '1900' after assigning median: []

4. 提取特征

4.1. 类别特征(norminal变量/ordinal变量)
all_occupations = user_fields.map(lambda fields: fields[3]).distinct().collect()
all_occupations.sort()
# create a new dictionary to hold the occupations, and assign the "1-of-k" indexes
idx = 0
all_occupations_dict = {}
for o in all_occupations:
    all_occupations_dict[o] = idx
    idx +=1

# try a few examples to see what "1-of-k" encoding is assigned
print "Encoding of 'doctor': %d" % all_occupations_dict['doctor']
print "Encoding of 'programmer': %d" % all_occupations_dict['programmer']

Encoding of 'doctor': 2
Encoding of 'programmer': 14

numpy的zeros函数
K = len(all_occupations_dict)
binary_x = np.zeros(K)
k_programmer = all_occupations_dict['programmer']
binary_x[k_programmer] = 1
print "Binary feature vector: %s" % binary_x
print "Length of binary vector: %d" % K

Binary feature vector: [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.

    1. 0.]
      Length of binary vector: 21
4.2. 派生特征
时间戳转换为类别特征
def extract_datetime(ts):
    import datetime
    return datetime.datetime.fromtimestamp(ts)

def assign_tod(hr):
    times_of_day = {
        'morning' : range(7, 12),
        'lunch' : range(12, 15),
        'afternoon' : range(15, 18),
        'evening' : range(18, 23),
        'night' : {23,24,0,1,2,3,4,5,6,7}
    }
    for k, v in times_of_day.iteritems():
        if hr in v:
            return k

timestamps = rating_data.map(lambda fields: int(fields[3]))
hour_of_day = timestamps.map(lambda ts: extract_datetime(ts).hour)
# now apply the "time of day" function to the "hour of day" RDD
time_of_day = hour_of_day.map(lambda hr: assign_tod(hr))
timestamps.take(5)

[881250949, 891717742, 878887116, 880606923, 886397596]

hour_of_day.take(5)

[23, 3, 15, 13, 13]

time_of_day.take(5)

['night', 'night', 'afternoon', 'lunch', 'lunch']

4.3. 文本特征
def extract_title(raw):
    import re
    grps = re.search("\((\w+)\)", raw)
    if grps:
        return raw[:grps.start()].strip()
    else:
        return raw

raw_titles = movie_fields.map(lambda fields: fields[1])
for raw_title in raw_titles.take(5):
    print extract_title(raw_title)

Toy Story
GoldenEye
Four Rooms
Get Shorty
Copycat

movie_titles = raw_titles.map(lambda m: extract_title(m))
# next we tokenize the titles into terms. We'll use simple whitespace tokenization
title_terms = movie_titles.map(lambda t: t.split(" "))
print title_terms.take(5)

[[u'Toy', u'Story'], [u'GoldenEye'], [u'Four', u'Rooms'], [u'Get', u'Shorty'], [u'Copycat']]

flatMap
all_terms = title_terms.flatMap(lambda x: x).distinct().collect()
# create a new dictionary to hold the terms, and assign the "1-of-k" indexes
idx = 0
all_terms_dict = {}
for term in all_terms:
    all_terms_dict[term] = idx
    idx +=1

num_terms = len(all_terms_dict)
print "Total number of terms: %d" % num_terms

Total number of terms: 2645

print "Index of term 'Dead': %d" % all_terms_dict['Dead']

Index of term 'Dead': 147

print "Index of term 'Rooms': %d" % all_terms_dict['Rooms']

Index of term 'Rooms': 1963

zipWithIndex
all_terms_dict2 = title_terms.flatMap(lambda x: x).distinct().zipWithIndex().collectAsMap()
print "Index of term 'Dead': %d" % all_terms_dict2['Dead']
print "Index of term 'Rooms': %d" % all_terms_dict2['Rooms']

Index of term 'Dead': 147
Index of term 'Rooms': 1963

创建稀疏向量/广播变量

scipy depends $PYTHONPATH

def create_vector(terms, term_dict):
    from scipy import sparse as sp
    x = sp.csc_matrix((1, num_terms))
    for t in terms:
        if t in term_dict:
            idx = term_dict[t]
            x[0, idx] = 1
    return x

all_terms_bcast = sc.broadcast(all_terms_dict)
term_vectors = title_terms.map(lambda terms: create_vector(terms, all_terms_bcast.value))
term_vectors.take(5)

[<1x2645 sparse matrix of type ''

 with 1 stored elements in Compressed Sparse Column format>,

<1x2645 sparse matrix of type ''

 with 1 stored elements in Compressed Sparse Column format>,

<1x2645 sparse matrix of type ''

 with 1 stored elements in Compressed Sparse Column format>,

<1x2645 sparse matrix of type ''

 with 1 stored elements in Compressed Sparse Column format>,

<1x2645 sparse matrix of type ''

 with 1 stored elements in Compressed Sparse Column format>]
4.4. 正则化特征
np.random.seed(42)
x = np.random.randn(10)
norm_x_2 = np.linalg.norm(x)
normalized_x = x / norm_x_2
print "x:\n%s" % x
print "2-Norm of x: %2.4f" % norm_x_2
print "Normalized x:\n%s" % normalized_x
print "2-Norm of normalized_x: %2.4f" % np.linalg.norm(normalized_x)

x:
[ 0.49671415 -0.1382643 0.64768854 1.52302986 -0.23415337 -0.23413696
1.57921282 0.76743473 -0.46947439 0.54256004]
2-Norm of x: 2.5908
Normalized x:
[ 0.19172213 -0.05336737 0.24999534 0.58786029 -0.09037871 -0.09037237
0.60954584 0.29621508 -0.1812081 0.20941776]
2-Norm of normalized_x: 1.0000

from pyspark.mllib.feature import Normalizer
normalizer = Normalizer()
vector = sc.parallelize([x])
normalized_x_mllib = normalizer.transform(vector).first().toArray()

print "x:\n%s" % x
print "2-Norm of x: %2.4f" % norm_x_2
print "Normalized x MLlib:\n%s" % normalized_x_mllib
print "2-Norm of normalized_x_mllib: %2.4f" % np.linalg.norm(normalized_x_mllib)

x:
[ 0.49671415 -0.1382643 0.64768854 1.52302986 -0.23415337 -0.23413696
1.57921282 0.76743473 -0.46947439 0.54256004]
2-Norm of x: 2.5908
Normalized x MLlib:
[ 0.19172213 -0.05336737 0.24999534 0.58786029 -0.09037871 -0.09037237
0.60954584 0.29621508 -0.1812.20941776]
2-Norm of normalized_x_mllib: 1.0000

目录
相关文章
|
13天前
|
机器学习/深度学习 数据采集 数据处理
谷歌提出视觉记忆方法,让大模型训练数据更灵活
谷歌研究人员提出了一种名为“视觉记忆”的方法,结合了深度神经网络的表示能力和数据库的灵活性。该方法将图像分类任务分为图像相似性和搜索两部分,支持灵活添加和删除数据、可解释的决策机制以及大规模数据处理能力。实验结果显示,该方法在多个数据集上取得了优异的性能,如在ImageNet上实现88.5%的top-1准确率。尽管有依赖预训练模型等限制,但视觉记忆为深度学习提供了新的思路。
22 2
|
1月前
|
机器学习/深度学习 存储 人工智能
揭秘机器学习背后的神秘力量:如何高效收集数据,让AI更懂你?
【10月更文挑战第12天】在数据驱动的时代,机器学习广泛应用,从智能推荐到自动驾驶。本文以电商平台个性化推荐系统为例,探讨数据收集方法,包括明确数据需求、选择数据来源、编写代码自动化收集、数据清洗与预处理及特征工程,最终完成数据的训练集和测试集划分,为模型训练奠定基础。
46 3
|
1月前
|
机器学习/深度学习 算法 Python
“探秘机器学习的幕后英雄:梯度下降——如何在数据的海洋中寻找那枚失落的钥匙?”
【10月更文挑战第11天】梯度下降是机器学习和深度学习中的核心优化算法,用于最小化损失函数,找到最优参数。通过计算损失函数的梯度,算法沿着负梯度方向更新参数,逐步逼近最小值。常见的变种包括批量梯度下降、随机梯度下降和小批量梯度下降,各有优缺点。示例代码展示了如何用Python和NumPy实现简单的线性回归模型训练。掌握梯度下降有助于深入理解模型优化机制。
30 2
|
2月前
|
机器学习/深度学习 数据采集 监控
探索机器学习:从数据到决策
【9月更文挑战第18天】在这篇文章中,我们将一起踏上一段激动人心的旅程,穿越机器学习的世界。我们将探讨如何通过收集和处理数据,利用算法的力量来预测未来的趋势,并做出更加明智的决策。无论你是初学者还是有经验的开发者,这篇文章都将为你提供新的视角和思考方式。
|
1月前
|
机器学习/深度学习 算法 数据处理
EM算法对人脸数据降维(机器学习作业06)
本文介绍了使用EM算法对人脸数据进行降维的机器学习作业。首先通过加载ORL人脸数据库,然后分别应用SVD_PCA、MLE_PCA及EM_PCA三种方法实现数据降维,并输出降维后的数据形状。此作业展示了不同PCA变种在人脸数据处理中的应用效果。
34 0
|
2月前
|
机器学习/深度学习 数据采集 算法
利用未标记数据的半监督学习在模型训练中的效果评估
本文将介绍三种适用于不同类型数据和任务的半监督学习方法。我们还将在一个实际数据集上评估这些方法的性能,并与仅使用标记数据的基准进行比较。
221 8
|
1月前
|
机器学习/深度学习 算法 数据建模
【机器学习】类别不平衡数据的处理
【机器学习】类别不平衡数据的处理
|
3月前
|
缓存 开发者 测试技术
跨平台应用开发必备秘籍:运用 Uno Platform 打造高性能与优雅设计兼备的多平台应用,全面解析从代码共享到最佳实践的每一个细节
【8月更文挑战第31天】Uno Platform 是一种强大的工具,允许开发者使用 C# 和 XAML 构建跨平台应用。本文探讨了 Uno Platform 中实现跨平台应用的最佳实践,包括代码共享、平台特定功能、性能优化及测试等方面。通过共享代码、采用 MVVM 模式、使用条件编译指令以及优化性能,开发者可以高效构建高质量应用。Uno Platform 支持多种测试方法,确保应用在各平台上的稳定性和可靠性。这使得 Uno Platform 成为个人项目和企业应用的理想选择。
66 0
|
3月前
|
机器学习/深度学习 缓存 TensorFlow
TensorFlow 数据管道优化超重要!掌握这些关键技巧,大幅提升模型训练效率!
【8月更文挑战第31天】在机器学习领域,高效的数据处理对构建优秀模型至关重要。TensorFlow作为深度学习框架,其数据管道优化能显著提升模型训练效率。数据管道如同模型生命线,负责将原始数据转化为可理解形式。低效的数据管道会限制模型性能,即便模型架构先进。优化方法包括:合理利用数据加载与预处理功能,使用`tf.data.Dataset` API并行读取文件;使用`tf.image`进行图像数据增强;缓存数据避免重复读取,使用`cache`和`prefetch`方法提高效率。通过这些方法,可以大幅提升数据管道效率,加快模型训练速度。
51 0
|
3月前
|
机器学习/深度学习 SQL 数据采集
"解锁机器学习数据预处理新姿势!SQL,你的数据金矿挖掘神器,从清洗到转换,再到特征工程,一网打尽,让数据纯净如金,模型性能飙升!"
【8月更文挑战第31天】在机器学习项目中,数据质量至关重要,而SQL作为数据预处理的强大工具,助力数据科学家高效清洗、转换和分析数据。通过去除重复记录、处理缺失值和异常值,SQL确保数据纯净;利用数据类型转换和字符串操作,SQL重塑数据结构;通过复杂查询生成新特征,SQL提升模型性能。掌握SQL,就如同拥有了开启数据金矿的钥匙,为机器学习项目奠定坚实基础。
38 0
下一篇
无影云桌面