Python 数据分析（PYDA）第三版（六）(3)-阿里云开发者社区

Python 数据分析（PYDA）第三版（六）（2）https://developer.aliyun.com/article/1482397

十三、数据分析示例

原文：wesmckinney.com/book/data-analysis-examples

译者：飞龙

协议：CC BY-NC-SA 4.0

此开放访问网络版本的《Python 数据分析第三版》现已作为印刷版和数字版的伴侣提供。如果您发现任何勘误，请在此处报告。请注意，由 Quarto 生成的本站点的某些方面与 O’Reilly 的印刷版和电子书版本的格式不同。

如果您发现本书的在线版本有用，请考虑订购纸质版或无 DRM 的电子书以支持作者。本网站的内容不得复制或再生产。代码示例采用 MIT 许可，可在 GitHub 或 Gitee 上找到。

现在我们已经到达本书的最后一章，我们将查看一些真实世界的数据集。对于每个数据集，我们将使用本书中介绍的技术从原始数据中提取含义。演示的技术可以应用于各种其他数据集。本章包含一系列杂例数据集，您可以使用这些数据集练习本书中的工具。

示例数据集可在本书附带的GitHub 存储库中找到。如果无法访问 GitHub，还可以从Gitee 上的存储库镜像获取它们。

13.1 Bitly Data from 1.USA.gov

2011 年，URL 缩短服务Bitly与美国政府网站USA.gov合作，提供从缩短链接以*.gov或.mil*结尾的用户收集的匿名数据的源。2011 年，可下载的文本文件提供了实时数据以及每小时的快照。本文撰写时（2022 年），该服务已关闭，但我们保留了一份数据文件用于本书的示例。

在每个文件的每一行中，每小时快照包含一种称为 JSON 的常见网络数据形式，JSON 代表 JavaScript 对象表示法。例如，如果我们只读取文件的第一行，可能会看到类似于这样的内容：

In [5]: path = "datasets/bitly_usagov/example.txt"
In [6]: with open(path) as f:
 ...:     print(f.readline())
 ...:
{ "a": "Mozilla\\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\\/535.11
(KHTML, like Gecko) Chrome\\/17.0.963.78 Safari\\/535.11", "c": "US", "nk": 1,
"tz": "America\\/New_York", "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l":
"orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r":
"http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/1.usa.gov\\/wfLQtf", "u":
"http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991", "t": 1331923247, "hc":
1331822918, "cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }

Python 有内置和第三方库，用于将 JSON 字符串转换为 Python 字典。在这里，我们将使用json模块及其在我们下载的示例文件中的每一行上调用的loads函数：

import json
with open(path) as f:
 records = [json.loads(line) for line in f]

结果对象records现在是一个 Python 字典列表：

In [18]: records[0]
Out[18]:
{'a': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko)
Chrome/17.0.963.78 Safari/535.11',
 'al': 'en-US,en;q=0.8',
 'c': 'US',
 'cy': 'Danvers',
 'g': 'A6qOVH',
 'gr': 'MA',
 'h': 'wfLQtf',
 'hc': 1331822918,
 'hh': '1.usa.gov',
 'l': 'orofrog',
 'll': [42.576698, -70.954903],
 'nk': 1,
 'r': 'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
 't': 1331923247,
 'tz': 'America/New_York',
 'u': 'http://www.ncbi.nlm.nih.gov/pubmed/22415991'}

使用纯 Python 计算时区

假设我们有兴趣找出数据集中最常出现的时区（tz字段）。我们可以通过多种方式来实现这一点。首先，让我们再次使用列表推导式提取时区列表：

In [15]: time_zones = [rec["tz"] for rec in records]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-15-abdeba901c13> in <module>
----> 1 time_zones = [rec["tz"] for rec in records]
<ipython-input-15-abdeba901c13> in <listcomp>(.0)
----> 1 time_zones = [rec["tz"] for rec in records]
KeyError: 'tz'

糟糕！原来并非所有记录都有时区字段。我们可以通过在列表推导式末尾添加检查if "tz" in rec来处理这个问题：

In [16]: time_zones = [rec["tz"] for rec in records if "tz" in rec]
In [17]: time_zones[:10]
Out[17]: 
['America/New_York',
 'America/Denver',
 'America/New_York',
 'America/Sao_Paulo',
 'America/New_York',
 'America/New_York',
 'Europe/Warsaw',
 '',
 '',
 '']

仅查看前 10 个时区，我们会发现其中一些是未知的（空字符串）。您也可以将这些过滤掉，但我暂时保留它们。接下来，为了按时区生成计数，我将展示两种方法：一种更困难的方法（仅使用 Python 标准库）和一种更简单的方法（使用 pandas）。计数的一种方法是使用字典来存储计数，同时我们遍历时区：

def get_counts(sequence):
 counts = {}
 for x in sequence:
 if x in counts:
 counts[x] += 1
 else:
 counts[x] = 1
 return counts

使用 Python 标准库中更高级的工具，您可以更简洁地编写相同的内容：

from collections import defaultdict
def get_counts2(sequence):
 counts = defaultdict(int) # values will initialize to 0
 for x in sequence:
 counts[x] += 1
 return counts

我将这个逻辑放在一个函数中，以使其更具可重用性。要在时区上使用它，只需传递time_zones列表：

In [20]: counts = get_counts(time_zones)
In [21]: counts["America/New_York"]
Out[21]: 1251
In [22]: len(time_zones)
Out[22]: 3440

如果我们想要前 10 个时区及其计数，我们可以通过(count, timezone)创建一个元组列表，并对其进行排序：

def top_counts(count_dict, n=10):
 value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
 value_key_pairs.sort()
 return value_key_pairs[-n:]

我们有：

In [24]: top_counts(counts)
Out[24]: 
[(33, 'America/Sao_Paulo'),
 (35, 'Europe/Madrid'),
 (36, 'Pacific/Honolulu'),
 (37, 'Asia/Tokyo'),
 (74, 'Europe/London'),
 (191, 'America/Denver'),
 (382, 'America/Los_Angeles'),
 (400, 'America/Chicago'),
 (521, ''),
 (1251, 'America/New_York')]

如果您搜索 Python 标准库，可能会找到collections.Counter类，这将使这个任务变得更简单：

In [25]: from collections import Counter
In [26]: counts = Counter(time_zones)
In [27]: counts.most_common(10)
Out[27]: 
[('America/New_York', 1251),
 ('', 521),
 ('America/Chicago', 400),
 ('America/Los_Angeles', 382),
 ('America/Denver', 191),
 ('Europe/London', 74),
 ('Asia/Tokyo', 37),
 ('Pacific/Honolulu', 36),
 ('Europe/Madrid', 35),
 ('America/Sao_Paulo', 33)]

使用 pandas 计算时区

您可以通过将记录列表传递给pandas.DataFrame来从原始记录集创建一个 DataFrame：

In [28]: frame = pd.DataFrame(records)

我们可以查看有关这个新 DataFrame 的一些基本信息，比如列名、推断的列类型或缺失值的数量，使用frame.info()：

In [29]: frame.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3560 entries, 0 to 3559
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   a            3440 non-null   object 
 1   c            2919 non-null   object 
 2   nk           3440 non-null   float64
 3   tz           3440 non-null   object 
 4   gr           2919 non-null   object 
 5   g            3440 non-null   object 
 6   h            3440 non-null   object 
 7   l            3440 non-null   object 
 8   al           3094 non-null   object 
 9   hh           3440 non-null   object 
 10  r            3440 non-null   object 
 11  u            3440 non-null   object 
 12  t            3440 non-null   float64
 13  hc           3440 non-null   float64
 14  cy           2919 non-null   object 
 15  ll           2919 non-null   object 
 16  _heartbeat_  120 non-null    float64
 17  kw           93 non-null     object 
dtypes: float64(4), object(14)
memory usage: 500.8+ KB
In [30]: frame["tz"].head()
Out[30]: 
0     America/New_York
1       America/Denver
2     America/New_York
3    America/Sao_Paulo
4     America/New_York
Name: tz, dtype: object

frame的输出显示为摘要视图，适用于大型 DataFrame 对象。然后我们可以使用 Series 的value_counts方法：

In [31]: tz_counts = frame["tz"].value_counts()
In [32]: tz_counts.head()
Out[32]: 
tz
America/New_York       1251
 521
America/Chicago         400
America/Los_Angeles     382
America/Denver          191
Name: count, dtype: int64

我们可以使用 matplotlib 可视化这些数据。我们可以通过为记录中的未知或缺失时区数据填充替代值来使图表更加美观。我们使用fillna方法替换缺失值，并使用布尔数组索引来处理空字符串：

In [33]: clean_tz = frame["tz"].fillna("Missing")
In [34]: clean_tz[clean_tz == ""] = "Unknown"
In [35]: tz_counts = clean_tz.value_counts()
In [36]: tz_counts.head()
Out[36]: 
tz
America/New_York       1251
Unknown                 521
America/Chicago         400
America/Los_Angeles     382
America/Denver          191
Name: count, dtype: int64

此时，我们可以使用seaborn 包制作一个水平条形图（参见 1.usa.gov 示例数据中的顶级时区以查看结果可视化）：

In [38]: import seaborn as sns
In [39]: subset = tz_counts.head()
In [40]: sns.barplot(y=subset.index, x=subset.to_numpy())

图 13.1：1.usa.gov 示例数据中的顶级时区

a字段包含有关用于执行 URL 缩短的浏览器、设备或应用程序的信息：

In [41]: frame["a"][1]
Out[41]: 'GoogleMaps/RochesterNY'
In [42]: frame["a"][50]
Out[42]: 'Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2'
In [43]: frame["a"][51][:50]  # long line
Out[43]: 'Mozilla/5.0 (Linux; U; Android 2.2.2; en-us; LG-P9'

解析这些“代理”字符串中的所有有趣信息可能看起来是一项艰巨的任务。一种可能的策略是将字符串中的第一个标记（大致对应于浏览器功能）拆分出来，并对用户行为进行另一个摘要：

In [44]: results = pd.Series([x.split()[0] for x in frame["a"].dropna()])
In [45]: results.head(5)
Out[45]: 
0               Mozilla/5.0
1    GoogleMaps/RochesterNY
2               Mozilla/4.0
3               Mozilla/5.0
4               Mozilla/5.0
dtype: object
In [46]: results.value_counts().head(8)
Out[46]: 
Mozilla/5.0                 2594
Mozilla/4.0                  601
GoogleMaps/RochesterNY       121
Opera/9.80                    34
TEST_INTERNET_AGENT           24
GoogleProducer                21
Mozilla/6.0                    5
BlackBerry8520/5.0.0.681       4
Name: count, dtype: int64

现在，假设您想将顶级时区分解为 Windows 和非 Windows 用户。为简化起见，假设如果代理字符串中包含"Windows"字符串，则用户使用的是 Windows。由于一些代理缺失，我们将排除这些数据：

In [47]: cframe = frame[frame["a"].notna()].copy()

然后，我们想计算每行是否为 Windows 的值：

In [48]: cframe["os"] = np.where(cframe["a"].str.contains("Windows"),
 ....:                         "Windows", "Not Windows")
In [49]: cframe["os"].head(5)
Out[49]: 
0        Windows
1    Not Windows
2        Windows
3    Not Windows
4        Windows
Name: os, dtype: object

然后，您可以按其时区列和这个新的操作系统列表对数据进行分组：

In [50]: by_tz_os = cframe.groupby(["tz", "os"])

类似于value_counts函数，可以使用size计算组计数。然后将结果重塑为表格，使用unstack：

In [51]: agg_counts = by_tz_os.size().unstack().fillna(0)
In [52]: agg_counts.head()
Out[52]: 
os                   Not Windows  Windows
tz 
 245.0    276.0
Africa/Cairo                 0.0      3.0
Africa/Casablanca            0.0      1.0
Africa/Ceuta                 0.0      2.0
Africa/Johannesburg          0.0      1.0

最后，让我们选择顶级的整体时区。为此，我从agg_counts中的行计数构建一个间接索引数组。在使用agg_counts.sum("columns")计算行计数后，我可以调用argsort()来获得一个可以用于升序排序的索引数组：

In [53]: indexer = agg_counts.sum("columns").argsort()
In [54]: indexer.values[:10]
Out[54]: array([24, 20, 21, 92, 87, 53, 54, 57, 26, 55])

我使用take按顺序选择行，然后切掉最后 10 行（最大值）：

In [55]: count_subset = agg_counts.take(indexer[-10:])
In [56]: count_subset
Out[56]: 
os                   Not Windows  Windows
tz 
America/Sao_Paulo           13.0     20.0
Europe/Madrid               16.0     19.0
Pacific/Honolulu             0.0     36.0
Asia/Tokyo                   2.0     35.0
Europe/London               43.0     31.0
America/Denver             132.0     59.0
America/Los_Angeles        130.0    252.0
America/Chicago            115.0    285.0
 245.0    276.0
America/New_York           339.0    912.0

pandas 有一个方便的方法叫做nlargest，可以做同样的事情：

In [57]: agg_counts.sum(axis="columns").nlargest(10)
Out[57]: 
tz
America/New_York       1251.0
 521.0
America/Chicago         400.0
America/Los_Angeles     382.0
America/Denver          191.0
Europe/London            74.0
Asia/Tokyo               37.0
Pacific/Honolulu         36.0
Europe/Madrid            35.0
America/Sao_Paulo        33.0
dtype: float64

然后，可以绘制一个分组条形图，比较 Windows 和非 Windows 用户的数量，使用 seaborn 的barplot函数（参见按 Windows 和非 Windows 用户的顶级时区）。我首先调用count_subset.stack()并重置索引以重新排列数据，以便更好地与 seaborn 兼容：

In [59]: count_subset = count_subset.stack()
In [60]: count_subset.name = "total"
In [61]: count_subset = count_subset.reset_index()
In [62]: count_subset.head(10)
Out[62]: 
 tz           os  total
0  America/Sao_Paulo  Not Windows   13.0
1  America/Sao_Paulo      Windows   20.0
2      Europe/Madrid  Not Windows   16.0
3      Europe/Madrid      Windows   19.0
4   Pacific/Honolulu  Not Windows    0.0
5   Pacific/Honolulu      Windows   36.0
6         Asia/Tokyo  Not Windows    2.0
7         Asia/Tokyo      Windows   35.0
8      Europe/London  Not Windows   43.0
9      Europe/London      Windows   31.0
In [63]: sns.barplot(x="total", y="tz", hue="os",  data=count_subset)

图 13.2：按 Windows 和非 Windows 用户的顶级时区

在较小的组中，很难看出 Windows 用户的相对百分比，因此让我们将组百分比归一化为 1：

def norm_total(group):
 group["normed_total"] = group["total"] / group["total"].sum()
 return group
results = count_subset.groupby("tz").apply(norm_total)

然后在出现频率最高的时区中 Windows 和非 Windows 用户的百分比中绘制这个图：

In [66]: sns.barplot(x="normed_total", y="tz", hue="os",  data=results)

图 13.3：出现频率最高的时区中 Windows 和非 Windows 用户的百分比

我们可以通过使用transform方法和groupby更有效地计算归一化和：

In [67]: g = count_subset.groupby("tz")
In [68]: results2 = count_subset["total"] / g["total"].transform("sum")

13.2 MovieLens 1M 数据集

GroupLens Research提供了从 1990 年代末到 2000 年代初从 MovieLens 用户收集的多个电影评分数据集。数据提供了电影评分、电影元数据（类型和年份）以及关于用户的人口统计数据（年龄、邮政编码、性别认同和职业）。这些数据通常在基于机器学习算法的推荐系统的开发中很有兴趣。虽然我们在本书中没有详细探讨机器学习技术，但我将向您展示如何将这些数据集切分成您需要的确切形式。

MovieLens 1M 数据集包含从六千名用户对四千部电影收集的一百万个评分。它分布在三个表中：评分、用户信息和电影信息。我们可以使用pandas.read_table将每个表加载到一个 pandas DataFrame 对象中。在 Jupyter 单元格中运行以下代码：

unames = ["user_id", "gender", "age", "occupation", "zip"]
users = pd.read_table("datasets/movielens/users.dat", sep="::",
 header=None, names=unames, engine="python")
rnames = ["user_id", "movie_id", "rating", "timestamp"]
ratings = pd.read_table("datasets/movielens/ratings.dat", sep="::",
 header=None, names=rnames, engine="python")
mnames = ["movie_id", "title", "genres"]
movies = pd.read_table("datasets/movielens/movies.dat", sep="::",
 header=None, names=mnames, engine="python")

您可以通过查看每个 DataFrame 来验证一切是否成功：

In [70]: users.head(5)
Out[70]: 
 user_id gender  age  occupation    zip
0        1      F    1          10  48067
1        2      M   56          16  70072
2        3      M   25          15  55117
3        4      M   45           7  02460
4        5      M   25          20  55455
In [71]: ratings.head(5)
Out[71]: 
 user_id  movie_id  rating  timestamp
0        1      1193       5  978300760
1        1       661       3  978302109
2        1       914       3  978301968
3        1      3408       4  978300275
4        1      2355       5  978824291
In [72]: movies.head(5)
Out[72]: 
 movie_id                               title                        genres
0         1                    Toy Story (1995)   Animation|Children's|Comedy
1         2                      Jumanji (1995)  Adventure|Children's|Fantasy
2         3             Grumpier Old Men (1995)                Comedy|Romance
3         4            Waiting to Exhale (1995)                  Comedy|Drama
4         5  Father of the Bride Part II (1995)                        Comedy
In [73]: ratings
Out[73]: 
 user_id  movie_id  rating  timestamp
0              1      1193       5  978300760
1              1       661       3  978302109
2              1       914       3  978301968
3              1      3408       4  978300275
4              1      2355       5  978824291
...          ...       ...     ...        ...
1000204     6040      1091       1  956716541
1000205     6040      1094       5  956704887
1000206     6040       562       5  956704746
1000207     6040      1096       4  956715648
1000208     6040      1097       4  956715569
[1000209 rows x 4 columns]

请注意，年龄和职业被编码为整数，表示数据集的README文件中描述的组。分析分布在三个表中的数据并不是一项简单的任务；例如，假设您想要按性别身份和年龄计算特定电影的平均评分。正如您将看到的，将所有数据合并到一个单一表中更方便。使用 pandas 的merge函数，我们首先将ratings与users合并，然后将该结果与movies数据合并。pandas 根据重叠的名称推断要用作合并（或join）键的列：

In [74]: data = pd.merge(pd.merge(ratings, users), movies)
In [75]: data
Out[75]: 
 user_id  movie_id  rating  timestamp gender  age  occupation    zip 
0              1      1193       5  978300760      F    1          10  48067  \
1              2      1193       5  978298413      M   56          16  70072 
2             12      1193       4  978220179      M   25          12  32793 
3             15      1193       4  978199279      M   25           7  22903 
4             17      1193       5  978158471      M   50           1  95350 
...          ...       ...     ...        ...    ...  ...         ...    ... 
1000204     5949      2198       5  958846401      M   18          17  47901 
1000205     5675      2703       3  976029116      M   35          14  30030 
1000206     5780      2845       1  958153068      M   18          17  92886 
1000207     5851      3607       5  957756608      F   18          20  55410 
1000208     5938      2909       4  957273353      M   25           1  35401 
 title                genres 
0             One Flew Over the Cuckoo's Nest (1975)                 Drama 
1             One Flew Over the Cuckoo's Nest (1975)                 Drama 
2             One Flew Over the Cuckoo's Nest (1975)                 Drama 
3             One Flew Over the Cuckoo's Nest (1975)                 Drama 
4             One Flew Over the Cuckoo's Nest (1975)                 Drama 
...                                              ...                   ... 
1000204                           Modulations (1998)           Documentary 
1000205                        Broken Vessels (1998)                 Drama 
1000206                            White Boys (1999)                 Drama 
1000207                     One Little Indian (1973)  Comedy|Drama|Western 
1000208  Five Wives, Three Secretaries and Me (1998)           Documentary 
[1000209 rows x 10 columns]
In [76]: data.iloc[0]
Out[76]: 
user_id                                            1
movie_id                                        1193
rating                                             5
timestamp                                  978300760
gender                                             F
age                                                1
occupation                                        10
zip                                            48067
title         One Flew Over the Cuckoo's Nest (1975)
genres                                         Drama
Name: 0, dtype: object

为了获得按性别分组的每部电影的平均评分，我们可以使用pivot_table方法：

In [77]: mean_ratings = data.pivot_table("rating", index="title",
 ....:                                 columns="gender", aggfunc="mean")
In [78]: mean_ratings.head(5)
Out[78]: 
gender                                F         M
title 
$1,000,000 Duck (1971)         3.375000  2.761905
'Night Mother (1986)           3.388889  3.352941
'Til There Was You (1997)      2.675676  2.733333
'burbs, The (1989)             2.793478  2.962085
...And Justice for All (1979)  3.828571  3.689024

这产生了另一个包含平均评分的 DataFrame，其中电影标题作为行标签（“索引”），性别作为列标签。我首先筛选出至少收到 250 个评分的电影（一个任意的数字）；为此，我按标题对数据进行分组，并使用size()来获取每个标题的组大小的 Series：

In [79]: ratings_by_title = data.groupby("title").size()
In [80]: ratings_by_title.head()
Out[80]: 
title
$1,000,000 Duck (1971)            37
'Night Mother (1986)              70
'Til There Was You (1997)         52
'burbs, The (1989)               303
...And Justice for All (1979)    199
dtype: int64
In [81]: active_titles = ratings_by_title.index[ratings_by_title >= 250]
In [82]: active_titles
Out[82]: 
Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',
 '101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)',
 '13th Warrior, The (1999)', '2 Days in the Valley (1996)',
 '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
 '2010 (1984)',
 ...
 'X-Men (2000)', 'Year of Living Dangerously (1982)',
 'Yellow Submarine (1968)', 'You've Got Mail (1998)',
 'Young Frankenstein (1974)', 'Young Guns (1988)',
 'Young Guns II (1990)', 'Young Sherlock Holmes (1985)',
 'Zero Effect (1998)', 'eXistenZ (1999)'],
 dtype='object', name='title', length=1216)

然后，可以使用至少收到 250 个评分的标题的索引来从mean_ratings中选择行，使用.loc：

In [83]: mean_ratings = mean_ratings.loc[active_titles]
In [84]: mean_ratings
Out[84]: 
gender                                    F         M
title 
'burbs, The (1989)                 2.793478  2.962085
10 Things I Hate About You (1999)  3.646552  3.311966
101 Dalmatians (1961)              3.791444  3.500000
101 Dalmatians (1996)              3.240000  2.911215
12 Angry Men (1957)                4.184397  4.328421
...                                     ...       ...
Young Guns (1988)                  3.371795  3.425620
Young Guns II (1990)               2.934783  2.904025
Young Sherlock Holmes (1985)       3.514706  3.363344
Zero Effect (1998)                 3.864407  3.723140
eXistenZ (1999)                    3.098592  3.289086
[1216 rows x 2 columns]

要查看女性观众最喜欢的电影，我们可以按降序排序F列：

In [86]: top_female_ratings = mean_ratings.sort_values("F", ascending=False)
In [87]: top_female_ratings.head()
Out[87]: 
gender                                                         F         M
title 
Close Shave, A (1995)                                   4.644444  4.473795
Wrong Trousers, The (1993)                              4.588235  4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)           4.572650  4.464589
Wallace & Gromit: The Best of Aardman Animation (1996)  4.563107  4.385075
Schindler's List (1993)                                 4.562602  4.491415

测量评分分歧

假设您想要找到在男性和女性观众之间最具分歧的电影。一种方法是向mean_ratings添加一个包含平均值差异的列，然后按照该列进行排序：

In [88]: mean_ratings["diff"] = mean_ratings["M"] - mean_ratings["F"]
• 1

按照"diff"排序，可以得到评分差异最大的电影，以便看到哪些电影更受女性喜欢：

In [89]: sorted_by_diff = mean_ratings.sort_values("diff")
In [90]: sorted_by_diff.head()
Out[90]: 
gender                            F         M      diff
title 
Dirty Dancing (1987)       3.790378  2.959596 -0.830782
Jumpin' Jack Flash (1986)  3.254717  2.578358 -0.676359
Grease (1978)              3.975265  3.367041 -0.608224
Little Women (1994)        3.870588  3.321739 -0.548849
Steel Magnolias (1989)     3.901734  3.365957 -0.535777

颠倒行的顺序并再次切片前 10 行，我们得到了男性喜欢但女性评分不高的电影：

In [91]: sorted_by_diff[::-1].head()
Out[91]: 
gender                                         F         M      diff
title 
Good, The Bad and The Ugly, The (1966)  3.494949  4.221300  0.726351
Kentucky Fried Movie, The (1977)        2.878788  3.555147  0.676359
Dumb & Dumber (1994)                    2.697987  3.336595  0.638608
Longest Day, The (1962)                 3.411765  4.031447  0.619682
Cable Guy, The (1996)                   2.250000  2.863787  0.613787

假设您想要找到在观众中引起最大分歧的电影，而不考虑性别认同。分歧可以通过评分的方差或标准差来衡量。为了得到这个结果，我们首先按标题计算评分的标准差，然后筛选出活跃的标题：

In [92]: rating_std_by_title = data.groupby("title")["rating"].std()
In [93]: rating_std_by_title = rating_std_by_title.loc[active_titles]
In [94]: rating_std_by_title.head()
Out[94]: 
title
'burbs, The (1989)                   1.107760
10 Things I Hate About You (1999)    0.989815
101 Dalmatians (1961)                0.982103
101 Dalmatians (1996)                1.098717
12 Angry Men (1957)                  0.812731
Name: rating, dtype: float64

然后，我们按降序排序并选择前 10 行，这大致是评分最具分歧的 10 部电影：

In [95]: rating_std_by_title.sort_values(ascending=False)[:10]
Out[95]: 
title
Dumb & Dumber (1994)                     1.321333
Blair Witch Project, The (1999)          1.316368
Natural Born Killers (1994)              1.307198
Tank Girl (1995)                         1.277695
Rocky Horror Picture Show, The (1975)    1.260177
Eyes Wide Shut (1999)                    1.259624
Evita (1996)                             1.253631
Billy Madison (1995)                     1.249970
Fear and Loathing in Las Vegas (1998)    1.246408
Bicentennial Man (1999)                  1.245533
Name: rating, dtype: float64

您可能已经注意到电影类型是以管道分隔（|）的字符串给出的，因为一部电影可以属于多种类型。为了帮助我们按类型对评分数据进行分组，我们可以在 DataFrame 上使用explode方法。让我们看看这是如何工作的。首先，我们可以使用 Series 上的str.split方法将类型字符串拆分为类型列表：

In [96]: movies["genres"].head()
Out[96]: 
0     Animation|Children's|Comedy
1    Adventure|Children's|Fantasy
2                  Comedy|Romance
3                    Comedy|Drama
4                          Comedy
Name: genres, dtype: object
In [97]: movies["genres"].head().str.split("|")
Out[97]: 
0     [Animation, Children's, Comedy]
1 [Adventure, Children's, Fantasy]
2 [Comedy, Romance]
3                     [Comedy, Drama]
4                            [Comedy]
Name: genres, dtype: object
In [98]: movies["genre"] = movies.pop("genres").str.split("|")
In [99]: movies.head()
Out[99]: 
 movie_id                               title 
0         1                    Toy Story (1995)  \
1         2                      Jumanji (1995) 
2         3             Grumpier Old Men (1995) 
3         4            Waiting to Exhale (1995) 
4         5  Father of the Bride Part II (1995) 
 genre 
0   [Animation, Children's, Comedy] 
1  [Adventure, Children's, Fantasy] 
2 [Comedy, Romance] 
3                   [Comedy, Drama] 
4                          [Comedy]

现在，调用movies.explode("genre")会生成一个新的 DataFrame，其中每个电影类型列表中的“内部”元素都有一行。例如，如果一部电影被分类为喜剧和浪漫片，那么结果中将有两行，一行只有“喜剧”，另一行只有“浪漫片”：

In [100]: movies_exploded = movies.explode("genre")
In [101]: movies_exploded[:10]
Out[101]: 
 movie_id                     title       genre
0         1          Toy Story (1995)   Animation
0         1          Toy Story (1995)  Children's
0         1          Toy Story (1995)      Comedy
1         2            Jumanji (1995)   Adventure
1         2            Jumanji (1995)  Children's
1         2            Jumanji (1995)     Fantasy
2         3   Grumpier Old Men (1995)      Comedy
2         3   Grumpier Old Men (1995)     Romance
3         4  Waiting to Exhale (1995)      Comedy
3         4  Waiting to Exhale (1995)       Drama

现在，我们可以将所有三个表合并在一起，并按类型分组：

In [102]: ratings_with_genre = pd.merge(pd.merge(movies_exploded, ratings), users
)
In [103]: ratings_with_genre.iloc[0]
Out[103]: 
movie_id                     1
title         Toy Story (1995)
genre                Animation
user_id                      1
rating                       5
timestamp            978824268
gender                       F
age                          1
occupation                  10
zip                      48067
Name: 0, dtype: object
In [104]: genre_ratings = (ratings_with_genre.groupby(["genre", "age"])
 .....:                  ["rating"].mean()
 .....:                  .unstack("age"))
In [105]: genre_ratings[:10]
Out[105]: 
age                1         18        25        35        45        50 
genre 
Action       3.506385  3.447097  3.453358  3.538107  3.528543  3.611333  \
Adventure    3.449975  3.408525  3.443163  3.515291  3.528963  3.628163 
Animation    3.476113  3.624014  3.701228  3.740545  3.734856  3.780020 
Children's   3.241642  3.294257  3.426873  3.518423  3.527593  3.556555 
Comedy       3.497491  3.460417  3.490385  3.561984  3.591789  3.646868 
Crime        3.710170  3.668054  3.680321  3.733736  3.750661  3.810688 
Documentary  3.730769  3.865865  3.946690  3.953747  3.966521  3.908108 
Drama        3.794735  3.721930  3.726428  3.782512  3.784356  3.878415 
Fantasy      3.317647  3.353778  3.452484  3.482301  3.532468  3.581570 
Film-Noir    4.145455  3.997368  4.058725  4.064910  4.105376  4.175401 
age                56 
genre 
Action       3.610709 
Adventure    3.649064 
Animation    3.756233 
Children's   3.621822 
Comedy       3.650949 
Crime        3.832549 
Documentary  3.961538 
Drama        3.933465 
Fantasy      3.532700 
Film-Noir    4.125932

Python 数据分析（PYDA）第三版（六）（4）https://developer.aliyun.com/article/1482399

Python 数据分析（PYDA）第三版（六）(3)

十三、数据分析示例

13.1 Bitly Data from 1.USA.gov

使用纯 Python 计算时区

使用 pandas 计算时区

13.2 MovieLens 1M 数据集

测量评分分歧

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Python 数据分析（PYDA）第三版（六）(3)

十三、数据分析示例

13.1 Bitly Data from 1.USA.gov

使用纯 Python 计算时区

使用 pandas 计算时区

13.2 MovieLens 1M 数据集

测量评分分歧

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像