业务数据分析最佳案例!旅游业数据分析!⛵

本文涉及的产品
函数计算FC,每月15万CU 3个月
简介: 本文使用『城市酒店和度假酒店的预订信息』,对旅游业的发展现状进行数据分析,包含了完整的数据分析流程:数据读取、数据初览、数据预处理、描述性统计、探索性数据分析、关联分析、相关性分析。
fdd6f75b2c2c8336fb9c4c1194bef41a.png
💡 作者: 韩信子@ ShowMeAI
📘 数据分析实战系列https://www.showmeai.tech/tutorials/40
📘 本文地址https://www.showmeai.tech/article-detail/388
📢 声明:版权所有,转载请联系平台与作者并注明出处
📢 收藏 ShowMeAI查看更多精彩内容
d3df89f736fd5ce3b179b6994b2928e8.png

在本篇内容中,ShowMeAI将带大家对旅游业,主要是酒店预订需求进行分析,我们使用到的数据集包含城市酒店和度假酒店的预订信息,包括预订时间、住宿时长、客人入住的周末或工作日晚数以及可用停车位数量等信息。

b0d276dbc6891cb8ea4bd800932d7dce.png

我们本次用到的是 🏆酒店预订数据集,包含 119390 位客人,有 32 个特征字段,大家可以通过 ShowMeAI 的百度网盘地址下载。

🏆 实战数据集下载(百度网盘):公✦众✦号『ShowMeAI研究中心』回复『 实战』,或者点击 这里 获取本文 [[59]旅游业大数据多维度业务分析案例]( https://www.showmeai.tech/article-detail/388)酒店预订数据集

ShowMeAI官方GitHubhttps://github.com/ShowMeAI-Hub

170493bafcd5cd6a4a5582741f6809a6.png
本文数据分析部分涉及的工具库,大家可以参考 ShowMeAI制作的工具库速查表和教程进行学习和快速使用。

📘 数据科学工具库速查表 | Pandas 速查表
📘 图解数据分析:从入门到精通系列教程

💡 导入工具库

# 数据处理&科学计算
import pandas as pd
import numpy as np

# 数据分析&绘图
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff

import warnings
warnings.filterwarnings("ignore")

# 科学计算
from scipy.stats import skew,kurtosis
import pylab as py

# 时间
import time
import datetime
from datetime import datetime
from datetime import date

💡 读取数据

df = pd.read_csv("hotel_bookings.csv")
df.head()
0acaea35a89e8d47995fb3d10cb2f6d6.png

💡 数据信息初览

df.info()
7009f4c166813f4acc65d4f4fac1ee58.png

💡 数据预处理

💦 清洗&缺失值处理

首先统计字段缺失值比例

df.isnull().sum().sort_values(ascending = False) / len(df)
4d25acc8c3b0f9c5778ca277bae89bde.png

我们对有缺失的字段做一些缺失值填充工作

# 填充"agent" 和 "company" 字段中的缺失值
df["agent"].fillna(0, inplace = True)
df["company"].fillna(0, inplace = True)

# 使用众数填充"country"字段缺失值
df["country"].fillna(df["country"].mode()[0], inplace = True)

# 删除包含"children"缺失值的数据记录
df.dropna(subset = ["children"], axis = 0, inplace = True)

💦 字段数据处理

# 将“distribution_channel”列中的“Undefined”转换为“TA/TO”
df["distribution_channel"].replace("Undefined", "TA/TO", inplace = True)

# meal字段映射处理
df["meal"].replace(["Undefined", "BB", "FB", "HB", "SC" ], ["No Meal", "Breakfast", "Full Board", "Half Board", "No Meal"], inplace = True)

# 将“is_canceled”列的值从 0 和 1 转换为“Cancelled”和“Not Cancelled”
df["is_canceled"].replace([0, 1], ["Cancelled", "Not Cancelled"], inplace = True)

💦 调整数据类型

  • childrenagentcompany列的数据类型转换为整型
  • reservation_status_date列的数据类型从对象转换为日期类型
# 转整型
df["children"].astype(int)
df["agent"].astype(int)
df["company"].astype(int)

# 时间型
pd.to_datetime(df["reservation_status_date"])

💦 重复数据处理

df.drop_duplicates(inplace = True)

💦 构建汇总字段

我们对顾客总体的居住晚数进行统计

df["total_nights"] = df["stays_in_weekend_nights"] + df["stays_in_week_nights"]

💡 描述性统计

我们基于pandas的简单功能,对数据的统计分布做一个处理了解

df.describe().T
be0eabd504ba59a4c88e28f41374f240.png

💡 探索性数据分析

💦 酒店维度分析

# 我们对 城市酒店 和 度假酒店 进行统计分析
labels = ['City Hotel', 'Resort Hotel']
colors = ["#538B8B", "#7AC5CD"]
order = df['hotel'].value_counts().index
plt.figure(figsize = (19, 9))
plt.suptitle('Bookings By Hotels', fontweight = 'heavy', fontsize = '16',
            fontfamily = 'sans-serif', color = "black")
# Pie Chart
plt.subplot(1, 2, 1)
plt.title('Pie Chart', fontweight = 'bold', fontfamily = "sans-serif", color = 'black')
plt.pie(df["hotel"].value_counts(), pctdistance = 0.7, autopct = '%.2f%%', labels = labels,
wedgeprops = dict(alpha = 0.8, edgecolor = "black"), textprops = {'fontsize': 12}, colors = colors)
centre = plt.Circle((0,0), 0.45, fc = "white", edgecolor = "black")
plt.gcf().gca().add_artist(centre)
# Histogram
countplt = plt.subplot(1, 2, 2)
plt.title("Histogram", fontweight = "bold", fontsize = 14, 
        fontfamily = "sans-serif", color = 'black')
ax = sns.countplot(x = "hotel", data = df, order = order, edgecolor = "black", palette = colors)
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width()/2, rect.get_height() + 4.25, rect.get_height(),
    horizontalalignment="center", fontsize = 10, bbox = dict(facecolor = "none", edgecolor = "black",
    linewidth = 0.25, boxstyle = "round"))
plt.xlabel("Hotel", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.ylabel("Number Of Bookings", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.xticks([0, 1], labels)
plt.grid(axis = "y", alpha = 0.4)

df['hotel'].value_counts()
0a5cd3bc760da9bfe21c18c4585d6001.png
📢 结论:超过 60% 的预订酒店是城市酒店

💦 细分市场分析

labels = ["Online TA", "Offline TA/TO", "Direct", "Groups", "Corporate", "Complementary", "Aviation"]
order = df['market_segment'].value_counts().index
plt.figure(figsize = (22, 9))
plt.suptitle('Bookings By Market Segment', fontweight = 'heavy', fontsize = '16',
            fontfamily = 'sans-serif', color = "black")
# Pie Chart
plt.subplot(1, 2, 1)
plt.title('Pie Chart', fontweight = 'bold', fontfamily = "sans-serif", color = 'black')
plt.pie(df["market_segment"].value_counts(), pctdistance = 0.7, autopct = '%.2f%%', labels = labels,
wedgeprops = dict(alpha = 0.8, edgecolor = "black"), textprops = {'fontsize': 12})
centre = plt.Circle((0,0), 0.45, fc = "white", edgecolor = "black")
plt.gcf().gca().add_artist(centre)
# Histogram
countplt = plt.subplot(1, 2, 2)
plt.title("Histogram", fontweight = "bold", fontsize = 14, 
        fontfamily = "sans-serif", color = 'black')
ax = sns.countplot(x = "market_segment", data = df, order = order, edgecolor = "black",)
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width()/2, rect.get_height() + 4.25, rect.get_height(),
    horizontalalignment="center", fontsize = 10, bbox = dict(facecolor = "none", edgecolor = "black",
    linewidth = 0.25, boxstyle = "round"))
plt.xlabel("Market Segment", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.ylabel("Number Of Bookings", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.grid(axis = "y", alpha = 0.4)

df['market_segment'].value_counts()
ec45edb2450533159fdc751ca0499d65.png
📢 结论:超过 50% 的预订是通过在线旅行社完成的。

💦 分销渠道分析

colors = ["#8B7D6B", "#000000", "#CDB79E", "#FFE4C4"]
labels = ["TA/TO", "Direct", "Corporate", "GDS"]
order = df['distribution_channel'].value_counts().index
plt.figure(figsize = (19, 9))
plt.suptitle('Bookings By Distribution Channel', fontweight = 'heavy', fontsize = '16',
            fontfamily = 'sans-serif', color = "black")
# Pie Chart
plt.subplot(1, 2, 1)
plt.title('Pie Chart', fontweight = 'bold', fontfamily = "sans-serif", color = 'black')
plt.pie(df["distribution_channel"].value_counts(), pctdistance = 0.7, autopct = '%.2f%%', labels = labels, colors = colors,
wedgeprops = dict(alpha = 0.8, edgecolor = "black"), textprops = {'fontsize': 12})
centre = plt.Circle((0,0), 0.45, fc = "white", edgecolor = "black")
plt.gcf().gca().add_artist(centre)
# Histogram
countplt = plt.subplot(1, 2, 2)
plt.title("Histogram", fontweight = "bold", fontsize = 14, 
        fontfamily = "sans-serif", color = 'black')
ax = sns.countplot(x = "distribution_channel", data = df, order = order, edgecolor = "black", palette = colors)
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width()/2, rect.get_height() + 4.25, rect.get_height(),
    horizontalalignment="center", fontsize = 10, bbox = dict(facecolor = "none", edgecolor = "black",
    linewidth = 0.25, boxstyle = "round"))
plt.xlabel("Distribution Channel", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.ylabel("Number Of Bookings", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.grid(axis = "y", alpha = 0.4)

df['distribution_channel'].value_counts()
50cceb4884f9973b655d70352c80c899.png
📢 结论:超过 80% 的预订是通过旅行社/运营商完成的。

💦 餐食分析

colors = ["#6495ED", "#1874CD", "#009ACD", "#00688B"]
labels = ["Breakfast", "No Meal", "Half Board", "Full Board"]
order = df['meal'].value_counts().index
plt.figure(figsize = (19, 9))
plt.suptitle('Bookings By Meals', fontweight = 'heavy', fontsize = '16',
            fontfamily = 'sans-serif', color = "black")
# Pie Chart
plt.subplot(1, 2, 1)
plt.title('Pie Chart', fontweight = 'bold', fontfamily = "sans-serif", color = 'black')
plt.pie(df["meal"].value_counts(), pctdistance = 0.7, autopct = '%.2f%%', labels = labels, colors = colors,
wedgeprops = dict(alpha = 0.8, edgecolor = "black"), textprops = {'fontsize': 12})
centre = plt.Circle((0,0), 0.45, fc = "white", edgecolor = "black")
plt.gcf().gca().add_artist(centre)
# Histogram
countplt = plt.subplot(1, 2, 2)
plt.title("Histogram", fontweight = "bold", fontsize = 14, 
        fontfamily = "sans-serif", color = 'black')
ax = sns.countplot(x = "meal", data = df, order = order, edgecolor = "black", palette = colors)
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width()/2, rect.get_height() + 4.25, rect.get_height(),
    horizontalalignment="center", fontsize = 10, bbox = dict(facecolor = "none", edgecolor = "black",
    linewidth = 0.25, boxstyle = "round"))
plt.xlabel("Meal", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.ylabel("Number Of Bookings", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.grid(axis = "y", alpha = 0.4)

df['meal'].value_counts()
edbe1c31f323d56d3c13f729e4dcaf1c.png
📢 结论:超过 70% 的客人预订早餐,近 90% 的客人预订餐点。

💦 顾客类型分析

labels = ["Transient", "Transient-Party", "Contract", "Group"]
order = df['customer_type'].value_counts().index
plt.figure(figsize = (19, 9))
plt.suptitle('Bookings By Customer Type', fontweight = 'heavy', fontsize = '16',
            fontfamily = 'sans-serif', color = "black")
# Pie Chart
plt.subplot(1, 2, 1)
plt.title('Pie Chart', fontweight = 'bold', fontfamily = "sans-serif", color = 'black')
plt.pie(df["customer_type"].value_counts(), pctdistance = 0.7, autopct = '%.2f%%', labels = labels,
wedgeprops = dict(alpha = 0.8, edgecolor = "black"), textprops = {'fontsize': 12})
centre = plt.Circle((0,0), 0.45, fc = "white", edgecolor = "black")
plt.gcf().gca().add_artist(centre)
# Histogram
countplt = plt.subplot(1, 2, 2)
plt.title("Histogram", fontweight = "bold", fontsize = 14, 
        fontfamily = "sans-serif", color = 'black')
ax = sns.countplot(x = "customer_type", data = df, order = order, edgecolor = "black")
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width()/2, rect.get_height() + 4.25, rect.get_height(),
    horizontalalignment="center", fontsize = 10, bbox = dict(facecolor = "none", edgecolor = "black",
    linewidth = 0.25, boxstyle = "round"))
plt.xlabel("Customer Type", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.ylabel("Number Of Bookings", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.grid(axis = "y", alpha = 0.4)

df['customer_type'].value_counts()
18805d2efa3d2a95422c7fd9a33a18f6.png
📢 结论:大多数人没有选择跟团旅游。

💦 押金情况分析

plt.figure(figsize = (19, 12))
order = sorted(df["deposit_type"].unique())
plt.title("Bookings By Deposit Types", fontweight = "bold", fontsize = 14, 
        fontfamily = "sans-serif", color = 'black')
ax = sns.countplot(x = "deposit_type", data = df, hue = "hotel", edgecolor = "black", palette = "bone", order = order)
plt.xlabel("Deposit Type", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.ylabel("Number Of Bookings", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.grid(axis = "y", alpha = 0.4)

df["deposit_type"].value_counts()
f0faa144113f6cb7a29fbb3f442ad320.png
📢 结论:大部分客人没有交押金。

💦 客人类型分析

labels = ['New Guest', 'Repeated Guest']
colors = ["#00008B", "#C1CDCD"]
order = df['is_repeated_guest'].value_counts().index
plt.figure(figsize = (19, 9))
plt.suptitle('Bookings By Type Of Guest', fontweight = 'heavy', fontsize = '16',
            fontfamily = 'sans-serif', color = "black")
# Pie Chart
plt.subplot(1, 2, 1)
plt.title('Pie Chart', fontweight = 'bold', fontfamily = "sans-serif", color = 'black')
plt.pie(df["is_repeated_guest"].value_counts(), pctdistance = 0.7, autopct = '%.2f%%', labels = labels,
wedgeprops = dict(alpha = 0.8, edgecolor = "black"), textprops = {'fontsize': 12}, colors = colors)
centre = plt.Circle((0,0), 0.45, fc = "white", edgecolor = "black")
plt.gcf().gca().add_artist(centre)
# Histogram
countplt = plt.subplot(1, 2, 2)
plt.title("Histogram", fontweight = "bold", fontsize = 14, fontfamily = "sans-serif", color = 'black')
ax = sns.countplot(x = "is_repeated_guest", data = df, order = order, edgecolor = "black", palette = colors)
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width()/2, rect.get_height() + 4.25, rect.get_height(),
    horizontalalignment="center", fontsize = 10, bbox = dict(facecolor = "none", edgecolor = "black",
    linewidth = 0.25, boxstyle = "round"))
plt.xlabel("Type Of Guest", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.ylabel("Total", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.xticks([0, 1], labels)
plt.grid(axis = "y", alpha = 0.4)

df['is_repeated_guest'].value_counts()
1a0095d51daa1cccf20058bc9ccc4feb.png
📢 结论:几乎所有的客人都是新客人。

💦 预订房间类型分析

plt.figure(figsize = (19, 12))
order = sorted(df["reserved_room_type"].unique())
plt.title("Bookings By Reserved Room Types", fontweight = "bold", fontsize = 14, 
        fontfamily = "sans-serif", color = 'black')
ax = sns.countplot(x = "reserved_room_type", data = df, hue = "hotel", edgecolor = "black", palette = "bone", order = order)
plt.xlabel("Reserved Room Type", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.ylabel("Number Of Bookings", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.grid(axis = "y", alpha = 0.4)

df["reserved_room_type"].value_counts()
3f901e2165645bea74f8310c948332e9.png
📢 结论:大多数客人预订了房间A,少数预订了房间D和E,其余的需求很少。

💦 分配的房间类型分析

plt.figure(figsize = (19, 12))
order = sorted(df["assigned_room_type"].unique())
plt.title("Bookings By Assigned Room Types", fontweight = "bold", fontsize = 14, 
        fontfamily = "sans-serif", color = 'black')
ax = sns.countplot(x = "assigned_room_type", data = df, hue = "hotel", edgecolor = "black", palette = "bone", order = order)
plt.xlabel("Assigned Room Type", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.ylabel("Number Of Bookings", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.grid(axis = "y", alpha = 0.4)

df["assigned_room_type"].value_counts()
26fbd515aa1afbf43935af4e1cd5ac99.png
📢 结论:大多数客人被分配到 A 室,少数被分配到 D 和 E 室,其余的很少。

💦 预订状态分析

plt.figure(figsize = (19, 12))
order = sorted(df["reservation_status"].unique())
plt.title("Bookings By Reservation Status", fontweight = "bold", fontsize = 14, 
        fontfamily = "sans-serif", color = 'black')
ax = sns.countplot(x = "reservation_status", data = df, hue = "hotel", edgecolor = "black", palette = "ocean", order = order)
plt.xlabel("Reservation Status", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.ylabel("Number Of Bookings", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.grid(axis = "y", alpha = 0.4)

df["reservation_status"].value_counts()
edb1a2f3045f31f075416362d7ee5c13.png
📢 结论:大多数客人登记入住并已经离开。

💦 总住宿夜数分布

plt.figure(figsize = (19, 9))
df2 = df.groupby("total_nights")["total_nights"].count()
df2.sort_values(ascending = False)[: 10].plot(kind = 'bar')
plt.title("Bookings By Total Nights Stayed By Guests", fontweight = "bold", fontsize = 14, fontfamily = "sans-serif", 
color = 'black')
plt.xticks(rotation = 30)
plt.xlabel("Number Of Nights", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.ylabel("Number Of Bookings", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.grid(axis = "y", alpha = 0.4)
c98b848b51bd707f3abd37ecb55d7256.png
📢 结论:最受欢迎的酒店住宿时间是三晚。

💦 酒店&总住宿夜数

plt.figure(figsize = (19, 12))
order = df.total_nights.value_counts().iloc[:10].index
plt.title("Total Nights Stayed By Guests In Hotel", fontweight = "bold", fontsize = 14, 
        fontfamily = "sans-serif", color = 'black')
ax = sns.countplot(x = "total_nights", data = df, hue = "hotel", edgecolor = "black", palette = "ocean", order = order)
plt.xlabel("Total Nights", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.ylabel("Number Of Bookings", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.grid(axis = "y", alpha = 0.4)
6c6998a8b9512d968cdd4e6c9f2747ec.png
📢 结论:度假酒店最受欢迎的住宿时间是一晚、七晚、两晚、三晚和四晚。城市酒店最受欢迎的住宿时间是三晚、两晚、一晚和四晚。

💦 热门国家分布

plt.figure(figsize = (19, 9))
df2 = df.groupby("country")["country"].count()
df2.sort_values(ascending = False)[: 20].plot(kind = 'bar')
plt.title("Bookings By Top 20 Countries", fontweight = "bold", fontsize = 14, fontfamily = "sans-serif", color = 'black')
plt.xticks(rotation = 30)
plt.xlabel("Country", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.ylabel("Number Of Bookings", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.grid(axis = "y", alpha = 0.4)

df["country"].value_counts()
c6538233d41b79b56c168afdb229e31f.png
📢 结论:在这份数据中,葡萄牙的预订量比其他任何国家都多。

💦 预定下单时间

plt.figure(figsize = (16, 6))
plt.title("Bookings By Lead Time", fontweight = "bold", fontsize = 14, fontfamily = 'sans-serif', color = 'black')
sns.histplot(data = df, x = 'lead_time', hue = "hotel", kde = True, color = "#104E8B")
plt.xlabel('Lead Time', fontweight = 'normal', fontsize = 11, fontfamily = 'sans-serif', color = "black")
plt.ylabel('Number Of Bookings', fontweight = 'regular', fontsize = 11, fontfamily = "sans-serif", color = "black")

df["lead_time"].describe().T
c1faeaf47ecf4eb316b8f92e5fd2864b.png
📢 结论:大多数预订是在入住酒店前 100 天内完成的。

💡 关联分析

💦 预订取消&酒店类型

plt.figure(figsize = (19, 12))
plt.title("Number Of Bookings Cancelled By Guests", fontweight = "bold", fontsize = 14, 
        fontfamily = "sans-serif", color = 'black')
ax = sns.countplot(x = "hotel", data = df, hue = "is_canceled", edgecolor = "black", palette = "bone")
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width()/2, rect.get_height() + 4.25, rect.get_height(),
    horizontalalignment="center", fontsize = 10, bbox = dict(facecolor = "none", edgecolor = "black",
    linewidth = 0.25, boxstyle = "round"))
plt.xlabel("Hotel", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.ylabel("Number Of Bookings", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.grid(axis = "y", alpha = 0.4)
36315e5bd66abf1723d43da7a9ac6ef1.png
📢 结论:度假村酒店的客人取消预订的频率低于城市酒店的客人。

💦 预约取消&新老客

plt.figure(figsize = (19, 12))
plt.title("Number Of Bookings Cancelled By Type Of Guests", fontweight = "bold", fontsize = 14, 
        fontfamily = "sans-serif", color = 'black')
ax = sns.countplot(x = "is_canceled", data = df, hue = "is_repeated_guest", edgecolor = "black", palette = "bone")
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width()/2, rect.get_height() + 4.25, rect.get_height(),
    horizontalalignment="center", fontsize = 10, bbox = dict(facecolor = "none", edgecolor = "black",
    linewidth = 0.25, boxstyle = "round"))
plt.xlabel("Cancellation", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.ylabel("Number Of Bookings", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.legend(['New Guest', 'Repeated Guest'])
plt.grid(axis = "y", alpha = 0.4)
2756d8d78077b3b81b0edb8be6164eaf.png
📢 结论:老客取消预订的次数少于新客。

💦 预约取消&细分市场

plt.figure(figsize = (19, 12))
plt.title("Number Of Bookings Cancelled By Market Segments", fontweight = "bold", fontsize = 14, 
        fontfamily = "sans-serif", color = 'black')
ax = sns.countplot(x = "market_segment", data = df, hue = "is_canceled", edgecolor = "black", palette = "bone")
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width()/2, rect.get_height() + 4.25, rect.get_height(),
    horizontalalignment="center", fontsize = 10, bbox = dict(facecolor = "none", edgecolor = "black",
    linewidth = 0.25, boxstyle = "round"))
plt.xlabel("Market Segment", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.ylabel("Number Of Bookings", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.grid(axis = "y", alpha = 0.4)
6a4b4cfe805f1d2784462ebb3f80df97.png
📢 结论:在线旅行社、线下旅行社/运营商和直销部分的取消率高于其他部分。

💦 预订数量&年份

plt.figure(figsize = (19, 12))
plt.title("Number Of Bookings Per Year", fontweight = "bold", fontsize = 14, 
        fontfamily = "sans-serif", color = 'black')
ax = sns.countplot(x = "arrival_date_year", data = df, hue = "hotel", edgecolor = "black", palette = "cool")
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width()/2, rect.get_height() + 4.25, rect.get_height(),
    horizontalalignment="center", fontsize = 10, bbox = dict(facecolor = "none", edgecolor = "black",
    linewidth = 0.25, boxstyle = "round"))
plt.xlabel("Year", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.ylabel("Number Of Bookings", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.grid(axis = "y", alpha = 0.4)

df["arrival_date_year"].value_counts()
2d87f594a226dd9251a9e993c427ab8e.png
📢 结论:度假村和城市酒店在 2016 年的预订量均最高。与度假村酒店相比,城市酒店在 2017 年的预订量更高。两者在 2015 年的预订量几乎相同。

💦 预订数量&月份

plt.figure(figsize = (16, 10))
plt.title("Number Of Bookings Per Customer Type", fontweight = "bold", fontsize = 14, 
        fontfamily = "sans-serif", color = 'black')
ax = sns.countplot(x = "customer_type", data = df, hue = "hotel", edgecolor = "black", palette = "pink")
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width()/2, rect.get_height() + 4.25, rect.get_height(),
    horizontalalignment="center", fontsize = 10, bbox = dict(facecolor = "none", edgecolor = "black",
    linewidth = 0.25, boxstyle = "round"))
plt.xlabel("Customer Type", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.ylabel("Number Of Bookings", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.grid(axis = "y", alpha = 0.4)
060e680e9db4a5face2dbd54bc5b69fc.png
📢 结论:十一月、十二月、一月和二月是预定最少的月份,7 月和 8 月是预订高峰月份。

💦 预订数量&客户类型

months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October",
"November", "December"]

plt.figure(figsize = (19, 12))
plt.title("Number Of Bookings Per Month", fontweight = "bold", fontsize = 14, 
        fontfamily = "sans-serif", color = 'black')
d = df.groupby("arrival_date_month")["arrival_date_month"].count()
sns.barplot(x = d.index, y = d, order = months)
plt.xticks(rotation = 30)
plt.xlabel("Months")
plt.ylabel("Number Of Bookings")

df["arrival_date_month"].value_counts()
5eae0bf88b9de72f85b0da92d8e6eba7.png
📢 结论:临时和临时派对客人大多预订城市酒店,而跟团客人在度假村和城市酒店的预订数量几乎相同。

💦 车位&预订

plt.figure(figsize = (19, 12))
plt.title("Number Of Bookings Per Required Car Parking Space", fontweight = "bold", fontsize = 14, 
        fontfamily = "sans-serif", color = 'black')
ax = sns.countplot(x = "required_car_parking_spaces", data = df, hue = "hotel", edgecolor = "black", palette = "cool")
plt.xlabel("Required Car Parking Space", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.ylabel("Number Of Bookings", fontweight = "bold", fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.grid(axis = "y", alpha = 0.4)

df["required_car_parking_spaces"].value_counts()
b12b609d0a21412c4a29348b063da3b0.png
📢 结论:大多数客人不需要停车位,而少数客人需要停车位。

💦 国家/地区&特殊要求数量

df2 = df.groupby("country")["total_of_special_requests"].mean().sort_values(ascending = False)[: 20]
plt.figure(figsize = (18, 8))
sns.barplot(x = df2.index, y = df2)
plt.xticks(rotation = 30)
plt.xlabel("Country")
plt.ylabel("Average Number Of Special Requests")
plt.title("Average Number Of Special Requests Made By Top 20 Countries ", fontweight = "bold", fontsize = 14,fontfamily = "sans-serif", color = 'black')
c50544eb272a2a4036d962bfebf7f47e.png
📢 结论:在这些国家中,博茨瓦纳的特殊要求数量最多。

💦 客户类型&特殊要求数量

df2 = df.groupby("customer_type")["total_of_special_requests"].mean().sort_values(ascending = False)[: 20]
plt.figure(figsize = (18, 8))
sns.barplot(x = df2.index, y = df2)
plt.xticks(rotation = 30)
plt.xlabel("Customer Type")
plt.ylabel("Average Number Of Special Requests")
plt.title("Average Number Of Special Requests By Customer Type", fontweight = "bold", fontsize = 14, fontfamily = "sans-serif", color = 'black')
d0016e4e55446e6c80494c8452e4101f.png
📢 结论:跟团客人的特殊要求数量最多,而临时派对客人的特殊要求数量最少。

💦 月份&特殊要求数量

months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October",
"November", "December"]

df2 = df.groupby("arrival_date_month")["total_of_special_requests"].mean().sort_values(ascending = False)[: 20]
plt.figure(figsize = (18, 8))
sns.barplot(x = df2.index, y = df2, order = months)
plt.xticks(rotation = 30)
plt.xlabel("Months")
plt.ylabel("Average Number Of Special Requests")
plt.title("Average Number Of Special Requests By Guests Per Months ", fontweight = "bold", fontsize = 14, fontfamily = "sans-serif", color = 'black')
b8863184ecbd4d2c62f1eb6389b7a7c7.png
📢 结论:客人在几个月内提出了几乎相似数量的特殊要求,但在 8 月、7 月和 12 月提出的特殊要求略多一些。

💦 酒店类型&价格

# Histogram
fig = plt.figure(figsize = (16, 10))
df.drop(df[df["adr"] == 5400].index, axis = 0, inplace = True)
plt.suptitle("Average Daily Rate Per Hotel", fontweight = "bold", fontsize = 14, 
        fontfamily = "sans-serif", color = 'black')

plot1 = fig.add_subplot(1, 2, 2)
plt.title("Histogram Plot", fontweight = "bold", fontsize = 14, fontfamily = 'sans-serif', color = 'black')
sns.histplot(data = df, x = 'adr', hue = "hotel", kde = True, color = "#104E8B")
plt.xlabel('Average Daily Rate', fontweight = 'normal', fontsize = 11, fontfamily = 'sans-serif', color = "black")
plt.ylabel('Count', fontweight = 'regular', fontsize = 11, fontfamily = "sans-serif", color = "black")
# Box Plot
plot2 = fig.add_subplot(1, 2, 1)
plt.title("Box Plot", fontweight = "bold", fontsize = 14, fontfamily = 'sans-serif', color = 'black')
sns.boxplot(data = df, x = "hotel", y = 'adr', color = "#104E8B")
plt.ylabel('Average Daily Rate', fontweight = 'regular', fontsize = 11, fontfamily = 'sans-serif', color = "black")
plt.show()

df["adr"].describe()
77409946eb7becd0a10b3df0bc1998b6.png
📢 结论:度假村酒店的平均每日价格比城市酒店更分散。

💦 月份&费率

months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October",
"November", "December"]

df.drop(df[df["adr"] == 5400].index, axis = 0, inplace = True)
d = df.groupby(["hotel", "arrival_date_month"])["adr"].mean().reset_index()
d["arrival_date_month"] = pd.Categorical(d["arrival_date_month"], categories = months, ordered = True)
d.sort_values("arrival_date_month", inplace = True)

fig = plt.figure(figsize = (16, 10))
plt.suptitle("Average Daily Rate Per Month", fontweight = "bold", fontsize = 14, 
        fontfamily = "sans-serif", color = 'black')
sns.lineplot(data = d, y = 'adr', x = "arrival_date_month", hue = "hotel")
plt.ylabel('Average Daily Rate', fontweight = 'normal', fontsize = 11, fontfamily = 'sans-serif', color = "black")
plt.xlabel('Months', fontweight = 'regular', fontsize = 11, fontfamily = "sans-serif", color = "black")
plt.xticks(rotation = 30)
23e62d953f1b15a5d9a08eeba01f916b.png
📢 结论:两类酒店的平均每日房价在年中均较高。与度假村酒店相比,城市酒店在年初和年末的每日房价较高。

💡 相关性分析

💦 相关矩阵

我们计算一下相关矩阵,看看字段间的相关性如何

# 剔除一些不参与相关分析的字段
df_sub = df.drop(['arrival_date_week_number', 'arrival_date_day_of_month', 'previous_cancellations','previous_bookings_not_canceled', 'booking_changes', 'reservation_status_date', 'agent', 'company', 'days_in_waiting_list', 'adults', 'babies', 'children'], axis = 1)

# 相关矩阵
corr_matrix = round(df_sub.corr(), 3)
"Correlation Matrix: "
corr_matrix
e6b4dc2c917f6802053904d432f8fa51.png

💦 热力图

我们做一个热力图的绘制,以便更清晰看到字段间相关性。

plt.rcParams['figure.figsize'] =(12, 6)
sns.heatmap(df_sub.corr(), annot=True, cmap='Reds', linewidths=5)
plt.suptitle('Correlation Between Variables', fontweight='heavy', x=0.03, y=0.98, ha = "left", fontsize='18', fontfamily='sans-serif', color= "black")
3236b88983199f89dbe3ab65cd1f2cc8.png

参考资料

e9190f41b8de4af38c8a1a0c96f0513b~tplv-k3u1fbpfcp-zoom-1.image

相关实践学习
【文生图】一键部署Stable Diffusion基于函数计算
本实验教你如何在函数计算FC上从零开始部署Stable Diffusion来进行AI绘画创作,开启AIGC盲盒。函数计算提供一定的免费额度供用户使用。本实验答疑钉钉群:29290019867
建立 Serverless 思维
本课程包括: Serverless 应用引擎的概念, 为开发者带来的实际价值, 以及让您了解常见的 Serverless 架构模式
目录
相关文章
|
7月前
|
数据可视化 数据挖掘
R语言生存分析数据分析可视化案例(下)
R语言生存分析数据分析可视化案例
|
4月前
|
数据采集 存储 数据挖掘
【优秀python数据分析案例】基于Python书旗网小说网站数据采集与分析的设计与实现
本文介绍了一个基于Python的书旗网小说网站数据采集与分析系统,通过自动化爬虫收集小说数据,利用Pandas进行数据处理,并通过Matplotlib和Seaborn等库进行数据可视化,旨在揭示用户喜好和市场趋势,为图书出版行业提供决策支持。
366 6
【优秀python数据分析案例】基于Python书旗网小说网站数据采集与分析的设计与实现
|
4月前
|
数据采集 数据可视化 关系型数据库
【优秀python 数据分析案例】基于python的穷游网酒店数据采集与可视化分析的设计与实现
本文介绍了一个基于Python的穷游网酒店数据采集与可视化分析系统,通过爬虫技术自动抓取酒店信息,并利用数据分析算法和可视化工具,提供了全国主要城市酒店的数量、星级、价格、评分等多维度的深入洞察,旨在为旅行者和酒店经营者提供决策支持。
131 4
【优秀python 数据分析案例】基于python的穷游网酒店数据采集与可视化分析的设计与实现
|
4月前
|
JSON 数据挖掘 API
案例 | 用pdpipe搭建pandas数据分析流水线
案例 | 用pdpipe搭建pandas数据分析流水线
|
4月前
|
数据采集 存储 数据可视化
【优秀python数据分析案例】基于python的中国天气网数据采集与可视化分析的设计与实现
本文介绍了一个基于Python的中国天气网数据采集与可视化分析系统,通过requests和BeautifulSoup库实现数据爬取,利用matplotlib、numpy和pandas进行数据可视化,提供了温湿度变化曲线、空气质量图、风向雷达图等分析结果,有效预测和展示了未来天气信息。
1283 3
|
4月前
|
数据采集 数据可视化 数据挖掘
【优秀python案例】基于python爬虫的深圳房价数据分析与可视化实现
本文通过Python爬虫技术从链家网站爬取深圳二手房房价数据,并进行数据清洗、分析和可视化,提供了房价走势、区域房价比较及房屋特征等信息,旨在帮助购房者更清晰地了解市场并做出明智决策。
171 2
|
4月前
|
数据采集 数据可视化 算法
基于Python flask的boss直聘数据分析与可视化系统案例,能预测boss直聘某个岗位某个城市的薪资
本文介绍了一个基于Python Flask框架的Boss直聘数据分析与可视化系统,系统使用selenium爬虫、MySQL和csv进行数据存储,通过Pandas和Numpy进行数据处理分析,并采用模糊匹配算法进行薪资预测。
115 0
基于Python flask的boss直聘数据分析与可视化系统案例,能预测boss直聘某个岗位某个城市的薪资
|
5月前
|
数据采集 机器学习/深度学习 数据可视化
完整的Python数据分析流程案例解析-数据科学项目实战
【7月更文挑战第5天】这是一个Python数据分析项目的概览,涵盖了从CSV数据加载到模型评估的步骤:获取数据、预处理(处理缺失值和异常值、转换数据)、数据探索(可视化和统计分析)、模型选择(线性回归)、训练与评估、优化,以及结果的可视化和解释。此流程展示了理论与实践的结合在解决实际问题中的应用。
118 1
|
6月前
|
数据采集 机器学习/深度学习 数据可视化
数据挖掘实战:Python在金融数据分析中的应用案例
Python在金融数据分析中扮演关键角色,用于预测市场趋势和风险管理。本文通过案例展示了使用Python库(如pandas、numpy、matplotlib等)进行数据获取、清洗、分析和建立预测模型,例如计算苹果公司(AAPL)股票的简单移动平均线,以展示基本流程。此示例为更复杂的金融建模奠定了基础。【6月更文挑战第13天】
1596 3
|
7月前
|
数据采集 数据可视化 数据挖掘
利用 DataFrame 进行数据分析:实战案例解析
【5月更文挑战第19天】DataFrame是数据分析利器,本文通过一个销售数据案例展示其使用:读取数据创建DataFrame,计算产品总销量,分析月销售趋势,找出最畅销产品,并进行数据可视化。此外,还提及数据清洗和异常处理。DataFrame为数据处理、分组计算和可视化提供便利,助力高效数据分析。
141 3