Python数据特征分析_01(MOOC学习时长分析）（上）

2022-05-26 122

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 本篇博客使用到的数据如下：通过分析每个学生的学习时长来分析学生的学习稳定性。（共有115人，每个人记录了11次的学习数据）

本篇博客使用到的数据如下：通过分析每个学生的学习时长来分析学生的学习稳定性。

（共有115人，每个人记录了11次的学习数据）

一、分布分析

1、定量数据分布分析

定量数据分布分析：主要是求极差

将Excel数据导入之后赋值给x，令x3为最后一周的数据：

x3=x['20220403']

# 01.求极差
R=x3.max()-x3.min() 
# 得到的极差值：11405

# 02.分桶
rg=np.ceil(R/600)
# 使用600作为分组的间隔
#　np.ceil用法：求出大于等于Ｒ／600的最小整数

得到的rg=20.0，因此如果我们按照间隔600来分组的话，应该要分20桶。

# 构造区间，决定分点
listBins=np.arange(0,12001,600) # 从0-12000，使用600来间隔
listBins

import itertools # 容器工具，根据需要组件特定的数据
aa=listBins
bb=list(itertools.permutations(aa,2))
print(bb)
# python 全排列，permutations函数
# itertools.permutations(iterable, r=None)
# 连续返回由 iterable 元素生成长度为 r 的排列。

fw=list(listBins)
fenzu=pd.cut(x3.values,fw,right=False)
print(fenzu.codes)
print(fenzu.categories)
# IntervalIndex([[0, 600), [600, 1200), [1200, 1800), [1800, 2400), [2400, 3000) ... [9000, 9600), [9600, 10200), [10200, 10800), [10800, 11400), [11400, 12000)], dtype='interval[int64, left]')
ps=fenzu.value_counts()
ps

ps.plot(kind='bar')

通过分桶的列表了解学生的学习情况：

qujian=pd.cut(x3,fw,right=False)
x['区间']=qujian.values
x.groupby('区间').median()    # 获取中位值
x.groupby('区间').mean()    # 获取平均值
ps_df=pd.DataFrame(ps,columns=['频数'])
ps_df['频率f']=ps_df/ps_df['频数'].sum()
ps_df['频率%']=ps_df['频率f'].map(lambda x:'%.2f%%'%(x*100))
ps_df['累计频率f']=ps_df['频率f'].cumsum() # 累计
ps_df['累计频率%']=ps_df['累计频率f'].map(lambda x:'%.4f%%'%(x*100))
ps_df

2、绘制频率图

ps_df['频率f'].plot(kind='bar',
                 width=0.8,
                 figsize=(18,9),
                 rot=90,
                 color='g',
                 grid=True,
                 alpha=0.5
                 )
plt.title("在线学习频率直方图")
x=len(ps_df)
y=ps_df['频率f']
m=ps_df['频数']
for i,j,k in zip(range(x),y,m): # 打包函数，将数据重新整合
    plt.text(i-0.2,j-0.01,'%i' % k,color='k') # 代表柱状图上面的数字与柱状图的距离

# 类似Excel的对比图
ps_df.style.bar(subset=['频率f','累计频率f'], color='green',width=100)

二、对比分析

1、绝对数对比

import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['font.sans-serif']=['Microsoft YaHei'] # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False # 用来正常显示负号
from datetime import datetime
fig,axes = plt.subplots(2,1,figsize = (30,20),sharex=False)
dfddiff.plot(kind='line',style='--.',alpha=0.8,ax=axes[0])
axes[0].legend(loc='best',frameon=True,ncol=3)
dfdiffv2.plot(kind='line',style='--.',alpha=0.8,ax=axes[1])
x=range(len(dfdiffv2))
plt.xticks(x,(dfdiffv2.index.values))
axes[1].set_xticklabels(dfdiffv2.index.values,rotation=90)

将前10人的数据做bar图：可以分析他们在不同时间段的表现情况

ax=df[0:10].plot(kind='bar',legend=False,figsize=(12,8))
patches,labels=ax.get_legend_handles_labels()
ax.legend(patches,labels,loc='best')
plt.xticks(rotation=30)

2、差值折线图

# 将两个人的学习时长数据进行比较
fig=plt.figure(figsize=(10,6))
plt.subplots_adjust(hspace=0.3)
ax1=fig.add_subplot(2,1,1)
x=range(len(df2))
y1=df2['罗梓']
y2=-df2['李裕']
plt.bar(x,y1,width=1,facecolor='yellowgreen')
plt.bar(x,y2,width=1,facecolor='lightskyblue')
plt.grid()
plt.xticks(x,(df2.index.values))
ax1.set_xticklabels(df2.index.values,rotation=90)

#差值折线图比较
fig=plt.figure(figsize=(10,6))
plt.subplots_adjust(hspace=0.3)
ax1=fig.add_subplot(2,1,1)
x=range(len(df2))
y1=df2['罗梓']
y2=df2['李裕']
y3=y1-y2
plt.plot(x,y3,'--^r')
plt.axhline(1000,color='r',linestyle='--')
plt.bar(x,y1,width=1,facecolor='yellowgreen')
plt.bar(x,y2,width=1,facecolor='lightskyblue')
plt.grid()
plt.xticks(x,(df2.index.values))
ax1.set_xticklabels(df2.index.values,rotation=90)

Python数据特征分析_01(MOOC学习时长分析）（上）

一、分布分析

1、定量数据分布分析

2、绘制频率图

二、对比分析

1、绝对数对比

2、差值折线图

热门文章

最新文章

相关课程

相关电子书

相关实验场景