2.3.5 查看特征的数值类型有哪些,对象类型有哪些
- 特征一般都是由类别型特征和数值型特征组成,而数值型特征又分为连续型和离散型。
- 类别型特征有时具有非数值关系,有时也具有数值关系。比如‘grade’中的等级A,B,C等,是否只是单纯的分类,还是A优于其他要结合业务判断。
- 数值型特征本是可以直接入模的,但往往风控人员要对其做分箱,转化为WOE编码进而做标准评分卡等操作。从模型效果上来看,特征分箱主要是为了降低变量的复杂性,减少变量噪音对模型的影响,提高自变量和因变量的相关度。从而使模型更加稳定。
numerical_fea = list(data_train.select_dtypes(exclude=['object']).columns) category_fea = list(filter(lambda x: x not in numerical_fea,list(data_train.columns)))
numerical_fea
['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'employmentTitle', 'homeOwnership', 'annualIncome', 'verificationStatus', 'isDefault', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'initialListStatus', 'applicationType', 'title', 'policyCode', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12', 'n13', 'n14']
category_fea
['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']
data_train.grade
0 E 1 D 2 D 3 A 4 C .. 799995 C 799996 A 799997 C 799998 A 799999 B Name: grade, Length: 800000, dtype: object
数值型变量分析,数值型肯定是包括连续型变量和离散型变量的,找出来
- 划分数值型变量中的连续变量和离散型变量
#过滤数值型类别特征 def get_numerical_serial_fea(data,feas): numerical_serial_fea = [] numerical_noserial_fea = [] for fea in feas: temp = data[fea].nunique() if temp <= 10: numerical_noserial_fea.append(fea) continue numerical_serial_fea.append(fea) return numerical_serial_fea,numerical_noserial_fea numerical_serial_fea,numerical_noserial_fea = get_numerical_serial_fea(data_train,numerical_fea)
numerical_serial_fea
['id', 'loanAmnt', 'interestRate', 'installment', 'employmentTitle', 'annualIncome', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'title', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n13', 'n14']
numerical_noserial_fea
['term', 'homeOwnership', 'verificationStatus', 'isDefault', 'initialListStatus', 'applicationType', 'policyCode', 'n11', 'n12']
- 数值类别型变量分析
data_train['term'].value_counts()#离散型变量
3 606902 5 193098 Name: term, dtype: int64
data_train['homeOwnership'].value_counts()#离散型变量
0 395732 1 317660 2 86309 3 185 5 81 4 33 Name: homeOwnership, dtype: int64
data_train['verificationStatus'].value_counts()#离散型变量
1 309810 2 248968 0 241222 Name: verificationStatus, dtype: int64
data_train['initialListStatus'].value_counts()#离散型变量
0 466438 1 333562 Name: initialListStatus, dtype: int64
data_train['applicationType'].value_counts()#离散型变量
0 784586 1 15414 Name: applicationType, dtype: int64
data_train['policyCode'].value_counts()#离散型变量,无用,全部一个值
1.0 800000 Name: policyCode, dtype: int64
data_train['n11'].value_counts()#离散型变量,相差悬殊,用不用再分析
0.0 729682 1.0 540 2.0 24 4.0 1 3.0 1 Name: n11, dtype: int64
data_train['n12'].value_counts()#离散型变量,相差悬殊,用不用再分析
0.0 757315 1.0 2281 2.0 115 3.0 16 4.0 3 Name: n12, dtype: int64
- 数值连续型变量分析
#每个数字特征得分布可视化 f = pd.melt(data_train, value_vars=numerical_serial_fea) g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False) g = g.map(sns.distplot, "value")
png
- 查看某一个数值型变量的分布,查看变量是否符合正态分布,如果不符合正太分布的变量可以log化后再观察下是否符合正态分布。
- 如果想统一处理一批数据变标准化 必须把这些之前已经正态化的数据提出
- 正态化的原因:一些情况下正态非正态可以让模型更快的收敛,一些模型要求数据正态(eg. GMM、KNN),保证数据不要过偏态即可,过于偏态可能会影响模型预测结果。
#Ploting Transaction Amount Values Distribution plt.figure(figsize=(16,12)) plt.suptitle('Transaction Values Distribution', fontsize=22) plt.subplot(221) sub_plot_1 = sns.distplot(data_train['loanAmnt']) sub_plot_1.set_title("loanAmnt Distribuition", fontsize=18) sub_plot_1.set_xlabel("") sub_plot_1.set_ylabel("Probability", fontsize=15) plt.subplot(222) sub_plot_2 = sns.distplot(np.log(data_train['loanAmnt'])) sub_plot_2.set_title("loanAmnt (Log) Distribuition", fontsize=18) sub_plot_2.set_xlabel("") sub_plot_2.set_ylabel("Probability", fontsize=15)
Text(0, 0.5, 'Probability')
png
- 非数值类别型变量分析
category_fea
['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']
data_train['grade'].value_counts()
B 233690 C 227118 A 139661 D 119453 E 55661 F 19053 G 5364 Name: grade, dtype: int64
data_train['subGrade'].value_counts()
C1 50763 B4 49516 B5 48965 B3 48600 C2 47068 C3 44751 C4 44272 B2 44227 B1 42382 C5 40264 A5 38045 A4 30928 D1 30538 D2 26528 A1 25909 D3 23410 A3 22655 A2 22124 D4 21139 D5 17838 E1 14064 E2 12746 E3 10925 E4 9273 E5 8653 F1 5925 F2 4340 F3 3577 F4 2859 F5 2352 G1 1759 G2 1231 G3 978 G4 751 G5 645 Name: subGrade, dtype: int64
data_train['employmentLength'].value_counts()
10+ years 262753 2 years 72358 < 1 year 64237 3 years 64152 1 year 52489 5 years 50102 4 years 47985 6 years 37254 8 years 36192 7 years 35407 9 years 30272 Name: employmentLength, dtype: int64
data_train['issueDate'].value_counts()
2016-03-01 29066 2015-10-01 25525 2015-07-01 24496 2015-12-01 23245 2014-10-01 21461 ... 2007-08-01 23 2007-07-01 21 2008-09-01 19 2007-09-01 7 2007-06-01 1 Name: issueDate, Length: 139, dtype: int64
data_train['earliesCreditLine'].value_counts()
Aug-2001 5567 Sep-2003 5403 Aug-2002 5403 Oct-2001 5258 Aug-2000 5246 ... May-1960 1 Apr-1958 1 Feb-1960 1 Aug-1946 1 Mar-1958 1 Name: earliesCreditLine, Length: 720, dtype: int64
data_train['isDefault'].value_counts()
0 640390 1 159610 Name: isDefault, dtype: int64
总结:
- 上面我们用value_counts()等函数看了特征属性的分布,但是图表是概括原始信息最便捷的方式。
- 数无形时少直觉。
- 同一份数据集,在不同的尺度刻画上显示出来的图形反映的规律是不一样的。python将数据转化成图表,但结论是否正确需要由你保证。