一、肺癌风险预测
1.背景描述
癌症预测系统的有效性帮助人们以较低的成本了解自己的癌症风险,也帮助人们根据自己的癌症风险状况做出适当的决定。数据收集自在线肺癌预测网站。
2.数据说明
字段总数:16
实例数:284
字段信息:
1.性别:M(男性),F(女性)
2.年龄:病人的年龄
3.吸烟:YES=2 , NO=1
4.黄色的手指:YES=2 , NO=1
5.焦虑:YES=2 , NO=1
6.同伴压力: YES=2 , NO=1
7.慢性疾病:YES=2 , NO=1
8.疲劳:YES=2 , NO=1
9.过敏症:YES=2 , NO=1
10.喘息:YES=2 , NO=1
11.酒精:YES=2 , NO=1
12.咳嗽: YES=2 , NO=1
13.呼吸急促:YES=2 , NO=1
14.吞咽困难:YES=2 , NO=1
15.胸部疼痛:YES=2 , NO=1
16.肺癌:YES , NO
3.数据来源
二、数据处理
1.读取数据
import pandas as pd df=pd.read_csv("data/data209803/survey_lung_cancer.csv", index_col=None) df.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
GENDER | AGE | SMOKING | YELLOW_FINGERS | ANXIETY | PEER_PRESSURE | CHRONIC DISEASE | FATIGUE | ALLERGY | WHEEZING | ALCOHOL CONSUMING | COUGHING | SHORTNESS OF BREATH | SWALLOWING DIFFICULTY | CHEST PAIN | LUNG_CANCER | |
0 | M | 69 | 1 | 2 | 2 | 1 | 1 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | YES |
1 | M | 74 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | YES |
2 | F | 59 | 1 | 1 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 2 | 2 | 1 | 2 | NO |
3 | M | 63 | 2 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 2 | 2 | NO |
4 | F | 63 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 2 | 2 | 1 | 1 | NO |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 309 entries, 0 to 308 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 GENDER 309 non-null object 1 AGE 309 non-null int64 2 SMOKING 309 non-null int64 3 YELLOW_FINGERS 309 non-null int64 4 ANXIETY 309 non-null int64 5 PEER_PRESSURE 309 non-null int64 6 CHRONIC DISEASE 309 non-null int64 7 FATIGUE 309 non-null int64 8 ALLERGY 309 non-null int64 9 WHEEZING 309 non-null int64 10 ALCOHOL CONSUMING 309 non-null int64 11 COUGHING 309 non-null int64 12 SHORTNESS OF BREATH 309 non-null int64 13 SWALLOWING DIFFICULTY 309 non-null int64 14 CHEST PAIN 309 non-null int64 15 LUNG_CANCER 309 non-null object dtypes: int64(14), object(2) memory usage: 38.8+ KB
df.isnull().sum()
GENDER 0 AGE 0 SMOKING 0 YELLOW_FINGERS 0 ANXIETY 0 PEER_PRESSURE 0 CHRONIC DISEASE 0 FATIGUE 0 ALLERGY 0 WHEEZING 0 ALCOHOL CONSUMING 0 COUGHING 0 SHORTNESS OF BREATH 0 SWALLOWING DIFFICULTY 0 CHEST PAIN 0 LUNG_CANCER 0 dtype: int64
可见没有空值
2.数据序列化
df.GENDER.replace({"M":1,"F":0},inplace=True) df.LUNG_CANCER.replace({"YES":1,"NO":0},inplace=True)
import matplotlib.pyplot as plt %matplotlib inline
3.查看数据分布
figure,axes=plt.subplots(nrows=4,ncols=4,figsize=(20,16)) i=0 for column in df.columns: x=int(i/4) y=i%4 df[column].value_counts().plot(ax=axes[x][y], kind='bar',title=f"{column} scatter gram") i=i+1
从上图可见,数据得癌症的比较多,其他的较为均衡。
4.抽烟与患病关系
smoke_yes=df.loc[df.SMOKING==2,["SMOKING","LUNG_CANCER"]] smoke_no=df.loc[df.SMOKING==1,["SMOKING","LUNG_CANCER"]] fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2,figsize=(16,8)) ax1.pie(smoke_yes.LUNG_CANCER.value_counts(normalize=True),labels=["YES","NO"],colors=["yellow","green"],autopct='%1.1f%%',shadow=True,) ax1.set_title("Lung Cancer & Smoking_YES") ax2.pie(smoke_no.LUNG_CANCER.value_counts(normalize=True),labels=["YES","NO"],colors=["red","green"],autopct='%1.1f%%',shadow=True,) ax2.set_title("Lung Cancer & Smoking_NO")
Text(0.5,1,'Lung Cancer & Smoking_NO')
5.过敏、饮酒、吞咽困难、胸疼与患癌关系
import seaborn as sns fig,(ax1,ax2,ax3)=plt.subplots(1,3,figsize=(30,8)) sns.countplot(df.LUNG_CANCER,hue=df["ALLERGY "],ax=ax1,palette=['green', 'black']) sns.countplot(df.LUNG_CANCER,hue=df.COUGHING,ax=ax2,palette=['green', 'black']) sns.countplot(df.LUNG_CANCER,hue=df["ALCOHOL CONSUMING"],ax=ax3,palette=['green', 'black']) fig,(ax1,ax2,ax3)=plt.subplots(1,3,figsize=(30,8)) sns.countplot(df.LUNG_CANCER,hue=df["SWALLOWING DIFFICULTY"],ax=ax1,palette=['green', 'black']) sns.countplot(df.LUNG_CANCER,hue=df.WHEEZING,ax=ax2,palette=['green', 'black']) sns.countplot(df.LUNG_CANCER,hue=df["CHEST PAIN"],ax=ax3,palette=['green', 'black'])
<matplotlib.axes._subplots.AxesSubplot at 0x7fba81b66350>
6.绘制热力图
import seaborn as sns plt.figure(figsize=(16,10)) sns.heatmap(df.corr(),annot=True,cmap='viridis',vmin=0, vmax=1)
<matplotlib.axes._subplots.AxesSubplot at 0x7fba83b48d90>
可见性别、年龄和是否抽烟与患肺癌相关性不大。
7.构造X、y
# 构造X、y X=df.drop(columns=["LUNG_CANCER"],axis=1) y=df["LUNG_CANCER"]
y.value_counts()
1 270 0 39 Name: LUNG_CANCER, dtype: int64
sns.countplot(y)
<matplotlib.axes._subplots.AxesSubplot at 0x7fba81a56590>
8.数据均衡
安装完要重启才能生效,不然报错,具体如下:
from IPython.display import clear_output !pip install imblearn --user !pip uninstall scipy -y !pip install scipy --user clear_output()
from imblearn.over_sampling import SMOTE
help(SMOTE)
Help on class SMOTE in module imblearn.over_sampling._smote.base: class SMOTE(BaseSMOTE) | SMOTE(*, sampling_strategy='auto', random_state=None, k_neighbors=5, n_jobs=None) | | Class to perform over-sampling using SMOTE. | | This object is an implementation of SMOTE - Synthetic Minority | Over-sampling Technique as presented in [1]_. | | Read more in the :ref:`User Guide <smote_adasyn>`. | | Parameters | ---------- | sampling_strategy : float, str, dict or callable, default='auto' | Sampling information to resample the data set. | | - When ``float``, it corresponds to the desired ratio of the number of | samples in the minority class over the number of samples in the | majority class after resampling. Therefore, the ratio is expressed as | :math:`\alpha_{os} = N_{rm} / N_{M}` where :math:`N_{rm}` is the | number of samples in the minority class after resampling and | :math:`N_{M}` is the number of samples in the majority class. | | .. warning:: | ``float`` is only available for **binary** classification. An | error is raised for multi-class classification. | | - When ``str``, specify the class targeted by the resampling. The | number of samples in the different classes will be equalized. | Possible choices are: | | ``'minority'``: resample only the minority class; | | ``'not minority'``: resample all classes but the minority class; | | ``'not majority'``: resample all classes but the majority class; | | ``'all'``: resample all classes; | | ``'auto'``: equivalent to ``'not majority'``. | | - When ``dict``, the keys correspond to the targeted classes. The | values correspond to the desired number of samples for each targeted | class. | | - When callable, function taking ``y`` and returns a ``dict``. The keys | correspond to the targeted classes. The values correspond to the | desired number of samples for each class. | | random_state : int, RandomState instance, default=None | Control the randomization of the algorithm. | | - If int, ``random_state`` is the seed used by the random number | generator; | - If ``RandomState`` instance, random_state is the random number | generator; | - If ``None``, the random number generator is the ``RandomState`` | instance used by ``np.random``. | | k_neighbors : int or object, default=5 | The nearest neighbors used to define the neighborhood of samples to use | to generate the synthetic samples. You can pass: | | - an `int` corresponding to the number of neighbors to use. A | `~sklearn.neighbors.NearestNeighbors` instance will be fitted in this | case. | - an instance of a compatible nearest neighbors algorithm that should | implement both methods `kneighbors` and `kneighbors_graph`. For | instance, it could correspond to a | :class:`~sklearn.neighbors.NearestNeighbors` but could be extended to | any compatible class. | | n_jobs : int, default=None | Number of CPU cores used during the cross-validation loop. | ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. | ``-1`` means using all processors. See | `Glossary <https://scikit-learn.org/stable/glossary.html#term-n-jobs>`_ | for more details. | | .. deprecated:: 0.10 | `n_jobs` has been deprecated in 0.10 and will be removed in 0.12. | It was previously used to set `n_jobs` of nearest neighbors | algorithm. From now on, you can pass an estimator where `n_jobs` is | already set instead. | | Attributes | ---------- | sampling_strategy_ : dict | Dictionary containing the information to sample the dataset. The keys | corresponds to the class labels from which to sample and the values | are the number of samples to sample. | | nn_k_ : estimator object | Validated k-nearest neighbours created from the `k_neighbors` parameter. | | n_features_in_ : int | Number of features in the input dataset. | | .. versionadded:: 0.9 | | feature_names_in_ : ndarray of shape (`n_features_in_`,) | Names of features seen during `fit`. Defined only when `X` has feature | names that are all strings. | | .. versionadded:: 0.10 | | See Also | -------- | SMOTENC : Over-sample using SMOTE for continuous and categorical features. | | SMOTEN : Over-sample using the SMOTE variant specifically for categorical | features only. | | BorderlineSMOTE : Over-sample using the borderline-SMOTE variant. | | SVMSMOTE : Over-sample using the SVM-SMOTE variant. | | ADASYN : Over-sample using ADASYN. | | KMeansSMOTE : Over-sample applying a clustering before to oversample using | SMOTE. | | Notes | ----- | See the original papers: [1]_ for more details. | | Supports multi-class resampling. A one-vs.-rest scheme is used as | originally proposed in [1]_. | | References | ---------- | .. [1] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, "SMOTE: | synthetic minority over-sampling technique," Journal of artificial | intelligence research, 321-357, 2002. | | Examples | -------- | >>> from collections import Counter | >>> from sklearn.datasets import make_classification | >>> from imblearn.over_sampling import SMOTE | >>> X, y = make_classification(n_classes=2, class_sep=2, | ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0, | ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10) | >>> print('Original dataset shape %s' % Counter(y)) | Original dataset shape Counter({1: 900, 0: 100}) | >>> sm = SMOTE(random_state=42) | >>> X_res, y_res = sm.fit_resample(X, y) | >>> print('Resampled dataset shape %s' % Counter(y_res)) | Resampled dataset shape Counter({0: 900, 1: 900}) | | Method resolution order: | SMOTE | BaseSMOTE | imblearn.over_sampling.base.BaseOverSampler | imblearn.base.BaseSampler | imblearn.base.SamplerMixin | sklearn.base.BaseEstimator | sklearn.base._OneToOneFeatureMixin | imblearn.base._ParamsValidationMixin | builtins.object | | Methods defined here: | | __init__(self, *, sampling_strategy='auto', random_state=None, k_neighbors=5, n_jobs=None) | Initialize self. See help(type(self)) for accurate signature. | | ---------------------------------------------------------------------- | Data and other attributes defined here: | | __abstractmethods__ = frozenset() | | ---------------------------------------------------------------------- | Data and other attributes inherited from BaseSMOTE: | | __annotations__ = {'_parameter_constraints': <class 'dict'>} | | ---------------------------------------------------------------------- | Methods inherited from imblearn.base.BaseSampler: | | fit(self, X, y) | Check inputs and statistics of the sampler. | | You should use ``fit_resample`` in all cases. | | Parameters | ---------- | X : {array-like, dataframe, sparse matrix} of shape (n_samples, n_features) | Data array. | | y : array-like of shape (n_samples,) | Target array. | | Returns | ------- | self : object | Return the instance itself. | | fit_resample(self, X, y) | Resample the dataset. | | Parameters | ---------- | X : {array-like, dataframe, sparse matrix} of shape (n_samples, n_features) | Matrix containing the data which have to be sampled. | | y : array-like of shape (n_samples,) | Corresponding label for each sample in X. | | Returns | ------- | X_resampled : {array-like, dataframe, sparse matrix} of shape (n_samples_new, n_features) | The array containing the resampled data. | | y_resampled : array-like of shape (n_samples_new,) | The corresponding label of `X_resampled`. | | ---------------------------------------------------------------------- | Methods inherited from sklearn.base.BaseEstimator: | | __getstate__(self) | | __repr__(self, N_CHAR_MAX=700) | Return repr(self). | | __setstate__(self, state) | | get_params(self, deep=True) | Get parameters for this estimator. | | Parameters | ---------- | deep : bool, default=True | If True, will return the parameters for this estimator and | contained subobjects that are estimators. | | Returns | ------- | params : dict | Parameter names mapped to their values. | | set_params(self, **params) | Set the parameters of this estimator. | | The method works on simple estimators as well as on nested objects | (such as :class:`~sklearn.pipeline.Pipeline`). The latter have | parameters of the form ``<component>__<parameter>`` so that it's | possible to update each component of a nested object. | | Parameters | ---------- | **params : dict | Estimator parameters. | | Returns | ------- | self : estimator instance | Estimator instance. | | ---------------------------------------------------------------------- | Data descriptors inherited from sklearn.base.BaseEstimator: | | __dict__ | dictionary for instance variables (if defined) | | __weakref__ | list of weak references to the object (if defined) | | ---------------------------------------------------------------------- | Methods inherited from sklearn.base._OneToOneFeatureMixin: | | get_feature_names_out(self, input_features=None) | Get output feature names for transformation. | | Parameters | ---------- | input_features : array-like of str or None, default=None | Input features. | | - If `input_features` is `None`, then `feature_names_in_` is | used as feature names in. If `feature_names_in_` is not defined, | then names are generated: `[x0, x1, ..., x(n_features_in_)]`. | - If `input_features` is an array-like, then `input_features` must | match `feature_names_in_` if `feature_names_in_` is defined. | | Returns | ------- | feature_names_out : ndarray of str objects | Same as input features.
sampling_strategy 有以下参数:
- " minority' ' ':只重新采样少数类
- " not minority' ' ':重采样除minority类外的所有类
- " not majority' ' ':重采样除majority类外的所有类
- " all' ' ':重采样所有类
- " auto' ' ':相当于' " not majority'
from imblearn.over_sampling import SMOTE smote=SMOTE(sampling_strategy='minority') X,y=smote.fit_resample(X,y)
sns.countplot(y)
<matplotlib.axes._subplots.AxesSubplot at 0x7fbd565994d0>