一、肺癌风险预测

1.背景描述

癌症预测系统的有效性帮助人们以较低的成本了解自己的癌症风险，也帮助人们根据自己的癌症风险状况做出适当的决定。数据收集自在线肺癌预测网站。

2.数据说明

字段总数：16

实例数：284

字段信息：

1.性别：M（男性），F（女性）

2.年龄：病人的年龄

3.吸烟：YES=2 , NO=1

4.黄色的手指：YES=2 , NO=1

5.焦虑：YES=2 , NO=1

6.同伴压力: YES=2 , NO=1

7.慢性疾病：YES=2 , NO=1

8.疲劳：YES=2 , NO=1

9.过敏症：YES=2 , NO=1

10.喘息：YES=2 , NO=1

11.酒精：YES=2 , NO=1

12.咳嗽： YES=2 , NO=1

13.呼吸急促：YES=2 , NO=1

14.吞咽困难：YES=2 , NO=1

15.胸部疼痛：YES=2 , NO=1

16.肺癌：YES , NO

3.数据来源

www.kaggle.com/datasets/na…

二、数据处理

1.读取数据

import pandas as pd
df=pd.read_csv("data/data209803/survey_lung_cancer.csv", index_col=None)
df.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	GENDER	AGE	SMOKING	YELLOW_FINGERS	ANXIETY	PEER_PRESSURE	CHRONIC DISEASE	FATIGUE	ALLERGY	WHEEZING	ALCOHOL CONSUMING	COUGHING	SHORTNESS OF BREATH	SWALLOWING DIFFICULTY	CHEST PAIN	LUNG_CANCER
0	M	69	1	2	2	1	1	2	1	2	2	2	2	2	2	YES
1	M	74	2	1	1	1	2	2	2	1	1	1	2	2	2	YES
2	F	59	1	1	1	2	1	2	1	2	1	2	2	1	2	NO
3	M	63	2	2	2	1	1	1	1	1	2	1	1	2	2	NO
4	F	63	1	2	1	1	1	1	1	2	1	2	2	1	1	NO

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309 entries, 0 to 308
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   GENDER                 309 non-null    object
 1   AGE                    309 non-null    int64 
 2   SMOKING                309 non-null    int64 
 3   YELLOW_FINGERS         309 non-null    int64 
 4   ANXIETY                309 non-null    int64 
 5   PEER_PRESSURE          309 non-null    int64 
 6   CHRONIC DISEASE        309 non-null    int64 
 7   FATIGUE                309 non-null    int64 
 8   ALLERGY                309 non-null    int64 
 9   WHEEZING               309 non-null    int64 
 10  ALCOHOL CONSUMING      309 non-null    int64 
 11  COUGHING               309 non-null    int64 
 12  SHORTNESS OF BREATH    309 non-null    int64 
 13  SWALLOWING DIFFICULTY  309 non-null    int64 
 14  CHEST PAIN             309 non-null    int64 
 15  LUNG_CANCER            309 non-null    object
dtypes: int64(14), object(2)
memory usage: 38.8+ KB

df.isnull().sum()

GENDER                   0
AGE                      0
SMOKING                  0
YELLOW_FINGERS           0
ANXIETY                  0
PEER_PRESSURE            0
CHRONIC DISEASE          0
FATIGUE                  0
ALLERGY                  0
WHEEZING                 0
ALCOHOL CONSUMING        0
COUGHING                 0
SHORTNESS OF BREATH      0
SWALLOWING DIFFICULTY    0
CHEST PAIN               0
LUNG_CANCER              0
dtype: int64

可见没有空值

2.数据序列化

df.GENDER.replace({"M":1,"F":0},inplace=True)
df.LUNG_CANCER.replace({"YES":1,"NO":0},inplace=True)

import matplotlib.pyplot as plt
%matplotlib inline

3.查看数据分布

figure,axes=plt.subplots(nrows=4,ncols=4,figsize=(20,16)) 
i=0
for column in df.columns:
    x=int(i/4)
    y=i%4
    df[column].value_counts().plot(ax=axes[x][y], kind='bar',title=f"{column} scatter gram")
    i=i+1

从上图可见，数据得癌症的比较多，其他的较为均衡。

4.抽烟与患病关系

smoke_yes=df.loc[df.SMOKING==2,["SMOKING","LUNG_CANCER"]]
smoke_no=df.loc[df.SMOKING==1,["SMOKING","LUNG_CANCER"]]
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2,figsize=(16,8))
ax1.pie(smoke_yes.LUNG_CANCER.value_counts(normalize=True),labels=["YES","NO"],colors=["yellow","green"],autopct='%1.1f%%',shadow=True,)
ax1.set_title("Lung Cancer & Smoking_YES")
ax2.pie(smoke_no.LUNG_CANCER.value_counts(normalize=True),labels=["YES","NO"],colors=["red","green"],autopct='%1.1f%%',shadow=True,)
ax2.set_title("Lung Cancer & Smoking_NO")

Text(0.5,1,'Lung Cancer & Smoking_NO')

5.过敏、饮酒、吞咽困难、胸疼与患癌关系

import seaborn as sns
fig,(ax1,ax2,ax3)=plt.subplots(1,3,figsize=(30,8))
sns.countplot(df.LUNG_CANCER,hue=df["ALLERGY "],ax=ax1,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df.COUGHING,ax=ax2,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df["ALCOHOL CONSUMING"],ax=ax3,palette=['green', 'black'])
fig,(ax1,ax2,ax3)=plt.subplots(1,3,figsize=(30,8))
sns.countplot(df.LUNG_CANCER,hue=df["SWALLOWING DIFFICULTY"],ax=ax1,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df.WHEEZING,ax=ax2,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df["CHEST PAIN"],ax=ax3,palette=['green', 'black'])

<matplotlib.axes._subplots.AxesSubplot at 0x7fba81b66350>

6.绘制热力图

import seaborn as sns
plt.figure(figsize=(16,10))
sns.heatmap(df.corr(),annot=True,cmap='viridis',vmin=0, vmax=1)

<matplotlib.axes._subplots.AxesSubplot at 0x7fba83b48d90>

可见性别、年龄和是否抽烟与患肺癌相关性不大。

7.构造X、y

# 构造X、y
X=df.drop(columns=["LUNG_CANCER"],axis=1)
y=df["LUNG_CANCER"]

y.value_counts()

1    270
0     39
Name: LUNG_CANCER, dtype: int64

sns.countplot(y)

<matplotlib.axes._subplots.AxesSubplot at 0x7fba81a56590>

8.数据均衡

安装完要重启才能生效，不然报错，具体如下：

from IPython.display import clear_output
!pip install imblearn --user
!pip uninstall scipy -y
!pip install scipy --user
clear_output()

from imblearn.over_sampling import SMOTE

help(SMOTE)

Help on class SMOTE in module imblearn.over_sampling._smote.base:
class SMOTE(BaseSMOTE)
 |  SMOTE(*, sampling_strategy='auto', random_state=None, k_neighbors=5, n_jobs=None)
 |  
 |  Class to perform over-sampling using SMOTE.
 |  
 |  This object is an implementation of SMOTE - Synthetic Minority
 |  Over-sampling Technique as presented in [1]_.
 |  
 |  Read more in the :ref:`User Guide <smote_adasyn>`.
 |  
 |  Parameters
 |  ----------
 |  sampling_strategy : float, str, dict or callable, default='auto'
 |      Sampling information to resample the data set.
 |  
 |      - When ``float``, it corresponds to the desired ratio of the number of
 |        samples in the minority class over the number of samples in the
 |        majority class after resampling. Therefore, the ratio is expressed as
 |        :math:`\alpha_{os} = N_{rm} / N_{M}` where :math:`N_{rm}` is the
 |        number of samples in the minority class after resampling and
 |        :math:`N_{M}` is the number of samples in the majority class.
 |  
 |          .. warning::
 |             ``float`` is only available for **binary** classification. An
 |             error is raised for multi-class classification.
 |  
 |      - When ``str``, specify the class targeted by the resampling. The
 |        number of samples in the different classes will be equalized.
 |        Possible choices are:
 |  
 |          ``'minority'``: resample only the minority class;
 |  
 |          ``'not minority'``: resample all classes but the minority class;
 |  
 |          ``'not majority'``: resample all classes but the majority class;
 |  
 |          ``'all'``: resample all classes;
 |  
 |          ``'auto'``: equivalent to ``'not majority'``.
 |  
 |      - When ``dict``, the keys correspond to the targeted classes. The
 |        values correspond to the desired number of samples for each targeted
 |        class.
 |  
 |      - When callable, function taking ``y`` and returns a ``dict``. The keys
 |        correspond to the targeted classes. The values correspond to the
 |        desired number of samples for each class.
 |  
 |  random_state : int, RandomState instance, default=None
 |      Control the randomization of the algorithm.
 |  
 |      - If int, ``random_state`` is the seed used by the random number
 |        generator;
 |      - If ``RandomState`` instance, random_state is the random number
 |        generator;
 |      - If ``None``, the random number generator is the ``RandomState``
 |        instance used by ``np.random``.
 |  
 |  k_neighbors : int or object, default=5
 |      The nearest neighbors used to define the neighborhood of samples to use
 |      to generate the synthetic samples. You can pass:
 |  
 |      - an `int` corresponding to the number of neighbors to use. A
 |        `~sklearn.neighbors.NearestNeighbors` instance will be fitted in this
 |        case.
 |      - an instance of a compatible nearest neighbors algorithm that should
 |        implement both methods `kneighbors` and `kneighbors_graph`. For
 |        instance, it could correspond to a
 |        :class:`~sklearn.neighbors.NearestNeighbors` but could be extended to
 |        any compatible class.
 |  
 |  n_jobs : int, default=None
 |      Number of CPU cores used during the cross-validation loop.
 |      ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
 |      ``-1`` means using all processors. See
 |      `Glossary <https://scikit-learn.org/stable/glossary.html#term-n-jobs>`_
 |      for more details.
 |  
 |      .. deprecated:: 0.10
 |         `n_jobs` has been deprecated in 0.10 and will be removed in 0.12.
 |         It was previously used to set `n_jobs` of nearest neighbors
 |         algorithm. From now on, you can pass an estimator where `n_jobs` is
 |         already set instead.
 |  
 |  Attributes
 |  ----------
 |  sampling_strategy_ : dict
 |      Dictionary containing the information to sample the dataset. The keys
 |      corresponds to the class labels from which to sample and the values
 |      are the number of samples to sample.
 |  
 |  nn_k_ : estimator object
 |      Validated k-nearest neighbours created from the `k_neighbors` parameter.
 |  
 |  n_features_in_ : int
 |      Number of features in the input dataset.
 |  
 |      .. versionadded:: 0.9
 |  
 |  feature_names_in_ : ndarray of shape (`n_features_in_`,)
 |      Names of features seen during `fit`. Defined only when `X` has feature
 |      names that are all strings.
 |  
 |      .. versionadded:: 0.10
 |  
 |  See Also
 |  --------
 |  SMOTENC : Over-sample using SMOTE for continuous and categorical features.
 |  
 |  SMOTEN : Over-sample using the SMOTE variant specifically for categorical
 |      features only.
 |  
 |  BorderlineSMOTE : Over-sample using the borderline-SMOTE variant.
 |  
 |  SVMSMOTE : Over-sample using the SVM-SMOTE variant.
 |  
 |  ADASYN : Over-sample using ADASYN.
 |  
 |  KMeansSMOTE : Over-sample applying a clustering before to oversample using
 |      SMOTE.
 |  
 |  Notes
 |  -----
 |  See the original papers: [1]_ for more details.
 |  
 |  Supports multi-class resampling. A one-vs.-rest scheme is used as
 |  originally proposed in [1]_.
 |  
 |  References
 |  ----------
 |  .. [1] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, "SMOTE:
 |     synthetic minority over-sampling technique," Journal of artificial
 |     intelligence research, 321-357, 2002.
 |  
 |  Examples
 |  --------
 |  >>> from collections import Counter
 |  >>> from sklearn.datasets import make_classification
 |  >>> from imblearn.over_sampling import SMOTE
 |  >>> X, y = make_classification(n_classes=2, class_sep=2,
 |  ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
 |  ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
 |  >>> print('Original dataset shape %s' % Counter(y))
 |  Original dataset shape Counter({1: 900, 0: 100})
 |  >>> sm = SMOTE(random_state=42)
 |  >>> X_res, y_res = sm.fit_resample(X, y)
 |  >>> print('Resampled dataset shape %s' % Counter(y_res))
 |  Resampled dataset shape Counter({0: 900, 1: 900})
 |  
 |  Method resolution order:
 |      SMOTE
 |      BaseSMOTE
 |      imblearn.over_sampling.base.BaseOverSampler
 |      imblearn.base.BaseSampler
 |      imblearn.base.SamplerMixin
 |      sklearn.base.BaseEstimator
 |      sklearn.base._OneToOneFeatureMixin
 |      imblearn.base._ParamsValidationMixin
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, *, sampling_strategy='auto', random_state=None, k_neighbors=5, n_jobs=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset()
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes inherited from BaseSMOTE:
 |  
 |  __annotations__ = {'_parameter_constraints': <class 'dict'>}
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from imblearn.base.BaseSampler:
 |  
 |  fit(self, X, y)
 |      Check inputs and statistics of the sampler.
 |      
 |      You should use ``fit_resample`` in all cases.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, dataframe, sparse matrix} of shape                 (n_samples, n_features)
 |          Data array.
 |      
 |      y : array-like of shape (n_samples,)
 |          Target array.
 |      
 |      Returns
 |      -------
 |      self : object
 |          Return the instance itself.
 |  
 |  fit_resample(self, X, y)
 |      Resample the dataset.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, dataframe, sparse matrix} of shape                 (n_samples, n_features)
 |          Matrix containing the data which have to be sampled.
 |      
 |      y : array-like of shape (n_samples,)
 |          Corresponding label for each sample in X.
 |      
 |      Returns
 |      -------
 |      X_resampled : {array-like, dataframe, sparse matrix} of shape                 (n_samples_new, n_features)
 |          The array containing the resampled data.
 |      
 |      y_resampled : array-like of shape (n_samples_new,)
 |          The corresponding label of `X_resampled`.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __getstate__(self)
 |  
 |  __repr__(self, N_CHAR_MAX=700)
 |      Return repr(self).
 |  
 |  __setstate__(self, state)
 |  
 |  get_params(self, deep=True)
 |      Get parameters for this estimator.
 |      
 |      Parameters
 |      ----------
 |      deep : bool, default=True
 |          If True, will return the parameters for this estimator and
 |          contained subobjects that are estimators.
 |      
 |      Returns
 |      -------
 |      params : dict
 |          Parameter names mapped to their values.
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as :class:`~sklearn.pipeline.Pipeline`). The latter have
 |      parameters of the form ``<component>__<parameter>`` so that it's
 |      possible to update each component of a nested object.
 |      
 |      Parameters
 |      ----------
 |      **params : dict
 |          Estimator parameters.
 |      
 |      Returns
 |      -------
 |      self : estimator instance
 |          Estimator instance.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base.BaseEstimator:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base._OneToOneFeatureMixin:
 |  
 |  get_feature_names_out(self, input_features=None)
 |      Get output feature names for transformation.
 |      
 |      Parameters
 |      ----------
 |      input_features : array-like of str or None, default=None
 |          Input features.
 |      
 |          - If `input_features` is `None`, then `feature_names_in_` is
 |            used as feature names in. If `feature_names_in_` is not defined,
 |            then names are generated: `[x0, x1, ..., x(n_features_in_)]`.
 |          - If `input_features` is an array-like, then `input_features` must
 |            match `feature_names_in_` if `feature_names_in_` is defined.
 |      
 |      Returns
 |      -------
 |      feature_names_out : ndarray of str objects
 |          Same as input features.

sampling_strategy 有以下参数：

" minority' ' ':只重新采样少数类
" not minority' ' ':重采样除minority类外的所有类
" not majority' ' ':重采样除majority类外的所有类
" all' ' ':重采样所有类
" auto' ' ':相当于' " not majority'

from imblearn.over_sampling import SMOTE
smote=SMOTE(sampling_strategy='minority')
X,y=smote.fit_resample(X,y)

sns.countplot(y)

<matplotlib.axes._subplots.AxesSubplot at 0x7fbd565994d0>

基于人工智能的【患肺癌病】风险预测与分析（上）

一、肺癌风险预测

1.背景描述

2.数据说明

3.数据来源

二、数据处理

1.读取数据

2.数据序列化

3.查看数据分布

4.抽烟与患病关系

5.过敏、饮酒、吞咽困难、胸疼与患癌关系

6.绘制热力图

7.构造X、y

8.数据均衡

热门文章

最新文章

相关课程

相关电子书

相关实验场景

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

基于人工智能的【患肺癌病】风险预测与分析（上）

一、肺癌风险预测

1.背景描述

2.数据说明

3.数据来源

二、数据处理

1.读取数据

2.数据序列化

3.查看数据分布

4.抽烟与患病关系

5.过敏、饮酒、吞咽困难、胸疼与患癌关系

6.绘制热力图

7.构造X、y

8.数据均衡

热门文章

最新文章

相关课程

相关电子书

相关实验场景