基于人工智能的【患肺癌病】风险预测与分析(上)

简介: 基于人工智能的【患肺癌病】风险预测与分析

一、肺癌风险预测


image.png


1.背景描述


癌症预测系统的有效性帮助人们以较低的成本了解自己的癌症风险,也帮助人们根据自己的癌症风险状况做出适当的决定。数据收集自在线肺癌预测网站。


2.数据说明


字段总数:16

实例数:284

字段信息:

1.性别:M(男性),F(女性)

2.年龄:病人的年龄

3.吸烟:YES=2 , NO=1

4.黄色的手指:YES=2 , NO=1

5.焦虑:YES=2 , NO=1

6.同伴压力: YES=2 , NO=1

7.慢性疾病:YES=2 , NO=1

8.疲劳:YES=2 , NO=1

9.过敏症:YES=2 , NO=1

10.喘息:YES=2 , NO=1

11.酒精:YES=2 , NO=1

12.咳嗽: YES=2 , NO=1

13.呼吸急促:YES=2 , NO=1

14.吞咽困难:YES=2 , NO=1

15.胸部疼痛:YES=2 , NO=1

16.肺癌:YES , NO


3.数据来源


www.kaggle.com/datasets/na…


二、数据处理


1.读取数据


import pandas as pd
df=pd.read_csv("data/data209803/survey_lung_cancer.csv", index_col=None)
df.head()

    .dataframe tbody tr th:only-of-type {         vertical-align: middle;     } .dataframe tbody tr th {     vertical-align: top; } .dataframe thead th {     text-align: right; }

GENDER AGE SMOKING YELLOW_FINGERS ANXIETY PEER_PRESSURE CHRONIC DISEASE FATIGUE ALLERGY WHEEZING ALCOHOL CONSUMING COUGHING SHORTNESS OF BREATH SWALLOWING DIFFICULTY CHEST PAIN LUNG_CANCER
0 M 69 1 2 2 1 1 2 1 2 2 2 2 2 2 YES
1 M 74 2 1 1 1 2 2 2 1 1 1 2 2 2 YES
2 F 59 1 1 1 2 1 2 1 2 1 2 2 1 2 NO
3 M 63 2 2 2 1 1 1 1 1 2 1 1 2 2 NO
4 F 63 1 2 1 1 1 1 1 2 1 2 2 1 1 NO


df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309 entries, 0 to 308
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   GENDER                 309 non-null    object
 1   AGE                    309 non-null    int64 
 2   SMOKING                309 non-null    int64 
 3   YELLOW_FINGERS         309 non-null    int64 
 4   ANXIETY                309 non-null    int64 
 5   PEER_PRESSURE          309 non-null    int64 
 6   CHRONIC DISEASE        309 non-null    int64 
 7   FATIGUE                309 non-null    int64 
 8   ALLERGY                309 non-null    int64 
 9   WHEEZING               309 non-null    int64 
 10  ALCOHOL CONSUMING      309 non-null    int64 
 11  COUGHING               309 non-null    int64 
 12  SHORTNESS OF BREATH    309 non-null    int64 
 13  SWALLOWING DIFFICULTY  309 non-null    int64 
 14  CHEST PAIN             309 non-null    int64 
 15  LUNG_CANCER            309 non-null    object
dtypes: int64(14), object(2)
memory usage: 38.8+ KB


df.isnull().sum()


GENDER                   0
AGE                      0
SMOKING                  0
YELLOW_FINGERS           0
ANXIETY                  0
PEER_PRESSURE            0
CHRONIC DISEASE          0
FATIGUE                  0
ALLERGY                  0
WHEEZING                 0
ALCOHOL CONSUMING        0
COUGHING                 0
SHORTNESS OF BREATH      0
SWALLOWING DIFFICULTY    0
CHEST PAIN               0
LUNG_CANCER              0
dtype: int64

可见没有空值


2.数据序列化


df.GENDER.replace({"M":1,"F":0},inplace=True)
df.LUNG_CANCER.replace({"YES":1,"NO":0},inplace=True)


import matplotlib.pyplot as plt
%matplotlib inline


3.查看数据分布


figure,axes=plt.subplots(nrows=4,ncols=4,figsize=(20,16)) 
i=0
for column in df.columns:
    x=int(i/4)
    y=i%4
    df[column].value_counts().plot(ax=axes[x][y], kind='bar',title=f"{column} scatter gram")
    i=i+1


image.png

从上图可见,数据得癌症的比较多,其他的较为均衡。


4.抽烟与患病关系


smoke_yes=df.loc[df.SMOKING==2,["SMOKING","LUNG_CANCER"]]
smoke_no=df.loc[df.SMOKING==1,["SMOKING","LUNG_CANCER"]]
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2,figsize=(16,8))
ax1.pie(smoke_yes.LUNG_CANCER.value_counts(normalize=True),labels=["YES","NO"],colors=["yellow","green"],autopct='%1.1f%%',shadow=True,)
ax1.set_title("Lung Cancer & Smoking_YES")
ax2.pie(smoke_no.LUNG_CANCER.value_counts(normalize=True),labels=["YES","NO"],colors=["red","green"],autopct='%1.1f%%',shadow=True,)
ax2.set_title("Lung Cancer & Smoking_NO")


Text(0.5,1,'Lung Cancer & Smoking_NO')

image.png


5.过敏、饮酒、吞咽困难、胸疼与患癌关系


import seaborn as sns
fig,(ax1,ax2,ax3)=plt.subplots(1,3,figsize=(30,8))
sns.countplot(df.LUNG_CANCER,hue=df["ALLERGY "],ax=ax1,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df.COUGHING,ax=ax2,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df["ALCOHOL CONSUMING"],ax=ax3,palette=['green', 'black'])
fig,(ax1,ax2,ax3)=plt.subplots(1,3,figsize=(30,8))
sns.countplot(df.LUNG_CANCER,hue=df["SWALLOWING DIFFICULTY"],ax=ax1,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df.WHEEZING,ax=ax2,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df["CHEST PAIN"],ax=ax3,palette=['green', 'black'])


<matplotlib.axes._subplots.AxesSubplot at 0x7fba81b66350>

image.png

image.png


6.绘制热力图


import seaborn as sns
plt.figure(figsize=(16,10))
sns.heatmap(df.corr(),annot=True,cmap='viridis',vmin=0, vmax=1)


<matplotlib.axes._subplots.AxesSubplot at 0x7fba83b48d90>

image.png

可见性别、年龄和是否抽烟与患肺癌相关性不大。


7.构造X、y


# 构造X、y
X=df.drop(columns=["LUNG_CANCER"],axis=1)
y=df["LUNG_CANCER"]


y.value_counts()


1    270
0     39
Name: LUNG_CANCER, dtype: int64


sns.countplot(y)


<matplotlib.axes._subplots.AxesSubplot at 0x7fba81a56590>

image.png


8.数据均衡


安装完要重启才能生效,不然报错,具体如下:

image.png


from IPython.display import clear_output
!pip install imblearn --user
!pip uninstall scipy -y
!pip install scipy --user
clear_output()


from imblearn.over_sampling import SMOTE


help(SMOTE)


Help on class SMOTE in module imblearn.over_sampling._smote.base:
class SMOTE(BaseSMOTE)
 |  SMOTE(*, sampling_strategy='auto', random_state=None, k_neighbors=5, n_jobs=None)
 |  
 |  Class to perform over-sampling using SMOTE.
 |  
 |  This object is an implementation of SMOTE - Synthetic Minority
 |  Over-sampling Technique as presented in [1]_.
 |  
 |  Read more in the :ref:`User Guide <smote_adasyn>`.
 |  
 |  Parameters
 |  ----------
 |  sampling_strategy : float, str, dict or callable, default='auto'
 |      Sampling information to resample the data set.
 |  
 |      - When ``float``, it corresponds to the desired ratio of the number of
 |        samples in the minority class over the number of samples in the
 |        majority class after resampling. Therefore, the ratio is expressed as
 |        :math:`\alpha_{os} = N_{rm} / N_{M}` where :math:`N_{rm}` is the
 |        number of samples in the minority class after resampling and
 |        :math:`N_{M}` is the number of samples in the majority class.
 |  
 |          .. warning::
 |             ``float`` is only available for **binary** classification. An
 |             error is raised for multi-class classification.
 |  
 |      - When ``str``, specify the class targeted by the resampling. The
 |        number of samples in the different classes will be equalized.
 |        Possible choices are:
 |  
 |          ``'minority'``: resample only the minority class;
 |  
 |          ``'not minority'``: resample all classes but the minority class;
 |  
 |          ``'not majority'``: resample all classes but the majority class;
 |  
 |          ``'all'``: resample all classes;
 |  
 |          ``'auto'``: equivalent to ``'not majority'``.
 |  
 |      - When ``dict``, the keys correspond to the targeted classes. The
 |        values correspond to the desired number of samples for each targeted
 |        class.
 |  
 |      - When callable, function taking ``y`` and returns a ``dict``. The keys
 |        correspond to the targeted classes. The values correspond to the
 |        desired number of samples for each class.
 |  
 |  random_state : int, RandomState instance, default=None
 |      Control the randomization of the algorithm.
 |  
 |      - If int, ``random_state`` is the seed used by the random number
 |        generator;
 |      - If ``RandomState`` instance, random_state is the random number
 |        generator;
 |      - If ``None``, the random number generator is the ``RandomState``
 |        instance used by ``np.random``.
 |  
 |  k_neighbors : int or object, default=5
 |      The nearest neighbors used to define the neighborhood of samples to use
 |      to generate the synthetic samples. You can pass:
 |  
 |      - an `int` corresponding to the number of neighbors to use. A
 |        `~sklearn.neighbors.NearestNeighbors` instance will be fitted in this
 |        case.
 |      - an instance of a compatible nearest neighbors algorithm that should
 |        implement both methods `kneighbors` and `kneighbors_graph`. For
 |        instance, it could correspond to a
 |        :class:`~sklearn.neighbors.NearestNeighbors` but could be extended to
 |        any compatible class.
 |  
 |  n_jobs : int, default=None
 |      Number of CPU cores used during the cross-validation loop.
 |      ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
 |      ``-1`` means using all processors. See
 |      `Glossary <https://scikit-learn.org/stable/glossary.html#term-n-jobs>`_
 |      for more details.
 |  
 |      .. deprecated:: 0.10
 |         `n_jobs` has been deprecated in 0.10 and will be removed in 0.12.
 |         It was previously used to set `n_jobs` of nearest neighbors
 |         algorithm. From now on, you can pass an estimator where `n_jobs` is
 |         already set instead.
 |  
 |  Attributes
 |  ----------
 |  sampling_strategy_ : dict
 |      Dictionary containing the information to sample the dataset. The keys
 |      corresponds to the class labels from which to sample and the values
 |      are the number of samples to sample.
 |  
 |  nn_k_ : estimator object
 |      Validated k-nearest neighbours created from the `k_neighbors` parameter.
 |  
 |  n_features_in_ : int
 |      Number of features in the input dataset.
 |  
 |      .. versionadded:: 0.9
 |  
 |  feature_names_in_ : ndarray of shape (`n_features_in_`,)
 |      Names of features seen during `fit`. Defined only when `X` has feature
 |      names that are all strings.
 |  
 |      .. versionadded:: 0.10
 |  
 |  See Also
 |  --------
 |  SMOTENC : Over-sample using SMOTE for continuous and categorical features.
 |  
 |  SMOTEN : Over-sample using the SMOTE variant specifically for categorical
 |      features only.
 |  
 |  BorderlineSMOTE : Over-sample using the borderline-SMOTE variant.
 |  
 |  SVMSMOTE : Over-sample using the SVM-SMOTE variant.
 |  
 |  ADASYN : Over-sample using ADASYN.
 |  
 |  KMeansSMOTE : Over-sample applying a clustering before to oversample using
 |      SMOTE.
 |  
 |  Notes
 |  -----
 |  See the original papers: [1]_ for more details.
 |  
 |  Supports multi-class resampling. A one-vs.-rest scheme is used as
 |  originally proposed in [1]_.
 |  
 |  References
 |  ----------
 |  .. [1] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, "SMOTE:
 |     synthetic minority over-sampling technique," Journal of artificial
 |     intelligence research, 321-357, 2002.
 |  
 |  Examples
 |  --------
 |  >>> from collections import Counter
 |  >>> from sklearn.datasets import make_classification
 |  >>> from imblearn.over_sampling import SMOTE
 |  >>> X, y = make_classification(n_classes=2, class_sep=2,
 |  ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
 |  ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
 |  >>> print('Original dataset shape %s' % Counter(y))
 |  Original dataset shape Counter({1: 900, 0: 100})
 |  >>> sm = SMOTE(random_state=42)
 |  >>> X_res, y_res = sm.fit_resample(X, y)
 |  >>> print('Resampled dataset shape %s' % Counter(y_res))
 |  Resampled dataset shape Counter({0: 900, 1: 900})
 |  
 |  Method resolution order:
 |      SMOTE
 |      BaseSMOTE
 |      imblearn.over_sampling.base.BaseOverSampler
 |      imblearn.base.BaseSampler
 |      imblearn.base.SamplerMixin
 |      sklearn.base.BaseEstimator
 |      sklearn.base._OneToOneFeatureMixin
 |      imblearn.base._ParamsValidationMixin
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, *, sampling_strategy='auto', random_state=None, k_neighbors=5, n_jobs=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset()
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes inherited from BaseSMOTE:
 |  
 |  __annotations__ = {'_parameter_constraints': <class 'dict'>}
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from imblearn.base.BaseSampler:
 |  
 |  fit(self, X, y)
 |      Check inputs and statistics of the sampler.
 |      
 |      You should use ``fit_resample`` in all cases.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, dataframe, sparse matrix} of shape                 (n_samples, n_features)
 |          Data array.
 |      
 |      y : array-like of shape (n_samples,)
 |          Target array.
 |      
 |      Returns
 |      -------
 |      self : object
 |          Return the instance itself.
 |  
 |  fit_resample(self, X, y)
 |      Resample the dataset.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, dataframe, sparse matrix} of shape                 (n_samples, n_features)
 |          Matrix containing the data which have to be sampled.
 |      
 |      y : array-like of shape (n_samples,)
 |          Corresponding label for each sample in X.
 |      
 |      Returns
 |      -------
 |      X_resampled : {array-like, dataframe, sparse matrix} of shape                 (n_samples_new, n_features)
 |          The array containing the resampled data.
 |      
 |      y_resampled : array-like of shape (n_samples_new,)
 |          The corresponding label of `X_resampled`.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __getstate__(self)
 |  
 |  __repr__(self, N_CHAR_MAX=700)
 |      Return repr(self).
 |  
 |  __setstate__(self, state)
 |  
 |  get_params(self, deep=True)
 |      Get parameters for this estimator.
 |      
 |      Parameters
 |      ----------
 |      deep : bool, default=True
 |          If True, will return the parameters for this estimator and
 |          contained subobjects that are estimators.
 |      
 |      Returns
 |      -------
 |      params : dict
 |          Parameter names mapped to their values.
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as :class:`~sklearn.pipeline.Pipeline`). The latter have
 |      parameters of the form ``<component>__<parameter>`` so that it's
 |      possible to update each component of a nested object.
 |      
 |      Parameters
 |      ----------
 |      **params : dict
 |          Estimator parameters.
 |      
 |      Returns
 |      -------
 |      self : estimator instance
 |          Estimator instance.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base.BaseEstimator:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base._OneToOneFeatureMixin:
 |  
 |  get_feature_names_out(self, input_features=None)
 |      Get output feature names for transformation.
 |      
 |      Parameters
 |      ----------
 |      input_features : array-like of str or None, default=None
 |          Input features.
 |      
 |          - If `input_features` is `None`, then `feature_names_in_` is
 |            used as feature names in. If `feature_names_in_` is not defined,
 |            then names are generated: `[x0, x1, ..., x(n_features_in_)]`.
 |          - If `input_features` is an array-like, then `input_features` must
 |            match `feature_names_in_` if `feature_names_in_` is defined.
 |      
 |      Returns
 |      -------
 |      feature_names_out : ndarray of str objects
 |          Same as input features.

sampling_strategy 有以下参数:

  • " minority' ' ':只重新采样少数类
  • " not minority' ' ':重采样除minority类外的所有类
  • " not majority' ' ':重采样除majority类外的所有类
  • " all' ' ':重采样所有类
  • " auto' ' ':相当于' " not majority'


from imblearn.over_sampling import SMOTE
smote=SMOTE(sampling_strategy='minority')
X,y=smote.fit_resample(X,y)
sns.countplot(y)
<matplotlib.axes._subplots.AxesSubplot at 0x7fbd565994d0>

image.png



目录
相关文章
|
1月前
|
机器学习/深度学习 人工智能 搜索推荐
探索人工智能在医疗影像分析中的应用
随着人工智能技术的飞速发展,其在医疗领域的应用日益增多,特别是在医疗影像分析方面。本文将深入探讨人工智能技术在医疗影像分析中的关键作用,包括图像识别、模式分析和深度学习等先进技术的运用。同时,文中还将讨论这些技术在提高诊断准确性、降低工作负荷以及促进个性化治疗等方面的贡献。通过案例研究和最新研究成果的展示,本文旨在为读者提供一个关于人工智能如何改变医疗影像分析领域的全面视角。
|
1月前
|
机器学习/深度学习 人工智能 搜索推荐
未来十年人工智能在医疗行业的应用前景分析
随着人工智能技术的不断发展,医疗行业也将迎来巨大的变革与机遇。本文从人工智能在医疗诊断、药物研发、个性化治疗等方面的应用现状入手,探讨了未来十年人工智能在医疗领域的发展趋势及挑战。
|
7月前
|
人工智能 安全 机器人
人工智能是否有风险
人工智能是否有风险
43 0
|
3月前
|
人工智能 安全
人工智能大模型井喷后需防风险
【1月更文挑战第21天】人工智能大模型井喷后需防风险
162 6
人工智能大模型井喷后需防风险
|
3月前
|
机器学习/深度学习 人工智能 监控
人工智能在内网上网行为管理软件中的智能分析与优化
随着科技的迅猛发展,内网上网行为管理软件越来越成为企业信息安全的重要组成部分。本文将探讨如何通过人工智能技术对内网上网行为进行智能分析与优化,以提高管理软件的效能。
328 0
|
4月前
|
机器学习/深度学习 人工智能 自然语言处理
Python在人工智能领域的应用案例分析
一、引言 随着人工智能技术的飞速发展,Python作为一种功能强大、易学易用的编程语言,在人工智能领域发挥着越来越重要的作用。本文将介绍Python在人工智能领域的应用案例,包括机器学习、深度学习、自然语言处理等方面,帮助读者了解Python在人工智能领域的实际应用和优势。
|
8月前
|
人工智能 搜索推荐 机器人
人工智能大模型未来发展和机遇,具体案列分析
人工智能大模型未来发展和机遇,具体案列分析
|
8月前
|
人工智能 自然语言处理 自动驾驶
人工智能大模型未来发展和机遇,具体案列分析
人工智能大模型未来发展和机遇,具体案列分析
143 0
|
10月前
|
机器学习/深度学习 存储 人工智能
人工智能直播的趋势分析报告
人工智能直播是指通过人工智能技术来模拟真人直播,通过机器学习和自然语言处理等技术实现。随着人工智能技术的不断发展,人工智能直播在近年来得到了广泛应用。
211 0
|
10月前
|
机器学习/深度学习 人工智能 数据可视化
人工智能创新挑战赛:助力精准气象和海洋预测Baseline[2]:数据探索性分析(温度风场可视化)、CNN+LSTM模型建模
人工智能创新挑战赛:助力精准气象和海洋预测Baseline[2]:数据探索性分析(温度风场可视化)、CNN+LSTM模型建模
人工智能创新挑战赛:助力精准气象和海洋预测Baseline[2]:数据探索性分析(温度风场可视化)、CNN+LSTM模型建模

热门文章

最新文章