基于人工智能的【患肺癌病】风险预测与分析(上)

简介: 基于人工智能的【患肺癌病】风险预测与分析

一、肺癌风险预测


image.png


1.背景描述


癌症预测系统的有效性帮助人们以较低的成本了解自己的癌症风险,也帮助人们根据自己的癌症风险状况做出适当的决定。数据收集自在线肺癌预测网站。


2.数据说明


字段总数:16

实例数:284

字段信息:

1.性别:M(男性),F(女性)

2.年龄:病人的年龄

3.吸烟:YES=2 , NO=1

4.黄色的手指:YES=2 , NO=1

5.焦虑:YES=2 , NO=1

6.同伴压力: YES=2 , NO=1

7.慢性疾病:YES=2 , NO=1

8.疲劳:YES=2 , NO=1

9.过敏症:YES=2 , NO=1

10.喘息:YES=2 , NO=1

11.酒精:YES=2 , NO=1

12.咳嗽: YES=2 , NO=1

13.呼吸急促:YES=2 , NO=1

14.吞咽困难:YES=2 , NO=1

15.胸部疼痛:YES=2 , NO=1

16.肺癌:YES , NO


3.数据来源


www.kaggle.com/datasets/na…


二、数据处理


1.读取数据


import pandas as pd
df=pd.read_csv("data/data209803/survey_lung_cancer.csv", index_col=None)
df.head()

    .dataframe tbody tr th:only-of-type {         vertical-align: middle;     } .dataframe tbody tr th {     vertical-align: top; } .dataframe thead th {     text-align: right; }

GENDER AGE SMOKING YELLOW_FINGERS ANXIETY PEER_PRESSURE CHRONIC DISEASE FATIGUE ALLERGY WHEEZING ALCOHOL CONSUMING COUGHING SHORTNESS OF BREATH SWALLOWING DIFFICULTY CHEST PAIN LUNG_CANCER
0 M 69 1 2 2 1 1 2 1 2 2 2 2 2 2 YES
1 M 74 2 1 1 1 2 2 2 1 1 1 2 2 2 YES
2 F 59 1 1 1 2 1 2 1 2 1 2 2 1 2 NO
3 M 63 2 2 2 1 1 1 1 1 2 1 1 2 2 NO
4 F 63 1 2 1 1 1 1 1 2 1 2 2 1 1 NO


df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309 entries, 0 to 308
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   GENDER                 309 non-null    object
 1   AGE                    309 non-null    int64 
 2   SMOKING                309 non-null    int64 
 3   YELLOW_FINGERS         309 non-null    int64 
 4   ANXIETY                309 non-null    int64 
 5   PEER_PRESSURE          309 non-null    int64 
 6   CHRONIC DISEASE        309 non-null    int64 
 7   FATIGUE                309 non-null    int64 
 8   ALLERGY                309 non-null    int64 
 9   WHEEZING               309 non-null    int64 
 10  ALCOHOL CONSUMING      309 non-null    int64 
 11  COUGHING               309 non-null    int64 
 12  SHORTNESS OF BREATH    309 non-null    int64 
 13  SWALLOWING DIFFICULTY  309 non-null    int64 
 14  CHEST PAIN             309 non-null    int64 
 15  LUNG_CANCER            309 non-null    object
dtypes: int64(14), object(2)
memory usage: 38.8+ KB


df.isnull().sum()


GENDER                   0
AGE                      0
SMOKING                  0
YELLOW_FINGERS           0
ANXIETY                  0
PEER_PRESSURE            0
CHRONIC DISEASE          0
FATIGUE                  0
ALLERGY                  0
WHEEZING                 0
ALCOHOL CONSUMING        0
COUGHING                 0
SHORTNESS OF BREATH      0
SWALLOWING DIFFICULTY    0
CHEST PAIN               0
LUNG_CANCER              0
dtype: int64

可见没有空值


2.数据序列化


df.GENDER.replace({"M":1,"F":0},inplace=True)
df.LUNG_CANCER.replace({"YES":1,"NO":0},inplace=True)


import matplotlib.pyplot as plt
%matplotlib inline


3.查看数据分布


figure,axes=plt.subplots(nrows=4,ncols=4,figsize=(20,16)) 
i=0
for column in df.columns:
    x=int(i/4)
    y=i%4
    df[column].value_counts().plot(ax=axes[x][y], kind='bar',title=f"{column} scatter gram")
    i=i+1


image.png

从上图可见,数据得癌症的比较多,其他的较为均衡。


4.抽烟与患病关系


smoke_yes=df.loc[df.SMOKING==2,["SMOKING","LUNG_CANCER"]]
smoke_no=df.loc[df.SMOKING==1,["SMOKING","LUNG_CANCER"]]
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2,figsize=(16,8))
ax1.pie(smoke_yes.LUNG_CANCER.value_counts(normalize=True),labels=["YES","NO"],colors=["yellow","green"],autopct='%1.1f%%',shadow=True,)
ax1.set_title("Lung Cancer & Smoking_YES")
ax2.pie(smoke_no.LUNG_CANCER.value_counts(normalize=True),labels=["YES","NO"],colors=["red","green"],autopct='%1.1f%%',shadow=True,)
ax2.set_title("Lung Cancer & Smoking_NO")


Text(0.5,1,'Lung Cancer & Smoking_NO')

image.png


5.过敏、饮酒、吞咽困难、胸疼与患癌关系


import seaborn as sns
fig,(ax1,ax2,ax3)=plt.subplots(1,3,figsize=(30,8))
sns.countplot(df.LUNG_CANCER,hue=df["ALLERGY "],ax=ax1,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df.COUGHING,ax=ax2,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df["ALCOHOL CONSUMING"],ax=ax3,palette=['green', 'black'])
fig,(ax1,ax2,ax3)=plt.subplots(1,3,figsize=(30,8))
sns.countplot(df.LUNG_CANCER,hue=df["SWALLOWING DIFFICULTY"],ax=ax1,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df.WHEEZING,ax=ax2,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df["CHEST PAIN"],ax=ax3,palette=['green', 'black'])


<matplotlib.axes._subplots.AxesSubplot at 0x7fba81b66350>

image.png

image.png


6.绘制热力图


import seaborn as sns
plt.figure(figsize=(16,10))
sns.heatmap(df.corr(),annot=True,cmap='viridis',vmin=0, vmax=1)


<matplotlib.axes._subplots.AxesSubplot at 0x7fba83b48d90>

image.png

可见性别、年龄和是否抽烟与患肺癌相关性不大。


7.构造X、y


# 构造X、y
X=df.drop(columns=["LUNG_CANCER"],axis=1)
y=df["LUNG_CANCER"]


y.value_counts()


1    270
0     39
Name: LUNG_CANCER, dtype: int64


sns.countplot(y)


<matplotlib.axes._subplots.AxesSubplot at 0x7fba81a56590>

image.png


8.数据均衡


安装完要重启才能生效,不然报错,具体如下:

image.png


from IPython.display import clear_output
!pip install imblearn --user
!pip uninstall scipy -y
!pip install scipy --user
clear_output()


from imblearn.over_sampling import SMOTE


help(SMOTE)


Help on class SMOTE in module imblearn.over_sampling._smote.base:
class SMOTE(BaseSMOTE)
 |  SMOTE(*, sampling_strategy='auto', random_state=None, k_neighbors=5, n_jobs=None)
 |  
 |  Class to perform over-sampling using SMOTE.
 |  
 |  This object is an implementation of SMOTE - Synthetic Minority
 |  Over-sampling Technique as presented in [1]_.
 |  
 |  Read more in the :ref:`User Guide <smote_adasyn>`.
 |  
 |  Parameters
 |  ----------
 |  sampling_strategy : float, str, dict or callable, default='auto'
 |      Sampling information to resample the data set.
 |  
 |      - When ``float``, it corresponds to the desired ratio of the number of
 |        samples in the minority class over the number of samples in the
 |        majority class after resampling. Therefore, the ratio is expressed as
 |        :math:`\alpha_{os} = N_{rm} / N_{M}` where :math:`N_{rm}` is the
 |        number of samples in the minority class after resampling and
 |        :math:`N_{M}` is the number of samples in the majority class.
 |  
 |          .. warning::
 |             ``float`` is only available for **binary** classification. An
 |             error is raised for multi-class classification.
 |  
 |      - When ``str``, specify the class targeted by the resampling. The
 |        number of samples in the different classes will be equalized.
 |        Possible choices are:
 |  
 |          ``'minority'``: resample only the minority class;
 |  
 |          ``'not minority'``: resample all classes but the minority class;
 |  
 |          ``'not majority'``: resample all classes but the majority class;
 |  
 |          ``'all'``: resample all classes;
 |  
 |          ``'auto'``: equivalent to ``'not majority'``.
 |  
 |      - When ``dict``, the keys correspond to the targeted classes. The
 |        values correspond to the desired number of samples for each targeted
 |        class.
 |  
 |      - When callable, function taking ``y`` and returns a ``dict``. The keys
 |        correspond to the targeted classes. The values correspond to the
 |        desired number of samples for each class.
 |  
 |  random_state : int, RandomState instance, default=None
 |      Control the randomization of the algorithm.
 |  
 |      - If int, ``random_state`` is the seed used by the random number
 |        generator;
 |      - If ``RandomState`` instance, random_state is the random number
 |        generator;
 |      - If ``None``, the random number generator is the ``RandomState``
 |        instance used by ``np.random``.
 |  
 |  k_neighbors : int or object, default=5
 |      The nearest neighbors used to define the neighborhood of samples to use
 |      to generate the synthetic samples. You can pass:
 |  
 |      - an `int` corresponding to the number of neighbors to use. A
 |        `~sklearn.neighbors.NearestNeighbors` instance will be fitted in this
 |        case.
 |      - an instance of a compatible nearest neighbors algorithm that should
 |        implement both methods `kneighbors` and `kneighbors_graph`. For
 |        instance, it could correspond to a
 |        :class:`~sklearn.neighbors.NearestNeighbors` but could be extended to
 |        any compatible class.
 |  
 |  n_jobs : int, default=None
 |      Number of CPU cores used during the cross-validation loop.
 |      ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
 |      ``-1`` means using all processors. See
 |      `Glossary <https://scikit-learn.org/stable/glossary.html#term-n-jobs>`_
 |      for more details.
 |  
 |      .. deprecated:: 0.10
 |         `n_jobs` has been deprecated in 0.10 and will be removed in 0.12.
 |         It was previously used to set `n_jobs` of nearest neighbors
 |         algorithm. From now on, you can pass an estimator where `n_jobs` is
 |         already set instead.
 |  
 |  Attributes
 |  ----------
 |  sampling_strategy_ : dict
 |      Dictionary containing the information to sample the dataset. The keys
 |      corresponds to the class labels from which to sample and the values
 |      are the number of samples to sample.
 |  
 |  nn_k_ : estimator object
 |      Validated k-nearest neighbours created from the `k_neighbors` parameter.
 |  
 |  n_features_in_ : int
 |      Number of features in the input dataset.
 |  
 |      .. versionadded:: 0.9
 |  
 |  feature_names_in_ : ndarray of shape (`n_features_in_`,)
 |      Names of features seen during `fit`. Defined only when `X` has feature
 |      names that are all strings.
 |  
 |      .. versionadded:: 0.10
 |  
 |  See Also
 |  --------
 |  SMOTENC : Over-sample using SMOTE for continuous and categorical features.
 |  
 |  SMOTEN : Over-sample using the SMOTE variant specifically for categorical
 |      features only.
 |  
 |  BorderlineSMOTE : Over-sample using the borderline-SMOTE variant.
 |  
 |  SVMSMOTE : Over-sample using the SVM-SMOTE variant.
 |  
 |  ADASYN : Over-sample using ADASYN.
 |  
 |  KMeansSMOTE : Over-sample applying a clustering before to oversample using
 |      SMOTE.
 |  
 |  Notes
 |  -----
 |  See the original papers: [1]_ for more details.
 |  
 |  Supports multi-class resampling. A one-vs.-rest scheme is used as
 |  originally proposed in [1]_.
 |  
 |  References
 |  ----------
 |  .. [1] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, "SMOTE:
 |     synthetic minority over-sampling technique," Journal of artificial
 |     intelligence research, 321-357, 2002.
 |  
 |  Examples
 |  --------
 |  >>> from collections import Counter
 |  >>> from sklearn.datasets import make_classification
 |  >>> from imblearn.over_sampling import SMOTE
 |  >>> X, y = make_classification(n_classes=2, class_sep=2,
 |  ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
 |  ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
 |  >>> print('Original dataset shape %s' % Counter(y))
 |  Original dataset shape Counter({1: 900, 0: 100})
 |  >>> sm = SMOTE(random_state=42)
 |  >>> X_res, y_res = sm.fit_resample(X, y)
 |  >>> print('Resampled dataset shape %s' % Counter(y_res))
 |  Resampled dataset shape Counter({0: 900, 1: 900})
 |  
 |  Method resolution order:
 |      SMOTE
 |      BaseSMOTE
 |      imblearn.over_sampling.base.BaseOverSampler
 |      imblearn.base.BaseSampler
 |      imblearn.base.SamplerMixin
 |      sklearn.base.BaseEstimator
 |      sklearn.base._OneToOneFeatureMixin
 |      imblearn.base._ParamsValidationMixin
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, *, sampling_strategy='auto', random_state=None, k_neighbors=5, n_jobs=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset()
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes inherited from BaseSMOTE:
 |  
 |  __annotations__ = {'_parameter_constraints': <class 'dict'>}
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from imblearn.base.BaseSampler:
 |  
 |  fit(self, X, y)
 |      Check inputs and statistics of the sampler.
 |      
 |      You should use ``fit_resample`` in all cases.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, dataframe, sparse matrix} of shape                 (n_samples, n_features)
 |          Data array.
 |      
 |      y : array-like of shape (n_samples,)
 |          Target array.
 |      
 |      Returns
 |      -------
 |      self : object
 |          Return the instance itself.
 |  
 |  fit_resample(self, X, y)
 |      Resample the dataset.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, dataframe, sparse matrix} of shape                 (n_samples, n_features)
 |          Matrix containing the data which have to be sampled.
 |      
 |      y : array-like of shape (n_samples,)
 |          Corresponding label for each sample in X.
 |      
 |      Returns
 |      -------
 |      X_resampled : {array-like, dataframe, sparse matrix} of shape                 (n_samples_new, n_features)
 |          The array containing the resampled data.
 |      
 |      y_resampled : array-like of shape (n_samples_new,)
 |          The corresponding label of `X_resampled`.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __getstate__(self)
 |  
 |  __repr__(self, N_CHAR_MAX=700)
 |      Return repr(self).
 |  
 |  __setstate__(self, state)
 |  
 |  get_params(self, deep=True)
 |      Get parameters for this estimator.
 |      
 |      Parameters
 |      ----------
 |      deep : bool, default=True
 |          If True, will return the parameters for this estimator and
 |          contained subobjects that are estimators.
 |      
 |      Returns
 |      -------
 |      params : dict
 |          Parameter names mapped to their values.
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as :class:`~sklearn.pipeline.Pipeline`). The latter have
 |      parameters of the form ``<component>__<parameter>`` so that it's
 |      possible to update each component of a nested object.
 |      
 |      Parameters
 |      ----------
 |      **params : dict
 |          Estimator parameters.
 |      
 |      Returns
 |      -------
 |      self : estimator instance
 |          Estimator instance.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base.BaseEstimator:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base._OneToOneFeatureMixin:
 |  
 |  get_feature_names_out(self, input_features=None)
 |      Get output feature names for transformation.
 |      
 |      Parameters
 |      ----------
 |      input_features : array-like of str or None, default=None
 |          Input features.
 |      
 |          - If `input_features` is `None`, then `feature_names_in_` is
 |            used as feature names in. If `feature_names_in_` is not defined,
 |            then names are generated: `[x0, x1, ..., x(n_features_in_)]`.
 |          - If `input_features` is an array-like, then `input_features` must
 |            match `feature_names_in_` if `feature_names_in_` is defined.
 |      
 |      Returns
 |      -------
 |      feature_names_out : ndarray of str objects
 |          Same as input features.

sampling_strategy 有以下参数:

  • " minority' ' ':只重新采样少数类
  • " not minority' ' ':重采样除minority类外的所有类
  • " not majority' ' ':重采样除majority类外的所有类
  • " all' ' ':重采样所有类
  • " auto' ' ':相当于' " not majority'


from imblearn.over_sampling import SMOTE
smote=SMOTE(sampling_strategy='minority')
X,y=smote.fit_resample(X,y)
sns.countplot(y)
<matplotlib.axes._subplots.AxesSubplot at 0x7fbd565994d0>

image.png



目录
相关文章
|
20天前
|
机器学习/深度学习 人工智能 搜索推荐
人工智能与体育:运动员表现分析
【10月更文挑战第31天】随着科技的发展,人工智能(AI)在体育领域的应用日益广泛,特别是在运动员表现分析方面。本文探讨了AI在数据收集与处理、数据分析与挖掘、实时反馈与调整等方面的应用,以及其在技术动作、战术策略、体能与心理状态评估中的具体作用。尽管面临数据准确性和隐私保护等挑战,AI仍为体育训练和竞技带来了新的机遇和前景。
|
1月前
|
机器学习/深度学习 人工智能 监控
AI与未来医疗:重塑健康产业的双刃剑随着科技的迅猛发展,人工智能(AI)正以前所未有的速度融入各行各业,其中医疗领域作为关系到人类生命健康的重要行业,自然也成为AI应用的焦点之一。本文将探讨AI在未来医疗中的潜力与挑战,分析其对健康产业可能带来的革命性变化。
在医疗领域,人工智能不仅仅是一种技术革新,更是一场关乎生死存亡的革命。从诊断到治疗,从后台数据分析到前端临床应用,AI正在全方位地改变传统医疗模式。然而,任何技术的发展都有其两面性,AI也不例外。本文通过深入分析,揭示AI在医疗领域的巨大潜力及其潜在风险,帮助读者更好地理解这一前沿技术对未来健康产业的影响。
|
2月前
|
机器学习/深度学习 存储 人工智能
文本情感识别分析系统Python+SVM分类算法+机器学习人工智能+计算机毕业设计
使用Python作为开发语言,基于文本数据集(一个积极的xls文本格式和一个消极的xls文本格式文件),使用Word2vec对文本进行处理。通过支持向量机SVM算法训练情绪分类模型。实现对文本消极情感和文本积极情感的识别。并基于Django框架开发网页平台实现对用户的可视化操作和数据存储。
50 0
文本情感识别分析系统Python+SVM分类算法+机器学习人工智能+计算机毕业设计
|
3月前
|
人工智能 自然语言处理 搜索推荐
【人工智能】人工智能(AI)、Web 3.0和元宇宙三者联系、应用及未来发展趋势的详细分析
人工智能(AI)、Web 3.0和元宇宙作为当前科技领域的热门话题,它们之间存在着紧密的联系,并在各自领域内展现出广泛的应用和未来的发展趋势。以下是对这三者联系、应用及未来发展趋势的详细分析
68 2
【人工智能】人工智能(AI)、Web 3.0和元宇宙三者联系、应用及未来发展趋势的详细分析
|
3月前
|
机器学习/深度学习 人工智能 数据处理
【人工智能】项目实践与案例分析:利用机器学习探测外太空中的系外行星
探测外太空中的系外行星是天文学和天体物理学的重要研究领域。随着望远镜观测技术的进步和大数据的积累,科学家们已经能够观测到大量恒星的光度变化,并尝试从中识别出由行星凌日(行星经过恒星前方时遮挡部分光线)引起的微小亮度变化。然而,由于数据量巨大且信号微弱,传统方法难以高效准确地识别所有行星信号。因此,本项目旨在利用机器学习技术,特别是深度学习,从海量的天文观测数据中自动识别和分类系外行星的信号。这要求设计一套高效的数据处理流程、构建适合的机器学习模型,并实现自动化的预测和验证系统。
68 1
【人工智能】项目实践与案例分析:利用机器学习探测外太空中的系外行星
|
3月前
|
机器学习/深度学习 人工智能 算法
【人工智能】传统语音识别算法概述,应用场景,项目实践及案例分析,附带代码示例
传统语音识别算法是将语音信号转化为文本形式的技术,它主要基于模式识别理论和数学统计学方法。以下是传统语音识别算法的基本概述
76 2
|
3月前
|
机器学习/深度学习 数据采集 人工智能
【AI在金融科技中的应用】详细介绍人工智能在金融分析、风险管理、智能投顾等方面的最新应用和发展趋势
人工智能(AI)在金融领域的应用日益广泛,对金融分析、风险管理和智能投顾等方面产生了深远影响。以下是这些领域的最新应用和发展趋势的详细介绍
421 1
|
3月前
|
机器学习/深度学习 人工智能 自然语言处理
【人工智能】TensorFlow简介,应用场景,使用方法以及项目实践及案例分析,附带源代码
TensorFlow是由Google Brain团队开发的开源机器学习库,广泛用于各种复杂的数学计算,特别是涉及深度学习的计算。它提供了丰富的工具和资源,用于构建和训练机器学习模型。TensorFlow的核心是计算图(Computation Graph),这是一种用于表示计算流程的图结构,由节点(代表操作)和边(代表数据流)组成。
74 0
|
4月前
|
人工智能 自然语言处理 小程序
政务VR导航:跨界融合AI人工智能与大数据分析,打造全方位智能政务服务
政务大厅引入智能导航系统,解决寻路难、指引不足及咨询台压力大的问题。VR导视与AI助手提供在线预览、VR路线指引、智能客服和小程序服务,提高办事效率,减轻咨询台工作,优化群众体验,塑造智慧政务形象。通过线上线下结合,实现政务服务的高效便民。
112 0
政务VR导航:跨界融合AI人工智能与大数据分析,打造全方位智能政务服务
|
5月前
|
人工智能 自然语言处理 搜索推荐
元宇宙与人工智能之间的关系紧密而复杂,它们相互影响、相互促进,共同推动了科技的进步和发展。以下是对这两者关系的详细分析:
元宇宙,融合扩展现实、数字孪生和区块链,是虚实相融的互联网新形态,具有同步、开源、永续和闭环经济特点。人工智能则通过模拟人类智能进行复杂任务处理。在元宇宙中,AI创建并管理虚拟环境,生成内容,提供智能交互,如虚拟助手。元宇宙对AI的需求包括大数据处理、智能决策和个性化服务。两者相互促进,AI推动元宇宙体验提升,元宇宙为AI提供应用舞台,共同驱动科技前进。