《Web安全之机器学习入门》一 3.3　特征提取-阿里云开发者社区

《Web安全之机器学习入门》一 3.3　特征提取

2017-09-08 1335

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 本节书摘来自华章出版社《Web安全之机器学习入门》一书中的第3章，第3.3节，作者：刘焱，更多章节内容可以访问云栖社区“华章计算机”公众号查看。

3.3　特征提取

机器学习中，特征提取被认为是个体力活，有人形象地称为“特征工程”，可见其工作量之大。特征提取中数字型和文本型特征的提取最为常见。

3.3.1　数字型特征提取

数字型特征可以直接作为特征，但是对于一个多维的特征，某一个特征的取值范围特别大，很可能导致其他特征对结果的影响被忽略，这时候我们需要对数字型特征进行预处理，常见的预处理方式有以下几种。
1.标准化：

>>> from sklearn import preprocessing
>>> import numpy as np
>>> X = np.array([[ 1., -1.,  2.],
...               [ 2.,  0.,  0.],
...               [ 0.,  1., -1.]])
>>> X_scaled = preprocessing.scale(X)
>>> X_scaled
array([[ 0.  ..., -1.22...,  1.33...],
       [ 1.22...,  0.  ..., -0.26...],
       [-1.22...,  1.22..., -1.06...]])

2.正则化：

>>> X = [[ 1., -1.,  2.],
...      [ 2.,  0.,  0.],
...      [ 0.,  1., -1.]]
>>> X_normalized = preprocessing.normalize(X, norm='l2')
>>> X_normalized
array([[ 0.40..., -0.40...,  0.81...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 0.  ...,  0.70..., -0.70...]])

3.归一化：

>>> X_train = np.array([[ 1., -1.,  2.],
...                     [ 2.,  0.,  0.],
...                     [ 0.,  1., -1.]])
...
>>> min_max_scaler = preprocessing.MinMaxScaler()
>>> X_train_minmax = min_max_scaler.fit_transform(X_train)
>>> X_train_minmax
array([[ 0.5       ,  0.        ,  1.        ],
       [ 1.        ,  0.5       ,  0.33333333],
       [ 0.        ,  1.        ,  0.        ]])

3.3.2　文本型特征提取

文本型数据提取特征相对数字型要复杂很多，本质上是做单词切分，不同的单词当作一个新的特征，以hash结构为例：

>>> measurements = [
...     {'city': 'Dubai', 'temperature': 33.},
...     {'city': 'London', 'temperature': 12.},
...     {'city': 'San Fransisco', 'temperature': 18.},
... ]

键值city具有多个取值，“Dubai”、“London”和“San Fransisco”，直接把每个取值作为新的特征即可。键值temperature是数值型，可以直接作为特征使用。

>>> from sklearn.feature_extraction import DictVectorizer
>>> vec = DictVectorizer()
>>> vec.fit_transform(measurements).toarray()
array([[  1.,   0.,   0.,  33.],
       [  0.,   1.,   0.,  12.],
       [  0.,   0.,   1.,  18.]])
>>> vec.get_feature_names()
['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']

文本特征提取有两个非常重要的模型。
词集模型：单词构成的集合，集合中每个元素都只有一个，即词集中的每个单词都只有一个。
词袋模型：如果一个单词在文档中出现不止一次，并统计其出现的次数（频数）。
两者本质上的区别，词袋是在词集的基础上增加了频率的维度：词集只关注有和没有，词袋还要关注有几个。
假设我们要对一篇文章进行特征化，最常见的方式就是词袋。
导入相关的函数库：

>>> from sklearn.feature_extraction.text import CountVectorizer

实例化分词对象：

>>> vectorizer = CountVectorizer(min_df=1)
>>> vectorizer                     
CountVectorizer(analyzer=...'word', binary=False, decode_error=...'strict',
    dtype=<... 'numpy.int64'>, encoding=...'utf-8', input=...'content',
    lowercase=True, max_df=1.0, max_features=None, min_df=1,
    ngram_range=(1, 1), preprocessor=None, stop_words=None,
    strip_accents=None, token_pattern=...'(?u)\\b\\w\\w+\\b',
    tokenizer=None, vocabulary=None)

将文本进行词袋处理：

>>> corpus = [
...     'This is the first document.',
...     'This is the second second document.',
...     'And the third one.',
...     'Is this the first document?',
... ]
>>> X = vectorizer.fit_transform(corpus)
>>> X                              
<4x9 sparse matrix of type '<... 'numpy.int64'>'
    with 19 stored elements in Compressed Sparse ... format>
获取对应的特征名称：
>>> vectorizer.get_feature_names() == (
...     ['and', 'document', 'first', 'is', 'one',
...      'second', 'the', 'third', 'this'])
True

获取词袋数据，至此我们已经完成了词袋化。但是对于程序中的其他文本，如何使用现有的词袋的特征进行向量化呢？

>>> X.toarray()
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)

我们定义词袋的特征空间叫做词汇表vocabulary：

vocabulary=vectorizer.vocabulary_

针对其他文本进行词袋处理时，可以直接使用现有的词汇表：

>>> new_vectorizer = CountVectorizer(min_df=1, vocabulary=vocabulary)

TensorFlow中有类似实现：

from sklearn.feature_extraction.text import CountVectorizer
MAX_DOCUMENT_LENGTH = 100
vocab_processor = 
learn.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)
x_train = np.array(list(vocab_processor.fit_transform(x_train)))
x_test = np.array(list(vocab_processor.transform(x_test)))

3.3.3　数据读取

平时处理数据时，CSV是最常见的格式，文件的每行记录一个向量，其中最后一列为标记。TensorFlow提供了非常便捷的方式从CSV文件中读取数据集。
加载对应的函数库：

    import tensorflow as tf
import numpy as np

从CSV文件中读取数据：

    training_set = tf.contrib.learn.datasets.base.load_csv_with_header(
    filename=" iris_training.csv",
    target_dtype=np.int,
    features_dtype=np.float32)
feature_columns = [tf.contrib.layers.real_valued_column("", dimension=4)]

其中各个参数定义为：
filename，文件名；
target_dtype，标记数据类型；
features_dtype，特征数据类型。
访问数据集合的特征以及标记的方式为：

x=training_set.data
y=training_set.target

《Web安全之机器学习入门》一 3.3　特征提取

3.3　特征提取

3.3.1　数字型特征提取

3.3.2　文本型特征提取

3.3.3　数据读取

华章出版社

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

《Web安全之机器学习入门》一 3.3 特征提取

3.3 特征提取

3.3.1 数字型特征提取

3.3.2 文本型特征提取

3.3.3 数据读取

华章出版社

热门文章

最新文章

相关课程

相关电子书

相关实验场景

《Web安全之机器学习入门》一 3.3　特征提取

3.3　特征提取

3.3.1　数字型特征提取

3.3.2　文本型特征提取

3.3.3　数据读取