开发者社区> 问答> 正文

Spark StringIndexer返回空数据集

在一个特定列上进行转换后,Apache Spark StringIndexerModel返回一个空数据集。我正在使用成人数据集:http://mlr.cs.umass.edu/ml/datasets/Adult

步骤1:创建StringIndexerModel并将其保存在本地

StringIndexerModel model = new StringIndexer().setInputCol(column).setOutputCol("label").setHandleInvalid("skip").setStringOrderType("alphabetAsc").fit(originalDataset);
model.write().save(filelocation);

步骤2:读取索引器模型并转换新数据集

StringIndexerModel model = StringIndexerModel.read().load(filelocation);
newDataset = model.transform(newDataset).drop(column).withColumnRenamed("label", column);

新数据集:

+---+------------+------------+----------+-------------+------+--------------+-------------------+--------------+----------------+-----+--------------+----+-----------------+
|age|capital gain|capital loss|education |education num|fnlgwt|hours per week|marital status     |native country|occupation      |race |relationship  |sex |workclass        |
+---+------------+------------+----------+-------------+------+--------------+-------------------+--------------+----------------+-----+--------------+----+-----------------+
|39 |2174        |0           | Bachelors|13           |77516 |40            | Never-married     | United-States| Adm-clerical   |White| Not-in-family|Male| State-gov       |
|50 |0           |0           | Bachelors|13           |83311 |13            | Married-civ-spouse| United-States| Exec-managerial|White| Husband      |Male| Self-emp-not-inc|
+---+------------+------------+----------+-------------+------+--------------+-------------------+--------------+----------------+-----+--------------+----+-----------------+

正确的输出:

Column: education | File Location: localFolder/stringIndex/education
Labels: [ 10th,  11th,  12th,  1st-4th,  5th-6th,  7th-8th,  9th,  Assoc-acdm,  Assoc-voc,  Bachelors,  Doctorate,  HS-grad,  Masters,  Preschool,  Prof-school,  Some-college]
+---+------------+------------+-------------+------+--------------+-------------------+--------------+----------------+-----+--------------+----+-----------------+---------+
|age|capital gain|capital loss|education num|fnlgwt|hours per week|marital status     |native country|occupation      |race |relationship  |sex |workclass        |education|
+---+------------+------------+-------------+------+--------------+-------------------+--------------+----------------+-----+--------------+----+-----------------+---------+
|39 |2174        |0           |13           |77516 |40            | Never-married     | United-States| Adm-clerical   |White| Not-in-family|Male| State-gov       |9.0      |
|50 |0           |0           |13           |83311 |13            | Married-civ-spouse| United-States| Exec-managerial|White| Husband      |Male| Self-emp-not-inc|9.0      |
+---+------------+------------+-------------+------+--------------+-------------------+--------------+----------------+-----+--------------+----+-----------------+---------+

Column: marital status | File Location: localFolder/stringIndex/marital status
Labels: [ Divorced,  Married-AF-spouse,  Married-civ-spouse,  Married-spouse-absent,  Never-married,  Separated,  Widowed]
+---+------------+------------+-------------+------+--------------+--------------+----------------+-----+--------------+----+-----------------+---------+--------------+
|age|capital gain|capital loss|education num|fnlgwt|hours per week|native country|occupation      |race |relationship  |sex |workclass        |education|marital status|
+---+------------+------------+-------------+------+--------------+--------------+----------------+-----+--------------+----+-----------------+---------+--------------+
|39 |2174        |0           |13           |77516 |40            | United-States| Adm-clerical   |White| Not-in-family|Male| State-gov       |9.0      |4.0           |
|50 |0           |0           |13           |83311 |13            | United-States| Exec-managerial|White| Husband      |Male| Self-emp-not-inc|9.0      |2.0           |
+---+------------+------------+-------------+------+--------------+--------------+----------------+-----+--------------+----+-----------------+---------+--------------+

Column: native country | File Location: localFolder/stringIndex/native country
Labels: [ ?,  Cambodia,  Canada,  China,  Columbia,  Cuba,  Dominican-Republic,  Ecuador,  El-Salvador,  England,  France,  Germany,  Greece,  Guatemala,  Haiti,  Holand-Netherlands,  Honduras,  Hong,  Hungary,  India,  Iran,  Ireland,  Italy,  Jamaica,  Japan,  Laos,  Mexico,  Nicaragua,  Outlying-US(Guam-USVI-etc),  Peru,  Philippines,  Poland,  Portugal,  Puerto-Rico,  Scotland,  South,  Taiwan,  Thailand,  Trinadad&Tobago,  United-States,  Vietnam,  Yugoslavia]
+---+------------+------------+-------------+------+--------------+----------------+-----+--------------+----+-----------------+---------+--------------+--------------+
|age|capital gain|capital loss|education num|fnlgwt|hours per week|occupation      |race |relationship  |sex |workclass        |education|marital status|native country|
+---+------------+------------+-------------+------+--------------+----------------+-----+--------------+----+-----------------+---------+--------------+--------------+
|39 |2174        |0           |13           |77516 |40            | Adm-clerical   |White| Not-in-family|Male| State-gov       |9.0      |4.0           |39.0          |
|50 |0           |0           |13           |83311 |13            | Exec-managerial|White| Husband      |Male| Self-emp-not-inc|9.0      |2.0           |39.0          |
+---+------------+------------+-------------+------+--------------+----------------+-----+--------------+----+-----------------+---------+--------------+--------------+

Column: occupation | File Location: localFolder/stringIndex/occupation
Labels: [ ?,  Adm-clerical,  Armed-Forces,  Craft-repair,  Exec-managerial,  Farming-fishing,  Handlers-cleaners,  Machine-op-inspct,  Other-service,  Priv-house-serv,  Prof-specialty,  Protective-serv,  Sales,  Tech-support,  Transport-moving]
+---+------------+------------+-------------+------+--------------+-----+--------------+----+-----------------+---------+--------------+--------------+----------+
|age|capital gain|capital loss|education num|fnlgwt|hours per week|race |relationship  |sex |workclass        |education|marital status|native country|occupation|
+---+------------+------------+-------------+------+--------------+-----+--------------+----+-----------------+---------+--------------+--------------+----------+
|39 |2174        |0           |13           |77516 |40            |White| Not-in-family|Male| State-gov       |9.0      |4.0           |39.0          |1.0       |
|50 |0           |0           |13           |83311 |13            |White| Husband      |Male| Self-emp-not-inc|9.0      |2.0           |39.0          |4.0       |
+---+------------+------------+-------------+------+--------------+-----+--------------+----+-----------------+---------+--------------+--------------+----------+

输出错误:除此型号外,所有其他型号均正常工作

Column: race | File Location: localFolder/stringIndex/race
Labels: [ Amer-Indian-Eskimo,  Asian-Pac-Islander,  Black,  Other,  White]
+---+------------+------------+-------------+------+--------------+------------+---+---------+---------+--------------+--------------+----------+----+
|age|capital gain|capital loss|education num|fnlgwt|hours per week|relationship|sex|workclass|education|marital status|native country|occupation|race|
+---+------------+------------+-------------+------+--------------+------------+---+---------+---------+--------------+--------------+----------+----+
+---+------------+------------+-------------+------+--------------+------------+---+---------+---------+--------------+--------------+----------+----

展开
收起
几许相思几点泪 2019-12-29 19:30:51 1268 0
1 条回答
写回答
取消 提交回答
  • 原来,新数据集的数据不正确。值前应有空格。

    添加空格使' White'我获得正确的输出。

    2019-12-29 19:31:03
    赞同 展开评论 打赏
问答标签:
问答地址:
问答排行榜
最热
最新

相关电子书

更多
Hybrid Cloud and Apache Spark 立即下载
Scalable Deep Learning on Spark 立即下载
Comparison of Spark SQL with Hive 立即下载