目录
基于news新闻文本数据集利用纯统计法、kNN、朴素贝叶斯(高斯/多元伯努利/多项式)、线性判别分析LDA、感知器等算法实现文本分类预测
相关文章
ML之NB:基于news新闻文本数据集利用纯统计法、kNN、朴素贝叶斯(高斯/多元伯努利/多项式)、线性判别分析LDA、感知器等算法实现文本分类预测
ML之NB:基于news新闻文本数据集利用纯统计法、kNN、朴素贝叶斯(高斯/多元伯努利/多项式)、线性判别分析LDA、感知器等算法实现文本分类预测实现
基于news新闻文本数据集利用纯统计法、kNN、朴素贝叶斯(高斯/多元伯努利/多项式)、线性判别分析LDA、感知器等算法实现文本分类预测
设计思路
输出结果
代码中的数据集:机器学习算法中自然语言处理常用数据集(新闻数据集news.csv)及jieba_dict字典、停用词等相关文件_新闻数据集-机器学习文档类资源-CSDN下载
1. F:\Program Files\Python\Python36\lib\site-packages\gensim\utils.py:1209: UserWarning: detected Windows; aliasing chunkize to chunkize_serial 2. warnings.warn("detected Windows; aliasing chunkize to chunkize_serial") 3. <class 'pandas.core.frame.DataFrame'> 4. RangeIndex: 1293 entries, 0 to 1292 5. Data columns (total 6 columns): 6. # Column Non-Null Count Dtype 7. --- ------ -------------- ----- 8. 0 Unnamed: 0 1293 non-null int64 9. 1 content 1292 non-null object 10. 2 id 1293 non-null int64 11. 3 tags 1293 non-null object 12. 4 time 1293 non-null object 13. 5 title 1293 non-null object 14. dtypes: int64(2), object(4) 15. memory usage: 60.7+ KB 16. None 17. 18. id tags \ 19. 0 6428905748545732865 ['财经', '白洋淀', '城市规划', '徐匡迪', '太行山'] 20. 1 6428954136200855810 ['财经', '碧桂园', '万科集团', '投资', '广州恒大'] 21. 2 6420576443738784002 ['财经', '自行车', '凤凰', '王朝阳', '汽车展览'] 22. 3 6429007290541031681 ['财经', '银行', '工商银行', '兴业银行', '交通银行'] 23. 4 6397481672254619905 ['财经', '小吃', '装修', '市场营销', '手工艺'] 24. 25. time title 26. 0 2017-06-07 22:52:55 雄安新区规划“骨架”敲定,方案有望9月底出炉 27. 1 2017-06-08 08:01:13 “红五月”不红 房企资金链压力攀升 28. 2 2017-05-16 12:03:00 凤凰自行车总裁:共享单车把我们打懵了 29. 3 2017-06-08 07:00:00 25家银行分红季派出3536亿“大红包” 30. 4 2017-03-15 07:03:22 五万以下的小本餐饮项目,卷饼赚钱最稳 31. chinese_pattern re.compile('[\\u4e00-\\u9fff]+') 32. Building prefix dict from F:\File_Jupyter\实用代码\naive_bayes(简单贝叶斯)\jieba_dict\dict.txt.big ... 33. Loading model from cache C:\Users\niu\AppData\Local\Temp\jieba.ue3752d4e13420d2dc6b66831a5a4ab13.cache 34. Loading model cost 1.326 seconds. 35. Prefix dict has been built succesfully. 36. dictionary 37. <class 'gensim.corpora.dictionary.Dictionary'> Dictionary(46351 unique tokens: ['一个', '一个个', '一举一动', '一些', '一体']...) 38. <class 'method'> <bound method Dictionary.doc2bow of <gensim.corpora.dictionary.Dictionary object at 0x000001BDC62291D0>> 39. F:\Program Files\Python\Python36\lib\site-packages\numpy\core\_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray 40. return array(a, dtype, copy=False, order=order) 41. 42. 43. corpus \ 44. 0 [(0, 6), (1, 1), (2, 1), (3, 3), (4, 2), (5, 2... 45. 1 [(0, 1), (3, 3), (13, 1), (17, 1), (41, 1), (5... 46. 2 [(15, 1), (53, 1), (167, 1), (262, 1), (396, 1... 47. 48. tfidf 49. 0 [(0, 0.005554342859788116), (1, 0.007470250835... 50. 1 [(0, 0.002081356679198299), (3, 0.012288034179... 51. 2 [(15, 0.057457146244872616), (53, 0.0543395377... 52. after abs 4.7683716e-07 53. foo: (1293, 1293) 54. dis2TSNE_Visual: (1293, 2) 55. {'养生': 0, '科技': 1, '财经': 2, '游戏': 3, '育儿': 4, '汽车': 5} 56. data_frame.keyword_index: 1 379 57. 2 287 58. 5 283 59. 4 148 60. 3 141 61. 0 55 62. Name: keyword_index, dtype: int64 63. 64. id tags \ 65. 0 6428905748545732865 ['财经', '白洋淀', '城市规划', '徐匡迪', '太行山'] 66. 1 6428954136200855810 ['财经', '碧桂园', '万科集团', '投资', '广州恒大'] 67. 2 6420576443738784002 ['财经', '自行车', '凤凰', '王朝阳', '汽车展览'] 68. 69. 70. doc_words \ 71. 0 [牵动人心, 雄安, 新区, 规划, 细节, 内容, 出台, 时间表, 敲定, 日前, 北京... 72. 1 [去年, 以来, 多个, 城市, 先后, 发布, 多项, 楼市, 调控, 政策, 限购, 限... 73. 2 [今年, 中国, 国际, 自行车, 展上, 上海, 凤凰, 自行车, 总裁, 王, 朝阳, ... 74. 75. corpus \ 76. 0 [(0, 6), (1, 1), (2, 1), (3, 3), (4, 2), (5, 2... 77. 1 [(0, 1), (3, 3), (13, 1), (17, 1), (41, 1), (5... 78. 2 [(15, 1), (53, 1), (167, 1), (262, 1), (396, 1... 79. 80. tfidf visual01 visual02 \ 81. 0 [(0, 0.005554342859788116), (1, 0.007470250835... -65.903542 -14.433964 82. 1 [(0, 0.002081356679198299), (3, 0.012288034179... -29.659267 -14.811647 83. 2 [(15, 0.057457146244872616), (53, 0.0543395377... -22.118195 -48.148167 84. 85. keyword_index 86. 0 2 87. 1 2 88. 2 2 89. Childcare,label_category_ID_pos.tfidf)[:20]: ['孩子', '家长', '教育', '学习', '男孩子', '成绩', '爸爸', '分享', '帮助', '方法', '小学', '数学', '交流', '男孩', '妈妈', '成长', '父母', '懂', '免费', '翼航'] 90. Childcare,label_category_ID_neg.tfidf)[:20]: [] 91. train_index MatrixSimilarity<646 docs, 46329 features> 92. hot_words shape: 6 300 93. {0: {1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 18009, 7258, 4697, 7260, 16989, 3674, 91, 87, 16993, 18020, 616, 4714, 5228, 40044, 1646, 4720, 3185, 15986, 34928, 5236, 113, 34936, 6777, 126, 15999, 127, 4737, 40067, 5252, 643, 4739, 13444, 8840, 1157, 133, 4749, 3219, 10388, 17562, 5278, 46239, 5287, 3751, 167, 680, 6827, 4784, 16048, 16050, 180, 46260, 16054, 6839, 4792, 2743, 4789, 17083, 16060, 4790, 16062, 43200, 5315, 46276, 46279, 17098, 6860, 5836, 16081, 43219, 1237, 1750, 15575, 8921, 2266, 6877, 12511, 12512, 21216, 226, 4834, 6884, 16101, 4838, 742, 2280, 2281, 227, 7915, 6886, 6893, 2798, 6894, 5870, 4849, 242, 1779, 4852, 21215, 44791, 4864, 3329, 258, 4865, 4866, 44805, 4877, 21264, 4882, 274, 8986, 8987, 796, 32029, 4382, 21277, 4896, 1825, 801, 3363, 36644, 1830, 4393, 36138, 303, 815, 4401, 12594, 21299, 7986, 820, 310, 1337, 21307, 4411, 317, 33598, 5953, 17730, 5954, 10050, 17733, 17734, 25927, 21320, 17739, 4939, 21324, 4942, 33615, 6885, 16210, 6071, 18261, 5976, 860, 16740, 16745, 2922, 4969, 17263, 6512, 33649, 16242, 2419, 17775, 373, 1398, 880, 1916, 17276, 16255, 1920, 43394, 3974, 4999, 396, 8080, 16788, 18325, 1942, 16279, 1433, 43418, 36252, 17311, 43425, 16802, 7585, 15959, 7594, 36268, 4525, 7597, 5551, 6063, 36272, 36275, 4533, 16309, 18358, 36280, 1465, 441, 7611, 16825, 16829, 4538, 2488, 2495, 8129, 4545, 4547, 16836, 4549, 7621, 1484, 1997, 11214, 1999, 16846, 16847, 4563, 7636, 14293, 7638, 4567, 16855, 17369, 16861, 478, 16351, 18400, 17377, 993, 9699, 5085, 6111, 7645, 6119, 6124, 17903, 1011, 4597, 6646, 16376, 6138, 16891, 16892, 7165, 4606}, 1: {0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 12391, 28267, 12396, 109, 9836, 12399, 11884, 12401, 12400, 12403, 627, 117, 629, 9847, 628, 17020, 637, 9855, 639, 12418, 643, 1668, 133, 3715, 14470, 1160, 12424, 11912, 9867, 33420, 10376, 655, 12433, 148, 150, 3735, 1176, 12440, 154, 21659, 1180, 3742, 10399, 11936, 1185, 31904, 675, 13472, 167, 1704, 7337, 11946, 171, 172, 8876, 8878, 2734, 1200, 1709, 2226, 8877, 180, 1155, 697, 12475, 189, 8894, 1215, 1218, 4291, 708, 709, 3271, 2760, 6354, 2771, 1748, 213, 3798, 727, 730, 20187, 44767, 225, 2786, 2787, 13028, 1765, 1254, 13543, 26344, 740, 11497, 1771, 3819, 13549, 11502, 751, 1775, 752, 242, 21743, 12524, 759, 11511, 2809, 2812, 35581, 257, 8962, 771, 259, 15623, 1288, 3849, 12048, 1810, 786, 788, 3862, 793, 7450, 798, 24862, 7458, 12579, 31524, 31523, 7459, 1322, 810, 25391, 12081, 1329, 820, 3386, 1850, 9023, 319, 835, 9029, 325, 4424, 330, 12107, 13134, 846, 3409, 3924, 1878, 854, 344, 11609, 5978, 1883, 11612, 343, 11615, 358, 4457, 362, 875, 1385, 1900, 4462, 3439, 12144, 369, 3438, 1396, 38773, 28025, 2428, 13305, 13183, 12161, 12674, 1922, 34690, 2438, 1926, 13193, 907, 9100, 911, 13204, 1431, 10135, 2456, 44956, 925, 413, 32670, 1952, 928, 23455, 5540, 1956, 1447, 12200, 1448, 1452, 8109, 12205, 1965, 9651, 2486, 5559, 1464, 956, 1982, 959, 3522, 12235, 976, 3025, 10194, 1491, 12244, 465, 30675, 5585, 472, 470, 10714, 475, 3027, 478, 1503, 479, 5089, 483, 2532, 995, 9190, 5607, 1512, 1513, 9703, 10728, 494, 1518, 1520, 2545, 1007, 1524, 501, 503, 1017, 1534}, 2: {0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 3146, 1100, 26701, 1614, 1102, 592, 3577, 35410, 2639, 2644, 3159, 25688, 1626, 91, 3162, 1119, 608, 21089, 1634, 102, 2662, 31848, 2665, 11881, 27242, 12907, 1131, 1132, 15388, 2672, 3185, 1138, 627, 43124, 2675, 113, 1657, 2682, 3194, 127, 3715, 1668, 133, 3717, 135, 2696, 3209, 1162, 1158, 1676, 2701, 11916, 1167, 138, 1169, 148, 2710, 1174, 152, 1177, 22167, 26779, 21659, 157, 158, 1183, 30880, 1185, 26784, 2209, 2724, 3232, 672, 167, 4256, 8876, 685, 4269, 1202, 2226, 691, 1205, 3253, 1207, 2231, 2242, 4291, 14026, 27340, 1740, 1231, 14032, 24273, 3284, 1749, 213, 727, 217, 730, 2266, 14044, 1246, 1248, 225, 1254, 742, 745, 3819, 14060, 12013, 750, 1775, 242, 1780, 1268, 759, 760, 249, 33536, 1281, 261, 262, 2311, 1290, 267, 37132, 5902, 1810, 7958, 39191, 280, 793, 43813, 1318, 807, 295, 45354, 1324, 28461, 1838, 28462, 815, 1329, 820, 1333, 317, 2366, 39743, 832, 2365, 45378, 835, 330, 1356, 845, 334, 1359, 4433, 4438, 854, 14168, 1370, 1883, 1372, 1371, 860, 863, 3935, 3937, 1378, 11618, 3426, 870, 358, 3942, 361, 874, 362, 875, 28010, 3438, 2416, 369, 880, 14196, 886, 4472, 1403, 894, 895, 2432, 385, 904, 905, 27528, 907, 909, 911, 1431, 409, 1433, 925, 1950, 415, 928, 413, 13731, 3494, 20902, 937, 1452, 942, 1968, 1973, 1464, 1977, 956, 34240, 3009, 32706, 14278, 3015, 456, 1993, 973, 975, 976, 465, 466, 1491, 14290, 2512, 1494, 472, 475, 480, 3554, 995, 2532, 3048, 1513, 23529, 3564, 494, 498, 500, 501, 503, 1017, 3070}, 3: {1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 6731, 9293, 31823, 2133, 9303, 601, 91, 43615, 608, 9314, 10338, 25709, 1646, 10349, 6257, 7794, 27763, 11381, 9337, 7801, 637, 3709, 639, 11391, 9345, 7299, 3715, 1668, 41606, 11401, 11402, 4233, 9868, 10893, 142, 5259, 9872, 25744, 25741, 148, 10389, 34455, 3735, 8345, 8857, 154, 10396, 1178, 7839, 10399, 8554, 1704, 10409, 9900, 10412, 2734, 14512, 10416, 7858, 9394, 9904, 6325, 2232, 1721, 38589, 8894, 6336, 1220, 9925, 11461, 3271, 9420, 719, 14544, 2773, 3286, 3287, 214, 20187, 9438, 26335, 6048, 13534, 226, 3811, 19172, 1766, 2280, 36585, 14575, 2801, 9457, 10993, 10485, 23797, 759, 27896, 5882, 8443, 23803, 1790, 767, 8962, 9476, 7433, 6924, 2316, 2318, 3853, 14608, 4371, 9494, 8983, 6425, 793, 362, 6433, 7458, 2339, 810, 1835, 8493, 6447, 1329, 28466, 44855, 9527, 1338, 10044, 317, 3390, 10047, 41280, 31554, 2372, 9029, 11592, 9547, 3916, 9042, 10066, 3925, 343, 10072, 5978, 860, 8030, 10079, 10593, 9572, 2916, 9061, 3430, 6501, 4969, 10089, 30571, 10603, 11117, 9582, 10607, 6505, 14193, 28529, 14707, 7197, 369, 11639, 23929, 894, 1919, 3459, 11652, 2438, 10631, 907, 10642, 9109, 2454, 14743, 2456, 29594, 11164, 6559, 9631, 3999, 1951, 14754, 14756, 31653, 9638, 31654, 33704, 45984, 3500, 31661, 1453, 1455, 9645, 9649, 41394, 9651, 9652, 10165, 30718, 2999, 31672, 1982, 9662, 44483, 11205, 2505, 5581, 10704, 465, 977, 31699, 9172, 4053, 9174, 31703, 4567, 470, 10714, 475, 5076, 478, 480, 23008, 9186, 30692, 9190, 9703, 10216, 491, 30699, 1005, 2542, 31726, 1007, 494, 25586, 10222, 18417, 10736, 8178, 3064, 1529, 509, 1534}, 4: {0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 601, 7258, 91, 5722, 5214, 4703, 608, 3679, 2143, 101, 6758, 5224, 616, 7277, 2158, 4723, 5236, 6267, 1660, 637, 639, 4737, 4739, 5252, 133, 1668, 4606, 23688, 5768, 17035, 2188, 5772, 38034, 5779, 3220, 6805, 2199, 1688, 5273, 154, 155, 1694, 4767, 5280, 5278, 5284, 1191, 1704, 167, 3754, 5802, 5290, 3751, 3247, 5296, 3257, 5818, 5823, 3265, 708, 5318, 5830, 4294, 1738, 5841, 5330, 4825, 4316, 734, 6369, 5349, 4838, 4326, 2280, 4329, 46315, 6380, 29660, 44269, 5871, 5873, 242, 7927, 759, 760, 2812, 1277, 8448, 3329, 4866, 2304, 4869, 5382, 7430, 3848, 3339, 2318, 782, 3857, 5906, 26513, 788, 2841, 7450, 4382, 1825, 7458, 801, 37156, 4393, 810, 7979, 3886, 815, 4911, 4401, 7986, 1329, 820, 5942, 3896, 8506, 2874, 317, 5441, 835, 5445, 5958, 6578, 5964, 5965, 4942, 8016, 8024, 344, 4952, 860, 1884, 29533, 8545, 8037, 3430, 6504, 7017, 2922, 4457, 362, 5998, 2928, 373, 374, 2935, 1398, 8057, 6011, 6015, 32127, 384, 4994, 8579, 4996, 8072, 396, 6541, 5006, 6540, 5009, 1938, 1427, 7571, 2965, 1942, 6039, 1940, 7574, 2970, 409, 7068, 7575, 8606, 5014, 5018, 7585, 5017, 6561, 7588, 1447, 3497, 6058, 5547, 1965, 6065, 4529, 21939, 4531, 6069, 5043, 5559, 7096, 1465, 6074, 3515, 4533, 6077, 5054, 7103, 448, 6080, 6076, 4547, 8132, 4552, 4555, 1484, 39372, 39374, 4561, 6611, 5078, 470, 1496, 5081, 472, 7131, 4572, 7133, 5598, 5086, 4576, 4577, 6111, 478, 4580, 1508, 480, 1503, 5096, 1506, 4584, 23019, 493, 494, 498, 5108, 18935, 1529, 6138, 7163, 10238, 5119}, 5: {0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91, 14940, 9308, 14937, 14943, 608, 6755, 1124, 13924, 14950, 5219, 14947, 9325, 3697, 14961, 11893, 14968, 12408, 15485, 637, 5247, 1668, 1157, 23172, 647, 15492, 15498, 5773, 19087, 13969, 9362, 15506, 1681, 148, 11926, 1176, 2713, 155, 1180, 15517, 1692, 20124, 10401, 19105, 675, 674, 19109, 167, 1704, 11946, 15019, 12458, 1709, 682, 9091, 2224, 15025, 20656, 176, 180, 7858, 12982, 15031, 15543, 41136, 14013, 2239, 1729, 708, 9413, 21700, 712, 15562, 15051, 2765, 15057, 15061, 9942, 15063, 21718, 22747, 15068, 15069, 32475, 13535, 15583, 15074, 227, 19683, 2789, 1766, 13542, 13036, 2799, 752, 3312, 13552, 242, 26867, 1268, 15618, 759, 2809, 763, 28924, 2812, 10495, 2817, 2818, 14083, 769, 259, 15622, 2823, 1288, 8962, 15109, 19720, 15629, 19213, 3345, 786, 788, 280, 25375, 2337, 15650, 804, 15653, 3366, 807, 2349, 15151, 7984, 1329, 21810, 820, 12602, 1338, 317, 11582, 5953, 2370, 835, 323, 15688, 1864, 15693, 854, 13142, 344, 15705, 4955, 860, 23899, 11615, 863, 15199, 15711, 13155, 15205, 872, 4457, 15722, 362, 15724, 875, 3438, 15215, 369, 883, 19828, 24437, 374, 29179, 9593, 19834, 15227, 894, 19326, 13186, 35203, 2436, 15749, 389, 19847, 15750, 19849, 2438, 1922, 6028, 909, 15752, 2446, 13200, 2448, 409, 21923, 9644, 14766, 22959, 14771, 23989, 12728, 9145, 14778, 14779, 3000, 12733, 7102, 3007, 9665, 14786, 12226, 2498, 14789, 8645, 15301, 15305, 15818, 461, 976, 5585, 977, 1489, 15358, 472, 1496, 42457, 2524, 478, 19422, 480, 15330, 15843, 20452, 26084, 6631, 14827, 492, 15343, 3571, 14836, 15348, 19446, 14839, 11765, 1017, 14843, 14844, 14846}} 94. word_bagNum shape: 6 50 95. {0: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960], 1: [0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613], 2: [0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651], 3: [1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284], 4: [0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740], 5: [0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91]} 96. after all_words, word_bag shape: 6 300 97. {0: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91], 1: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91], 2: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91], 3: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91], 4: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91], 5: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91]} 98. features_data_frame.shape: (6, 255) 99. 0 30 100. 1 185 101. 2 139 102. 3 66 103. 4 69 104. 5 157 105. class_Proportion: 106. [0.04643962848297214, 0.28637770897832815, 0.21517027863777088, 0.1021671826625387, 0.10681114551083591, 0.24303405572755418] 107. test_data_frame.head(2) 108. Unnamed: 0 content \ 109. 854 854 据Mobileexpose报道,华硕已经正式向媒体发出邀请,定于6月14日在台湾举办记者会,... 110. 101 101 6月6日,王者荣耀猴三棍重做引起王者峡谷一阵轩然大波,毕竟这个强势的猴子已经陪伴我们好几个... 111. 112. id tags \ 113. 854 6429089676803440897 ['科技', '华硕', '华硕ZenFone', '台湾', '手机'] 114. 101 6429098400347586818 ['游戏', '猴子', '王者荣耀', '黄忠', '游戏'] 115. 116. time title \ 117. 854 2017-06-07 10:11:00 华硕ZenFone AR宣布本月发售 118. 101 2017-06-07 10:39:20 猴子重做之后是加强还是削弱?狂到站对面泉水拿双杀 119. 120. doc_words \ 121. 854 [报道, 华硕, 已经, 正式, 媒体, 发出, 邀请, 定于, 月, 日, 台湾, 举办,... 122. 101 [月, 日, 王者, 荣耀, 猴三棍, 重, 做, 引起, 王者, 峡谷, 一阵, 轩然大波... 123. 124. corpus \ 125. 854 [(142, 1), (362, 1), (472, 1), (475, 1), (494,... 126. 101 [(0, 2), (68, 3), (133, 1), (184, 1), (226, 1)... 127. 128. tfidf visual01 visual02 \ 129. 854 [(142, 0.13953435619531032), (362, 0.046441336... 21.684397 -30.567736 130. 101 [(0, 0.012838015508020575), (68, 0.04742284222... 67.188065 21.183245 131. 132. keyword_index 133. 854 1 134. 101 3 135. print the first sample 136. Unnamed: 0 854 137. 会,... 138. id 6429089676803440897 139. tags ['科技', '华硕', '华硕ZenFone', '台湾', '手机'] 140. time 2017-06-07 10:11:00 141. title 华硕ZenFone AR宣布本月发售 142. doc_words [报道, 华硕, 已经, 正式, 媒体, 发出, 邀请, 定于, 月, 日, 台湾, 举办,... 143. corpus [(142, 1), (362, 1), (472, 1), (475, 1), (494,... 144. tfidf [(142, 0.13953435619531032), (362, 0.046441336... 145. visual01 21.6844 146. visual02 -30.5677 147. keyword_index 1 148. Name: 854, dtype: object 149. test_data_frame.iloc[0].corpus: [(142, 1), (362, 1), (472, 1), (475, 1), (494, 1), (530, 1), (872, 1), (909, 1), (1254, 1), (1312, 1), (1878, 1), (2577, 1), (2783, 1), (2979, 1), (3697, 1), (5508, 1), (9052, 1), (12204, 1), (12256, 1), (12591, 1), (12936, 1), (12991, 1), (13128, 1), (13194, 1), (13244, 1), (13317, 1), (31670, 1), (31683, 1), (33417, 1)] 150. [1.45708072e-43 1.78656934e-66 7.12148875e-63 1.71090490e-53 151. 4.71385662e-54 2.08405934e-64] 152. [-35.34436300647761, -16.431856044032266, -20.267559000416433, -22.405433968586664, -27.97121661401147, -18.05089965903481] 153. F:\File_Jupyter\实用代码\naive_bayes(简单贝叶斯)\TextClassPrediction_kNN_NB_LDA_P.py:346: SettingWithCopyWarning: 154. A value is trying to be set on a copy of a slice from a DataFrame. 155. Try using .loc[row_indexer,col_indexer] = value instead 156. 157. See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy 158. test_data_frame['predicted_class'] = test_data_frame['corpus'].apply(predict_text_ByMax) #预测所有测试文档 predict all test documents 159. 160. id tags \ 161. 854 6429089676803440897 ['科技', '华硕', '华硕ZenFone', '台湾', '手机'] 162. 101 6429098400347586818 ['游戏', '猴子', '王者荣耀', '黄忠', '游戏'] 163. 738 6413133652368982274 ['科技', '厨卫电器', '榨汁机', '小家电', '硅谷'] 164. 511 6428827159980867842 ['科技', '智能家居', '音箱', '苹果公司', '法国'] 165. 725 6428841852455354625 ['科技', '喜马拉雅山', '科技'] 166. ... ... ... 167. 805 6429151552733069569 ['财经', '财经'] 168. 448 6415852634885341441 ['汽车', 'SUV', '国产车', '概念车', '汽车用品'] 169. 782 6428858665063383297 ['科技', '新能源汽车', '电动汽车', '新能源', '经济'] 170. 1264 6427822755417194753 ['汽车', '日本汽车', '讴歌汽车', 'SUV', '空调'] 171. 1195 6429093420292210945 ['科技', '乐视', '科技'] 172. 173. time title \ 174. 854 2017-06-07 10:11:00 华硕ZenFone AR宣布本月发售 175. 101 2017-06-07 10:39:20 猴子重做之后是加强还是削弱?狂到站对面泉水拿双杀 176. 738 2017-04-26 10:41:39 绝!他用一台榨汁机骗了8亿 177. 511 2017-06-08 11:06:00 他的智能音箱一上市,苹果公司就推出了HomePod 178. 725 2017-06-07 18:37:00 喜马拉雅FM推出“付费会员”,当天召集超221万名会员 179. ... ... ... 180. 805 2017-06-08 14:30:00 盘中近20家龙头白马股集体创下历史新高 181. 448 2017-05-03 18:37:20 别瞎找了!10万左右尺寸最大的SUV都在这里了 182. 782 2017-06-07 19:12:00 倡导移动出行新概念 NEVS两款概念量产车亮相 183. 1264 2017-06-08 09:54:40 居然还有一款车,最低配和中高配看不出差别? 184. 1195 2017-06-08 10:45:00 乐视被爆未及时缴物业费,员工或将被阻止进大楼办公 185. 186. doc_words \ 187. 854 [报道, 华硕, 已经, 正式, 媒体, 发出, 邀请, 定于, 月, 日, 台湾, 举办,... 188. 101 [月, 日, 王者, 荣耀, 猴三棍, 重, 做, 引起, 王者, 峡谷, 一阵, 轩然大波... 189. 738 [骗子, 往往, 很会, 讲故事, 以下, 硅谷, 骗局, 验血, 公司, 号称, 指尖, ... 190. 511 [专访, 创始人, 孟, 崨, 学校, 最, 调皮, 却, 成绩, 最好, 学生, 老师, ... 191. 725 [据介绍, 喜马拉雅, 会员, 月费, 元, 年度, 会员, 元, 价格, 视频, 网站, ... 192. ... ... 193. 805 [每经, 记者, 王海, 慜, 每经, 编辑, 叶峰, 今日, 盘中, 昨日, 领涨, 中小... 194. 448 [中国, 人买, 喜欢, 房子, 买, 面积, 手机, 买, 屏大, 买车, 自然, 挑选,... 195. 782 [中证网, 讯, 记者, 徐金忠, 月, 日, 国, 电动汽车, 瑞典, 有限公司, 亮相,... 196. 1264 [目前, 日系, 豪华, 品牌, 讴歌, 已经, 开启, 国产, 路, 推出, 车型, 后,... 197. 1195 [近日, 爆料, 称, 乐视, 位于, 北京, 达美, 中心, 办公地, 因未, 及时, 缴... 198. 199. corpus \ 200. 854 [(142, 1), (362, 1), (472, 1), (475, 1), (494,... 201. 101 [(0, 2), (68, 3), (133, 1), (184, 1), (226, 1)... 202. 738 [(0, 2), (45, 1), (48, 1), (133, 2), (155, 1),... 203. 511 [(0, 10), (13, 2), (14, 2), (20, 1), (45, 1), ... 204. 725 [(30, 1), (102, 1), (142, 1), (154, 1), (189, ... 205. ... ... 206. 805 [(113, 1), (167, 1), (169, 1), (214, 1), (258,... 207. 448 [(4, 2), (8, 1), (14, 1), (51, 6), (53, 2), (6... 208. 782 [(15, 2), (30, 1), (53, 7), (93, 1), (143, 1),... 209. 1264 [(0, 1), (20, 1), (51, 1), (176, 1), (225, 1),... 210. 1195 [(57, 1), (111, 1), (191, 1), (361, 1), (476, ... 211. 212. tfidf visual01 visual02 \ 213. 854 [(142, 0.13953435619531032), (362, 0.046441336... 21.684397 -30.567736 214. 101 [(0, 0.012838015508020575), (68, 0.04742284222... 67.188065 21.183245 215. 738 [(0, 0.008984009118453712), (45, 0.01791359767... -22.855194 -11.270862 216. 511 [(0, 0.04361196171462796), (13, 0.028607388065... -22.198786 12.217076 217. 725 [(30, 0.05815947983270004), (102, 0.0450585853... 26.268911 21.240065 218. ... ... ... ... 219. 805 [(113, 0.030899018921031703), (167, 0.02103003... -66.232071 0.221611 220. 448 [(4, 0.04071064284477513), (8, 0.0235138776022... 41.836094 -44.539528 221. 782 [(15, 0.03392075672049564), (30, 0.03003603467... -26.810091 -29.602842 222. 1264 [(0, 0.009883726180653873), (20, 0.04080153677... 36.279522 -52.474297 223. 1195 [(57, 0.09668298763559263), (111, 0.1255406499... -6.373239 16.101738 224. 225. keyword_index predicted_class 226. 854 1 1 227. 101 3 3 228. 738 1 1 229. 511 1 2 230. 725 1 1 231. ... ... ... 232. 805 2 2 233. 448 5 5 234. 782 1 1 235. 1264 5 5 236. 1195 1 1 237. 238. [647 rows x 13 columns] 239. SModel_CS_acc_score: 0.7047913446676971 240. 300 241. label_category_ID 2 242. 一个 243. 一些 244. 概念 245. 经营 246. 补贴 247. 股市 248. 增持 249. 成本 250. 乳业 251. 万吨 252. train_data_frame.corpus[0] 253. [(0, 6), (1, 1), (2, 1), (3, 3), (4, 2), (5, 2), (6, 1), (7, 1), (8, 2), (9, 1), (10, 3), (11, 1), (12, 2), (13, 2), (14, 2), (15, 1), (16, 1), (17, 2), (18, 1), (19, 1), (20, 2), (21, 1), (22, 2), (23, 2), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 2), (30, 3), (31, 4), (32, 3), (33, 1), (34, 1), (35, 1), (36, 7), (37, 1), (38, 1), (39, 2), (40, 3), (41, 1), (42, 1), (43, 1), (44, 1), (45, 2), (46, 1), (47, 1), (48, 1), (49, 2), (50, 4), (51, 21), (52, 3), (53, 7), (54, 1), (55, 2), (56, 1), (57, 4), (58, 2), (59, 1), (60, 5), (61, 1), (62, 1), (63, 1), (64, 2), (65, 1), (66, 3), (67, 1), (68, 2), (69, 2), (70, 1), (71, 1), (72, 1), (73, 1), (74, 2), (75, 1), (76, 1), (77, 1), (78, 1), (79, 2), (80, 1), (81, 1), (82, 1), (83, 4), (84, 7), (85, 2), (86, 3), (87, 1), (88, 9), (89, 1), (90, 1), (91, 8), (92, 3), (93, 1), (94, 4), (95, 1), (96, 2), (97, 1), (98, 7), (99, 1), (100, 2), (101, 1), (102, 1), (103, 1), (104, 1), (105, 1), (106, 1), (107, 1), (108, 1), (109, 2), (110, 1), (111, 2), (112, 1), (113, 1), (114, 1), (115, 1), (116, 1), (117, 1), (118, 1), (119, 1), (120, 1), (121, 2), (122, 1), (123, 1), (124, 1), (125, 1), (126, 5), (127, 1), (128, 4), (129, 1), (130, 1), (131, 1), (132, 2), (133, 2), (134, 1), (135, 5), (136, 1), (137, 1), (138, 3), (139, 1), (140, 1), (141, 1), (142, 1), (143, 1), (144, 1), (145, 2), (146, 1), (147, 1), (148, 2), (149, 4), (150, 1), (151, 1), (152, 2), (153, 2), (154, 1), (155, 3), (156, 1), (157, 1), (158, 1), (159, 1), (160, 1), (161, 2), (162, 1), (163, 1), (164, 1), (165, 2), (166, 1), (167, 3), (168, 1), (169, 1), (170, 3), (171, 3), (172, 1), (173, 2), (174, 1), (175, 1), (176, 2), (177, 5), (178, 1), (179, 1), (180, 1), (181, 1), (182, 1), (183, 1), (184, 4), (185, 1), (186, 1), (187, 1), (188, 1), (189, 3), (190, 1), (191, 14), (192, 2), (193, 2), (194, 2), (195, 1), (196, 3), (197, 1), (198, 1), (199, 11), (200, 6), (201, 1), (202, 1), (203, 2), (204, 1), (205, 8), (206, 2), (207, 2), (208, 2), (209, 1), (210, 1), (211, 1), (212, 1), (213, 1), (214, 1), (215, 1), (216, 3), (217, 1), (218, 1), (219, 2), (220, 2), (221, 1), (222, 1), (223, 1), (224, 1), (225, 17), (226, 1), (227, 1), (228, 1), (229, 1), (230, 1), (231, 1), (232, 2), (233, 1), (234, 1), (235, 3), (236, 1), (237, 1), (238, 2), (239, 1), (240, 1), (241, 1), (242, 1), (243, 2), (244, 2), (245, 1), (246, 1), (247, 2), (248, 2), (249, 2), (250, 1), (251, 1), (252, 2), (253, 1), (254, 1), (255, 1), (256, 1), (257, 1), (258, 3), (259, 3), (260, 1), (261, 3), (262, 2), (263, 1), (264, 1), (265, 6), (266, 1), (267, 3), (268, 1), (269, 1), (270, 3), (271, 2), (272, 1), (273, 2), (274, 1), (275, 1), (276, 5), (277, 1), (278, 4), (279, 4), (280, 25), (281, 2), (282, 2), (283, 2), (284, 7), (285, 1), (286, 1), (287, 2), (288, 2), (289, 1), (290, 1), (291, 1), (292, 1), (293, 3), (294, 2), (295, 1), (296, 3), (297, 1), (298, 3), (299, 2), (300, 1), (301, 1), (302, 1), (303, 2), (304, 1), (305, 1), (306, 1), (307, 2), (308, 2), (309, 1), (310, 1), (311, 1), (312, 1), (313, 1), (314, 1), (315, 1), (316, 7), (317, 2), (318, 2), (319, 1), (320, 1), (321, 1), (322, 1), (323, 1), (324, 1), (325, 4), (326, 1), (327, 2), (328, 1), (329, 1), (330, 3), (331, 3), (332, 1), (333, 2), (334, 2), (335, 1), (336, 1), (337, 2), (338, 1), (339, 1), (340, 1), (341, 1), (342, 1), (343, 1), (344, 2), (345, 1), (346, 1), (347, 2), (348, 1), (349, 2), (350, 5), (351, 2), (352, 3), (353, 1), (354, 4), (355, 1), (356, 1), (357, 2), (358, 4), (359, 2), (360, 2), (361, 1), (362, 9), (363, 2), (364, 2), (365, 1), (366, 1), (367, 7), (368, 1), (369, 4), (370, 2), (371, 1), (372, 1), (373, 1), (374, 1), (375, 1), (376, 1), (377, 1), (378, 2), (379, 1), (380, 3), (381, 1), (382, 2), (383, 1), (384, 3), (385, 26), (386, 1), (387, 1), (388, 1), (389, 3), (390, 1), (391, 2), (392, 1), (393, 4), (394, 4), (395, 4), (396, 2), (397, 1), (398, 40), (399, 2), (400, 4), (401, 1), (402, 1), (403, 2), (404, 1), (405, 1), (406, 2), (407, 1), (408, 1), (409, 3), (410, 1), (411, 1), (412, 2), (413, 7), (414, 4), (415, 2), (416, 1), (417, 1), (418, 1), (419, 3), (420, 1), (421, 1), (422, 1), (423, 1), (424, 1), (425, 1), (426, 1), (427, 2), (428, 1), (429, 1), (430, 1), (431, 1), (432, 5), (433, 1), (434, 1), (435, 1), (436, 1), (437, 1), (438, 1), (439, 1), (440, 1), (441, 1), (442, 1), (443, 3), (444, 3), (445, 2), (446, 5), (447, 1), (448, 1), (449, 1), (450, 4), (451, 1), (452, 2), (453, 2), (454, 1), (455, 4), (456, 1), (457, 1), (458, 1), (459, 2), (460, 1), (461, 1), (462, 5), (463, 2), (464, 1), (465, 5), (466, 74), (467, 2), (468, 1), (469, 1), (470, 2), (471, 22), (472, 2), (473, 1), (474, 1), (475, 2), (476, 2), (477, 2), (478, 2), (479, 1), (480, 1), (481, 1), (482, 1), (483, 2), (484, 1), (485, 1), (486, 2), (487, 1), (488, 2), (489, 1), (490, 1), (491, 1), (492, 4), (493, 1), (494, 2), (495, 4), (496, 2), (497, 1), (498, 1), (499, 1), (500, 1), (501, 5), (502, 1), (503, 13), (504, 4), (505, 3), (506, 1), (507, 7), (508, 1), (509, 1), (510, 1), (511, 1), (512, 1), (513, 1), (514, 2), (515, 1), (516, 3), (517, 4), (518, 1), (519, 1), (520, 1), (521, 1), (522, 1), (523, 1), (524, 1), (525, 1), (526, 2), (527, 2), (528, 1), (529, 1), (530, 1), (531, 1), (532, 1), (533, 1), (534, 1), (535, 2), (536, 5), (537, 2), (538, 1), (539, 1), (540, 1), (541, 7), (542, 1), (543, 1), (544, 1), (545, 2), (546, 1), (547, 3), (548, 2), (549, 1), (550, 1), (551, 2), (552, 1), (553, 2), (554, 1), (555, 1), (556, 2), (557, 1), (558, 2), (559, 5), (560, 2), (561, 1), (562, 1), (563, 1), (564, 1), (565, 1), (566, 1), (567, 7), (568, 2), (569, 1), (570, 2), (571, 1), (572, 1), (573, 1), (574, 4), (575, 1), (576, 2), (577, 2), (578, 1), (579, 2), (580, 1), (581, 1), (582, 1), (583, 2), (584, 1), (585, 1), (586, 1), (587, 4), (588, 1), (589, 4), (590, 2), (591, 1), (592, 1), (593, 1), (594, 2), (595, 1), (596, 1), (597, 1), (598, 1), (599, 1), (600, 1), (601, 1), (602, 1), (603, 1), (604, 1), (605, 1), (606, 1), (607, 1), (608, 2), (609, 1), (610, 2), (611, 1), (612, 1), (613, 11), (614, 1), (615, 1), (616, 3), (617, 1), (618, 1), (619, 1), (620, 1), (621, 1), (622, 1), (623, 1), (624, 32), (625, 2), (626, 1), (627, 8), (628, 1), (629, 3), (630, 3), (631, 1), (632, 1), (633, 4), (634, 1), (635, 1), (636, 2), (637, 1), (638, 3), (639, 2), (640, 1), (641, 1), (642, 1), (643, 3), (644, 5), (645, 4), (646, 1), (647, 1), (648, 3), (649, 1), (650, 1), (651, 1), (652, 1), (653, 1), (654, 1), (655, 2), (656, 1), (657, 7), (658, 1), (659, 2), (660, 1), (661, 2), (662, 1), (663, 1), (664, 1), (665, 1), (666, 1), (667, 1), (668, 4), (669, 1), (670, 1), (671, 3), (672, 1), (673, 1), (674, 2), (675, 1), (676, 1), (677, 1), (678, 1), (679, 1), (680, 2), (681, 2), (682, 1), (683, 1), (684, 1), (685, 3), (686, 1), (687, 1), (688, 1), (689, 1), (690, 4), (691, 1), (692, 2), (693, 3), (694, 1), (695, 2), (696, 1), (697, 1), (698, 2), (699, 1), (700, 1), (701, 4), (702, 1), (703, 1), (704, 2), (705, 1), (706, 1), (707, 1), (708, 1), (709, 2), (710, 1), (711, 3), (712, 1), (713, 1), (714, 4), (715, 1), (716, 1), (717, 1), (718, 2), (719, 1), (720, 1), (721, 2), (722, 1), (723, 1), (724, 4), (725, 1), (726, 1), (727, 1), (728, 1), (729, 2), (730, 12), (731, 2), (732, 1), (733, 2), (734, 3), (735, 1), (736, 26), (737, 1), (738, 5), (739, 1), (740, 2), (741, 5), (742, 2), (743, 3), (744, 3), (745, 2), (746, 1), (747, 3), (748, 2), (749, 2), (750, 2), (751, 1), (752, 1), (753, 2), (754, 1), (755, 1), (756, 1), (757, 1), (758, 1), (759, 4), (760, 1), (761, 1), (762, 1), (763, 1), (764, 1), (765, 2), (766, 1), (767, 1), (768, 1), (769, 2), (770, 8), (771, 2), (772, 4), (773, 1), (774, 8), (775, 3), (776, 1), (777, 1), (778, 3), (779, 1), (780, 1), (781, 1), (782, 5), (783, 2), (784, 2), (785, 1), (786, 4), (787, 1), (788, 1), (789, 1), (790, 1), (791, 1), (792, 1), (793, 4), (794, 1), (795, 1), (796, 1), (797, 5), (798, 3), (799, 5), (800, 3), (801, 1), (802, 1), (803, 1), (804, 1), (805, 2), (806, 2), (807, 2), (808, 1), (809, 1), (810, 1), (811, 1), (812, 1), (813, 1), (814, 1), (815, 3), (816, 1), (817, 2), (818, 1), (819, 1), (820, 11), (821, 1), (822, 1), (823, 2), (824, 3), (825, 1), (826, 1), (827, 1), (828, 1), (829, 1), (830, 3), (831, 4), (832, 46), (833, 1), (834, 1), (835, 2), (836, 2), (837, 1), (838, 1), (839, 2), (840, 2), (841, 1), (842, 1), (843, 2), (844, 2), (845, 2), (846, 1), (847, 1), (848, 2), (849, 1), (850, 1), (851, 1), (852, 3), (853, 1), (854, 1), (855, 6), (856, 1), (857, 1), (858, 1)] 254. [33. 74. 73. 31. 47. 48.] 255. <class 'numpy.ndarray'> 256. SModel_acc_score: 0.8114374034003091 257. kNNC_acc_score: 0.8160741885625966 258. GNBC_acc_score: 0.6352395672333848 259. MNBC_acc_score: 0.6352395672333848 260. BNBC_acc_score: 0.29675425038639874 261. LDAC_acc_score: 0.8238021638330757 262. PerceptronC_acc_score: 0.8222565687789799
核心代码
1. class GaussianNB Found at: sklearn.naive_bayes 2. 3. class GaussianNB(_BaseNB): 4. """ 5. Gaussian Naive Bayes (GaussianNB) 6. 7. Can perform online updates to model parameters via :meth:`partial_fit`. 8. For details on algorithm used to update feature means and variance online, 9. see Stanford CS tech report STAN-CS-79-773 by Chan, Golub, and LeVeque: 10. 11. http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf 12. 13. Read more in the :ref:`User Guide <gaussian_naive_bayes>`. 14. 15. Parameters 16. ---------- 17. priors : array-like of shape (n_classes,) 18. Prior probabilities of the classes. If specified the priors are not 19. adjusted according to the data. 20. 21. var_smoothing : float, default=1e-9 22. Portion of the largest variance of all features that is added to 23. variances for calculation stability. 24. 25. .. versionadded:: 0.20 26. 27. Attributes 28. ---------- 29. class_count_ : ndarray of shape (n_classes,) 30. number of training samples observed in each class. 31. 32. class_prior_ : ndarray of shape (n_classes,) 33. probability of each class. 34. 35. classes_ : ndarray of shape (n_classes,) 36. class labels known to the classifier 37. 38. epsilon_ : float 39. absolute additive value to variances 40. 41. sigma_ : ndarray of shape (n_classes, n_features) 42. variance of each feature per class 43. 44. theta_ : ndarray of shape (n_classes, n_features) 45. mean of each feature per class 46. 47. Examples 48. -------- 49. >>> import numpy as np 50. >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) 51. >>> Y = np.array([1, 1, 1, 2, 2, 2]) 52. >>> from sklearn.naive_bayes import GaussianNB 53. >>> clf = GaussianNB() 54. >>> clf.fit(X, Y) 55. GaussianNB() 56. >>> print(clf.predict([[-0.8, -1]])) 57. [1] 58. >>> clf_pf = GaussianNB() 59. >>> clf_pf.partial_fit(X, Y, np.unique(Y)) 60. GaussianNB() 61. >>> print(clf_pf.predict([[-0.8, -1]])) 62. [1] 63. """ 64. @_deprecate_positional_args 65. def __init__(self, *, priors=None, var_smoothing=1e-9): 66. self.priors = priors 67. self.var_smoothing = var_smoothing 68. 69. def fit(self, X, y, sample_weight=None): 70. """Fit Gaussian Naive Bayes according to X, y 71. 72. Parameters 73. ---------- 74. X : array-like of shape (n_samples, n_features) 75. Training vectors, where n_samples is the number of samples 76. and n_features is the number of features. 77. 78. y : array-like of shape (n_samples,) 79. Target values. 80. 81. sample_weight : array-like of shape (n_samples,), default=None 82. Weights applied to individual samples (1. for unweighted). 83. 84. .. versionadded:: 0.17 85. Gaussian Naive Bayes supports fitting with *sample_weight*. 86. 87. Returns 88. ------- 89. self : object 90. """ 91. X, y = self._validate_data(X, y) 92. y = column_or_1d(y, warn=True) 93. return self._partial_fit(X, y, np.unique(y), _refit=True, 94. sample_weight=sample_weight) 95. 96. def _check_X(self, X): 97. return check_array(X) 98. 99. @staticmethod 100. def _update_mean_variance(n_past, mu, var, X, sample_weight=None): 101. """Compute online update of Gaussian mean and variance. 102. 103. Given starting sample count, mean, and variance, a new set of 104. points X, and optionally sample weights, return the updated mean and 105. variance. (NB - each dimension (column) in X is treated as independent 106. -- you get variance, not covariance). 107. 108. Can take scalar mean and variance, or vector mean and variance to 109. simultaneously update a number of independent Gaussians. 110. 111. See Stanford CS tech report STAN-CS-79-773 by Chan, Golub, and 112. LeVeque: 113. 114. http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf 115. 116. Parameters 117. ---------- 118. n_past : int 119. Number of samples represented in old mean and variance. If sample 120. weights were given, this should contain the sum of sample 121. weights represented in old mean and variance. 122. 123. mu : array-like of shape (number of Gaussians,) 124. Means for Gaussians in original set. 125. 126. var : array-like of shape (number of Gaussians,) 127. Variances for Gaussians in original set. 128. 129. sample_weight : array-like of shape (n_samples,), default=None 130. Weights applied to individual samples (1. for unweighted). 131. 132. Returns 133. ------- 134. total_mu : array-like of shape (number of Gaussians,) 135. Updated mean for each Gaussian over the combined set. 136. 137. total_var : array-like of shape (number of Gaussians,) 138. Updated variance for each Gaussian over the combined set. 139. """ 140. if X.shape[0] == 0: 141. return mu, var 142. # Compute (potentially weighted) mean and variance of new datapoints 143. if sample_weight is not None: 144. n_new = float(sample_weight.sum()) 145. new_mu = np.average(X, axis=0, weights=sample_weight) 146. new_var = np.average((X - new_mu) ** 2, axis=0, 147. weights=sample_weight) 148. else: 149. n_new = X.shape[0] 150. new_var = np.var(X, axis=0) 151. new_mu = np.mean(X, axis=0) 152. if n_past == 0: 153. return new_mu, new_var 154. n_total = float(n_past + n_new) 155. # Combine mean of old and new data, taking into consideration 156. # (weighted) number of observations 157. total_mu = (n_new * new_mu + n_past * mu) / n_total 158. # Combine variance of old and new data, taking into consideration 159. # (weighted) number of observations. This is achieved by combining 160. # the sum-of-squared-differences (ssd) 161. old_ssd = n_past * var 162. new_ssd = n_new * new_var 163. total_ssd = old_ssd + new_ssd + (n_new * n_past / n_total) * (mu - 164. new_mu) ** 2 165. total_var = total_ssd / n_total 166. return total_mu, total_var 167. 168. def partial_fit(self, X, y, classes=None, sample_weight=None): 169. """Incremental fit on a batch of samples. 170. 171. This method is expected to be called several times consecutively 172. on different chunks of a dataset so as to implement out-of-core 173. or online learning. 174. 175. This is especially useful when the whole dataset is too big to fit in 176. memory at once. 177. 178. This method has some performance and numerical stability overhead, 179. hence it is better to call partial_fit on chunks of data that are 180. as large as possible (as long as fitting in the memory budget) to 181. hide the overhead. 182. 183. Parameters 184. ---------- 185. X : array-like of shape (n_samples, n_features) 186. Training vectors, where n_samples is the number of samples and 187. n_features is the number of features. 188. 189. y : array-like of shape (n_samples,) 190. Target values. 191. 192. classes : array-like of shape (n_classes,), default=None 193. List of all the classes that can possibly appear in the y vector. 194. 195. Must be provided at the first call to partial_fit, can be omitted 196. in subsequent calls. 197. 198. sample_weight : array-like of shape (n_samples,), default=None 199. Weights applied to individual samples (1. for unweighted). 200. 201. .. versionadded:: 0.17 202. 203. Returns 204. ------- 205. self : object 206. """ 207. return self._partial_fit(X, y, classes, _refit=False, 208. sample_weight=sample_weight) 209. 210. def _partial_fit(self, X, y, classes=None, _refit=False, 211. sample_weight=None): 212. """Actual implementation of Gaussian NB fitting. 213. 214. Parameters 215. ---------- 216. X : array-like of shape (n_samples, n_features) 217. Training vectors, where n_samples is the number of samples and 218. n_features is the number of features. 219. 220. y : array-like of shape (n_samples,) 221. Target values. 222. 223. classes : array-like of shape (n_classes,), default=None 224. List of all the classes that can possibly appear in the y vector. 225. 226. Must be provided at the first call to partial_fit, can be omitted 227. in subsequent calls. 228. 229. _refit : bool, default=False 230. If true, act as though this were the first time we called 231. _partial_fit (ie, throw away any past fitting and start over). 232. 233. sample_weight : array-like of shape (n_samples,), default=None 234. Weights applied to individual samples (1. for unweighted). 235. 236. Returns 237. ------- 238. self : object 239. """ 240. X, y = check_X_y(X, y) 241. if sample_weight is not None: 242. sample_weight = _check_sample_weight(sample_weight, X) 243. # If the ratio of data variance between dimensions is too small, it 244. # will cause numerical errors. To address this, we artificially 245. # boost the variance by epsilon, a small fraction of the standard 246. # deviation of the largest dimension. 247. self.epsilon_ = self.var_smoothing * np.var(X, axis=0).max() 248. if _refit: 249. self.classes_ = None 250. if _check_partial_fit_first_call(self, classes): 251. # This is the first call to partial_fit: 252. # initialize various cumulative counters 253. n_features = X.shape[1] 254. n_classes = len(self.classes_) 255. self.theta_ = np.zeros((n_classes, n_features)) 256. self.sigma_ = np.zeros((n_classes, n_features)) 257. self.class_count_ = np.zeros(n_classes, dtype=np.float64) 258. # Initialise the class prior 259. # Take into account the priors 260. if self.priors is not None: 261. priors = np.asarray(self.priors) 262. # Check that the provide prior match the number of classes 263. if len(priors) != n_classes: 264. raise ValueError('Number of priors must match number of' 265. ' classes.') 266. # Check that the sum is 1 267. if not np.isclose(priors.sum(), 1.0): 268. raise ValueError('The sum of the priors should be 1.') # Check that 269. the prior are non-negative 270. if (priors < 0).any(): 271. raise ValueError('Priors must be non-negative.') 272. self.class_prior_ = priors 273. else: 274. self.class_prior_ = np.zeros(len(self.classes_), 275. dtype=np.float64) # Initialize the priors to zeros for each class 276. else: 277. if X.shape[1] != self.theta_.shape[1]: 278. msg = "Number of features %d does not match previous data %d." 279. raise ValueError(msg % (X.shape[1], self.theta_.shape[1])) 280. # Put epsilon back in each time 281. ::]self.epsilon_ 282. self.sigma_[ -= 283. classes = self.classes_ 284. unique_y = np.unique(y) 285. unique_y_in_classes = np.in1d(unique_y, classes) 286. if not np.all(unique_y_in_classes): 287. raise ValueError("The target label(s) %s in y do not exist in the " 288. "initial classes %s" % 289. (unique_y[~unique_y_in_classes], classes)) 290. for y_i in unique_y: 291. i = classes.searchsorted(y_i) 292. X_i = X[y == y_i:] 293. if sample_weight is not None: 294. sw_i = sample_weight[y == y_i] 295. N_i = sw_i.sum() 296. else: 297. sw_i = None 298. N_i = X_i.shape[0] 299. new_theta, new_sigma = self._update_mean_variance( 300. self.class_count_[i], self.theta_[i:], self.sigma_[i:], 301. X_i, sw_i) 302. self.theta_[i:] = new_theta 303. self.sigma_[i:] = new_sigma 304. self.class_count_[i] += N_i 305. 306. self.sigma_[::] += self.epsilon_ 307. # Update if only no priors is provided 308. if self.priors is None: 309. # Empirical prior, with sample_weight taken into account 310. self.class_prior_ = self.class_count_ / self.class_count_.sum() 311. return self 312. 313. def _joint_log_likelihood(self, X): 314. joint_log_likelihood = [] 315. for i in range(np.size(self.classes_)): 316. jointi = np.log(self.class_prior_[i]) 317. n_ij = -0.5 * np.sum(np.log(2. * np.pi * self.sigma_[i:])) 318. n_ij -= 0.5 * np.sum(((X - self.theta_[i:]) ** 2) / 319. (self.sigma_[i:]), 1) 320. joint_log_likelihood.append(jointi + n_ij) 321. 322. joint_log_likelihood = np.array(joint_log_likelihood).T 323. return joint_log_likelihood 324. 325. 326. 327. class MultinomialNB Found at: sklearn.naive_bayes 328. 329. class MultinomialNB(_BaseDiscreteNB): 330. """ 331. Naive Bayes classifier for multinomial models 332. 333. The multinomial Naive Bayes classifier is suitable for classification with 334. discrete features (e.g., word counts for text classification). The 335. multinomial distribution normally requires integer feature counts. However, 336. in practice, fractional counts such as tf-idf may also work. 337. 338. Read more in the :ref:`User Guide <multinomial_naive_bayes>`. 339. 340. Parameters 341. ---------- 342. alpha : float, default=1.0 343. Additive (Laplace/Lidstone) smoothing parameter 344. (0 for no smoothing). 345. 346. fit_prior : bool, default=True 347. Whether to learn class prior probabilities or not. 348. If false, a uniform prior will be used. 349. 350. class_prior : array-like of shape (n_classes,), default=None 351. Prior probabilities of the classes. If specified the priors are not 352. adjusted according to the data. 353. 354. Attributes 355. ---------- 356. class_count_ : ndarray of shape (n_classes,) 357. Number of samples encountered for each class during fitting. This 358. value is weighted by the sample weight when provided. 359. 360. class_log_prior_ : ndarray of shape (n_classes, ) 361. Smoothed empirical log probability for each class. 362. 363. classes_ : ndarray of shape (n_classes,) 364. Class labels known to the classifier 365. 366. coef_ : ndarray of shape (n_classes, n_features) 367. Mirrors ``feature_log_prob_`` for interpreting MultinomialNB 368. as a linear model. 369. 370. feature_count_ : ndarray of shape (n_classes, n_features) 371. Number of samples encountered for each (class, feature) 372. during fitting. This value is weighted by the sample weight when 373. provided. 374. 375. feature_log_prob_ : ndarray of shape (n_classes, n_features) 376. Empirical log probability of features 377. given a class, ``P(x_i|y)``. 378. 379. intercept_ : ndarray of shape (n_classes, ) 380. Mirrors ``class_log_prior_`` for interpreting MultinomialNB 381. as a linear model. 382. 383. n_features_ : int 384. Number of features of each sample. 385. 386. Examples 387. -------- 388. >>> import numpy as np 389. >>> rng = np.random.RandomState(1) 390. >>> X = rng.randint(5, size=(6, 100)) 391. >>> y = np.array([1, 2, 3, 4, 5, 6]) 392. >>> from sklearn.naive_bayes import MultinomialNB 393. >>> clf = MultinomialNB() 394. >>> clf.fit(X, y) 395. MultinomialNB() 396. >>> print(clf.predict(X[2:3])) 397. [3] 398. 399. Notes 400. ----- 401. For the rationale behind the names `coef_` and `intercept_`, i.e. 402. naive Bayes as a linear classifier, see J. Rennie et al. (2003), 403. Tackling the poor assumptions of naive Bayes text classifiers, ICML. 404. 405. References 406. ---------- 407. C.D. Manning, P. Raghavan and H. Schuetze (2008). Introduction to 408. Information Retrieval. Cambridge University Press, pp. 234-265. 409. https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text- 410. classification-1.html 411. """ 412. @_deprecate_positional_args 413. def __init__(self, *, alpha=1.0, fit_prior=True, class_prior=None): 414. self.alpha = alpha 415. self.fit_prior = fit_prior 416. self.class_prior = class_prior 417. 418. def _more_tags(self): 419. return {'requires_positive_X':True} 420. 421. def _count(self, X, Y): 422. """Count and smooth feature occurrences.""" 423. check_non_negative(X, "MultinomialNB (input X)") 424. self.feature_count_ += safe_sparse_dot(Y.T, X) 425. self.class_count_ += Y.sum(axis=0) 426. 427. def _update_feature_log_prob(self, alpha): 428. """Apply smoothing to raw counts and recompute log probabilities""" 429. smoothed_fc = self.feature_count_ + alpha 430. smoothed_cc = smoothed_fc.sum(axis=1) 431. self.feature_log_prob_ = np.log(smoothed_fc) - np.log(smoothed_cc. 432. reshape(-1, 1)) 433. 434. def _joint_log_likelihood(self, X): 435. """Calculate the posterior log probability of the samples X""" 436. return safe_sparse_dot(X, self.feature_log_prob_.T) + self.class_log_prior_ 437. 438. 439. 440. 441. 442. class BernoulliNB Found at: sklearn.naive_bayes 443. 444. class BernoulliNB(_BaseDiscreteNB): 445. """Naive Bayes classifier for multivariate Bernoulli models. 446. 447. Like MultinomialNB, this classifier is suitable for discrete data. The 448. difference is that while MultinomialNB works with occurrence counts, 449. BernoulliNB is designed for binary/boolean features. 450. 451. Read more in the :ref:`User Guide <bernoulli_naive_bayes>`. 452. 453. Parameters 454. ---------- 455. alpha : float, default=1.0 456. Additive (Laplace/Lidstone) smoothing parameter 457. (0 for no smoothing). 458. 459. binarize : float or None, default=0.0 460. Threshold for binarizing (mapping to booleans) of sample features. 461. If None, input is presumed to already consist of binary vectors. 462. 463. fit_prior : bool, default=True 464. Whether to learn class prior probabilities or not. 465. If false, a uniform prior will be used. 466. 467. class_prior : array-like of shape (n_classes,), default=None 468. Prior probabilities of the classes. If specified the priors are not 469. adjusted according to the data. 470. 471. Attributes 472. ---------- 473. class_count_ : ndarray of shape (n_classes) 474. Number of samples encountered for each class during fitting. This 475. value is weighted by the sample weight when provided. 476. 477. class_log_prior_ : ndarray of shape (n_classes) 478. Log probability of each class (smoothed). 479. 480. classes_ : ndarray of shape (n_classes,) 481. Class labels known to the classifier 482. 483. feature_count_ : ndarray of shape (n_classes, n_features) 484. Number of samples encountered for each (class, feature) 485. during fitting. This value is weighted by the sample weight when 486. provided. 487. 488. feature_log_prob_ : ndarray of shape (n_classes, n_features) 489. Empirical log probability of features given a class, P(x_i|y). 490. 491. n_features_ : int 492. Number of features of each sample. 493. 494. Examples 495. -------- 496. >>> import numpy as np 497. >>> rng = np.random.RandomState(1) 498. >>> X = rng.randint(5, size=(6, 100)) 499. >>> Y = np.array([1, 2, 3, 4, 4, 5]) 500. >>> from sklearn.naive_bayes import BernoulliNB 501. >>> clf = BernoulliNB() 502. >>> clf.fit(X, Y) 503. BernoulliNB() 504. >>> print(clf.predict(X[2:3])) 505. [3] 506. 507. References 508. ---------- 509. C.D. Manning, P. Raghavan and H. Schuetze (2008). Introduction to 510. Information Retrieval. Cambridge University Press, pp. 234-265. 511. https://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli- 512. model-1.html 513. 514. A. McCallum and K. Nigam (1998). A comparison of event models 515. for naive 516. Bayes text classification. Proc. AAAI/ICML-98 Workshop on Learning 517. for 518. Text Categorization, pp. 41-48. 519. 520. V. Metsis, I. Androutsopoulos and G. Paliouras (2006). Spam filtering 521. with 522. naive Bayes -- Which naive Bayes? 3rd Conf. on Email and Anti-Spam 523. (CEAS). 524. """ 525. @_deprecate_positional_args 526. def __init__(self, *, alpha=1.0, binarize=.0, fit_prior=True, 527. class_prior=None): 528. self.alpha = alpha 529. self.binarize = binarize 530. self.fit_prior = fit_prior 531. self.class_prior = class_prior 532. 533. def _check_X(self, X): 534. X = super()._check_X(X) 535. if self.binarize is not None: 536. X = binarize(X, threshold=self.binarize) 537. return X 538. 539. def _check_X_y(self, X, y): 540. X, y = super()._check_X_y(X, y) 541. if self.binarize is not None: 542. X = binarize(X, threshold=self.binarize) 543. return X, y 544. 545. def _count(self, X, Y): 546. """Count and smooth feature occurrences.""" 547. self.feature_count_ += safe_sparse_dot(Y.T, X) 548. self.class_count_ += Y.sum(axis=0) 549. 550. def _update_feature_log_prob(self, alpha): 551. """Apply smoothing to raw counts and recompute log 552. probabilities""" 553. smoothed_fc = self.feature_count_ + alpha 554. smoothed_cc = self.class_count_ + alpha * 2 555. self.feature_log_prob_ = np.log(smoothed_fc) - np.log 556. (smoothed_cc.reshape(-1, 1)) 557. 558. def _joint_log_likelihood(self, X): 559. """Calculate the posterior log probability of the samples X""" 560. n_classes, n_features = self.feature_log_prob_.shape 561. n_samples, n_features_X = X.shape 562. if n_features_X != n_features: 563. raise ValueError( 564. "Expected input with %d features, got %d instead" % 565. (n_features, n_features_X)) 566. neg_prob = np.log(1 - np.exp(self.feature_log_prob_)) 567. # Compute neg_prob · (1 - X).T as ∑neg_prob - X · neg_prob 568. jll = safe_sparse_dot(X, (self.feature_log_prob_ - neg_prob).T) 569. jll += self.class_log_prior_ + neg_prob.sum(axis=1) 570. return jll