ML之NB:基于news新闻文本数据集利用纯统计法、kNN、朴素贝叶斯(高斯/多元伯努利/多项式)、线性判别分析LDA、感知器等算法实现文本分类预测daiding

本文涉及的产品
函数计算FC,每月15万CU 3个月
简介: ML之NB:基于news新闻文本数据集利用纯统计法、kNN、朴素贝叶斯(高斯/多元伯努利/多项式)、线性判别分析LDA、感知器等算法实现文本分类预测daiding


目录

基于news新闻文本数据集利用纯统计法、kNN、朴素贝叶斯(高斯/多元伯努利/多项式)、线性判别分析LDA、感知器等算法实现文本分类预测

设计思路

输出结果

核心代码


相关文章

ML之NB:基于news新闻文本数据集利用纯统计法、kNN、朴素贝叶斯(高斯/多元伯努利/多项式)、线性判别分析LDA、感知器等算法实现文本分类预测

ML之NB:基于news新闻文本数据集利用纯统计法、kNN、朴素贝叶斯(高斯/多元伯努利/多项式)、线性判别分析LDA、感知器等算法实现文本分类预测实现

基于news新闻文本数据集利用纯统计法、kNN、朴素贝叶斯(高斯/多元伯努利/多项式)、线性判别分析LDA、感知器等算法实现文本分类预测

设计思路

输出结果

代码中的数据集机器学习算法中自然语言处理常用数据集(新闻数据集news.csv)及jieba_dict字典、停用词等相关文件_新闻数据集-机器学习文档类资源-CSDN下载

1. F:\Program Files\Python\Python36\lib\site-packages\gensim\utils.py:1209: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
2.   warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
3. <class 'pandas.core.frame.DataFrame'>
4. RangeIndex: 1293 entries, 0 to 1292
5. Data columns (total 6 columns):
6. #   Column      Non-Null Count  Dtype 
7. ---  ------      --------------  ----- 
8. 0   Unnamed: 0  1293 non-null   int64 
9. 1   content     1292 non-null   object
10. 2   id          1293 non-null   int64 
11. 3   tags        1293 non-null   object
12. 4   time        1293 non-null   object
13. 5   title       1293 non-null   object
14. dtypes: int64(2), object(4)
15. memory usage: 60.7+ KB
16. None
17. 
18. id                                  tags  \
19. 0  6428905748545732865   ['财经', '白洋淀', '城市规划', '徐匡迪', '太行山']   
20. 1  6428954136200855810   ['财经', '碧桂园', '万科集团', '投资', '广州恒大']   
21. 2  6420576443738784002    ['财经', '自行车', '凤凰', '王朝阳', '汽车展览']   
22. 3  6429007290541031681  ['财经', '银行', '工商银行', '兴业银行', '交通银行']   
23. 4  6397481672254619905     ['财经', '小吃', '装修', '市场营销', '手工艺']   
24. 
25.                   time                   title  
26. 0  2017-06-07 22:52:55  雄安新区规划“骨架”敲定,方案有望9月底出炉  
27. 1  2017-06-08 08:01:13       “红五月”不红 房企资金链压力攀升  
28. 2  2017-05-16 12:03:00      凤凰自行车总裁:共享单车把我们打懵了  
29. 3  2017-06-08 07:00:00    25家银行分红季派出3536亿“大红包”  
30. 4  2017-03-15 07:03:22      五万以下的小本餐饮项目,卷饼赚钱最稳  
31. chinese_pattern re.compile('[\\u4e00-\\u9fff]+')
32. Building prefix dict from F:\File_Jupyter\实用代码\naive_bayes(简单贝叶斯)\jieba_dict\dict.txt.big ...
33. Loading model from cache C:\Users\niu\AppData\Local\Temp\jieba.ue3752d4e13420d2dc6b66831a5a4ab13.cache
34. Loading model cost 1.326 seconds.
35. Prefix dict has been built succesfully.
36. dictionary
37. <class 'gensim.corpora.dictionary.Dictionary'> Dictionary(46351 unique tokens: ['一个', '一个个', '一举一动', '一些', '一体']...)
38. <class 'method'> <bound method Dictionary.doc2bow of <gensim.corpora.dictionary.Dictionary object at 0x000001BDC62291D0>>
39. F:\Program Files\Python\Python36\lib\site-packages\numpy\core\_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
40. return array(a, dtype, copy=False, order=order)
41. 
42. 
43.                                               corpus  \
44. 0  [(0, 6), (1, 1), (2, 1), (3, 3), (4, 2), (5, 2...   
45. 1  [(0, 1), (3, 3), (13, 1), (17, 1), (41, 1), (5...   
46. 2  [(15, 1), (53, 1), (167, 1), (262, 1), (396, 1...   
47. 
48.                                                tfidf  
49. 0  [(0, 0.005554342859788116), (1, 0.007470250835...  
50. 1  [(0, 0.002081356679198299), (3, 0.012288034179...  
51. 2  [(15, 0.057457146244872616), (53, 0.0543395377...  
52. after abs 4.7683716e-07
53. foo: (1293, 1293)
54. dis2TSNE_Visual:  (1293, 2)
55. {'养生': 0, '科技': 1, '财经': 2, '游戏': 3, '育儿': 4, '汽车': 5}
56. data_frame.keyword_index: 1    379
57. 2    287
58. 5    283
59. 4    148
60. 3    141
61. 0     55
62. Name: keyword_index, dtype: int64
63. 
64. id                                 tags  \
65. 0  6428905748545732865  ['财经', '白洋淀', '城市规划', '徐匡迪', '太行山']   
66. 1  6428954136200855810  ['财经', '碧桂园', '万科集团', '投资', '广州恒大']   
67. 2  6420576443738784002   ['财经', '自行车', '凤凰', '王朝阳', '汽车展览']   
68. 
69. 
70.                                            doc_words  \
71. 0  [牵动人心, 雄安, 新区, 规划, 细节, 内容, 出台, 时间表, 敲定, 日前, 北京...   
72. 1  [去年, 以来, 多个, 城市, 先后, 发布, 多项, 楼市, 调控, 政策, 限购, 限...   
73. 2  [今年, 中国, 国际, 自行车, 展上, 上海, 凤凰, 自行车, 总裁, 王, 朝阳, ...   
74. 
75.                                               corpus  \
76. 0  [(0, 6), (1, 1), (2, 1), (3, 3), (4, 2), (5, 2...   
77. 1  [(0, 1), (3, 3), (13, 1), (17, 1), (41, 1), (5...   
78. 2  [(15, 1), (53, 1), (167, 1), (262, 1), (396, 1...   
79. 
80.                                                tfidf   visual01   visual02  \
81. 0  [(0, 0.005554342859788116), (1, 0.007470250835... -65.903542 -14.433964
82. 1  [(0, 0.002081356679198299), (3, 0.012288034179... -29.659267 -14.811647
83. 2  [(15, 0.057457146244872616), (53, 0.0543395377... -22.118195 -48.148167
84. 
85.    keyword_index  
86. 0              2
87. 1              2
88. 2              2
89. Childcare,label_category_ID_pos.tfidf)[:20]: ['孩子', '家长', '教育', '学习', '男孩子', '成绩', '爸爸', '分享', '帮助', '方法', '小学', '数学', '交流', '男孩', '妈妈', '成长', '父母', '懂', '免费', '翼航']
90. Childcare,label_category_ID_neg.tfidf)[:20]: []
91. train_index MatrixSimilarity<646 docs, 46329 features>
92. hot_words shape: 6 300
93. {0: {1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 18009, 7258, 4697, 7260, 16989, 3674, 91, 87, 16993, 18020, 616, 4714, 5228, 40044, 1646, 4720, 3185, 15986, 34928, 5236, 113, 34936, 6777, 126, 15999, 127, 4737, 40067, 5252, 643, 4739, 13444, 8840, 1157, 133, 4749, 3219, 10388, 17562, 5278, 46239, 5287, 3751, 167, 680, 6827, 4784, 16048, 16050, 180, 46260, 16054, 6839, 4792, 2743, 4789, 17083, 16060, 4790, 16062, 43200, 5315, 46276, 46279, 17098, 6860, 5836, 16081, 43219, 1237, 1750, 15575, 8921, 2266, 6877, 12511, 12512, 21216, 226, 4834, 6884, 16101, 4838, 742, 2280, 2281, 227, 7915, 6886, 6893, 2798, 6894, 5870, 4849, 242, 1779, 4852, 21215, 44791, 4864, 3329, 258, 4865, 4866, 44805, 4877, 21264, 4882, 274, 8986, 8987, 796, 32029, 4382, 21277, 4896, 1825, 801, 3363, 36644, 1830, 4393, 36138, 303, 815, 4401, 12594, 21299, 7986, 820, 310, 1337, 21307, 4411, 317, 33598, 5953, 17730, 5954, 10050, 17733, 17734, 25927, 21320, 17739, 4939, 21324, 4942, 33615, 6885, 16210, 6071, 18261, 5976, 860, 16740, 16745, 2922, 4969, 17263, 6512, 33649, 16242, 2419, 17775, 373, 1398, 880, 1916, 17276, 16255, 1920, 43394, 3974, 4999, 396, 8080, 16788, 18325, 1942, 16279, 1433, 43418, 36252, 17311, 43425, 16802, 7585, 15959, 7594, 36268, 4525, 7597, 5551, 6063, 36272, 36275, 4533, 16309, 18358, 36280, 1465, 441, 7611, 16825, 16829, 4538, 2488, 2495, 8129, 4545, 4547, 16836, 4549, 7621, 1484, 1997, 11214, 1999, 16846, 16847, 4563, 7636, 14293, 7638, 4567, 16855, 17369, 16861, 478, 16351, 18400, 17377, 993, 9699, 5085, 6111, 7645, 6119, 6124, 17903, 1011, 4597, 6646, 16376, 6138, 16891, 16892, 7165, 4606}, 1: {0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 12391, 28267, 12396, 109, 9836, 12399, 11884, 12401, 12400, 12403, 627, 117, 629, 9847, 628, 17020, 637, 9855, 639, 12418, 643, 1668, 133, 3715, 14470, 1160, 12424, 11912, 9867, 33420, 10376, 655, 12433, 148, 150, 3735, 1176, 12440, 154, 21659, 1180, 3742, 10399, 11936, 1185, 31904, 675, 13472, 167, 1704, 7337, 11946, 171, 172, 8876, 8878, 2734, 1200, 1709, 2226, 8877, 180, 1155, 697, 12475, 189, 8894, 1215, 1218, 4291, 708, 709, 3271, 2760, 6354, 2771, 1748, 213, 3798, 727, 730, 20187, 44767, 225, 2786, 2787, 13028, 1765, 1254, 13543, 26344, 740, 11497, 1771, 3819, 13549, 11502, 751, 1775, 752, 242, 21743, 12524, 759, 11511, 2809, 2812, 35581, 257, 8962, 771, 259, 15623, 1288, 3849, 12048, 1810, 786, 788, 3862, 793, 7450, 798, 24862, 7458, 12579, 31524, 31523, 7459, 1322, 810, 25391, 12081, 1329, 820, 3386, 1850, 9023, 319, 835, 9029, 325, 4424, 330, 12107, 13134, 846, 3409, 3924, 1878, 854, 344, 11609, 5978, 1883, 11612, 343, 11615, 358, 4457, 362, 875, 1385, 1900, 4462, 3439, 12144, 369, 3438, 1396, 38773, 28025, 2428, 13305, 13183, 12161, 12674, 1922, 34690, 2438, 1926, 13193, 907, 9100, 911, 13204, 1431, 10135, 2456, 44956, 925, 413, 32670, 1952, 928, 23455, 5540, 1956, 1447, 12200, 1448, 1452, 8109, 12205, 1965, 9651, 2486, 5559, 1464, 956, 1982, 959, 3522, 12235, 976, 3025, 10194, 1491, 12244, 465, 30675, 5585, 472, 470, 10714, 475, 3027, 478, 1503, 479, 5089, 483, 2532, 995, 9190, 5607, 1512, 1513, 9703, 10728, 494, 1518, 1520, 2545, 1007, 1524, 501, 503, 1017, 1534}, 2: {0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 3146, 1100, 26701, 1614, 1102, 592, 3577, 35410, 2639, 2644, 3159, 25688, 1626, 91, 3162, 1119, 608, 21089, 1634, 102, 2662, 31848, 2665, 11881, 27242, 12907, 1131, 1132, 15388, 2672, 3185, 1138, 627, 43124, 2675, 113, 1657, 2682, 3194, 127, 3715, 1668, 133, 3717, 135, 2696, 3209, 1162, 1158, 1676, 2701, 11916, 1167, 138, 1169, 148, 2710, 1174, 152, 1177, 22167, 26779, 21659, 157, 158, 1183, 30880, 1185, 26784, 2209, 2724, 3232, 672, 167, 4256, 8876, 685, 4269, 1202, 2226, 691, 1205, 3253, 1207, 2231, 2242, 4291, 14026, 27340, 1740, 1231, 14032, 24273, 3284, 1749, 213, 727, 217, 730, 2266, 14044, 1246, 1248, 225, 1254, 742, 745, 3819, 14060, 12013, 750, 1775, 242, 1780, 1268, 759, 760, 249, 33536, 1281, 261, 262, 2311, 1290, 267, 37132, 5902, 1810, 7958, 39191, 280, 793, 43813, 1318, 807, 295, 45354, 1324, 28461, 1838, 28462, 815, 1329, 820, 1333, 317, 2366, 39743, 832, 2365, 45378, 835, 330, 1356, 845, 334, 1359, 4433, 4438, 854, 14168, 1370, 1883, 1372, 1371, 860, 863, 3935, 3937, 1378, 11618, 3426, 870, 358, 3942, 361, 874, 362, 875, 28010, 3438, 2416, 369, 880, 14196, 886, 4472, 1403, 894, 895, 2432, 385, 904, 905, 27528, 907, 909, 911, 1431, 409, 1433, 925, 1950, 415, 928, 413, 13731, 3494, 20902, 937, 1452, 942, 1968, 1973, 1464, 1977, 956, 34240, 3009, 32706, 14278, 3015, 456, 1993, 973, 975, 976, 465, 466, 1491, 14290, 2512, 1494, 472, 475, 480, 3554, 995, 2532, 3048, 1513, 23529, 3564, 494, 498, 500, 501, 503, 1017, 3070}, 3: {1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 6731, 9293, 31823, 2133, 9303, 601, 91, 43615, 608, 9314, 10338, 25709, 1646, 10349, 6257, 7794, 27763, 11381, 9337, 7801, 637, 3709, 639, 11391, 9345, 7299, 3715, 1668, 41606, 11401, 11402, 4233, 9868, 10893, 142, 5259, 9872, 25744, 25741, 148, 10389, 34455, 3735, 8345, 8857, 154, 10396, 1178, 7839, 10399, 8554, 1704, 10409, 9900, 10412, 2734, 14512, 10416, 7858, 9394, 9904, 6325, 2232, 1721, 38589, 8894, 6336, 1220, 9925, 11461, 3271, 9420, 719, 14544, 2773, 3286, 3287, 214, 20187, 9438, 26335, 6048, 13534, 226, 3811, 19172, 1766, 2280, 36585, 14575, 2801, 9457, 10993, 10485, 23797, 759, 27896, 5882, 8443, 23803, 1790, 767, 8962, 9476, 7433, 6924, 2316, 2318, 3853, 14608, 4371, 9494, 8983, 6425, 793, 362, 6433, 7458, 2339, 810, 1835, 8493, 6447, 1329, 28466, 44855, 9527, 1338, 10044, 317, 3390, 10047, 41280, 31554, 2372, 9029, 11592, 9547, 3916, 9042, 10066, 3925, 343, 10072, 5978, 860, 8030, 10079, 10593, 9572, 2916, 9061, 3430, 6501, 4969, 10089, 30571, 10603, 11117, 9582, 10607, 6505, 14193, 28529, 14707, 7197, 369, 11639, 23929, 894, 1919, 3459, 11652, 2438, 10631, 907, 10642, 9109, 2454, 14743, 2456, 29594, 11164, 6559, 9631, 3999, 1951, 14754, 14756, 31653, 9638, 31654, 33704, 45984, 3500, 31661, 1453, 1455, 9645, 9649, 41394, 9651, 9652, 10165, 30718, 2999, 31672, 1982, 9662, 44483, 11205, 2505, 5581, 10704, 465, 977, 31699, 9172, 4053, 9174, 31703, 4567, 470, 10714, 475, 5076, 478, 480, 23008, 9186, 30692, 9190, 9703, 10216, 491, 30699, 1005, 2542, 31726, 1007, 494, 25586, 10222, 18417, 10736, 8178, 3064, 1529, 509, 1534}, 4: {0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 601, 7258, 91, 5722, 5214, 4703, 608, 3679, 2143, 101, 6758, 5224, 616, 7277, 2158, 4723, 5236, 6267, 1660, 637, 639, 4737, 4739, 5252, 133, 1668, 4606, 23688, 5768, 17035, 2188, 5772, 38034, 5779, 3220, 6805, 2199, 1688, 5273, 154, 155, 1694, 4767, 5280, 5278, 5284, 1191, 1704, 167, 3754, 5802, 5290, 3751, 3247, 5296, 3257, 5818, 5823, 3265, 708, 5318, 5830, 4294, 1738, 5841, 5330, 4825, 4316, 734, 6369, 5349, 4838, 4326, 2280, 4329, 46315, 6380, 29660, 44269, 5871, 5873, 242, 7927, 759, 760, 2812, 1277, 8448, 3329, 4866, 2304, 4869, 5382, 7430, 3848, 3339, 2318, 782, 3857, 5906, 26513, 788, 2841, 7450, 4382, 1825, 7458, 801, 37156, 4393, 810, 7979, 3886, 815, 4911, 4401, 7986, 1329, 820, 5942, 3896, 8506, 2874, 317, 5441, 835, 5445, 5958, 6578, 5964, 5965, 4942, 8016, 8024, 344, 4952, 860, 1884, 29533, 8545, 8037, 3430, 6504, 7017, 2922, 4457, 362, 5998, 2928, 373, 374, 2935, 1398, 8057, 6011, 6015, 32127, 384, 4994, 8579, 4996, 8072, 396, 6541, 5006, 6540, 5009, 1938, 1427, 7571, 2965, 1942, 6039, 1940, 7574, 2970, 409, 7068, 7575, 8606, 5014, 5018, 7585, 5017, 6561, 7588, 1447, 3497, 6058, 5547, 1965, 6065, 4529, 21939, 4531, 6069, 5043, 5559, 7096, 1465, 6074, 3515, 4533, 6077, 5054, 7103, 448, 6080, 6076, 4547, 8132, 4552, 4555, 1484, 39372, 39374, 4561, 6611, 5078, 470, 1496, 5081, 472, 7131, 4572, 7133, 5598, 5086, 4576, 4577, 6111, 478, 4580, 1508, 480, 1503, 5096, 1506, 4584, 23019, 493, 494, 498, 5108, 18935, 1529, 6138, 7163, 10238, 5119}, 5: {0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91, 14940, 9308, 14937, 14943, 608, 6755, 1124, 13924, 14950, 5219, 14947, 9325, 3697, 14961, 11893, 14968, 12408, 15485, 637, 5247, 1668, 1157, 23172, 647, 15492, 15498, 5773, 19087, 13969, 9362, 15506, 1681, 148, 11926, 1176, 2713, 155, 1180, 15517, 1692, 20124, 10401, 19105, 675, 674, 19109, 167, 1704, 11946, 15019, 12458, 1709, 682, 9091, 2224, 15025, 20656, 176, 180, 7858, 12982, 15031, 15543, 41136, 14013, 2239, 1729, 708, 9413, 21700, 712, 15562, 15051, 2765, 15057, 15061, 9942, 15063, 21718, 22747, 15068, 15069, 32475, 13535, 15583, 15074, 227, 19683, 2789, 1766, 13542, 13036, 2799, 752, 3312, 13552, 242, 26867, 1268, 15618, 759, 2809, 763, 28924, 2812, 10495, 2817, 2818, 14083, 769, 259, 15622, 2823, 1288, 8962, 15109, 19720, 15629, 19213, 3345, 786, 788, 280, 25375, 2337, 15650, 804, 15653, 3366, 807, 2349, 15151, 7984, 1329, 21810, 820, 12602, 1338, 317, 11582, 5953, 2370, 835, 323, 15688, 1864, 15693, 854, 13142, 344, 15705, 4955, 860, 23899, 11615, 863, 15199, 15711, 13155, 15205, 872, 4457, 15722, 362, 15724, 875, 3438, 15215, 369, 883, 19828, 24437, 374, 29179, 9593, 19834, 15227, 894, 19326, 13186, 35203, 2436, 15749, 389, 19847, 15750, 19849, 2438, 1922, 6028, 909, 15752, 2446, 13200, 2448, 409, 21923, 9644, 14766, 22959, 14771, 23989, 12728, 9145, 14778, 14779, 3000, 12733, 7102, 3007, 9665, 14786, 12226, 2498, 14789, 8645, 15301, 15305, 15818, 461, 976, 5585, 977, 1489, 15358, 472, 1496, 42457, 2524, 478, 19422, 480, 15330, 15843, 20452, 26084, 6631, 14827, 492, 15343, 3571, 14836, 15348, 19446, 14839, 11765, 1017, 14843, 14844, 14846}}
94. word_bagNum shape: 6 50
95. {0: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960], 1: [0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613], 2: [0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651], 3: [1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284], 4: [0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740], 5: [0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91]}
96. after all_words, word_bag shape: 6 300
97. {0: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91], 1: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91], 2: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91], 3: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91], 4: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91], 5: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91]}
98. features_data_frame.shape: (6, 255)
99. 0 30
100. 1 185
101. 2 139
102. 3 66
103. 4 69
104. 5 157
105. class_Proportion: 
106.  [0.04643962848297214, 0.28637770897832815, 0.21517027863777088, 0.1021671826625387, 0.10681114551083591, 0.24303405572755418]
107. test_data_frame.head(2) 
108.       Unnamed: 0                                            content  \
109. 854         854  据Mobileexpose报道,华硕已经正式向媒体发出邀请,定于6月14日在台湾举办记者会,...   
110. 101         101   6月6日,王者荣耀猴三棍重做引起王者峡谷一阵轩然大波,毕竟这个强势的猴子已经陪伴我们好几个...   
111. 
112. id                                   tags  \
113. 854  6429089676803440897  ['科技', '华硕', '华硕ZenFone', '台湾', '手机']   
114. 101  6429098400347586818       ['游戏', '猴子', '王者荣耀', '黄忠', '游戏']   
115. 
116.                     time                     title  \
117. 854  2017-06-07 10:11:00        华硕ZenFone AR宣布本月发售   
118. 101  2017-06-07 10:39:20  猴子重做之后是加强还是削弱?狂到站对面泉水拿双杀   
119. 
120.                                              doc_words  \
121. 854  [报道, 华硕, 已经, 正式, 媒体, 发出, 邀请, 定于, 月, 日, 台湾, 举办,...   
122. 101  [月, 日, 王者, 荣耀, 猴三棍, 重, 做, 引起, 王者, 峡谷, 一阵, 轩然大波...   
123. 
124.                                                 corpus  \
125. 854  [(142, 1), (362, 1), (472, 1), (475, 1), (494,...   
126. 101  [(0, 2), (68, 3), (133, 1), (184, 1), (226, 1)...   
127. 
128.                                                  tfidf   visual01   visual02  \
129. 854  [(142, 0.13953435619531032), (362, 0.046441336...  21.684397 -30.567736
130. 101  [(0, 0.012838015508020575), (68, 0.04742284222...  67.188065  21.183245
131. 
132.      keyword_index  
133. 854              1
134. 101              3
135. print the first sample 
136.  Unnamed: 0                                                     854
137. 会,...
138. id                                             6429089676803440897
139. tags                         ['科技', '华硕', '华硕ZenFone', '台湾', '手机']
140. time                                           2017-06-07 10:11:00
141. title                                           华硕ZenFone AR宣布本月发售
142. doc_words        [报道, 华硕, 已经, 正式, 媒体, 发出, 邀请, 定于, 月, 日, 台湾, 举办,...
143. corpus           [(142, 1), (362, 1), (472, 1), (475, 1), (494,...
144. tfidf            [(142, 0.13953435619531032), (362, 0.046441336...
145. visual01                                                   21.6844
146. visual02                                                  -30.5677
147. keyword_index                                                    1
148. Name: 854, dtype: object
149. test_data_frame.iloc[0].corpus:  [(142, 1), (362, 1), (472, 1), (475, 1), (494, 1), (530, 1), (872, 1), (909, 1), (1254, 1), (1312, 1), (1878, 1), (2577, 1), (2783, 1), (2979, 1), (3697, 1), (5508, 1), (9052, 1), (12204, 1), (12256, 1), (12591, 1), (12936, 1), (12991, 1), (13128, 1), (13194, 1), (13244, 1), (13317, 1), (31670, 1), (31683, 1), (33417, 1)]
150. [1.45708072e-43 1.78656934e-66 7.12148875e-63 1.71090490e-53
151. 4.71385662e-54 2.08405934e-64]
152. [-35.34436300647761, -16.431856044032266, -20.267559000416433, -22.405433968586664, -27.97121661401147, -18.05089965903481]
153. F:\File_Jupyter\实用代码\naive_bayes(简单贝叶斯)\TextClassPrediction_kNN_NB_LDA_P.py:346: SettingWithCopyWarning: 
154. A value is trying to be set on a copy of a slice from a DataFrame.
155. Try using .loc[row_indexer,col_indexer] = value instead
156. 
157. See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
158.   test_data_frame['predicted_class'] = test_data_frame['corpus'].apply(predict_text_ByMax)       #预测所有测试文档   predict all test documents
159. 
160. id                                   tags  \
161. 854   6429089676803440897  ['科技', '华硕', '华硕ZenFone', '台湾', '手机']   
162. 101   6429098400347586818       ['游戏', '猴子', '王者荣耀', '黄忠', '游戏']   
163. 738   6413133652368982274     ['科技', '厨卫电器', '榨汁机', '小家电', '硅谷']   
164. 511   6428827159980867842     ['科技', '智能家居', '音箱', '苹果公司', '法国']   
165. 725   6428841852455354625                  ['科技', '喜马拉雅山', '科技']   
166. ...                   ...                                    ...   
167. 805   6429151552733069569                           ['财经', '财经']   
168. 448   6415852634885341441    ['汽车', 'SUV', '国产车', '概念车', '汽车用品']   
169. 782   6428858665063383297   ['科技', '新能源汽车', '电动汽车', '新能源', '经济']   
170. 1264  6427822755417194753    ['汽车', '日本汽车', '讴歌汽车', 'SUV', '空调']   
171. 1195  6429093420292210945                     ['科技', '乐视', '科技']   
172. 
173.                      time                        title  \
174. 854   2017-06-07 10:11:00           华硕ZenFone AR宣布本月发售   
175. 101   2017-06-07 10:39:20     猴子重做之后是加强还是削弱?狂到站对面泉水拿双杀   
176. 738   2017-04-26 10:41:39                绝!他用一台榨汁机骗了8亿   
177. 511   2017-06-08 11:06:00    他的智能音箱一上市,苹果公司就推出了HomePod   
178. 725   2017-06-07 18:37:00  喜马拉雅FM推出“付费会员”,当天召集超221万名会员   
179. ...                   ...                          ...   
180. 805   2017-06-08 14:30:00          盘中近20家龙头白马股集体创下历史新高   
181. 448   2017-05-03 18:37:20      别瞎找了!10万左右尺寸最大的SUV都在这里了   
182. 782   2017-06-07 19:12:00      倡导移动出行新概念 NEVS两款概念量产车亮相   
183. 1264  2017-06-08 09:54:40        居然还有一款车,最低配和中高配看不出差别?   
184. 1195  2017-06-08 10:45:00     乐视被爆未及时缴物业费,员工或将被阻止进大楼办公   
185. 
186.                                               doc_words  \
187. 854   [报道, 华硕, 已经, 正式, 媒体, 发出, 邀请, 定于, 月, 日, 台湾, 举办,...   
188. 101   [月, 日, 王者, 荣耀, 猴三棍, 重, 做, 引起, 王者, 峡谷, 一阵, 轩然大波...   
189. 738   [骗子, 往往, 很会, 讲故事, 以下, 硅谷, 骗局, 验血, 公司, 号称, 指尖, ...   
190. 511   [专访, 创始人, 孟, 崨, 学校, 最, 调皮, 却, 成绩, 最好, 学生, 老师, ...   
191. 725   [据介绍, 喜马拉雅, 会员, 月费, 元, 年度, 会员, 元, 价格, 视频, 网站, ...   
192. ...                                                 ...   
193. 805   [每经, 记者, 王海, 慜, 每经, 编辑, 叶峰, 今日, 盘中, 昨日, 领涨, 中小...   
194. 448   [中国, 人买, 喜欢, 房子, 买, 面积, 手机, 买, 屏大, 买车, 自然, 挑选,...   
195. 782   [中证网, 讯, 记者, 徐金忠, 月, 日, 国, 电动汽车, 瑞典, 有限公司, 亮相,...   
196. 1264  [目前, 日系, 豪华, 品牌, 讴歌, 已经, 开启, 国产, 路, 推出, 车型, 后,...   
197. 1195  [近日, 爆料, 称, 乐视, 位于, 北京, 达美, 中心, 办公地, 因未, 及时, 缴...   
198. 
199.                                                  corpus  \
200. 854   [(142, 1), (362, 1), (472, 1), (475, 1), (494,...   
201. 101   [(0, 2), (68, 3), (133, 1), (184, 1), (226, 1)...   
202. 738   [(0, 2), (45, 1), (48, 1), (133, 2), (155, 1),...   
203. 511   [(0, 10), (13, 2), (14, 2), (20, 1), (45, 1), ...   
204. 725   [(30, 1), (102, 1), (142, 1), (154, 1), (189, ...   
205. ...                                                 ...   
206. 805   [(113, 1), (167, 1), (169, 1), (214, 1), (258,...   
207. 448   [(4, 2), (8, 1), (14, 1), (51, 6), (53, 2), (6...   
208. 782   [(15, 2), (30, 1), (53, 7), (93, 1), (143, 1),...   
209. 1264  [(0, 1), (20, 1), (51, 1), (176, 1), (225, 1),...   
210. 1195  [(57, 1), (111, 1), (191, 1), (361, 1), (476, ...   
211. 
212.                                                   tfidf   visual01   visual02  \
213. 854   [(142, 0.13953435619531032), (362, 0.046441336...  21.684397 -30.567736
214. 101   [(0, 0.012838015508020575), (68, 0.04742284222...  67.188065  21.183245
215. 738   [(0, 0.008984009118453712), (45, 0.01791359767... -22.855194 -11.270862
216. 511   [(0, 0.04361196171462796), (13, 0.028607388065... -22.198786  12.217076
217. 725   [(30, 0.05815947983270004), (102, 0.0450585853...  26.268911  21.240065
218. ...                                                 ...        ...        ...   
219. 805   [(113, 0.030899018921031703), (167, 0.02103003... -66.232071   0.221611
220. 448   [(4, 0.04071064284477513), (8, 0.0235138776022...  41.836094 -44.539528
221. 782   [(15, 0.03392075672049564), (30, 0.03003603467... -26.810091 -29.602842
222. 1264  [(0, 0.009883726180653873), (20, 0.04080153677...  36.279522 -52.474297
223. 1195  [(57, 0.09668298763559263), (111, 0.1255406499...  -6.373239  16.101738
224. 
225.       keyword_index  predicted_class  
226. 854               1                1
227. 101               3                3
228. 738               1                1
229. 511               1                2
230. 725               1                1
231. ...             ...              ...  
232. 805               2                2
233. 448               5                5
234. 782               1                1
235. 1264              5                5
236. 1195              1                1
237. 
238. [647 rows x 13 columns]
239. SModel_CS_acc_score: 0.7047913446676971
240. 300
241. label_category_ID 2
242. 一个
243. 一些
244. 概念
245. 经营
246. 补贴
247. 股市
248. 增持
249. 成本
250. 乳业
251. 万吨
252. train_data_frame.corpus[0] 
253.  [(0, 6), (1, 1), (2, 1), (3, 3), (4, 2), (5, 2), (6, 1), (7, 1), (8, 2), (9, 1), (10, 3), (11, 1), (12, 2), (13, 2), (14, 2), (15, 1), (16, 1), (17, 2), (18, 1), (19, 1), (20, 2), (21, 1), (22, 2), (23, 2), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 2), (30, 3), (31, 4), (32, 3), (33, 1), (34, 1), (35, 1), (36, 7), (37, 1), (38, 1), (39, 2), (40, 3), (41, 1), (42, 1), (43, 1), (44, 1), (45, 2), (46, 1), (47, 1), (48, 1), (49, 2), (50, 4), (51, 21), (52, 3), (53, 7), (54, 1), (55, 2), (56, 1), (57, 4), (58, 2), (59, 1), (60, 5), (61, 1), (62, 1), (63, 1), (64, 2), (65, 1), (66, 3), (67, 1), (68, 2), (69, 2), (70, 1), (71, 1), (72, 1), (73, 1), (74, 2), (75, 1), (76, 1), (77, 1), (78, 1), (79, 2), (80, 1), (81, 1), (82, 1), (83, 4), (84, 7), (85, 2), (86, 3), (87, 1), (88, 9), (89, 1), (90, 1), (91, 8), (92, 3), (93, 1), (94, 4), (95, 1), (96, 2), (97, 1), (98, 7), (99, 1), (100, 2), (101, 1), (102, 1), (103, 1), (104, 1), (105, 1), (106, 1), (107, 1), (108, 1), (109, 2), (110, 1), (111, 2), (112, 1), (113, 1), (114, 1), (115, 1), (116, 1), (117, 1), (118, 1), (119, 1), (120, 1), (121, 2), (122, 1), (123, 1), (124, 1), (125, 1), (126, 5), (127, 1), (128, 4), (129, 1), (130, 1), (131, 1), (132, 2), (133, 2), (134, 1), (135, 5), (136, 1), (137, 1), (138, 3), (139, 1), (140, 1), (141, 1), (142, 1), (143, 1), (144, 1), (145, 2), (146, 1), (147, 1), (148, 2), (149, 4), (150, 1), (151, 1), (152, 2), (153, 2), (154, 1), (155, 3), (156, 1), (157, 1), (158, 1), (159, 1), (160, 1), (161, 2), (162, 1), (163, 1), (164, 1), (165, 2), (166, 1), (167, 3), (168, 1), (169, 1), (170, 3), (171, 3), (172, 1), (173, 2), (174, 1), (175, 1), (176, 2), (177, 5), (178, 1), (179, 1), (180, 1), (181, 1), (182, 1), (183, 1), (184, 4), (185, 1), (186, 1), (187, 1), (188, 1), (189, 3), (190, 1), (191, 14), (192, 2), (193, 2), (194, 2), (195, 1), (196, 3), (197, 1), (198, 1), (199, 11), (200, 6), (201, 1), (202, 1), (203, 2), (204, 1), (205, 8), (206, 2), (207, 2), (208, 2), (209, 1), (210, 1), (211, 1), (212, 1), (213, 1), (214, 1), (215, 1), (216, 3), (217, 1), (218, 1), (219, 2), (220, 2), (221, 1), (222, 1), (223, 1), (224, 1), (225, 17), (226, 1), (227, 1), (228, 1), (229, 1), (230, 1), (231, 1), (232, 2), (233, 1), (234, 1), (235, 3), (236, 1), (237, 1), (238, 2), (239, 1), (240, 1), (241, 1), (242, 1), (243, 2), (244, 2), (245, 1), (246, 1), (247, 2), (248, 2), (249, 2), (250, 1), (251, 1), (252, 2), (253, 1), (254, 1), (255, 1), (256, 1), (257, 1), (258, 3), (259, 3), (260, 1), (261, 3), (262, 2), (263, 1), (264, 1), (265, 6), (266, 1), (267, 3), (268, 1), (269, 1), (270, 3), (271, 2), (272, 1), (273, 2), (274, 1), (275, 1), (276, 5), (277, 1), (278, 4), (279, 4), (280, 25), (281, 2), (282, 2), (283, 2), (284, 7), (285, 1), (286, 1), (287, 2), (288, 2), (289, 1), (290, 1), (291, 1), (292, 1), (293, 3), (294, 2), (295, 1), (296, 3), (297, 1), (298, 3), (299, 2), (300, 1), (301, 1), (302, 1), (303, 2), (304, 1), (305, 1), (306, 1), (307, 2), (308, 2), (309, 1), (310, 1), (311, 1), (312, 1), (313, 1), (314, 1), (315, 1), (316, 7), (317, 2), (318, 2), (319, 1), (320, 1), (321, 1), (322, 1), (323, 1), (324, 1), (325, 4), (326, 1), (327, 2), (328, 1), (329, 1), (330, 3), (331, 3), (332, 1), (333, 2), (334, 2), (335, 1), (336, 1), (337, 2), (338, 1), (339, 1), (340, 1), (341, 1), (342, 1), (343, 1), (344, 2), (345, 1), (346, 1), (347, 2), (348, 1), (349, 2), (350, 5), (351, 2), (352, 3), (353, 1), (354, 4), (355, 1), (356, 1), (357, 2), (358, 4), (359, 2), (360, 2), (361, 1), (362, 9), (363, 2), (364, 2), (365, 1), (366, 1), (367, 7), (368, 1), (369, 4), (370, 2), (371, 1), (372, 1), (373, 1), (374, 1), (375, 1), (376, 1), (377, 1), (378, 2), (379, 1), (380, 3), (381, 1), (382, 2), (383, 1), (384, 3), (385, 26), (386, 1), (387, 1), (388, 1), (389, 3), (390, 1), (391, 2), (392, 1), (393, 4), (394, 4), (395, 4), (396, 2), (397, 1), (398, 40), (399, 2), (400, 4), (401, 1), (402, 1), (403, 2), (404, 1), (405, 1), (406, 2), (407, 1), (408, 1), (409, 3), (410, 1), (411, 1), (412, 2), (413, 7), (414, 4), (415, 2), (416, 1), (417, 1), (418, 1), (419, 3), (420, 1), (421, 1), (422, 1), (423, 1), (424, 1), (425, 1), (426, 1), (427, 2), (428, 1), (429, 1), (430, 1), (431, 1), (432, 5), (433, 1), (434, 1), (435, 1), (436, 1), (437, 1), (438, 1), (439, 1), (440, 1), (441, 1), (442, 1), (443, 3), (444, 3), (445, 2), (446, 5), (447, 1), (448, 1), (449, 1), (450, 4), (451, 1), (452, 2), (453, 2), (454, 1), (455, 4), (456, 1), (457, 1), (458, 1), (459, 2), (460, 1), (461, 1), (462, 5), (463, 2), (464, 1), (465, 5), (466, 74), (467, 2), (468, 1), (469, 1), (470, 2), (471, 22), (472, 2), (473, 1), (474, 1), (475, 2), (476, 2), (477, 2), (478, 2), (479, 1), (480, 1), (481, 1), (482, 1), (483, 2), (484, 1), (485, 1), (486, 2), (487, 1), (488, 2), (489, 1), (490, 1), (491, 1), (492, 4), (493, 1), (494, 2), (495, 4), (496, 2), (497, 1), (498, 1), (499, 1), (500, 1), (501, 5), (502, 1), (503, 13), (504, 4), (505, 3), (506, 1), (507, 7), (508, 1), (509, 1), (510, 1), (511, 1), (512, 1), (513, 1), (514, 2), (515, 1), (516, 3), (517, 4), (518, 1), (519, 1), (520, 1), (521, 1), (522, 1), (523, 1), (524, 1), (525, 1), (526, 2), (527, 2), (528, 1), (529, 1), (530, 1), (531, 1), (532, 1), (533, 1), (534, 1), (535, 2), (536, 5), (537, 2), (538, 1), (539, 1), (540, 1), (541, 7), (542, 1), (543, 1), (544, 1), (545, 2), (546, 1), (547, 3), (548, 2), (549, 1), (550, 1), (551, 2), (552, 1), (553, 2), (554, 1), (555, 1), (556, 2), (557, 1), (558, 2), (559, 5), (560, 2), (561, 1), (562, 1), (563, 1), (564, 1), (565, 1), (566, 1), (567, 7), (568, 2), (569, 1), (570, 2), (571, 1), (572, 1), (573, 1), (574, 4), (575, 1), (576, 2), (577, 2), (578, 1), (579, 2), (580, 1), (581, 1), (582, 1), (583, 2), (584, 1), (585, 1), (586, 1), (587, 4), (588, 1), (589, 4), (590, 2), (591, 1), (592, 1), (593, 1), (594, 2), (595, 1), (596, 1), (597, 1), (598, 1), (599, 1), (600, 1), (601, 1), (602, 1), (603, 1), (604, 1), (605, 1), (606, 1), (607, 1), (608, 2), (609, 1), (610, 2), (611, 1), (612, 1), (613, 11), (614, 1), (615, 1), (616, 3), (617, 1), (618, 1), (619, 1), (620, 1), (621, 1), (622, 1), (623, 1), (624, 32), (625, 2), (626, 1), (627, 8), (628, 1), (629, 3), (630, 3), (631, 1), (632, 1), (633, 4), (634, 1), (635, 1), (636, 2), (637, 1), (638, 3), (639, 2), (640, 1), (641, 1), (642, 1), (643, 3), (644, 5), (645, 4), (646, 1), (647, 1), (648, 3), (649, 1), (650, 1), (651, 1), (652, 1), (653, 1), (654, 1), (655, 2), (656, 1), (657, 7), (658, 1), (659, 2), (660, 1), (661, 2), (662, 1), (663, 1), (664, 1), (665, 1), (666, 1), (667, 1), (668, 4), (669, 1), (670, 1), (671, 3), (672, 1), (673, 1), (674, 2), (675, 1), (676, 1), (677, 1), (678, 1), (679, 1), (680, 2), (681, 2), (682, 1), (683, 1), (684, 1), (685, 3), (686, 1), (687, 1), (688, 1), (689, 1), (690, 4), (691, 1), (692, 2), (693, 3), (694, 1), (695, 2), (696, 1), (697, 1), (698, 2), (699, 1), (700, 1), (701, 4), (702, 1), (703, 1), (704, 2), (705, 1), (706, 1), (707, 1), (708, 1), (709, 2), (710, 1), (711, 3), (712, 1), (713, 1), (714, 4), (715, 1), (716, 1), (717, 1), (718, 2), (719, 1), (720, 1), (721, 2), (722, 1), (723, 1), (724, 4), (725, 1), (726, 1), (727, 1), (728, 1), (729, 2), (730, 12), (731, 2), (732, 1), (733, 2), (734, 3), (735, 1), (736, 26), (737, 1), (738, 5), (739, 1), (740, 2), (741, 5), (742, 2), (743, 3), (744, 3), (745, 2), (746, 1), (747, 3), (748, 2), (749, 2), (750, 2), (751, 1), (752, 1), (753, 2), (754, 1), (755, 1), (756, 1), (757, 1), (758, 1), (759, 4), (760, 1), (761, 1), (762, 1), (763, 1), (764, 1), (765, 2), (766, 1), (767, 1), (768, 1), (769, 2), (770, 8), (771, 2), (772, 4), (773, 1), (774, 8), (775, 3), (776, 1), (777, 1), (778, 3), (779, 1), (780, 1), (781, 1), (782, 5), (783, 2), (784, 2), (785, 1), (786, 4), (787, 1), (788, 1), (789, 1), (790, 1), (791, 1), (792, 1), (793, 4), (794, 1), (795, 1), (796, 1), (797, 5), (798, 3), (799, 5), (800, 3), (801, 1), (802, 1), (803, 1), (804, 1), (805, 2), (806, 2), (807, 2), (808, 1), (809, 1), (810, 1), (811, 1), (812, 1), (813, 1), (814, 1), (815, 3), (816, 1), (817, 2), (818, 1), (819, 1), (820, 11), (821, 1), (822, 1), (823, 2), (824, 3), (825, 1), (826, 1), (827, 1), (828, 1), (829, 1), (830, 3), (831, 4), (832, 46), (833, 1), (834, 1), (835, 2), (836, 2), (837, 1), (838, 1), (839, 2), (840, 2), (841, 1), (842, 1), (843, 2), (844, 2), (845, 2), (846, 1), (847, 1), (848, 2), (849, 1), (850, 1), (851, 1), (852, 3), (853, 1), (854, 1), (855, 6), (856, 1), (857, 1), (858, 1)]
254. [33. 74. 73. 31. 47. 48.]
255. <class 'numpy.ndarray'>
256. SModel_acc_score: 0.8114374034003091
257. kNNC_acc_score: 0.8160741885625966
258. GNBC_acc_score: 0.6352395672333848
259. MNBC_acc_score: 0.6352395672333848
260. BNBC_acc_score: 0.29675425038639874
261. LDAC_acc_score: 0.8238021638330757
262. PerceptronC_acc_score: 0.8222565687789799

核心代码

1. class GaussianNB Found at: sklearn.naive_bayes
2. 
3. class GaussianNB(_BaseNB):
4. """
5.     Gaussian Naive Bayes (GaussianNB)
6.     
7.     Can perform online updates to model parameters via :meth:`partial_fit`.
8.     For details on algorithm used to update feature means and variance online,
9.     see Stanford CS tech report STAN-CS-79-773 by Chan, Golub, and LeVeque:
10.     
11.     http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf
12.     
13.     Read more in the :ref:`User Guide <gaussian_naive_bayes>`.
14.     
15.     Parameters
16.     ----------
17.     priors : array-like of shape (n_classes,)
18.     Prior probabilities of the classes. If specified the priors are not
19.     adjusted according to the data.
20.     
21.     var_smoothing : float, default=1e-9
22.     Portion of the largest variance of all features that is added to
23.     variances for calculation stability.
24.     
25.     .. versionadded:: 0.20
26.     
27.     Attributes
28.     ----------
29.     class_count_ : ndarray of shape (n_classes,)
30.     number of training samples observed in each class.
31.     
32.     class_prior_ : ndarray of shape (n_classes,)
33.     probability of each class.
34.     
35.     classes_ : ndarray of shape (n_classes,)
36.     class labels known to the classifier
37.     
38.     epsilon_ : float
39.     absolute additive value to variances
40.     
41.     sigma_ : ndarray of shape (n_classes, n_features)
42.     variance of each feature per class
43.     
44.     theta_ : ndarray of shape (n_classes, n_features)
45.     mean of each feature per class
46.     
47.     Examples
48.     --------
49.     >>> import numpy as np
50.     >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
51.     >>> Y = np.array([1, 1, 1, 2, 2, 2])
52.     >>> from sklearn.naive_bayes import GaussianNB
53.     >>> clf = GaussianNB()
54.     >>> clf.fit(X, Y)
55.     GaussianNB()
56.     >>> print(clf.predict([[-0.8, -1]]))
57.     [1]
58.     >>> clf_pf = GaussianNB()
59.     >>> clf_pf.partial_fit(X, Y, np.unique(Y))
60.     GaussianNB()
61.     >>> print(clf_pf.predict([[-0.8, -1]]))
62.     [1]
63.     """
64.     @_deprecate_positional_args
65. def __init__(self, *, priors=None, var_smoothing=1e-9):
66.         self.priors = priors
67.         self.var_smoothing = var_smoothing
68. 
69. def fit(self, X, y, sample_weight=None):
70. """Fit Gaussian Naive Bayes according to X, y
71. 
72.         Parameters
73.         ----------
74.         X : array-like of shape (n_samples, n_features)
75.             Training vectors, where n_samples is the number of samples
76.             and n_features is the number of features.
77. 
78.         y : array-like of shape (n_samples,)
79.             Target values.
80. 
81.         sample_weight : array-like of shape (n_samples,), default=None
82.             Weights applied to individual samples (1. for unweighted).
83. 
84.             .. versionadded:: 0.17
85.                Gaussian Naive Bayes supports fitting with *sample_weight*.
86. 
87.         Returns
88.         -------
89.         self : object
90.         """
91.         X, y = self._validate_data(X, y)
92.         y = column_or_1d(y, warn=True)
93. return self._partial_fit(X, y, np.unique(y), _refit=True, 
94.             sample_weight=sample_weight)
95. 
96. def _check_X(self, X):
97. return check_array(X)
98. 
99.     @staticmethod
100. def _update_mean_variance(n_past, mu, var, X, sample_weight=None):
101. """Compute online update of Gaussian mean and variance.
102. 
103.         Given starting sample count, mean, and variance, a new set of
104.         points X, and optionally sample weights, return the updated mean and
105.         variance. (NB - each dimension (column) in X is treated as independent
106.         -- you get variance, not covariance).
107. 
108.         Can take scalar mean and variance, or vector mean and variance to
109.         simultaneously update a number of independent Gaussians.
110. 
111.         See Stanford CS tech report STAN-CS-79-773 by Chan, Golub, and 
112.          LeVeque:
113. 
114.         http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf
115. 
116.         Parameters
117.         ----------
118.         n_past : int
119.             Number of samples represented in old mean and variance. If sample
120.             weights were given, this should contain the sum of sample
121.             weights represented in old mean and variance.
122. 
123.         mu : array-like of shape (number of Gaussians,)
124.             Means for Gaussians in original set.
125. 
126.         var : array-like of shape (number of Gaussians,)
127.             Variances for Gaussians in original set.
128. 
129.         sample_weight : array-like of shape (n_samples,), default=None
130.             Weights applied to individual samples (1. for unweighted).
131. 
132.         Returns
133.         -------
134.         total_mu : array-like of shape (number of Gaussians,)
135.             Updated mean for each Gaussian over the combined set.
136. 
137.         total_var : array-like of shape (number of Gaussians,)
138.             Updated variance for each Gaussian over the combined set.
139.         """
140. if X.shape[0] == 0:
141. return mu, var
142. # Compute (potentially weighted) mean and variance of new datapoints
143. if sample_weight is not None:
144.             n_new = float(sample_weight.sum())
145.             new_mu = np.average(X, axis=0, weights=sample_weight)
146.             new_var = np.average((X - new_mu) ** 2, axis=0, 
147.              weights=sample_weight)
148. else:
149.             n_new = X.shape[0]
150.             new_var = np.var(X, axis=0)
151.             new_mu = np.mean(X, axis=0)
152. if n_past == 0:
153. return new_mu, new_var
154.         n_total = float(n_past + n_new)
155. # Combine mean of old and new data, taking into consideration
156. # (weighted) number of observations
157.         total_mu = (n_new * new_mu + n_past * mu) / n_total
158. # Combine variance of old and new data, taking into consideration
159. # (weighted) number of observations. This is achieved by combining
160. # the sum-of-squared-differences (ssd)
161.         old_ssd = n_past * var
162.         new_ssd = n_new * new_var
163.         total_ssd = old_ssd + new_ssd + (n_new * n_past / n_total) * (mu - 
164.          new_mu) ** 2
165.         total_var = total_ssd / n_total
166. return total_mu, total_var
167. 
168. def partial_fit(self, X, y, classes=None, sample_weight=None):
169. """Incremental fit on a batch of samples.
170. 
171.         This method is expected to be called several times consecutively
172.         on different chunks of a dataset so as to implement out-of-core
173.         or online learning.
174. 
175.         This is especially useful when the whole dataset is too big to fit in
176.         memory at once.
177. 
178.         This method has some performance and numerical stability overhead,
179.         hence it is better to call partial_fit on chunks of data that are
180.         as large as possible (as long as fitting in the memory budget) to
181.         hide the overhead.
182. 
183.         Parameters
184.         ----------
185.         X : array-like of shape (n_samples, n_features)
186.             Training vectors, where n_samples is the number of samples and
187.             n_features is the number of features.
188. 
189.         y : array-like of shape (n_samples,)
190.             Target values.
191. 
192.         classes : array-like of shape (n_classes,), default=None
193.             List of all the classes that can possibly appear in the y vector.
194. 
195.             Must be provided at the first call to partial_fit, can be omitted
196.             in subsequent calls.
197. 
198.         sample_weight : array-like of shape (n_samples,), default=None
199.             Weights applied to individual samples (1. for unweighted).
200. 
201.             .. versionadded:: 0.17
202. 
203.         Returns
204.         -------
205.         self : object
206.         """
207. return self._partial_fit(X, y, classes, _refit=False, 
208.             sample_weight=sample_weight)
209. 
210. def _partial_fit(self, X, y, classes=None, _refit=False, 
211.         sample_weight=None):
212. """Actual implementation of Gaussian NB fitting.
213. 
214.         Parameters
215.         ----------
216.         X : array-like of shape (n_samples, n_features)
217.             Training vectors, where n_samples is the number of samples and
218.             n_features is the number of features.
219. 
220.         y : array-like of shape (n_samples,)
221.             Target values.
222. 
223.         classes : array-like of shape (n_classes,), default=None
224.             List of all the classes that can possibly appear in the y vector.
225. 
226.             Must be provided at the first call to partial_fit, can be omitted
227.             in subsequent calls.
228. 
229.         _refit : bool, default=False
230.             If true, act as though this were the first time we called
231.             _partial_fit (ie, throw away any past fitting and start over).
232. 
233.         sample_weight : array-like of shape (n_samples,), default=None
234.             Weights applied to individual samples (1. for unweighted).
235. 
236.         Returns
237.         -------
238.         self : object
239.         """
240.         X, y = check_X_y(X, y)
241. if sample_weight is not None:
242.             sample_weight = _check_sample_weight(sample_weight, X)
243. # If the ratio of data variance between dimensions is too small, it
244. # will cause numerical errors. To address this, we artificially
245. # boost the variance by epsilon, a small fraction of the standard
246. # deviation of the largest dimension.
247.         self.epsilon_ = self.var_smoothing * np.var(X, axis=0).max()
248. if _refit:
249.             self.classes_ = None
250. if _check_partial_fit_first_call(self, classes):
251. # This is the first call to partial_fit:
252. # initialize various cumulative counters
253.             n_features = X.shape[1]
254.             n_classes = len(self.classes_)
255.             self.theta_ = np.zeros((n_classes, n_features))
256.             self.sigma_ = np.zeros((n_classes, n_features))
257.             self.class_count_ = np.zeros(n_classes, dtype=np.float64)
258. # Initialise the class prior
259. # Take into account the priors
260. if self.priors is not None:
261.                 priors = np.asarray(self.priors)
262. # Check that the provide prior match the number of classes
263. if len(priors) != n_classes:
264. raise ValueError('Number of priors must match number of'
265. ' classes.')
266. # Check that the sum is 1
267. if not np.isclose(priors.sum(), 1.0):
268. raise ValueError('The sum of the priors should be 1.') # Check that 
269.                      the prior are non-negative
270. if (priors < 0).any():
271. raise ValueError('Priors must be non-negative.')
272.                 self.class_prior_ = priors
273. else:
274.                 self.class_prior_ = np.zeros(len(self.classes_), 
275.                     dtype=np.float64) # Initialize the priors to zeros for each class
276. else:
277. if X.shape[1] != self.theta_.shape[1]:
278.                 msg = "Number of features %d does not match previous data %d."
279. raise ValueError(msg % (X.shape[1], self.theta_.shape[1]))
280. # Put epsilon back in each time
281.             ::]self.epsilon_
282.         self.sigma_[ -= 
283.         classes = self.classes_
284.         unique_y = np.unique(y)
285.         unique_y_in_classes = np.in1d(unique_y, classes)
286. if not np.all(unique_y_in_classes):
287. raise ValueError("The target label(s) %s in y do not exist in the "
288. "initial classes %s" % 
289.                 (unique_y[~unique_y_in_classes], classes))
290. for y_i in unique_y:
291.             i = classes.searchsorted(y_i)
292.             X_i = X[y == y_i:]
293. if sample_weight is not None:
294.                 sw_i = sample_weight[y == y_i]
295.                 N_i = sw_i.sum()
296. else:
297.                 sw_i = None
298.                 N_i = X_i.shape[0]
299.             new_theta, new_sigma = self._update_mean_variance(
300.                 self.class_count_[i], self.theta_[i:], self.sigma_[i:], 
301.                 X_i, sw_i)
302.             self.theta_[i:] = new_theta
303.             self.sigma_[i:] = new_sigma
304.             self.class_count_[i] += N_i
305. 
306.         self.sigma_[::] += self.epsilon_
307. # Update if only no priors is provided
308. if self.priors is None:
309. # Empirical prior, with sample_weight taken into account
310.             self.class_prior_ = self.class_count_ / self.class_count_.sum()
311. return self
312. 
313. def _joint_log_likelihood(self, X):
314.         joint_log_likelihood = []
315. for i in range(np.size(self.classes_)):
316.             jointi = np.log(self.class_prior_[i])
317.             n_ij = -0.5 * np.sum(np.log(2. * np.pi * self.sigma_[i:]))
318.             n_ij -= 0.5 * np.sum(((X - self.theta_[i:]) ** 2) / 
319.                 (self.sigma_[i:]), 1)
320.             joint_log_likelihood.append(jointi + n_ij)
321. 
322.         joint_log_likelihood = np.array(joint_log_likelihood).T
323. return joint_log_likelihood
324. 
325. 
326. 
327. class MultinomialNB Found at: sklearn.naive_bayes
328. 
329. class MultinomialNB(_BaseDiscreteNB):
330. """
331.     Naive Bayes classifier for multinomial models
332.     
333.     The multinomial Naive Bayes classifier is suitable for classification with
334.     discrete features (e.g., word counts for text classification). The
335.     multinomial distribution normally requires integer feature counts. However,
336.     in practice, fractional counts such as tf-idf may also work.
337.     
338.     Read more in the :ref:`User Guide <multinomial_naive_bayes>`.
339.     
340.     Parameters
341.     ----------
342.     alpha : float, default=1.0
343.     Additive (Laplace/Lidstone) smoothing parameter
344.     (0 for no smoothing).
345.     
346.     fit_prior : bool, default=True
347.     Whether to learn class prior probabilities or not.
348.     If false, a uniform prior will be used.
349.     
350.     class_prior : array-like of shape (n_classes,), default=None
351.     Prior probabilities of the classes. If specified the priors are not
352.     adjusted according to the data.
353.     
354.     Attributes
355.     ----------
356.     class_count_ : ndarray of shape (n_classes,)
357.     Number of samples encountered for each class during fitting. This
358.     value is weighted by the sample weight when provided.
359.     
360.     class_log_prior_ : ndarray of shape (n_classes, )
361.     Smoothed empirical log probability for each class.
362.     
363.     classes_ : ndarray of shape (n_classes,)
364.     Class labels known to the classifier
365.     
366.     coef_ : ndarray of shape (n_classes, n_features)
367.     Mirrors ``feature_log_prob_`` for interpreting MultinomialNB
368.     as a linear model.
369.     
370.     feature_count_ : ndarray of shape (n_classes, n_features)
371.     Number of samples encountered for each (class, feature)
372.     during fitting. This value is weighted by the sample weight when
373.     provided.
374.     
375.     feature_log_prob_ : ndarray of shape (n_classes, n_features)
376.     Empirical log probability of features
377.     given a class, ``P(x_i|y)``.
378.     
379.     intercept_ : ndarray of shape (n_classes, )
380.     Mirrors ``class_log_prior_`` for interpreting MultinomialNB
381.     as a linear model.
382.     
383.     n_features_ : int
384.     Number of features of each sample.
385.     
386.     Examples
387.     --------
388.     >>> import numpy as np
389.     >>> rng = np.random.RandomState(1)
390.     >>> X = rng.randint(5, size=(6, 100))
391.     >>> y = np.array([1, 2, 3, 4, 5, 6])
392.     >>> from sklearn.naive_bayes import MultinomialNB
393.     >>> clf = MultinomialNB()
394.     >>> clf.fit(X, y)
395.     MultinomialNB()
396.     >>> print(clf.predict(X[2:3]))
397.     [3]
398.     
399.     Notes
400.     -----
401.     For the rationale behind the names `coef_` and `intercept_`, i.e.
402.     naive Bayes as a linear classifier, see J. Rennie et al. (2003),
403.     Tackling the poor assumptions of naive Bayes text classifiers, ICML.
404.     
405.     References
406.     ----------
407.     C.D. Manning, P. Raghavan and H. Schuetze (2008). Introduction to
408.     Information Retrieval. Cambridge University Press, pp. 234-265.
409.     https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-
410.      classification-1.html
411.     """
412.     @_deprecate_positional_args
413. def __init__(self, *, alpha=1.0, fit_prior=True, class_prior=None):
414.         self.alpha = alpha
415.         self.fit_prior = fit_prior
416.         self.class_prior = class_prior
417. 
418. def _more_tags(self):
419. return {'requires_positive_X':True}
420. 
421. def _count(self, X, Y):
422. """Count and smooth feature occurrences."""
423.         check_non_negative(X, "MultinomialNB (input X)")
424.         self.feature_count_ += safe_sparse_dot(Y.T, X)
425.         self.class_count_ += Y.sum(axis=0)
426. 
427. def _update_feature_log_prob(self, alpha):
428. """Apply smoothing to raw counts and recompute log probabilities"""
429.         smoothed_fc = self.feature_count_ + alpha
430.         smoothed_cc = smoothed_fc.sum(axis=1)
431.         self.feature_log_prob_ = np.log(smoothed_fc) - np.log(smoothed_cc.
432.          reshape(-1, 1))
433. 
434. def _joint_log_likelihood(self, X):
435. """Calculate the posterior log probability of the samples X"""
436. return safe_sparse_dot(X, self.feature_log_prob_.T) + self.class_log_prior_
437. 
438. 
439. 
440. 
441. 
442. class BernoulliNB Found at: sklearn.naive_bayes
443. 
444. class BernoulliNB(_BaseDiscreteNB):
445. """Naive Bayes classifier for multivariate Bernoulli models.
446.     
447.     Like MultinomialNB, this classifier is suitable for discrete data. The
448.     difference is that while MultinomialNB works with occurrence counts,
449.     BernoulliNB is designed for binary/boolean features.
450.     
451.     Read more in the :ref:`User Guide <bernoulli_naive_bayes>`.
452.     
453.     Parameters
454.     ----------
455.     alpha : float, default=1.0
456.     Additive (Laplace/Lidstone) smoothing parameter
457.     (0 for no smoothing).
458.     
459.     binarize : float or None, default=0.0
460.     Threshold for binarizing (mapping to booleans) of sample features.
461.     If None, input is presumed to already consist of binary vectors.
462.     
463.     fit_prior : bool, default=True
464.     Whether to learn class prior probabilities or not.
465.     If false, a uniform prior will be used.
466.     
467.     class_prior : array-like of shape (n_classes,), default=None
468.     Prior probabilities of the classes. If specified the priors are not
469.     adjusted according to the data.
470.     
471.     Attributes
472.     ----------
473.     class_count_ : ndarray of shape (n_classes)
474.     Number of samples encountered for each class during fitting. This
475.     value is weighted by the sample weight when provided.
476.     
477.     class_log_prior_ : ndarray of shape (n_classes)
478.     Log probability of each class (smoothed).
479.     
480.     classes_ : ndarray of shape (n_classes,)
481.     Class labels known to the classifier
482.     
483.     feature_count_ : ndarray of shape (n_classes, n_features)
484.     Number of samples encountered for each (class, feature)
485.     during fitting. This value is weighted by the sample weight when
486.     provided.
487.     
488.     feature_log_prob_ : ndarray of shape (n_classes, n_features)
489.     Empirical log probability of features given a class, P(x_i|y).
490.     
491.     n_features_ : int
492.     Number of features of each sample.
493.     
494.     Examples
495.     --------
496.     >>> import numpy as np
497.     >>> rng = np.random.RandomState(1)
498.     >>> X = rng.randint(5, size=(6, 100))
499.     >>> Y = np.array([1, 2, 3, 4, 4, 5])
500.     >>> from sklearn.naive_bayes import BernoulliNB
501.     >>> clf = BernoulliNB()
502.     >>> clf.fit(X, Y)
503.     BernoulliNB()
504.     >>> print(clf.predict(X[2:3]))
505.     [3]
506.     
507.     References
508.     ----------
509.     C.D. Manning, P. Raghavan and H. Schuetze (2008). Introduction to
510.     Information Retrieval. Cambridge University Press, pp. 234-265.
511.     https://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-
512.      model-1.html
513.     
514.     A. McCallum and K. Nigam (1998). A comparison of event models 
515.      for naive
516.     Bayes text classification. Proc. AAAI/ICML-98 Workshop on Learning 
517.      for
518.     Text Categorization, pp. 41-48.
519.     
520.     V. Metsis, I. Androutsopoulos and G. Paliouras (2006). Spam filtering 
521.      with
522.     naive Bayes -- Which naive Bayes? 3rd Conf. on Email and Anti-Spam 
523.      (CEAS).
524.     """
525.     @_deprecate_positional_args
526. def __init__(self, *, alpha=1.0, binarize=.0, fit_prior=True, 
527.         class_prior=None):
528.         self.alpha = alpha
529.         self.binarize = binarize
530.         self.fit_prior = fit_prior
531.         self.class_prior = class_prior
532. 
533. def _check_X(self, X):
534.         X = super()._check_X(X)
535. if self.binarize is not None:
536.             X = binarize(X, threshold=self.binarize)
537. return X
538. 
539. def _check_X_y(self, X, y):
540.         X, y = super()._check_X_y(X, y)
541. if self.binarize is not None:
542.             X = binarize(X, threshold=self.binarize)
543. return X, y
544. 
545. def _count(self, X, Y):
546. """Count and smooth feature occurrences."""
547.         self.feature_count_ += safe_sparse_dot(Y.T, X)
548.         self.class_count_ += Y.sum(axis=0)
549. 
550. def _update_feature_log_prob(self, alpha):
551. """Apply smoothing to raw counts and recompute log 
552.          probabilities"""
553.         smoothed_fc = self.feature_count_ + alpha
554.         smoothed_cc = self.class_count_ + alpha * 2
555.         self.feature_log_prob_ = np.log(smoothed_fc) - np.log
556.          (smoothed_cc.reshape(-1, 1))
557. 
558. def _joint_log_likelihood(self, X):
559. """Calculate the posterior log probability of the samples X"""
560.         n_classes, n_features = self.feature_log_prob_.shape
561.         n_samples, n_features_X = X.shape
562. if n_features_X != n_features:
563. raise ValueError(
564. "Expected input with %d features, got %d instead" % 
565.                  (n_features, n_features_X))
566.         neg_prob = np.log(1 - np.exp(self.feature_log_prob_))
567. # Compute  neg_prob · (1 - X).T  as  ∑neg_prob - X · neg_prob
568.         jll = safe_sparse_dot(X, (self.feature_log_prob_ - neg_prob).T)
569.         jll += self.class_log_prior_ + neg_prob.sum(axis=1)
570. return jll


相关实践学习
【文生图】一键部署Stable Diffusion基于函数计算
本实验教你如何在函数计算FC上从零开始部署Stable Diffusion来进行AI绘画创作,开启AIGC盲盒。函数计算提供一定的免费额度供用户使用。本实验答疑钉钉群:29290019867
建立 Serverless 思维
本课程包括: Serverless 应用引擎的概念, 为开发者带来的实际价值, 以及让您了解常见的 Serverless 架构模式
相关文章
|
6月前
|
算法
【MATLAB】语音信号识别与处理:高斯加权移动平均滤波算法去噪及谱相减算法呈现频谱
【MATLAB】语音信号识别与处理:高斯加权移动平均滤波算法去噪及谱相减算法呈现频谱
253 0
|
3月前
|
数据采集 算法 数据可视化
基于Python的k-means聚类分析算法的实现与应用,可以用在电商评论、招聘信息等各个领域的文本聚类及指标聚类,效果很好
本文介绍了基于Python实现的k-means聚类分析算法,并通过微博考研话题的数据清洗、聚类数量评估、聚类分析实现与结果可视化等步骤,展示了该算法在文本聚类领域的应用效果。
109 1
|
10天前
|
算法 C# 索引
C#线性查找算法
C#线性查找算法!
|
2月前
|
机器学习/深度学习 人工智能 算法
【新闻文本分类识别系统】Python+卷积神经网络算法+人工智能+深度学习+计算机毕设项目+Django网页界面平台
文本分类识别系统。本系统使用Python作为主要开发语言,首先收集了10种中文文本数据集("体育类", "财经类", "房产类", "家居类", "教育类", "科技类", "时尚类", "时政类", "游戏类", "娱乐类"),然后基于TensorFlow搭建CNN卷积神经网络算法模型。通过对数据集进行多轮迭代训练,最后得到一个识别精度较高的模型,并保存为本地的h5格式。然后使用Django开发Web网页端操作界面,实现用户上传一段文本识别其所属的类别。
93 1
【新闻文本分类识别系统】Python+卷积神经网络算法+人工智能+深度学习+计算机毕设项目+Django网页界面平台
|
2月前
|
机器学习/深度学习 算法 Java
[算法与数据结构] 谈谈线性查找法~
该文章详细介绍了线性查找法的基本概念与实现方法,通过Java代码示例解释了如何在一个数组中查找特定元素,并分析了该算法的时间复杂度。
|
2月前
|
机器学习/深度学习 存储 人工智能
文本情感识别分析系统Python+SVM分类算法+机器学习人工智能+计算机毕业设计
使用Python作为开发语言,基于文本数据集(一个积极的xls文本格式和一个消极的xls文本格式文件),使用Word2vec对文本进行处理。通过支持向量机SVM算法训练情绪分类模型。实现对文本消极情感和文本积极情感的识别。并基于Django框架开发网页平台实现对用户的可视化操作和数据存储。
50 0
文本情感识别分析系统Python+SVM分类算法+机器学习人工智能+计算机毕业设计
|
1月前
|
人工智能 算法 BI
【算法】 线性DP(C/C++)
【算法】 线性DP(C/C++)
|
3月前
|
数据采集 自然语言处理 数据可视化
基于Python的社交媒体评论数据挖掘,使用LDA主题分析、文本聚类算法、情感分析实现
本文介绍了基于Python的社交媒体评论数据挖掘方法,使用LDA主题分析、文本聚类算法和情感分析技术,对数据进行深入分析和可视化,以揭示文本数据中的潜在主题、模式和情感倾向。
171 0
|
3月前
|
算法 5G vr&ar
基于1bitDAC的MU-MIMO的非线性预编码算法matlab性能仿真
在现代无线通信中,1-bit DAC的非线性预编码技术应用于MU-MIMO系统,旨在降低成本与能耗。本文采用MATLAB 2022a版本,深入探讨此技术,并通过算法运行效果图展示性能。核心代码支持中文注释与操作指导。理论部分包括信号量化、符号最大化准则,并对比ZF、WF、MRT及ADMM等算法,揭示了在1-bit量化条件下如何优化预编码以提升系统性能。
|
4月前
|
机器学习/深度学习 数据采集 算法
Python基于KMeans算法进行文本聚类项目实战
Python基于KMeans算法进行文本聚类项目实战
189 19