目录
基于titanic泰坦尼克数据集利用catboost算法实现二分类
相关内容
ML之CatBoost:CatBoost算法的简介、安装、案例应用之详细攻略
ML之CatboostC:基于titanic泰坦尼克数据集利用catboost算法实现二分类
ML之CatboostC:基于titanic泰坦尼克数据集利用catboost算法实现二分类实现
基于titanic泰坦尼克数据集利用catboost算法实现二分类
设计思路
输出结果
1. Pclass Sex Age SibSp Parch Survived 2. 0 3 male 22.0 1 0 0 3. 1 1 female 38.0 1 0 1 4. 2 3 female 26.0 0 0 1 5. 3 1 female 35.0 1 0 1 6. 4 3 male 35.0 0 0 0 7. Pclass int64 8. Sex object 9. Age float64 10. SibSp int64 11. Parch int64 12. Survived int64 13. dtype: object 14. object_features_ID: [1] 15. 0: learn: 0.5469469 test: 0.5358272 best: 0.5358272 (0) total: 98.1ms remaining: 9.71s 16. 1: learn: 0.4884967 test: 0.4770551 best: 0.4770551 (1) total: 98.7ms remaining: 4.84s 17. 2: learn: 0.4459496 test: 0.4453159 best: 0.4453159 (2) total: 99.3ms remaining: 3.21s 18. 3: learn: 0.4331858 test: 0.4352757 best: 0.4352757 (3) total: 99.8ms remaining: 2.4s 19. 4: learn: 0.4197131 test: 0.4266055 best: 0.4266055 (4) total: 100ms remaining: 1.91s 20. 5: learn: 0.4085381 test: 0.4224953 best: 0.4224953 (5) total: 101ms remaining: 1.58s 21. 6: learn: 0.4063807 test: 0.4209804 best: 0.4209804 (6) total: 102ms remaining: 1.35s 22. 7: learn: 0.4007713 test: 0.4155077 best: 0.4155077 (7) total: 102ms remaining: 1.17s 23. 8: learn: 0.3971064 test: 0.4135872 best: 0.4135872 (8) total: 103ms remaining: 1.04s 24. 9: learn: 0.3943774 test: 0.4105674 best: 0.4105674 (9) total: 103ms remaining: 928ms 25. 10: learn: 0.3930801 test: 0.4099915 best: 0.4099915 (10) total: 104ms remaining: 839ms 26. 11: learn: 0.3904409 test: 0.4089840 best: 0.4089840 (11) total: 104ms remaining: 764ms 27. 12: learn: 0.3890830 test: 0.4091666 best: 0.4089840 (11) total: 105ms remaining: 701ms 28. 13: learn: 0.3851196 test: 0.4108839 best: 0.4089840 (11) total: 105ms remaining: 647ms 29. 14: learn: 0.3833366 test: 0.4106298 best: 0.4089840 (11) total: 106ms remaining: 600ms 30. 15: learn: 0.3792283 test: 0.4126097 best: 0.4089840 (11) total: 106ms remaining: 558ms 31. 16: learn: 0.3765680 test: 0.4114997 best: 0.4089840 (11) total: 107ms remaining: 522ms 32. 17: learn: 0.3760966 test: 0.4112166 best: 0.4089840 (11) total: 107ms remaining: 489ms 33. 18: learn: 0.3736951 test: 0.4122305 best: 0.4089840 (11) total: 108ms remaining: 461ms 34. 19: learn: 0.3719966 test: 0.4101199 best: 0.4089840 (11) total: 109ms remaining: 435ms 35. 20: learn: 0.3711460 test: 0.4097299 best: 0.4089840 (11) total: 109ms remaining: 411ms 36. 21: learn: 0.3707144 test: 0.4093512 best: 0.4089840 (11) total: 110ms remaining: 389ms 37. 22: learn: 0.3699238 test: 0.4083409 best: 0.4083409 (22) total: 110ms remaining: 370ms 38. 23: learn: 0.3670864 test: 0.4071850 best: 0.4071850 (23) total: 111ms remaining: 351ms 39. 24: learn: 0.3635514 test: 0.4038399 best: 0.4038399 (24) total: 111ms remaining: 334ms 40. 25: learn: 0.3627657 test: 0.4025837 best: 0.4025837 (25) total: 112ms remaining: 319ms 41. 26: learn: 0.3621028 test: 0.4018449 best: 0.4018449 (26) total: 113ms remaining: 304ms 42. 27: learn: 0.3616121 test: 0.4011693 best: 0.4011693 (27) total: 113ms remaining: 291ms 43. 28: learn: 0.3614262 test: 0.4011820 best: 0.4011693 (27) total: 114ms remaining: 278ms 44. 29: learn: 0.3610673 test: 0.4005475 best: 0.4005475 (29) total: 114ms remaining: 267ms 45. 30: learn: 0.3588062 test: 0.4002801 best: 0.4002801 (30) total: 115ms remaining: 256ms 46. 31: learn: 0.3583703 test: 0.3997255 best: 0.3997255 (31) total: 116ms remaining: 246ms 47. 32: learn: 0.3580553 test: 0.4001878 best: 0.3997255 (31) total: 116ms remaining: 236ms 48. 33: learn: 0.3556808 test: 0.4004169 best: 0.3997255 (31) total: 118ms remaining: 228ms 49. 34: learn: 0.3536833 test: 0.4003229 best: 0.3997255 (31) total: 119ms remaining: 220ms 50. 35: learn: 0.3519948 test: 0.4008047 best: 0.3997255 (31) total: 119ms remaining: 212ms 51. 36: learn: 0.3515452 test: 0.4000576 best: 0.3997255 (31) total: 120ms remaining: 204ms 52. 37: learn: 0.3512962 test: 0.3997214 best: 0.3997214 (37) total: 120ms remaining: 196ms 53. 38: learn: 0.3507648 test: 0.4001569 best: 0.3997214 (37) total: 121ms remaining: 189ms 54. 39: learn: 0.3489575 test: 0.4009203 best: 0.3997214 (37) total: 121ms remaining: 182ms 55. 40: learn: 0.3480966 test: 0.4014031 best: 0.3997214 (37) total: 122ms remaining: 175ms 56. 41: learn: 0.3477613 test: 0.4009293 best: 0.3997214 (37) total: 122ms remaining: 169ms 57. 42: learn: 0.3472945 test: 0.4006602 best: 0.3997214 (37) total: 123ms remaining: 163ms 58. 43: learn: 0.3465271 test: 0.4007531 best: 0.3997214 (37) total: 124ms remaining: 157ms 59. 44: learn: 0.3461538 test: 0.4010608 best: 0.3997214 (37) total: 124ms remaining: 152ms 60. 45: learn: 0.3455060 test: 0.4012489 best: 0.3997214 (37) total: 125ms remaining: 146ms 61. 46: learn: 0.3449922 test: 0.4013439 best: 0.3997214 (37) total: 125ms remaining: 141ms 62. 47: learn: 0.3445333 test: 0.4010754 best: 0.3997214 (37) total: 126ms remaining: 136ms 63. 48: learn: 0.3443186 test: 0.4011180 best: 0.3997214 (37) total: 126ms remaining: 132ms 64. 49: learn: 0.3424633 test: 0.4016071 best: 0.3997214 (37) total: 127ms remaining: 127ms 65. 50: learn: 0.3421565 test: 0.4013135 best: 0.3997214 (37) total: 128ms remaining: 123ms 66. 51: learn: 0.3417523 test: 0.4009993 best: 0.3997214 (37) total: 128ms remaining: 118ms 67. 52: learn: 0.3415669 test: 0.4009101 best: 0.3997214 (37) total: 129ms remaining: 114ms 68. 53: learn: 0.3413867 test: 0.4010833 best: 0.3997214 (37) total: 130ms remaining: 110ms 69. 54: learn: 0.3405166 test: 0.4014830 best: 0.3997214 (37) total: 130ms remaining: 107ms 70. 55: learn: 0.3401535 test: 0.4015556 best: 0.3997214 (37) total: 131ms remaining: 103ms 71. 56: learn: 0.3395217 test: 0.4021097 best: 0.3997214 (37) total: 132ms remaining: 99.4ms 72. 57: learn: 0.3393024 test: 0.4023377 best: 0.3997214 (37) total: 132ms remaining: 95.8ms 73. 58: learn: 0.3389909 test: 0.4019616 best: 0.3997214 (37) total: 133ms remaining: 92.3ms 74. 59: learn: 0.3388494 test: 0.4019746 best: 0.3997214 (37) total: 133ms remaining: 88.9ms 75. 60: learn: 0.3384901 test: 0.4017470 best: 0.3997214 (37) total: 134ms remaining: 85.6ms 76. 61: learn: 0.3382250 test: 0.4018783 best: 0.3997214 (37) total: 134ms remaining: 82.4ms 77. 62: learn: 0.3345761 test: 0.4039633 best: 0.3997214 (37) total: 135ms remaining: 79.3ms 78. 63: learn: 0.3317548 test: 0.4050218 best: 0.3997214 (37) total: 136ms remaining: 76.3ms 79. 64: learn: 0.3306501 test: 0.4036656 best: 0.3997214 (37) total: 136ms remaining: 73.3ms 80. 65: learn: 0.3292310 test: 0.4034339 best: 0.3997214 (37) total: 137ms remaining: 70.5ms 81. 66: learn: 0.3283600 test: 0.4033661 best: 0.3997214 (37) total: 137ms remaining: 67.6ms 82. 67: learn: 0.3282389 test: 0.4034237 best: 0.3997214 (37) total: 138ms remaining: 64.9ms 83. 68: learn: 0.3274603 test: 0.4039310 best: 0.3997214 (37) total: 138ms remaining: 62.2ms 84. 69: learn: 0.3273430 test: 0.4041663 best: 0.3997214 (37) total: 139ms remaining: 59.6ms 85. 70: learn: 0.3271585 test: 0.4044144 best: 0.3997214 (37) total: 140ms remaining: 57.1ms 86. 71: learn: 0.3268457 test: 0.4046981 best: 0.3997214 (37) total: 140ms remaining: 54.6ms 87. 72: learn: 0.3266497 test: 0.4042724 best: 0.3997214 (37) total: 141ms remaining: 52.1ms 88. 73: learn: 0.3259684 test: 0.4048797 best: 0.3997214 (37) total: 141ms remaining: 49.7ms 89. 74: learn: 0.3257845 test: 0.4044766 best: 0.3997214 (37) total: 142ms remaining: 47.3ms 90. 75: learn: 0.3256157 test: 0.4047031 best: 0.3997214 (37) total: 143ms remaining: 45.1ms 91. 76: learn: 0.3251433 test: 0.4043698 best: 0.3997214 (37) total: 144ms remaining: 42.9ms 92. 77: learn: 0.3247743 test: 0.4041652 best: 0.3997214 (37) total: 144ms remaining: 40.6ms 93. 78: learn: 0.3224876 test: 0.4058880 best: 0.3997214 (37) total: 145ms remaining: 38.5ms 94. 79: learn: 0.3223339 test: 0.4058139 best: 0.3997214 (37) total: 145ms remaining: 36.3ms 95. 80: learn: 0.3211858 test: 0.4060056 best: 0.3997214 (37) total: 146ms remaining: 34.2ms 96. 81: learn: 0.3200423 test: 0.4067103 best: 0.3997214 (37) total: 147ms remaining: 32.2ms 97. 82: learn: 0.3198329 test: 0.4069039 best: 0.3997214 (37) total: 147ms remaining: 30.1ms 98. 83: learn: 0.3196561 test: 0.4067853 best: 0.3997214 (37) total: 148ms remaining: 28.1ms 99. 84: learn: 0.3193160 test: 0.4072288 best: 0.3997214 (37) total: 148ms remaining: 26.1ms 100. 85: learn: 0.3184463 test: 0.4077451 best: 0.3997214 (37) total: 149ms remaining: 24.2ms 101. 86: learn: 0.3175777 test: 0.4086243 best: 0.3997214 (37) total: 149ms remaining: 22.3ms 102. 87: learn: 0.3173824 test: 0.4082013 best: 0.3997214 (37) total: 150ms remaining: 20.4ms 103. 88: learn: 0.3172840 test: 0.4083946 best: 0.3997214 (37) total: 150ms remaining: 18.6ms 104. 89: learn: 0.3166252 test: 0.4086761 best: 0.3997214 (37) total: 151ms remaining: 16.8ms 105. 90: learn: 0.3164144 test: 0.4083237 best: 0.3997214 (37) total: 151ms remaining: 15ms 106. 91: learn: 0.3162137 test: 0.4083699 best: 0.3997214 (37) total: 152ms remaining: 13.2ms 107. 92: learn: 0.3155611 test: 0.4091627 best: 0.3997214 (37) total: 152ms remaining: 11.5ms 108. 93: learn: 0.3153976 test: 0.4089484 best: 0.3997214 (37) total: 153ms remaining: 9.76ms 109. 94: learn: 0.3139281 test: 0.4116939 best: 0.3997214 (37) total: 154ms remaining: 8.08ms 110. 95: learn: 0.3128878 test: 0.4146652 best: 0.3997214 (37) total: 154ms remaining: 6.42ms 111. 96: learn: 0.3127863 test: 0.4145767 best: 0.3997214 (37) total: 155ms remaining: 4.78ms 112. 97: learn: 0.3126696 test: 0.4142118 best: 0.3997214 (37) total: 155ms remaining: 3.17ms 113. 98: learn: 0.3120048 test: 0.4140831 best: 0.3997214 (37) total: 156ms remaining: 1.57ms 114. 99: learn: 0.3117563 test: 0.4138267 best: 0.3997214 (37) total: 156ms remaining: 0us 115. 116. bestTest = 0.3997213503 117. bestIteration = 37 118. 119. Shrink model to first 38 iterations.
核心代码
1. class CatBoostClassifier Found at: catboost.core 2. 3. class CatBoostClassifier(CatBoost): 4. _estimator_type = 'classifier' 5. """ 6. Implementation of the scikit-learn API for CatBoost classification. 7. 8. Parameters 9. ---------- 10. iterations : int, [default=500] 11. Max count of trees. 12. range: [1,+inf] 13. learning_rate : float, [default value is selected automatically for 14. binary classification with other parameters set to default. In all 15. other cases default is 0.03] 16. Step size shrinkage used in update to prevents overfitting. 17. range: (0,1] 18. depth : int, [default=6] 19. Depth of a tree. All trees are the same depth. 20. range: [1,+inf] 21. l2_leaf_reg : float, [default=3.0] 22. Coefficient at the L2 regularization term of the cost function. 23. range: [0,+inf] 24. model_size_reg : float, [default=None] 25. Model size regularization coefficient. 26. range: [0,+inf] 27. rsm : float, [default=None] 28. Subsample ratio of columns when constructing each tree. 29. range: (0,1] 30. loss_function : string or object, [default='Logloss'] 31. The metric to use in training and also selector of the machine 32. learning 33. problem to solve. If string, then the name of a supported 34. metric, 35. optionally suffixed with parameter description. 36. If object, it shall provide methods 'calc_ders_range' or 37. 'calc_ders_multi'. 38. border_count : int, [default = 254 for training on CPU or 128 for 39. training on GPU] 40. The number of partitions in numeric features binarization. 41. Used in the preliminary calculation. 42. range: [1,65535] on CPU, [1,255] on GPU 43. feature_border_type : string, [default='GreedyLogSum'] 44. The binarization mode in numeric features binarization. Used 45. in the preliminary calculation. 46. Possible values: 47. - 'Median' 48. - 'Uniform' 49. - 'UniformAndQuantiles' 50. - 'GreedyLogSum' 51. - 'MaxLogSum' 52. - 'MinEntropy' 53. per_float_feature_quantization : list of strings, [default=None] 54. List of float binarization descriptions. 55. Format : described in documentation on catboost.ai 56. Example 1: ['0:1024'] means that feature 0 will have 1024 57. borders. 58. Example 2: ['0:border_count=1024', '1:border_count=1024', 59. ...] means that two first features have 1024 borders. 60. Example 3: ['0:nan_mode=Forbidden,border_count=32, 61. border_type=GreedyLogSum', 62. '1:nan_mode=Forbidden,border_count=32, 63. border_type=GreedyLogSum'] - defines more quantization 64. properties for first two features. 65. input_borders : string, [default=None] 66. input file with borders used in numeric features binarization. 67. output_borders : string, [default=None] 68. output file for borders that were used in numeric features 69. binarization. 70. fold_permutation_block : int, [default=1] 71. To accelerate the learning. 72. The recommended value is within [1, 256]. On small samples, 73. must be set to 1. 74. range: [1,+inf] 75. od_pval : float, [default=None] 76. Use overfitting detector to stop training when reaching a 77. specified threshold. 78. Can be used only with eval_set. 79. range: [0,1] 80. od_wait : int, [default=None] 81. Number of iterations which overfitting detector will wait after 82. new best error. 83. od_type : string, [default=None] 84. Type of overfitting detector which will be used in program. 85. Posible values: 86. - 'IncToDec' 87. - 'Iter' 88. For 'Iter' type od_pval must not be set. 89. If None, then od_type=IncToDec. 90. nan_mode : string, [default=None] 91. Way to process missing values for numeric features. 92. Possible values: 93. - 'Forbidden' - raises an exception if there is a missing value 94. for a numeric feature in a dataset. 95. - 'Min' - each missing value will be processed as the 96. minimum numerical value. 97. - 'Max' - each missing value will be processed as the 98. maximum numerical value. 99. If None, then nan_mode=Min. 100. counter_calc_method : string, [default=None] 101. The method used to calculate counters for dataset with 102. Counter type. 103. Possible values: 104. - 'PrefixTest' - only objects up to current in the test dataset 105. are considered 106. - 'FullTest' - all objects are considered in the test dataset 107. - 'SkipTest' - Objects from test dataset are not considered 108. - 'Full' - all objects are considered for both learn and test 109. dataset 110. If None, then counter_calc_method=PrefixTest. 111. leaf_estimation_iterations : int, [default=None] 112. The number of steps in the gradient when calculating the 113. values in the leaves. 114. If None, then leaf_estimation_iterations=1. 115. range: [1,+inf] 116. leaf_estimation_method : string, [default=None] 117. The method used to calculate the values in the leaves. 118. Possible values: 119. - 'Newton' 120. - 'Gradient' 121. thread_count : int, [default=None] 122. Number of parallel threads used to run CatBoost. 123. If None or -1, then the number of threads is set to the 124. number of CPU cores. 125. range: [1,+inf] 126. random_seed : int, [default=None] 127. Random number seed. 128. If None, 0 is used. 129. range: [0,+inf] 130. use_best_model : bool, [default=None] 131. To limit the number of trees in predict() using information 132. about the optimal value of the error function. 133. Can be used only with eval_set. 134. best_model_min_trees : int, [default=None] 135. The minimal number of trees the best model should have. 136. verbose: bool 137. When set to True, logging_level is set to 'Verbose'. 138. When set to False, logging_level is set to 'Silent'. 139. silent: bool, synonym for verbose 140. logging_level : string, [default='Verbose'] 141. Possible values: 142. - 'Silent' 143. - 'Verbose' 144. - 'Info' 145. - 'Debug' 146. metric_period : int, [default=1] 147. The frequency of iterations to print the information to stdout. 148. The value should be a positive integer. 149. simple_ctr: list of strings, [default=None] 150. Binarization settings for categorical features. 151. Format : see documentation 152. Example: ['Borders:CtrBorderCount=5:Prior=0:Prior=0.5', 153. 'BinarizedTargetMeanValue:TargetBorderCount=10: 154. TargetBorderType=MinEntropy', ...] 155. CTR types: 156. CPU and GPU 157. - 'Borders' 158. - 'Buckets' 159. CPU only 160. - 'BinarizedTargetMeanValue' 161. - 'Counter' 162. GPU only 163. - 'FloatTargetMeanValue' 164. - 'FeatureFreq' 165. Number_of_borders, binarization type, target borders and 166. binarizations, priors are optional parametrs 167. combinations_ctr: list of strings, [default=None] 168. per_feature_ctr: list of strings, [default=None] 169. ctr_target_border_count: int, [default=None] 170. Maximum number of borders used in target binarization for 171. categorical features that need it. 172. If TargetBorderCount is specified in 'simple_ctr', 173. 'combinations_ctr' or 'per_feature_ctr' option it 174. overrides this value. 175. range: [1, 255] 176. ctr_leaf_count_limit : int, [default=None] 177. The maximum number of leaves with categorical features. 178. If the number of leaves exceeds the specified limit, some 179. leaves are discarded. 180. The leaves to be discarded are selected as follows: 181. - The leaves are sorted by the frequency of the values. 182. - The top N leaves are selected, where N is the value 183. specified in the parameter. 184. - All leaves starting from N+1 are discarded. 185. This option reduces the resulting model size 186. and the amount of memory required for training. 187. Note that the resulting quality of the model can be affected. 188. range: [1,+inf] (for zero limit use ignored_features) 189. store_all_simple_ctr : bool, [default=None] 190. Ignore categorical features, which are not used in feature 191. combinations, 192. when choosing candidates for exclusion. 193. Use this parameter with ctr_leaf_count_limit only. 194. max_ctr_complexity : int, [default=4] 195. The maximum number of Categ features that can be 196. combined. 197. range: [0,+inf] 198. has_time : bool, [default=False] 199. To use the order in which objects are represented in the 200. input data 201. (do not perform a random permutation of the dataset at the 202. preprocessing stage). 203. allow_const_label : bool, [default=False] 204. To allow the constant label value in dataset. 205. target_border: float, [default=None] 206. Border for target binarization. 207. classes_count : int, [default=None] 208. The upper limit for the numeric class label. 209. Defines the number of classes for multiclassification. 210. Only non-negative integers can be specified. 211. The given integer should be greater than any of the target 212. values. 213. If this parameter is specified the labels for all classes in the 214. input dataset 215. should be smaller than the given value. 216. If several of 'classes_count', 'class_weights', 'class_names' 217. parameters are defined 218. the numbers of classes specified by each of them must be 219. equal. 220. class_weights : list or dict, [default=None] 221. Classes weights. The values are used as multipliers for the 222. object weights. 223. If None, all classes are supposed to have weight one. 224. If list - class weights in order of class_names or sequential 225. classes if class_names is undefined 226. If dict - dict of class_name -> class_weight. 227. If several of 'classes_count', 'class_weights', 'class_names' 228. parameters are defined 229. the numbers of classes specified by each of them must be 230. equal. 231. auto_class_weights : string [default=None] 232. Enables automatic class weights calculation. Possible values: 233. - Balanced # weight = maxSummaryClassWeight / 234. summaryClassWeight, statistics determined from train pool 235. - SqrtBalanced # weight = sqrt(maxSummaryClassWeight / 236. summaryClassWeight) 237. class_names: list of strings, [default=None] 238. Class names. Allows to redefine the default values for class 239. labels (integer numbers). 240. If several of 'classes_count', 'class_weights', 'class_names' 241. parameters are defined 242. the numbers of classes specified by each of them must be 243. equal. 244. one_hot_max_size : int, [default=None] 245. Convert the feature to float 246. if the number of different values that it takes exceeds the 247. specified value. 248. Ctrs are not calculated for such features. 249. random_strength : float, [default=1] 250. Score standard deviation multiplier. 251. name : string, [default='experiment'] 252. The name that should be displayed in the visualization tools. 253. ignored_features : list, [default=None] 254. Indices or names of features that should be excluded when 255. training. 256. train_dir : string, [default=None] 257. The directory in which you want to record generated in the 258. process of learning files. 259. custom_metric : string or list of strings, [default=None] 260. To use your own metric function. 261. custom_loss: alias to custom_metric 262. eval_metric : string or object, [default=None] 263. To optimize your custom metric in loss. 264. bagging_temperature : float, [default=None] 265. Controls intensity of Bayesian bagging. The higher the 266. temperature the more aggressive bagging is. 267. Typical values are in range [0, 1] (0 - no bagging, 1 - default). 268. save_snapshot : bool, [default=None] 269. Enable progress snapshotting for restoring progress after 270. crashes or interruptions 271. snapshot_file : string, [default=None] 272. Learn progress snapshot file path, if None will use default 273. filename 274. snapshot_interval: int, [default=600] 275. Interval between saving snapshots (seconds) 276. fold_len_multiplier : float, [default=None] 277. Fold length multiplier. Should be greater than 1 278. used_ram_limit : string or number, [default=None] 279. Set a limit on memory consumption (value like '1.2gb' or 1.2 280. e9). 281. WARNING: Currently this option affects CTR memory usage 282. only. 283. gpu_ram_part : float, [default=0.95] 284. Fraction of the GPU RAM to use for training, a value from (0, 285. 1]. 286. pinned_memory_size: int [default=None] 287. Size of additional CPU pinned memory used for GPU learning, 288. usually is estimated automatically, thus usually should not be 289. set. 290. allow_writing_files : bool, [default=True] 291. If this flag is set to False, no files with different diagnostic info 292. will be created during training. 293. With this flag no snapshotting can be done. Plus visualisation 294. will not 295. work, because visualisation uses files that are created and 296. updated during training. 297. final_ctr_computation_mode : string, [default='Default'] 298. Possible values: 299. - 'Default' - Compute final ctrs for all pools. 300. - 'Skip' - Skip final ctr computation. WARNING: model 301. without ctrs can't be applied. 302. approx_on_full_history : bool, [default=False] 303. If this flag is set to True, each approximated value is 304. calculated using all the preceeding rows in the fold (slower, more 305. accurate). 306. If this flag is set to False, each approximated value is 307. calculated using only the beginning 1/fold_len_multiplier fraction 308. of the fold (faster, slightly less accurate). 309. boosting_type : string, default value depends on object count 310. and feature count in train dataset and on learning mode. 311. Boosting scheme. 312. Possible values: 313. - 'Ordered' - Gives better quality, but may slow down the 314. training. 315. - 'Plain' - The classic gradient boosting scheme. May result 316. in quality degradation, but does not slow down the training. 317. task_type : string, [default=None] 318. The calcer type used to train the model. 319. Possible values: 320. - 'CPU' 321. - 'GPU' 322. device_config : string, [default=None], deprecated, use devices 323. instead 324. devices : list or string, [default=None], GPU devices to use. 325. String format is: '0' for 1 device or '0:1:3' for multiple devices 326. or '0-3' for range of devices. 327. List format is : [0] for 1 device or [0,1,3] for multiple devices. 328. 329. bootstrap_type : string, Bayesian, Bernoulli, Poisson, MVS. 330. Default bootstrap is Bayesian for GPU and MVS for CPU. 331. Poisson bootstrap is supported only on GPU. 332. MVS bootstrap is supported only on CPU. 333. 334. subsample : float, [default=None] 335. Sample rate for bagging. This parameter can be used Poisson 336. or Bernoully bootstrap types. 337. 338. mvs-reg : float, [default is set automatically at each iteration 339. based on gradient distribution] 340. Regularization parameter for MVS sampling algorithm 341. 342. monotone_constraints : list or numpy.ndarray or string or dict, 343. [default=None] 344. Monotone constraints for features. 345. 346. feature_weights : list or numpy.ndarray or string or dict, 347. [default=None] 348. Coefficient to multiply split gain with specific feature use. 349. Should be non-negative. 350. 351. penalties_coefficient : float, [default=1] 352. Common coefficient for all penalties. Should be non-negative. 353. 354. first_feature_use_penalties : list or numpy.ndarray or string or 355. dict, [default=None] 356. Penalties to first use of specific feature in model. Should be 357. non-negative. 358. 359. per_object_feature_penalties : list or numpy.ndarray or string or 360. dict, [default=None] 361. Penalties for first use of feature for each object. Should be 362. non-negative. 363. 364. sampling_frequency : string, [default=PerTree] 365. Frequency to sample weights and objects when building 366. trees. 367. Possible values: 368. - 'PerTree' - Before constructing each new tree 369. - 'PerTreeLevel' - Before choosing each new split of a tree 370. 371. sampling_unit : string, [default='Object']. 372. Possible values: 373. - 'Object' 374. - 'Group' 375. The parameter allows to specify the sampling scheme: 376. sample weights for each object individually or for an entire 377. group of objects together. 378. 379. dev_score_calc_obj_block_size: int, [default=5000000] 380. CPU only. Size of block of samples in score calculation. Should 381. be > 0 382. Used only for learning speed tuning. 383. Changing this parameter can affect results due to numerical 384. accuracy differences 385. 386. dev_efb_max_buckets : int, [default=1024] 387. CPU only. Maximum bucket count in exclusive features 388. bundle. Should be in an integer between 0 and 65536. 389. Used only for learning speed tuning. 390. 391. sparse_features_conflict_fraction : float, [default=0.0] 392. CPU only. Maximum allowed fraction of conflicting non- 393. default values for features in exclusive features bundle. 394. Should be a real value in [0, 1) interval. 395. 396. grow_policy : string, [SymmetricTree,Lossguide,Depthwise], 397. [default=SymmetricTree] 398. The tree growing policy. It describes how to perform greedy 399. tree construction. 400. 401. min_data_in_leaf : int, [default=1]. 402. The minimum training samples count in leaf. 403. CatBoost will not search for new splits in leaves with samples 404. count less than min_data_in_leaf. 405. This parameter is used only for Depthwise and Lossguide 406. growing policies. 407. 408. max_leaves : int, [default=31], 409. The maximum leaf count in resulting tree. 410. This parameter is used only for Lossguide growing policy. 411. 412. score_function : string, possible values L2, Cosine, NewtonL2, 413. NewtonCosine, [default=Cosine] 414. For growing policy Lossguide default=NewtonL2. 415. GPU only. Score that is used during tree construction to 416. select the next tree split. 417. 418. max_depth : int, Synonym for depth. 419. 420. n_estimators : int, synonym for iterations. 421. 422. num_trees : int, synonym for iterations. 423. 424. num_boost_round : int, synonym for iterations. 425. 426. colsample_bylevel : float, synonym for rsm. 427. 428. random_state : int, synonym for random_seed. 429. 430. reg_lambda : float, synonym for l2_leaf_reg. 431. 432. objective : string, synonym for loss_function. 433. 434. num_leaves : int, synonym for max_leaves. 435. 436. min_child_samples : int, synonym for min_data_in_leaf 437. 438. eta : float, synonym for learning_rate. 439. 440. max_bin : float, synonym for border_count. 441. 442. scale_pos_weight : float, synonym for class_weights. 443. Can be used only for binary classification. Sets weight 444. multiplier for 445. class 1 to scale_pos_weight value. 446. 447. metadata : dict, string to string key-value pairs to be stored in 448. model metadata storage 449. 450. early_stopping_rounds : int 451. Synonym for od_wait. Only one of these parameters should 452. be set. 453. 454. cat_features : list or numpy.ndarray, [default=None] 455. If not None, giving the list of Categ features indices or names 456. (names are represented as strings). 457. If it contains feature names, feature names must be defined 458. for the training dataset passed to 'fit'. 459. 460. text_features : list or numpy.ndarray, [default=None] 461. If not None, giving the list of Text features indices or names 462. (names are represented as strings). 463. If it contains feature names, feature names must be defined 464. for the training dataset passed to 'fit'. 465. 466. embedding_features : list or numpy.ndarray, [default=None] 467. If not None, giving the list of Embedding features indices or 468. names (names are represented as strings). 469. If it contains feature names, feature names must be defined 470. for the training dataset passed to 'fit'. 471. 472. leaf_estimation_backtracking : string, [default=None] 473. Type of backtracking during gradient descent. 474. Possible values: 475. - 'No' - never backtrack; supported on CPU and GPU 476. - 'AnyImprovement' - reduce the descent step until the 477. value of loss function is less than before the step; supported on 478. CPU and GPU 479. - 'Armijo' - reduce the descent step until Armijo condition 480. is satisfied; supported on GPU only 481. 482. model_shrink_rate : float, [default=0] 483. This parameter enables shrinkage of model at the start of 484. each iteration. CPU only. 485. For Constant mode shrinkage coefficient is calculated as (1 - 486. model_shrink_rate * learning_rate). 487. For Decreasing mode shrinkage coefficient is calculated as (1 488. - model_shrink_rate / iteration). 489. Shrinkage coefficient should be in [0, 1). 490. 491. model_shrink_mode : string, [default=None] 492. Mode of shrinkage coefficient calculation. CPU only. 493. Possible values: 494. - 'Constant' - Shrinkage coefficient is constant at each 495. iteration. 496. - 'Decreasing' - Shrinkage coefficient decreases at each 497. iteration. 498. 499. langevin : bool, [default=False] 500. Enables the Stochastic Gradient Langevin Boosting. CPU only. 501. 502. diffusion_temperature : float, [default=0] 503. Langevin boosting diffusion temperature. CPU only. 504. 505. posterior_sampling : bool, [default=False] 506. Set group of parameters for further use Uncertainty 507. prediction: 508. - Langevin = True 509. - Model Shrink Rate = 1/(2N), where N is dataset size 510. - Model Shrink Mode = Constant 511. - Diffusion-temperature = N, where N is dataset size. CPU 512. only. 513. 514. boost_from_average : bool, [default=True for RMSE, False for 515. other losses] 516. Enables to initialize approx values by best constant value for 517. specified loss function. 518. Available for RMSE, Logloss, CrossEntropy, Quantile and MAE. 519. 520. tokenizers : list of dicts, 521. Each dict is a tokenizer description. Example: 522. ``` 523. [ 524. { 525. 'tokenizer_id': 'Tokenizer', # Tokeinzer identifier. 526. 'lowercasing': 'false', # Possible values: 'true', 'false'. 527. 'number_process_policy': 'LeaveAsIs', # Possible values: 528. 'Skip', 'LeaveAsIs', 'Replace'. 529. 'number_token': '%', # Rarely used character. Used in 530. conjunction with Replace NumberProcessPolicy. 531. 'separator_type': 'ByDelimiter', # Possible values: 532. 'ByDelimiter', 'BySense'. 533. 'delimiter': ' ', # Used in conjunction with ByDelimiter 534. SeparatorType. 535. 'split_by_set': 'false', # Each single character in delimiter 536. used as individual delimiter. 537. 'skip_empty': 'true', # Possible values: 'true', 'false'. 538. 'token_types': ['Word', 'Number', 'Unknown'], # Used in 539. conjunction with BySense SeparatorType. 540. # Possible values: 'Word', 'Number', 'Punctuation', 541. 'SentenceBreak', 'ParagraphBreak', 'Unknown'. 542. 'subtokens_policy': 'SingleToken', # Possible values: 543. # 'SingleToken' - All subtokens are interpreted as 544. single token). 545. # 'SeveralTokens' - All subtokens are interpreted as 546. several token. 547. }, 548. ... 549. ] 550. ``` 551. 552. dictionaries : list of dicts, 553. Each dict is a tokenizer description. Example: 554. ``` 555. [ 556. { 557. 'dictionary_id': 'Dictionary', # Dictionary identifier. 558. 'token_level_type': 'Word', # Possible values: 'Word', 559. 'Letter'. 560. 'gram_order': '1', # 1 for Unigram, 2 for Bigram, ... 561. 'skip_step': '0', # 1 for 1-skip-gram, ... 562. 'end_of_word_token_policy': 'Insert', # Possible values: 563. 'Insert', 'Skip'. 564. 'end_of_sentence_token_policy': 'Skip', # Possible values: 565. 'Insert', 'Skip'. 566. 'occurrence_lower_bound': '3', # The lower bound of 567. token occurrences in the text to include it in the dictionary. 568. 'max_dictionary_size': '50000', # The max dictionary size. 569. }, 570. ... 571. ] 572. ``` 573. 574. feature_calcers : list of strings, 575. Each string is a calcer description. Example: 576. ``` 577. [ 578. 'NaiveBayes', 579. 'BM25', 580. 'BoW:top_tokens_count=2000', 581. ] 582. ``` 583. 584. text_processing : dict, 585. Text processging description. 586. """ 587. def __init__( 588. self, 589. iterations=None, 590. learning_rate=None, 591. depth=None, 592. l2_leaf_reg=None, 593. model_size_reg=None, 594. rsm=None, 595. loss_function=None, 596. border_count=None, 597. feature_border_type=None, 598. per_float_feature_quantization=None, 599. input_borders=None, 600. output_borders=None, 601. fold_permutation_block=None, 602. od_pval=None, 603. od_wait=None, 604. od_type=None, 605. nan_mode=None, 606. counter_calc_method=None, 607. leaf_estimation_iterations=None, 608. leaf_estimation_method=None, 609. thread_count=None, 610. random_seed=None, 611. use_best_model=None, 612. best_model_min_trees=None, 613. verbose=None, 614. silent=None, 615. logging_level=None, 616. metric_period=None, 617. ctr_leaf_count_limit=None, 618. store_all_simple_ctr=None, 619. max_ctr_complexity=None, 620. has_time=None, 621. allow_const_label=None, 622. target_border=None, 623. classes_count=None, 624. class_weights=None, 625. auto_class_weights=None, 626. class_names=None, 627. one_hot_max_size=None, 628. random_strength=None, 629. name=None, 630. ignored_features=None, 631. train_dir=None, 632. custom_loss=None, 633. custom_metric=None, 634. eval_metric=None, 635. bagging_temperature=None, 636. save_snapshot=None, 637. snapshot_file=None, 638. snapshot_interval=None, 639. fold_len_multiplier=None, 640. used_ram_limit=None, 641. gpu_ram_part=None, 642. pinned_memory_size=None, 643. allow_writing_files=None, 644. final_ctr_computation_mode=None, 645. approx_on_full_history=None, 646. boosting_type=None, 647. simple_ctr=None, 648. combinations_ctr=None, 649. per_feature_ctr=None, 650. ctr_description=None, 651. ctr_target_border_count=None, 652. task_type=None, 653. device_config=None, 654. devices=None, 655. bootstrap_type=None, 656. subsample=None, 657. mvs_reg=None, 658. sampling_unit=None, 659. sampling_frequency=None, 660. dev_score_calc_obj_block_size=None, 661. dev_efb_max_buckets=None, 662. sparse_features_conflict_fraction=None, 663. max_depth=None, 664. n_estimators=None, 665. num_boost_round=None, 666. num_trees=None, 667. colsample_bylevel=None, 668. random_state=None, 669. reg_lambda=None, 670. objective=None, 671. eta=None, 672. max_bin=None, 673. scale_pos_weight=None, 674. gpu_cat_features_storage=None, 675. data_partition=None, 676. metadata=None, 677. early_stopping_rounds=None, 678. cat_features=None, 679. grow_policy=None, 680. min_data_in_leaf=None, 681. min_child_samples=None, 682. max_leaves=None, 683. num_leaves=None, 684. score_function=None, 685. leaf_estimation_backtracking=None, 686. ctr_history_unit=None, 687. monotone_constraints=None, 688. feature_weights=None, 689. penalties_coefficient=None, 690. first_feature_use_penalties=None, 691. per_object_feature_penalties=None, 692. model_shrink_rate=None, 693. model_shrink_mode=None, 694. langevin=None, 695. diffusion_temperature=None, 696. posterior_sampling=None, 697. boost_from_average=None, 698. text_features=None, 699. tokenizers=None, 700. dictionaries=None, 701. feature_calcers=None, 702. text_processing=None, 703. embedding_features=None): 704. params = {} 705. not_params = ["not_params", "self", "params", "__class__"] 706. for key, value in iteritems(locals().copy()): 707. if key not in not_params and value is not None: 708. params[key] = value 709. 710. super(CatBoostClassifier, self).__init__(params) 711. 712. def fit(self, X, y=None, cat_features=None, text_features=None, 713. embedding_features=None, sample_weight=None, 714. baseline=None, use_best_model=None, 715. eval_set=None, verbose=None, logging_level=None, 716. plot=False, column_description=None, 717. verbose_eval=None, metric_period=None, silent=None, 718. early_stopping_rounds=None, 719. save_snapshot=None, snapshot_file=None, 720. snapshot_interval=None, init_model=None): 721. """ 722. Fit the CatBoostClassifier model. 723. 724. Parameters 725. ---------- 726. X : catboost.Pool or list or numpy.ndarray or pandas. 727. DataFrame or pandas.Series 728. If not catboost.Pool, 2 dimensional Feature matrix or string 729. - file with dataset. 730. 731. y : list or numpy.ndarray or pandas.DataFrame or pandas. 732. Series, optional (default=None) 733. Labels, 1 dimensional array like. 734. Use only if X is not catboost.Pool. 735. 736. cat_features : list or numpy.ndarray, optional (default=None) 737. If not None, giving the list of Categ columns indices. 738. Use only if X is not catboost.Pool. 739. 740. text_features : list or numpy.ndarray, optional (default=None) 741. If not None, giving the list of Text columns indices. 742. Use only if X is not catboost.Pool. 743. 744. embedding_features : list or numpy.ndarray, optional 745. (default=None) 746. If not None, giving the list of Embedding columns indices. 747. Use only if X is not catboost.Pool. 748. 749. sample_weight : list or numpy.ndarray or pandas.DataFrame 750. or pandas.Series, optional (default=None) 751. Instance weights, 1 dimensional array like. 752. 753. baseline : list or numpy.ndarray, optional (default=None) 754. If not None, giving 2 dimensional array like data. 755. Use only if X is not catboost.Pool. 756. 757. use_best_model : bool, optional (default=None) 758. Flag to use best model 759. 760. eval_set : catboost.Pool or list, optional (default=None) 761. A list of (X, y) tuple pairs to use as a validation set for early- 762. stopping 763. 764. metric_period : int 765. Frequency of evaluating metrics. 766. 767. verbose : bool or int 768. If verbose is bool, then if set to True, logging_level is set to 769. Verbose, 770. if set to False, logging_level is set to Silent. 771. If verbose is int, it determines the frequency of writing 772. metrics to output and 773. logging_level is set to Verbose. 774. 775. silent : bool 776. If silent is True, logging_level is set to Silent. 777. If silent is False, logging_level is set to Verbose. 778. 779. logging_level : string, optional (default=None) 780. Possible values: 781. - 'Silent' 782. - 'Verbose' 783. - 'Info' 784. - 'Debug' 785. 786. plot : bool, optional (default=False) 787. If True, draw train and eval error in Jupyter notebook 788. 789. verbose_eval : bool or int 790. Synonym for verbose. Only one of these parameters should 791. be set. 792. 793. early_stopping_rounds : int 794. Activates Iter overfitting detector with od_wait set to 795. early_stopping_rounds. 796. 797. save_snapshot : bool, [default=None] 798. Enable progress snapshotting for restoring progress after 799. crashes or interruptions 800. 801. snapshot_file : string, [default=None] 802. Learn progress snapshot file path, if None will use default 803. filename 804. 805. snapshot_interval: int, [default=600] 806. Interval between saving snapshots (seconds) 807. 808. init_model : CatBoost class or string, [default=None] 809. Continue training starting from the existing model. 810. If this parameter is a string, load initial model from the path 811. specified by this string. 812. 813. Returns 814. ------- 815. model : CatBoost 816. """ 817. params = self._init_params.copy() 818. _process_synonyms(params) 819. if 'loss_function' in params: 820. self._check_is_classification_objective(params 821. ['loss_function']) 822. self._fit(X, y, cat_features, text_features, embedding_features, 823. None, sample_weight, None, None, None, None, baseline, 824. use_best_model, eval_set, verbose, logging_level, plot, 825. column_description, verbose_eval, metric_period, silent, 826. early_stopping_rounds, save_snapshot, snapshot_file, 827. snapshot_interval, init_model) 828. return self 829. 830. def predict(self, data, prediction_type='Class', ntree_start=0, 831. ntree_end=0, thread_count=-1, verbose=None): 832. """ 833. Predict with data. 834. 835. Parameters 836. ---------- 837. data : catboost.Pool or list of features or list of lists or numpy. 838. ndarray or pandas.DataFrame or pandas.Series 839. or catboost.FeaturesData 840. Data to apply model on. 841. If data is a simple list (not list of lists) or a one-dimensional 842. numpy.ndarray it is interpreted 843. as a list of features for a single object. 844. 845. prediction_type : string, optional (default='Class') 846. Can be: 847. - 'RawFormulaVal' : return raw formula value. 848. - 'Class' : return class label. 849. - 'Probability' : return probability for every class. 850. - 'LogProbability' : return log probability for every class. 851. 852. ntree_start: int, optional (default=0) 853. Model is applied on the interval [ntree_start, ntree_end) 854. (zero-based indexing). 855. 856. ntree_end: int, optional (default=0) 857. Model is applied on the interval [ntree_start, ntree_end) 858. (zero-based indexing). 859. If value equals to 0 this parameter is ignored and ntree_end 860. equal to tree_count_. 861. 862. thread_count : int (default=-1) 863. The number of threads to use when applying the model. 864. Allows you to optimize the speed of execution. This 865. parameter doesn't affect results. 866. If -1, then the number of threads is set to the number of 867. CPU cores. 868. 869. verbose : bool, optional (default=False) 870. If True, writes the evaluation metric measured set to stderr. 871. 872. Returns 873. ------- 874. prediction: 875. If data is for a single object, the return value depends on 876. prediction_type value: 877. - 'RawFormulaVal' : return raw formula value. 878. - 'Class' : return class label. 879. - 'Probability' : return one-dimensional numpy.ndarray 880. with probability for every class. 881. - 'LogProbability' : return one-dimensional numpy. 882. ndarray with 883. log probability for every class. 884. otherwise numpy.ndarray, with values that depend on 885. prediction_type value: 886. - 'RawFormulaVal' : one-dimensional array of raw formula 887. value for each object. 888. - 'Class' : one-dimensional array of class label for each 889. object. 890. - 'Probability' : two-dimensional numpy.ndarray with 891. shape (number_of_objects x number_of_classes) 892. with probability for every class for each object. 893. - 'LogProbability' : two-dimensional numpy.ndarray with 894. shape (number_of_objects x number_of_classes) 895. with log probability for every class for each object. 896. """ 897. return self._predict(data, prediction_type, ntree_start, 898. ntree_end, thread_count, verbose, 'predict') 899. 900. def predict_proba(self, data, ntree_start=0, ntree_end=0, 901. thread_count=-1, verbose=None): 902. """ 903. Predict class probability with data. 904. 905. Parameters 906. ---------- 907. data : catboost.Pool or list of features or list of lists or numpy. 908. ndarray or pandas.DataFrame or pandas.Series 909. or catboost.FeaturesData 910. Data to apply model on. 911. If data is a simple list (not list of lists) or a one-dimensional 912. numpy.ndarray it is interpreted 913. as a list of features for a single object. 914. 915. ntree_start: int, optional (default=0) 916. Model is applied on the interval [ntree_start, ntree_end) 917. (zero-based indexing). 918. 919. ntree_end: int, optional (default=0) 920. Model is applied on the interval [ntree_start, ntree_end) 921. (zero-based indexing). 922. If value equals to 0 this parameter is ignored and ntree_end 923. equal to tree_count_. 924. 925. thread_count : int (default=-1) 926. The number of threads to use when applying the model. 927. Allows you to optimize the speed of execution. This 928. parameter doesn't affect results. 929. If -1, then the number of threads is set to the number of 930. CPU cores. 931. 932. verbose : bool 933. If True, writes the evaluation metric measured set to stderr. 934. 935. Returns 936. ------- 937. prediction : 938. If data is for a single object 939. return one-dimensional numpy.ndarray with probability 940. for every class. 941. otherwise 942. return two-dimensional numpy.ndarray with shape 943. (number_of_objects x number_of_classes) 944. with probability for every class for each object. 945. """ 946. return self._predict(data, 'Probability', ntree_start, ntree_end, 947. thread_count, verbose, 'predict_proba') 948. 949. def predict_log_proba(self, data, ntree_start=0, ntree_end=0, 950. thread_count=-1, verbose=None): 951. """ 952. Predict class log probability with data. 953. 954. Parameters 955. ---------- 956. data : catboost.Pool or list of features or list of lists or numpy. 957. ndarray or pandas.DataFrame or pandas.Series 958. or catboost.FeaturesData 959. Data to apply model on. 960. If data is a simple list (not list of lists) or a one-dimensional 961. numpy.ndarray it is interpreted 962. as a list of features for a single object. 963. 964. ntree_start: int, optional (default=0) 965. Model is applied on the interval [ntree_start, ntree_end) 966. (zero-based indexing). 967. 968. ntree_end: int, optional (default=0) 969. Model is applied on the interval [ntree_start, ntree_end) 970. (zero-based indexing). 971. If value equals to 0 this parameter is ignored and ntree_end 972. equal to tree_count_. 973. 974. thread_count : int (default=-1) 975. The number of threads to use when applying the model. 976. Allows you to optimize the speed of execution. This 977. parameter doesn't affect results. 978. If -1, then the number of threads is set to the number of 979. CPU cores. 980. 981. verbose : bool 982. If True, writes the evaluation metric measured set to stderr. 983. 984. Returns 985. ------- 986. prediction : 987. If data is for a single object 988. return one-dimensional numpy.ndarray with log 989. probability for every class. 990. otherwise 991. return two-dimensional numpy.ndarray with shape 992. (number_of_objects x number_of_classes) 993. with log probability for every class for each object. 994. """ 995. return self._predict(data, 'LogProbability', ntree_start, 996. ntree_end, thread_count, verbose, 'predict_log_proba') 997. 998. def staged_predict(self, data, prediction_type='Class', 999. ntree_start=0, ntree_end=0, eval_period=1, thread_count=-1, 1000. verbose=None): 1001. """ 1002. Predict target at each stage for data. 1003. 1004. Parameters 1005. ---------- 1006. data : catboost.Pool or list of features or list of lists or numpy. 1007. ndarray or pandas.DataFrame or pandas.Series 1008. or catboost.FeaturesData 1009. Data to apply model on. 1010. If data is a simple list (not list of lists) or a one-dimensional 1011. numpy.ndarray it is interpreted 1012. as a list of features for a single object. 1013. 1014. prediction_type : string, optional (default='Class') 1015. Can be: 1016. - 'RawFormulaVal' : return raw formula value. 1017. - 'Class' : return class label. 1018. - 'Probability' : return probability for every class. 1019. - 'LogProbability' : return log probability for every class. 1020. 1021. ntree_start: int, optional (default=0) 1022. Model is applied on the interval [ntree_start, ntree_end) 1023. with the step eval_period (zero-based indexing). 1024. 1025. ntree_end: int, optional (default=0) 1026. Model is applied on the interval [ntree_start, ntree_end) 1027. with the step eval_period (zero-based indexing). 1028. If value equals to 0 this parameter is ignored and ntree_end 1029. equal to tree_count_. 1030. 1031. eval_period: int, optional (default=1) 1032. Model is applied on the interval [ntree_start, ntree_end) 1033. with the step eval_period (zero-based indexing). 1034. 1035. thread_count : int (default=-1) 1036. The number of threads to use when applying the model. 1037. Allows you to optimize the speed of execution. This 1038. parameter doesn't affect results. 1039. If -1, then the number of threads is set to the number of 1040. CPU cores. 1041. 1042. verbose : bool 1043. If True, writes the evaluation metric measured set to stderr. 1044. 1045. Returns 1046. ------- 1047. prediction : generator for each iteration that generates: 1048. If data is for a single object, the return value depends on 1049. prediction_type value: 1050. - 'RawFormulaVal' : return raw formula value. 1051. - 'Class' : return majority vote class. 1052. - 'Probability' : return one-dimensional numpy.ndarray 1053. with probability for every class. 1054. - 'LogProbability' : return one-dimensional numpy. 1055. ndarray with 1056. log probability for every class. 1057. otherwise numpy.ndarray, with values that depend on 1058. prediction_type value: 1059. - 'RawFormulaVal' : one-dimensional array of raw formula 1060. value for each object. 1061. - 'Class' : one-dimensional array of class label for each 1062. object. 1063. - 'Probability' : two-dimensional numpy.ndarray with 1064. shape (number_of_objects x number_of_classes) 1065. with probability for every class for each object. 1066. - 'LogProbability' : two-dimensional numpy.ndarray with 1067. shape (number_of_objects x number_of_classes) 1068. with log probability for every class for each object. 1069. """ 1070. return self._staged_predict(data, prediction_type, ntree_start, 1071. ntree_end, eval_period, thread_count, verbose, 'staged_predict') 1072. 1073. def staged_predict_proba(self, data, ntree_start=0, 1074. ntree_end=0, eval_period=1, thread_count=-1, verbose=None): 1075. """ 1076. Predict classification target at each stage for data. 1077. 1078. Parameters 1079. ---------- 1080. data : catboost.Pool or list of features or list of lists or numpy. 1081. ndarray or pandas.DataFrame or pandas.Series 1082. or catboost.FeaturesData 1083. Data to apply model on. 1084. If data is a simple list (not list of lists) or a one-dimensional 1085. numpy.ndarray it is interpreted 1086. as a list of features for a single object. 1087. 1088. ntree_start: int, optional (default=0) 1089. Model is applied on the interval [ntree_start, ntree_end) 1090. with the step eval_period (zero-based indexing). 1091. 1092. ntree_end: int, optional (default=0) 1093. Model is applied on the interval [ntree_start, ntree_end) 1094. with the step eval_period (zero-based indexing). 1095. If value equals to 0 this parameter is ignored and ntree_end 1096. equal to tree_count_. 1097. 1098. eval_period: int, optional (default=1) 1099. Model is applied on the interval [ntree_start, ntree_end) 1100. with the step eval_period (zero-based indexing). 1101. 1102. thread_count : int (default=-1) 1103. The number of threads to use when applying the model. 1104. Allows you to optimize the speed of execution. This 1105. parameter doesn't affect results. 1106. If -1, then the number of threads is set to the number of 1107. CPU cores. 1108. 1109. verbose : bool 1110. If True, writes the evaluation metric measured set to stderr. 1111. 1112. Returns 1113. ------- 1114. prediction : generator for each iteration that generates: 1115. If data is for a single object 1116. return one-dimensional numpy.ndarray with probability 1117. for every class. 1118. otherwise 1119. return two-dimensional numpy.ndarray with shape 1120. (number_of_objects x number_of_classes) 1121. with probability for every class for each object. 1122. """ 1123. return self._staged_predict(data, 'Probability', ntree_start, 1124. ntree_end, eval_period, thread_count, verbose, 1125. 'staged_predict_proba') 1126. 1127. def staged_predict_log_proba(self, data, ntree_start=0, 1128. ntree_end=0, eval_period=1, thread_count=-1, verbose=None): 1129. """ 1130. Predict classification target at each stage for data. 1131. 1132. Parameters 1133. ---------- 1134. data : catboost.Pool or list of features or list of lists or numpy. 1135. ndarray or pandas.DataFrame or pandas.Series 1136. or catboost.FeaturesData 1137. Data to apply model on. 1138. If data is a simple list (not list of lists) or a one-dimensional 1139. numpy.ndarray it is interpreted 1140. as a list of features for a single object. 1141. 1142. ntree_start: int, optional (default=0) 1143. Model is applied on the interval [ntree_start, ntree_end) 1144. with the step eval_period (zero-based indexing). 1145. 1146. ntree_end: int, optional (default=0) 1147. Model is applied on the interval [ntree_start, ntree_end) 1148. with the step eval_period (zero-based indexing). 1149. If value equals to 0 this parameter is ignored and ntree_end 1150. equal to tree_count_. 1151. 1152. eval_period: int, optional (default=1) 1153. Model is applied on the interval [ntree_start, ntree_end) 1154. with the step eval_period (zero-based indexing). 1155. 1156. thread_count : int (default=-1) 1157. The number of threads to use when applying the model. 1158. Allows you to optimize the speed of execution. This 1159. parameter doesn't affect results. 1160. If -1, then the number of threads is set to the number of 1161. CPU cores. 1162. 1163. verbose : bool 1164. If True, writes the evaluation metric measured set to stderr. 1165. 1166. Returns 1167. ------- 1168. prediction : generator for each iteration that generates: 1169. If data is for a single object 1170. return one-dimensional numpy.ndarray with log 1171. probability for every class. 1172. otherwise 1173. return two-dimensional numpy.ndarray with shape 1174. (number_of_objects x number_of_classes) 1175. with log probability for every class for each object. 1176. """ 1177. return self._staged_predict(data, 'LogProbability', ntree_start, 1178. ntree_end, eval_period, thread_count, verbose, 1179. 'staged_predict_log_proba') 1180. 1181. def score(self, X, y=None): 1182. """ 1183. Calculate accuracy. 1184. 1185. Parameters 1186. ---------- 1187. X : catboost.Pool or list or numpy.ndarray or pandas. 1188. DataFrame or pandas.Series 1189. Data to apply model on. 1190. y : list or numpy.ndarray 1191. True labels. 1192. 1193. Returns 1194. ------- 1195. accuracy : float 1196. """ 1197. if isinstance(X, Pool): 1198. if y is not None: 1199. raise CatBoostError("Wrong initializing y: X is catboost. 1200. Pool object, y must be initialized inside catboost.Pool.") 1201. y = X.get_label() 1202. if y is None: 1203. raise CatBoostError("Label in X has not initialized.") 1204. if isinstance(y, DataFrame): 1205. if len(y.columns) != 1: 1206. raise CatBoostError("y is DataFrame and has {} columns, 1207. but must have exactly one.".format(len(y.columns))) 1208. y = y[y.columns[0]] 1209. elif y is None: 1210. raise CatBoostError("y should be specified.") 1211. y = np.array(y) 1212. predicted_classes = self._predict(X, prediction_type='Class', 1213. ntree_start=0, ntree_end=0, thread_count=-1, verbose=None, 1214. parent_method_name='score').reshape(-1) 1215. if np.issubdtype(predicted_classes.dtype, np.number): 1216. if np.issubdtype(y.dtype, np.character): 1217. raise CatBoostError('predicted classes have numeric type 1218. but specified y contains strings') 1219. elif np.issubdtype(y.dtype, np.number): 1220. raise CatBoostError('predicted classes have string type but 1221. specified y is numeric') 1222. elif np.issubdtype(y.dtype, np.bool_): 1223. raise CatBoostError('predicted classes have string type but 1224. specified y is boolean') 1225. return np.mean(np.array(predicted_classes) == np.array(y)) 1226. 1227. def _check_is_classification_objective(self, loss_function): 1228. if isinstance(loss_function, str) and not self. 1229. _is_classification_objective(loss_function): 1230. raise CatBoostError( 1231. "Invalid loss_function='{}': for classifier use " 1232. "Logloss, CrossEntropy, MultiClass, MultiClassOneVsAll 1233. or custom objective object". 1234. format(loss_function))