ML之CatboostC：基于titanic泰坦尼克数据集利用catboost算法实现二分类实现

基于titanic泰坦尼克数据集利用catboost算法实现二分类

设计思路

输出结果

1.    Pclass     Sex   Age  SibSp  Parch  Survived
2. 0       3    male  22.0      1      0         0
3. 1       1  female  38.0      1      0         1
4. 2       3  female  26.0      0      0         1
5. 3       1  female  35.0      1      0         1
6. 4       3    male  35.0      0      0         0
7. Pclass        int64
8. Sex          object
9. Age         float64
10. SibSp         int64
11. Parch         int64
12. Survived      int64
13. dtype: object
14. object_features_ID： [1]
15. 0:  learn: 0.5469469  test: 0.5358272 best: 0.5358272 (0) total: 98.1ms remaining: 9.71s
16. 1:  learn: 0.4884967  test: 0.4770551 best: 0.4770551 (1) total: 98.7ms remaining: 4.84s
17. 2:  learn: 0.4459496  test: 0.4453159 best: 0.4453159 (2) total: 99.3ms remaining: 3.21s
18. 3:  learn: 0.4331858  test: 0.4352757 best: 0.4352757 (3) total: 99.8ms remaining: 2.4s
19. 4:  learn: 0.4197131  test: 0.4266055 best: 0.4266055 (4) total: 100ms  remaining: 1.91s
20. 5:  learn: 0.4085381  test: 0.4224953 best: 0.4224953 (5) total: 101ms  remaining: 1.58s
21. 6:  learn: 0.4063807  test: 0.4209804 best: 0.4209804 (6) total: 102ms  remaining: 1.35s
22. 7:  learn: 0.4007713  test: 0.4155077 best: 0.4155077 (7) total: 102ms  remaining: 1.17s
23. 8:  learn: 0.3971064  test: 0.4135872 best: 0.4135872 (8) total: 103ms  remaining: 1.04s
24. 9:  learn: 0.3943774  test: 0.4105674 best: 0.4105674 (9) total: 103ms  remaining: 928ms
25. 10: learn: 0.3930801  test: 0.4099915 best: 0.4099915 (10)  total: 104ms  remaining: 839ms
26. 11: learn: 0.3904409  test: 0.4089840 best: 0.4089840 (11)  total: 104ms  remaining: 764ms
27. 12: learn: 0.3890830  test: 0.4091666 best: 0.4089840 (11)  total: 105ms  remaining: 701ms
28. 13: learn: 0.3851196  test: 0.4108839 best: 0.4089840 (11)  total: 105ms  remaining: 647ms
29. 14: learn: 0.3833366  test: 0.4106298 best: 0.4089840 (11)  total: 106ms  remaining: 600ms
30. 15: learn: 0.3792283  test: 0.4126097 best: 0.4089840 (11)  total: 106ms  remaining: 558ms
31. 16: learn: 0.3765680  test: 0.4114997 best: 0.4089840 (11)  total: 107ms  remaining: 522ms
32. 17: learn: 0.3760966  test: 0.4112166 best: 0.4089840 (11)  total: 107ms  remaining: 489ms
33. 18: learn: 0.3736951  test: 0.4122305 best: 0.4089840 (11)  total: 108ms  remaining: 461ms
34. 19: learn: 0.3719966  test: 0.4101199 best: 0.4089840 (11)  total: 109ms  remaining: 435ms
35. 20: learn: 0.3711460  test: 0.4097299 best: 0.4089840 (11)  total: 109ms  remaining: 411ms
36. 21: learn: 0.3707144  test: 0.4093512 best: 0.4089840 (11)  total: 110ms  remaining: 389ms
37. 22: learn: 0.3699238  test: 0.4083409 best: 0.4083409 (22)  total: 110ms  remaining: 370ms
38. 23: learn: 0.3670864  test: 0.4071850 best: 0.4071850 (23)  total: 111ms  remaining: 351ms
39. 24: learn: 0.3635514  test: 0.4038399 best: 0.4038399 (24)  total: 111ms  remaining: 334ms
40. 25: learn: 0.3627657  test: 0.4025837 best: 0.4025837 (25)  total: 112ms  remaining: 319ms
41. 26: learn: 0.3621028  test: 0.4018449 best: 0.4018449 (26)  total: 113ms  remaining: 304ms
42. 27: learn: 0.3616121  test: 0.4011693 best: 0.4011693 (27)  total: 113ms  remaining: 291ms
43. 28: learn: 0.3614262  test: 0.4011820 best: 0.4011693 (27)  total: 114ms  remaining: 278ms
44. 29: learn: 0.3610673  test: 0.4005475 best: 0.4005475 (29)  total: 114ms  remaining: 267ms
45. 30: learn: 0.3588062  test: 0.4002801 best: 0.4002801 (30)  total: 115ms  remaining: 256ms
46. 31: learn: 0.3583703  test: 0.3997255 best: 0.3997255 (31)  total: 116ms  remaining: 246ms
47. 32: learn: 0.3580553  test: 0.4001878 best: 0.3997255 (31)  total: 116ms  remaining: 236ms
48. 33: learn: 0.3556808  test: 0.4004169 best: 0.3997255 (31)  total: 118ms  remaining: 228ms
49. 34: learn: 0.3536833  test: 0.4003229 best: 0.3997255 (31)  total: 119ms  remaining: 220ms
50. 35: learn: 0.3519948  test: 0.4008047 best: 0.3997255 (31)  total: 119ms  remaining: 212ms
51. 36: learn: 0.3515452  test: 0.4000576 best: 0.3997255 (31)  total: 120ms  remaining: 204ms
52. 37: learn: 0.3512962  test: 0.3997214 best: 0.3997214 (37)  total: 120ms  remaining: 196ms
53. 38: learn: 0.3507648  test: 0.4001569 best: 0.3997214 (37)  total: 121ms  remaining: 189ms
54. 39: learn: 0.3489575  test: 0.4009203 best: 0.3997214 (37)  total: 121ms  remaining: 182ms
55. 40: learn: 0.3480966  test: 0.4014031 best: 0.3997214 (37)  total: 122ms  remaining: 175ms
56. 41: learn: 0.3477613  test: 0.4009293 best: 0.3997214 (37)  total: 122ms  remaining: 169ms
57. 42: learn: 0.3472945  test: 0.4006602 best: 0.3997214 (37)  total: 123ms  remaining: 163ms
58. 43: learn: 0.3465271  test: 0.4007531 best: 0.3997214 (37)  total: 124ms  remaining: 157ms
59. 44: learn: 0.3461538  test: 0.4010608 best: 0.3997214 (37)  total: 124ms  remaining: 152ms
60. 45: learn: 0.3455060  test: 0.4012489 best: 0.3997214 (37)  total: 125ms  remaining: 146ms
61. 46: learn: 0.3449922  test: 0.4013439 best: 0.3997214 (37)  total: 125ms  remaining: 141ms
62. 47: learn: 0.3445333  test: 0.4010754 best: 0.3997214 (37)  total: 126ms  remaining: 136ms
63. 48: learn: 0.3443186  test: 0.4011180 best: 0.3997214 (37)  total: 126ms  remaining: 132ms
64. 49: learn: 0.3424633  test: 0.4016071 best: 0.3997214 (37)  total: 127ms  remaining: 127ms
65. 50: learn: 0.3421565  test: 0.4013135 best: 0.3997214 (37)  total: 128ms  remaining: 123ms
66. 51: learn: 0.3417523  test: 0.4009993 best: 0.3997214 (37)  total: 128ms  remaining: 118ms
67. 52: learn: 0.3415669  test: 0.4009101 best: 0.3997214 (37)  total: 129ms  remaining: 114ms
68. 53: learn: 0.3413867  test: 0.4010833 best: 0.3997214 (37)  total: 130ms  remaining: 110ms
69. 54: learn: 0.3405166  test: 0.4014830 best: 0.3997214 (37)  total: 130ms  remaining: 107ms
70. 55: learn: 0.3401535  test: 0.4015556 best: 0.3997214 (37)  total: 131ms  remaining: 103ms
71. 56: learn: 0.3395217  test: 0.4021097 best: 0.3997214 (37)  total: 132ms  remaining: 99.4ms
72. 57: learn: 0.3393024  test: 0.4023377 best: 0.3997214 (37)  total: 132ms  remaining: 95.8ms
73. 58: learn: 0.3389909  test: 0.4019616 best: 0.3997214 (37)  total: 133ms  remaining: 92.3ms
74. 59: learn: 0.3388494  test: 0.4019746 best: 0.3997214 (37)  total: 133ms  remaining: 88.9ms
75. 60: learn: 0.3384901  test: 0.4017470 best: 0.3997214 (37)  total: 134ms  remaining: 85.6ms
76. 61: learn: 0.3382250  test: 0.4018783 best: 0.3997214 (37)  total: 134ms  remaining: 82.4ms
77. 62: learn: 0.3345761  test: 0.4039633 best: 0.3997214 (37)  total: 135ms  remaining: 79.3ms
78. 63: learn: 0.3317548  test: 0.4050218 best: 0.3997214 (37)  total: 136ms  remaining: 76.3ms
79. 64: learn: 0.3306501  test: 0.4036656 best: 0.3997214 (37)  total: 136ms  remaining: 73.3ms
80. 65: learn: 0.3292310  test: 0.4034339 best: 0.3997214 (37)  total: 137ms  remaining: 70.5ms
81. 66: learn: 0.3283600  test: 0.4033661 best: 0.3997214 (37)  total: 137ms  remaining: 67.6ms
82. 67: learn: 0.3282389  test: 0.4034237 best: 0.3997214 (37)  total: 138ms  remaining: 64.9ms
83. 68: learn: 0.3274603  test: 0.4039310 best: 0.3997214 (37)  total: 138ms  remaining: 62.2ms
84. 69: learn: 0.3273430  test: 0.4041663 best: 0.3997214 (37)  total: 139ms  remaining: 59.6ms
85. 70: learn: 0.3271585  test: 0.4044144 best: 0.3997214 (37)  total: 140ms  remaining: 57.1ms
86. 71: learn: 0.3268457  test: 0.4046981 best: 0.3997214 (37)  total: 140ms  remaining: 54.6ms
87. 72: learn: 0.3266497  test: 0.4042724 best: 0.3997214 (37)  total: 141ms  remaining: 52.1ms
88. 73: learn: 0.3259684  test: 0.4048797 best: 0.3997214 (37)  total: 141ms  remaining: 49.7ms
89. 74: learn: 0.3257845  test: 0.4044766 best: 0.3997214 (37)  total: 142ms  remaining: 47.3ms
90. 75: learn: 0.3256157  test: 0.4047031 best: 0.3997214 (37)  total: 143ms  remaining: 45.1ms
91. 76: learn: 0.3251433  test: 0.4043698 best: 0.3997214 (37)  total: 144ms  remaining: 42.9ms
92. 77: learn: 0.3247743  test: 0.4041652 best: 0.3997214 (37)  total: 144ms  remaining: 40.6ms
93. 78: learn: 0.3224876  test: 0.4058880 best: 0.3997214 (37)  total: 145ms  remaining: 38.5ms
94. 79: learn: 0.3223339  test: 0.4058139 best: 0.3997214 (37)  total: 145ms  remaining: 36.3ms
95. 80: learn: 0.3211858  test: 0.4060056 best: 0.3997214 (37)  total: 146ms  remaining: 34.2ms
96. 81: learn: 0.3200423  test: 0.4067103 best: 0.3997214 (37)  total: 147ms  remaining: 32.2ms
97. 82: learn: 0.3198329  test: 0.4069039 best: 0.3997214 (37)  total: 147ms  remaining: 30.1ms
98. 83: learn: 0.3196561  test: 0.4067853 best: 0.3997214 (37)  total: 148ms  remaining: 28.1ms
99. 84: learn: 0.3193160  test: 0.4072288 best: 0.3997214 (37)  total: 148ms  remaining: 26.1ms
100. 85:  learn: 0.3184463  test: 0.4077451 best: 0.3997214 (37)  total: 149ms  remaining: 24.2ms
101. 86:  learn: 0.3175777  test: 0.4086243 best: 0.3997214 (37)  total: 149ms  remaining: 22.3ms
102. 87:  learn: 0.3173824  test: 0.4082013 best: 0.3997214 (37)  total: 150ms  remaining: 20.4ms
103. 88:  learn: 0.3172840  test: 0.4083946 best: 0.3997214 (37)  total: 150ms  remaining: 18.6ms
104. 89:  learn: 0.3166252  test: 0.4086761 best: 0.3997214 (37)  total: 151ms  remaining: 16.8ms
105. 90:  learn: 0.3164144  test: 0.4083237 best: 0.3997214 (37)  total: 151ms  remaining: 15ms
106. 91:  learn: 0.3162137  test: 0.4083699 best: 0.3997214 (37)  total: 152ms  remaining: 13.2ms
107. 92:  learn: 0.3155611  test: 0.4091627 best: 0.3997214 (37)  total: 152ms  remaining: 11.5ms
108. 93:  learn: 0.3153976  test: 0.4089484 best: 0.3997214 (37)  total: 153ms  remaining: 9.76ms
109. 94:  learn: 0.3139281  test: 0.4116939 best: 0.3997214 (37)  total: 154ms  remaining: 8.08ms
110. 95:  learn: 0.3128878  test: 0.4146652 best: 0.3997214 (37)  total: 154ms  remaining: 6.42ms
111. 96:  learn: 0.3127863  test: 0.4145767 best: 0.3997214 (37)  total: 155ms  remaining: 4.78ms
112. 97:  learn: 0.3126696  test: 0.4142118 best: 0.3997214 (37)  total: 155ms  remaining: 3.17ms
113. 98:  learn: 0.3120048  test: 0.4140831 best: 0.3997214 (37)  total: 156ms  remaining: 1.57ms
114. 99:  learn: 0.3117563  test: 0.4138267 best: 0.3997214 (37)  total: 156ms  remaining: 0us
115. 
116. bestTest = 0.3997213503
117. bestIteration = 37
118. 
119. Shrink model to first 38 iterations.

核心代码

1. class CatBoostClassifier Found at: catboost.core
2. 
3. class CatBoostClassifier(CatBoost):
4.     _estimator_type = 'classifier'
5. """
6.     Implementation of the scikit-learn API for CatBoost classification.
7. 
8.     Parameters
9.     ----------
10.     iterations : int, [default=500]
11.         Max count of trees.
12.         range: [1,+inf]
13.     learning_rate : float, [default value is selected automatically for 
14.      binary classification with other parameters set to default. In all 
15.      other cases default is 0.03]
16.         Step size shrinkage used in update to prevents overfitting.
17.         range: (0,1]
18.     depth : int, [default=6]
19.         Depth of a tree. All trees are the same depth.
20.         range: [1,+inf]
21.     l2_leaf_reg : float, [default=3.0]
22.         Coefficient at the L2 regularization term of the cost function.
23.         range: [0,+inf]
24.     model_size_reg : float, [default=None]
25.         Model size regularization coefficient.
26.         range: [0,+inf]
27.     rsm : float, [default=None]
28.         Subsample ratio of columns when constructing each tree.
29.         range: (0,1]
30.     loss_function : string or object, [default='Logloss']
31.         The metric to use in training and also selector of the machine 
32.          learning
33.         problem to solve. If string, then the name of a supported 
34.          metric,
35.         optionally suffixed with parameter description.
36.         If object, it shall provide methods 'calc_ders_range' or 
37.          'calc_ders_multi'.
38.     border_count : int, [default = 254 for training on CPU or 128 for 
39.      training on GPU]
40.         The number of partitions in numeric features binarization. 
41.          Used in the preliminary calculation.
42.         range: [1,65535] on CPU, [1,255] on GPU
43.     feature_border_type : string, [default='GreedyLogSum']
44.         The binarization mode in numeric features binarization. Used 
45.          in the preliminary calculation.
46.         Possible values:
47.             - 'Median'
48.             - 'Uniform'
49.             - 'UniformAndQuantiles'
50.             - 'GreedyLogSum'
51.             - 'MaxLogSum'
52.             - 'MinEntropy'
53.     per_float_feature_quantization : list of strings, [default=None]
54.         List of float binarization descriptions.
55.         Format : described in documentation on catboost.ai
56.         Example 1: ['0:1024'] means that feature 0 will have 1024 
57.          borders.
58.         Example 2: ['0:border_count=1024', '1:border_count=1024', 
59.          ...] means that two first features have 1024 borders.
60.         Example 3: ['0:nan_mode=Forbidden,border_count=32,
61.          border_type=GreedyLogSum',
62.                     '1:nan_mode=Forbidden,border_count=32,
63.                      border_type=GreedyLogSum'] - defines more quantization 
64.                      properties for first two features.
65.     input_borders : string, [default=None]
66.         input file with borders used in numeric features binarization.
67.     output_borders : string, [default=None]
68.         output file for borders that were used in numeric features 
69.          binarization.
70.     fold_permutation_block : int, [default=1]
71.         To accelerate the learning.
72.         The recommended value is within [1, 256]. On small samples, 
73.          must be set to 1.
74.         range: [1,+inf]
75.     od_pval : float, [default=None]
76.         Use overfitting detector to stop training when reaching a 
77.          specified threshold.
78.         Can be used only with eval_set.
79.         range: [0,1]
80.     od_wait : int, [default=None]
81.         Number of iterations which overfitting detector will wait after 
82.          new best error.
83.     od_type : string, [default=None]
84.         Type of overfitting detector which will be used in program.
85.         Posible values:
86.             - 'IncToDec'
87.             - 'Iter'
88.         For 'Iter' type od_pval must not be set.
89.         If None, then od_type=IncToDec.
90.     nan_mode : string, [default=None]
91.         Way to process missing values for numeric features.
92.         Possible values:
93.             - 'Forbidden' - raises an exception if there is a missing value 
94.              for a numeric feature in a dataset.
95.             - 'Min' - each missing value will be processed as the 
96.              minimum numerical value.
97.             - 'Max' - each missing value will be processed as the 
98.              maximum numerical value.
99.         If None, then nan_mode=Min.
100.     counter_calc_method : string, [default=None]
101.         The method used to calculate counters for dataset with 
102.          Counter type.
103.         Possible values:
104.             - 'PrefixTest' - only objects up to current in the test dataset 
105.              are considered
106.             - 'FullTest' - all objects are considered in the test dataset
107.             - 'SkipTest' - Objects from test dataset are not considered
108.             - 'Full' - all objects are considered for both learn and test 
109.              dataset
110.         If None, then counter_calc_method=PrefixTest.
111.     leaf_estimation_iterations : int, [default=None]
112.         The number of steps in the gradient when calculating the 
113.          values in the leaves.
114.         If None, then leaf_estimation_iterations=1.
115.         range: [1,+inf]
116.     leaf_estimation_method : string, [default=None]
117.         The method used to calculate the values in the leaves.
118.         Possible values:
119.             - 'Newton'
120.             - 'Gradient'
121.     thread_count : int, [default=None]
122.         Number of parallel threads used to run CatBoost.
123.         If None or -1, then the number of threads is set to the 
124.          number of CPU cores.
125.         range: [1,+inf]
126.     random_seed : int, [default=None]
127.         Random number seed.
128.         If None, 0 is used.
129.         range: [0,+inf]
130.     use_best_model : bool, [default=None]
131.         To limit the number of trees in predict() using information 
132.          about the optimal value of the error function.
133.         Can be used only with eval_set.
134.     best_model_min_trees : int, [default=None]
135.         The minimal number of trees the best model should have.
136.     verbose: bool
137.         When set to True, logging_level is set to 'Verbose'.
138.         When set to False, logging_level is set to 'Silent'.
139.     silent: bool, synonym for verbose
140.     logging_level : string, [default='Verbose']
141.         Possible values:
142.             - 'Silent'
143.             - 'Verbose'
144.             - 'Info'
145.             - 'Debug'
146.     metric_period : int, [default=1]
147.         The frequency of iterations to print the information to stdout. 
148.          The value should be a positive integer.
149.     simple_ctr: list of strings, [default=None]
150.         Binarization settings for categorical features.
151.             Format : see documentation
152.             Example: ['Borders:CtrBorderCount=5:Prior=0:Prior=0.5', 
153.              'BinarizedTargetMeanValue:TargetBorderCount=10:
154.              TargetBorderType=MinEntropy', ...]
155.             CTR types:
156.                 CPU and GPU
157.                 - 'Borders'
158.                 - 'Buckets'
159.                 CPU only
160.                 - 'BinarizedTargetMeanValue'
161.                 - 'Counter'
162.                 GPU only
163.                 - 'FloatTargetMeanValue'
164.                 - 'FeatureFreq'
165.             Number_of_borders, binarization type, target borders and 
166.              binarizations, priors are optional parametrs
167.     combinations_ctr: list of strings, [default=None]
168.     per_feature_ctr: list of strings, [default=None]
169.     ctr_target_border_count: int, [default=None]
170.         Maximum number of borders used in target binarization for 
171.          categorical features that need it.
172.         If TargetBorderCount is specified in 'simple_ctr', 
173.          'combinations_ctr' or 'per_feature_ctr' option it
174.         overrides this value.
175.         range: [1, 255]
176.     ctr_leaf_count_limit : int, [default=None]
177.         The maximum number of leaves with categorical features.
178.         If the number of leaves exceeds the specified limit, some 
179.          leaves are discarded.
180.         The leaves to be discarded are selected as follows:
181.             - The leaves are sorted by the frequency of the values.
182.             - The top N leaves are selected, where N is the value 
183.              specified in the parameter.
184.             - All leaves starting from N+1 are discarded.
185.         This option reduces the resulting model size
186.         and the amount of memory required for training.
187.         Note that the resulting quality of the model can be affected.
188.         range: [1,+inf] (for zero limit use ignored_features)
189.     store_all_simple_ctr : bool, [default=None]
190.         Ignore categorical features, which are not used in feature 
191.          combinations,
192.         when choosing candidates for exclusion.
193.         Use this parameter with ctr_leaf_count_limit only.
194.     max_ctr_complexity : int, [default=4]
195.         The maximum number of Categ features that can be 
196.          combined.
197.         range: [0,+inf]
198.     has_time : bool, [default=False]
199.         To use the order in which objects are represented in the 
200.          input data
201.         (do not perform a random permutation of the dataset at the 
202.          preprocessing stage).
203.     allow_const_label : bool, [default=False]
204.         To allow the constant label value in dataset.
205.     target_border: float, [default=None]
206.         Border for target binarization.
207.     classes_count : int, [default=None]
208.         The upper limit for the numeric class label.
209.         Defines the number of classes for multiclassification.
210.         Only non-negative integers can be specified.
211.         The given integer should be greater than any of the target 
212.          values.
213.         If this parameter is specified the labels for all classes in the 
214.          input dataset
215.         should be smaller than the given value.
216.         If several of 'classes_count', 'class_weights', 'class_names' 
217.          parameters are defined
218.         the numbers of classes specified by each of them must be 
219.          equal.
220.     class_weights : list or dict, [default=None]
221.         Classes weights. The values are used as multipliers for the 
222.          object weights.
223.         If None, all classes are supposed to have weight one.
224.         If list - class weights in order of class_names or sequential 
225.          classes if class_names is undefined
226.         If dict - dict of class_name -> class_weight.
227.         If several of 'classes_count', 'class_weights', 'class_names' 
228.          parameters are defined
229.         the numbers of classes specified by each of them must be 
230.          equal.
231.     auto_class_weights : string [default=None]
232.         Enables automatic class weights calculation. Possible values:
233.             - Balanced  # weight = maxSummaryClassWeight / 
234.              summaryClassWeight, statistics determined from train pool
235.             - SqrtBalanced  # weight = sqrt(maxSummaryClassWeight / 
236.              summaryClassWeight)
237.     class_names: list of strings, [default=None]
238.         Class names. Allows to redefine the default values for class 
239.          labels (integer numbers).
240.         If several of 'classes_count', 'class_weights', 'class_names' 
241.          parameters are defined
242.         the numbers of classes specified by each of them must be 
243.          equal.
244.     one_hot_max_size : int, [default=None]
245.         Convert the feature to float
246.         if the number of different values that it takes exceeds the 
247.          specified value.
248.         Ctrs are not calculated for such features.
249.     random_strength : float, [default=1]
250.         Score standard deviation multiplier.
251.     name : string, [default='experiment']
252.         The name that should be displayed in the visualization tools.
253.     ignored_features : list, [default=None]
254.         Indices or names of features that should be excluded when 
255.          training.
256.     train_dir : string, [default=None]
257.         The directory in which you want to record generated in the 
258.          process of learning files.
259.     custom_metric : string or list of strings, [default=None]
260.         To use your own metric function.
261.     custom_loss: alias to custom_metric
262.     eval_metric : string or object, [default=None]
263.         To optimize your custom metric in loss.
264.     bagging_temperature : float, [default=None]
265.         Controls intensity of Bayesian bagging. The higher the 
266.          temperature the more aggressive bagging is.
267.         Typical values are in range [0, 1] (0 - no bagging, 1 - default).
268.     save_snapshot : bool, [default=None]
269.         Enable progress snapshotting for restoring progress after 
270.          crashes or interruptions
271.     snapshot_file : string, [default=None]
272.         Learn progress snapshot file path, if None will use default 
273.          filename
274.     snapshot_interval: int, [default=600]
275.         Interval between saving snapshots (seconds)
276.     fold_len_multiplier : float, [default=None]
277.         Fold length multiplier. Should be greater than 1
278.     used_ram_limit : string or number, [default=None]
279.         Set a limit on memory consumption (value like '1.2gb' or 1.2
280.          e9).
281.         WARNING: Currently this option affects CTR memory usage 
282.          only.
283.     gpu_ram_part : float, [default=0.95]
284.         Fraction of the GPU RAM to use for training, a value from (0, 
285.          1].
286.     pinned_memory_size: int [default=None]
287.         Size of additional CPU pinned memory used for GPU learning,
288.         usually is estimated automatically, thus usually should not be 
289.          set.
290.     allow_writing_files : bool, [default=True]
291.         If this flag is set to False, no files with different diagnostic info 
292.          will be created during training.
293.         With this flag no snapshotting can be done. Plus visualisation 
294.          will not
295.         work, because visualisation uses files that are created and 
296.          updated during training.
297.     final_ctr_computation_mode : string, [default='Default']
298.         Possible values:
299.             - 'Default' - Compute final ctrs for all pools.
300.             - 'Skip' - Skip final ctr computation. WARNING: model 
301.              without ctrs can't be applied.
302.     approx_on_full_history : bool, [default=False]
303.         If this flag is set to True, each approximated value is 
304.          calculated using all the preceeding rows in the fold (slower, more 
305.          accurate).
306.         If this flag is set to False, each approximated value is 
307.          calculated using only the beginning 1/fold_len_multiplier fraction 
308.          of the fold (faster, slightly less accurate).
309.     boosting_type : string, default value depends on object count 
310.      and feature count in train dataset and on learning mode.
311.         Boosting scheme.
312.         Possible values:
313.             - 'Ordered' - Gives better quality, but may slow down the 
314.              training.
315.             - 'Plain' - The classic gradient boosting scheme. May result 
316.              in quality degradation, but does not slow down the training.
317.     task_type : string, [default=None]
318.         The calcer type used to train the model.
319.         Possible values:
320.             - 'CPU'
321.             - 'GPU'
322.     device_config : string, [default=None], deprecated, use devices 
323.      instead
324.     devices : list or string, [default=None], GPU devices to use.
325.         String format is: '0' for 1 device or '0:1:3' for multiple devices 
326.          or '0-3' for range of devices.
327.         List format is : [0] for 1 device or [0,1,3] for multiple devices.
328. 
329.     bootstrap_type : string, Bayesian, Bernoulli, Poisson, MVS.
330.         Default bootstrap is Bayesian for GPU and MVS for CPU.
331.         Poisson bootstrap is supported only on GPU.
332.         MVS bootstrap is supported only on CPU.
333. 
334.     subsample : float, [default=None]
335.         Sample rate for bagging. This parameter can be used Poisson 
336.          or Bernoully bootstrap types.
337. 
338.     mvs-reg : float, [default is set automatically at each iteration 
339.      based on gradient distribution]
340.         Regularization parameter for MVS sampling algorithm
341. 
342.     monotone_constraints : list or numpy.ndarray or string or dict, 
343.      [default=None]
344.         Monotone constraints for features.
345. 
346.     feature_weights : list or numpy.ndarray or string or dict, 
347.      [default=None]
348.         Coefficient to multiply split gain with specific feature use. 
349.          Should be non-negative.
350. 
351.     penalties_coefficient : float, [default=1]
352.         Common coefficient for all penalties. Should be non-negative.
353. 
354.     first_feature_use_penalties : list or numpy.ndarray or string or 
355.      dict, [default=None]
356.         Penalties to first use of specific feature in model. Should be 
357.          non-negative.
358. 
359.     per_object_feature_penalties : list or numpy.ndarray or string or 
360.      dict, [default=None]
361.         Penalties for first use of feature for each object. Should be 
362.          non-negative.
363. 
364.     sampling_frequency : string, [default=PerTree]
365.         Frequency to sample weights and objects when building 
366.          trees.
367.         Possible values:
368.             - 'PerTree' - Before constructing each new tree
369.             - 'PerTreeLevel' - Before choosing each new split of a tree
370. 
371.     sampling_unit : string, [default='Object'].
372.         Possible values:
373.             - 'Object'
374.             - 'Group'
375.         The parameter allows to specify the sampling scheme:
376.         sample weights for each object individually or for an entire 
377.          group of objects together.
378. 
379.     dev_score_calc_obj_block_size: int, [default=5000000]
380.         CPU only. Size of block of samples in score calculation. Should 
381.          be > 0
382.         Used only for learning speed tuning.
383.         Changing this parameter can affect results due to numerical 
384.          accuracy differences
385. 
386.     dev_efb_max_buckets : int, [default=1024]
387.         CPU only. Maximum bucket count in exclusive features 
388.          bundle. Should be in an integer between 0 and 65536.
389.         Used only for learning speed tuning.
390. 
391.     sparse_features_conflict_fraction : float, [default=0.0]
392.         CPU only. Maximum allowed fraction of conflicting non-
393.          default values for features in exclusive features bundle.
394.         Should be a real value in [0, 1) interval.
395. 
396.     grow_policy : string, [SymmetricTree,Lossguide,Depthwise], 
397.      [default=SymmetricTree]
398.         The tree growing policy. It describes how to perform greedy 
399.          tree construction.
400. 
401.     min_data_in_leaf : int, [default=1].
402.         The minimum training samples count in leaf.
403.         CatBoost will not search for new splits in leaves with samples 
404.          count less than min_data_in_leaf.
405.         This parameter is used only for Depthwise and Lossguide 
406.          growing policies.
407. 
408.     max_leaves : int, [default=31],
409.         The maximum leaf count in resulting tree.
410.         This parameter is used only for Lossguide growing policy.
411. 
412.     score_function : string, possible values L2, Cosine, NewtonL2, 
413.      NewtonCosine, [default=Cosine]
414.         For growing policy Lossguide default=NewtonL2.
415.         GPU only. Score that is used during tree construction to 
416.          select the next tree split.
417. 
418.     max_depth : int, Synonym for depth.
419. 
420.     n_estimators : int, synonym for iterations.
421. 
422.     num_trees : int, synonym for iterations.
423. 
424.     num_boost_round : int, synonym for iterations.
425. 
426.     colsample_bylevel : float, synonym for rsm.
427. 
428.     random_state : int, synonym for random_seed.
429. 
430.     reg_lambda : float, synonym for l2_leaf_reg.
431. 
432.     objective : string, synonym for loss_function.
433. 
434.     num_leaves : int, synonym for max_leaves.
435. 
436.     min_child_samples : int, synonym for min_data_in_leaf
437. 
438.     eta : float, synonym for learning_rate.
439. 
440.     max_bin : float, synonym for border_count.
441. 
442.     scale_pos_weight : float, synonym for class_weights.
443.         Can be used only for binary classification. Sets weight 
444.          multiplier for
445.         class 1 to scale_pos_weight value.
446. 
447.     metadata : dict, string to string key-value pairs to be stored in 
448.      model metadata storage
449. 
450.     early_stopping_rounds : int
451.         Synonym for od_wait. Only one of these parameters should 
452.          be set.
453. 
454.     cat_features : list or numpy.ndarray, [default=None]
455.         If not None, giving the list of Categ features indices or names 
456.          (names are represented as strings).
457.         If it contains feature names, feature names must be defined 
458.          for the training dataset passed to 'fit'.
459. 
460.     text_features : list or numpy.ndarray, [default=None]
461.         If not None, giving the list of Text features indices or names 
462.          (names are represented as strings).
463.         If it contains feature names, feature names must be defined 
464.          for the training dataset passed to 'fit'.
465. 
466.     embedding_features : list or numpy.ndarray, [default=None]
467.         If not None, giving the list of Embedding features indices or 
468.          names (names are represented as strings).
469.         If it contains feature names, feature names must be defined 
470.          for the training dataset passed to 'fit'.
471. 
472.     leaf_estimation_backtracking : string, [default=None]
473.         Type of backtracking during gradient descent.
474.         Possible values:
475.             - 'No' - never backtrack; supported on CPU and GPU
476.             - 'AnyImprovement' - reduce the descent step until the 
477.              value of loss function is less than before the step; supported on 
478.              CPU and GPU
479.             - 'Armijo' - reduce the descent step until Armijo condition 
480.              is satisfied; supported on GPU only
481. 
482.     model_shrink_rate : float, [default=0]
483.         This parameter enables shrinkage of model at the start of 
484.          each iteration. CPU only.
485.         For Constant mode shrinkage coefficient is calculated as (1 - 
486.          model_shrink_rate * learning_rate).
487.         For Decreasing mode shrinkage coefficient is calculated as (1 
488.          - model_shrink_rate / iteration).
489.         Shrinkage coefficient should be in [0, 1).
490. 
491.     model_shrink_mode : string, [default=None]
492.         Mode of shrinkage coefficient calculation. CPU only.
493.         Possible values:
494.             - 'Constant' - Shrinkage coefficient is constant at each 
495.              iteration.
496.             - 'Decreasing' - Shrinkage coefficient decreases at each 
497.              iteration.
498. 
499.     langevin : bool, [default=False]
500.         Enables the Stochastic Gradient Langevin Boosting. CPU only.
501. 
502.     diffusion_temperature : float, [default=0]
503.         Langevin boosting diffusion temperature. CPU only.
504. 
505.     posterior_sampling : bool, [default=False]
506.         Set group of parameters for further use Uncertainty 
507.          prediction:
508.             - Langevin = True
509.             - Model Shrink Rate = 1/(2N), where N is dataset size
510.             - Model Shrink Mode = Constant
511.             - Diffusion-temperature = N, where N is dataset size. CPU 
512.              only.
513. 
514.     boost_from_average : bool, [default=True for RMSE, False for 
515.      other losses]
516.         Enables to initialize approx values by best constant value for 
517.          specified loss function.
518.         Available for RMSE, Logloss, CrossEntropy, Quantile and MAE.
519. 
520.     tokenizers : list of dicts,
521.         Each dict is a tokenizer description. Example:
522.         ```
523.         [
524.             {
525.                 'tokenizer_id': 'Tokenizer',  # Tokeinzer identifier.
526.                 'lowercasing': 'false',  # Possible values: 'true', 'false'.
527.                 'number_process_policy': 'LeaveAsIs',  # Possible values: 
528.                  'Skip', 'LeaveAsIs', 'Replace'.
529.                 'number_token': '%',  # Rarely used character. Used in 
530.                  conjunction with Replace NumberProcessPolicy.
531.                 'separator_type': 'ByDelimiter',  # Possible values: 
532.                  'ByDelimiter', 'BySense'.
533.                 'delimiter': ' ',  # Used in conjunction with ByDelimiter 
534.                  SeparatorType.
535.                 'split_by_set': 'false',  # Each single character in delimiter 
536.                  used as individual delimiter.
537.                 'skip_empty': 'true',  # Possible values: 'true', 'false'.
538.                 'token_types': ['Word', 'Number', 'Unknown'],  # Used in 
539.                  conjunction with BySense SeparatorType.
540.                     # Possible values: 'Word', 'Number', 'Punctuation', 
541.                      'SentenceBreak', 'ParagraphBreak', 'Unknown'.
542.                 'subtokens_policy': 'SingleToken',  # Possible values:
543.                     # 'SingleToken' - All subtokens are interpreted as 
544.                      single token).
545.                     # 'SeveralTokens' - All subtokens are interpreted as 
546.                      several token.
547.             },
548.             ...
549.         ]
550.         ```
551. 
552.     dictionaries : list of dicts,
553.         Each dict is a tokenizer description. Example:
554.         ```
555.         [
556.             {
557.                 'dictionary_id': 'Dictionary',  # Dictionary identifier.
558.                 'token_level_type': 'Word',  # Possible values: 'Word', 
559.                  'Letter'.
560.                 'gram_order': '1',  # 1 for Unigram, 2 for Bigram, ...
561.                 'skip_step': '0',  # 1 for 1-skip-gram, ...
562.                 'end_of_word_token_policy': 'Insert',  # Possible values: 
563.                  'Insert', 'Skip'.
564.                 'end_of_sentence_token_policy': 'Skip',  # Possible values: 
565.                  'Insert', 'Skip'.
566.                 'occurrence_lower_bound': '3',  # The lower bound of 
567.                  token occurrences in the text to include it in the dictionary.
568.                 'max_dictionary_size': '50000',  # The max dictionary size.
569.             },
570.             ...
571.         ]
572.         ```
573. 
574.     feature_calcers : list of strings,
575.         Each string is a calcer description. Example:
576.         ```
577.         [
578.             'NaiveBayes',
579.             'BM25',
580.             'BoW:top_tokens_count=2000',
581.         ]
582.         ```
583. 
584.     text_processing : dict,
585.         Text processging description.
586.     """
587. def __init__(
588.         self, 
589.         iterations=None, 
590.         learning_rate=None, 
591.         depth=None, 
592.         l2_leaf_reg=None, 
593.         model_size_reg=None, 
594.         rsm=None, 
595.         loss_function=None, 
596.         border_count=None, 
597.         feature_border_type=None, 
598.         per_float_feature_quantization=None, 
599.         input_borders=None, 
600.         output_borders=None, 
601.         fold_permutation_block=None, 
602.         od_pval=None, 
603.         od_wait=None, 
604.         od_type=None, 
605.         nan_mode=None, 
606.         counter_calc_method=None, 
607.         leaf_estimation_iterations=None, 
608.         leaf_estimation_method=None, 
609.         thread_count=None, 
610.         random_seed=None, 
611.         use_best_model=None, 
612.         best_model_min_trees=None, 
613.         verbose=None, 
614.         silent=None, 
615.         logging_level=None, 
616.         metric_period=None, 
617.         ctr_leaf_count_limit=None, 
618.         store_all_simple_ctr=None, 
619.         max_ctr_complexity=None, 
620.         has_time=None, 
621.         allow_const_label=None, 
622.         target_border=None, 
623.         classes_count=None, 
624.         class_weights=None, 
625.         auto_class_weights=None, 
626.         class_names=None, 
627.         one_hot_max_size=None, 
628.         random_strength=None, 
629.         name=None, 
630.         ignored_features=None, 
631.         train_dir=None, 
632.         custom_loss=None, 
633.         custom_metric=None, 
634.         eval_metric=None, 
635.         bagging_temperature=None, 
636.         save_snapshot=None, 
637.         snapshot_file=None, 
638.         snapshot_interval=None, 
639.         fold_len_multiplier=None, 
640.         used_ram_limit=None, 
641.         gpu_ram_part=None, 
642.         pinned_memory_size=None, 
643.         allow_writing_files=None, 
644.         final_ctr_computation_mode=None, 
645.         approx_on_full_history=None, 
646.         boosting_type=None, 
647.         simple_ctr=None, 
648.         combinations_ctr=None, 
649.         per_feature_ctr=None, 
650.         ctr_description=None, 
651.         ctr_target_border_count=None, 
652.         task_type=None, 
653.         device_config=None, 
654.         devices=None, 
655.         bootstrap_type=None, 
656.         subsample=None, 
657.         mvs_reg=None, 
658.         sampling_unit=None, 
659.         sampling_frequency=None, 
660.         dev_score_calc_obj_block_size=None, 
661.         dev_efb_max_buckets=None, 
662.         sparse_features_conflict_fraction=None, 
663.         max_depth=None, 
664.         n_estimators=None, 
665.         num_boost_round=None, 
666.         num_trees=None, 
667.         colsample_bylevel=None, 
668.         random_state=None, 
669.         reg_lambda=None, 
670.         objective=None, 
671.         eta=None, 
672.         max_bin=None, 
673.         scale_pos_weight=None, 
674.         gpu_cat_features_storage=None, 
675.         data_partition=None, 
676.         metadata=None, 
677.         early_stopping_rounds=None, 
678.         cat_features=None, 
679.         grow_policy=None, 
680.         min_data_in_leaf=None, 
681.         min_child_samples=None, 
682.         max_leaves=None, 
683.         num_leaves=None, 
684.         score_function=None, 
685.         leaf_estimation_backtracking=None, 
686.         ctr_history_unit=None, 
687.         monotone_constraints=None, 
688.         feature_weights=None, 
689.         penalties_coefficient=None, 
690.         first_feature_use_penalties=None, 
691.         per_object_feature_penalties=None, 
692.         model_shrink_rate=None, 
693.         model_shrink_mode=None, 
694.         langevin=None, 
695.         diffusion_temperature=None, 
696.         posterior_sampling=None, 
697.         boost_from_average=None, 
698.         text_features=None, 
699.         tokenizers=None, 
700.         dictionaries=None, 
701.         feature_calcers=None, 
702.         text_processing=None, 
703.         embedding_features=None):
704.         params = {}
705.         not_params = ["not_params", "self", "params", "__class__"]
706. for key, value in iteritems(locals().copy()):
707. if key not in not_params and value is not None:
708.                 params[key] = value
709. 
710. super(CatBoostClassifier, self).__init__(params)
711. 
712. def fit(self, X, y=None, cat_features=None, text_features=None, 
713.      embedding_features=None, sample_weight=None, 
714.      baseline=None, use_best_model=None, 
715.         eval_set=None, verbose=None, logging_level=None, 
716.          plot=False, column_description=None, 
717.         verbose_eval=None, metric_period=None, silent=None, 
718.          early_stopping_rounds=None, 
719.         save_snapshot=None, snapshot_file=None, 
720.          snapshot_interval=None, init_model=None):
721. """
722.         Fit the CatBoostClassifier model.
723. 
724.         Parameters
725.         ----------
726.         X : catboost.Pool or list or numpy.ndarray or pandas.
727.          DataFrame or pandas.Series
728.             If not catboost.Pool, 2 dimensional Feature matrix or string 
729.              - file with dataset.
730. 
731.         y : list or numpy.ndarray or pandas.DataFrame or pandas.
732.          Series, optional (default=None)
733.             Labels, 1 dimensional array like.
734.             Use only if X is not catboost.Pool.
735. 
736.         cat_features : list or numpy.ndarray, optional (default=None)
737.             If not None, giving the list of Categ columns indices.
738.             Use only if X is not catboost.Pool.
739. 
740.         text_features : list or numpy.ndarray, optional (default=None)
741.             If not None, giving the list of Text columns indices.
742.             Use only if X is not catboost.Pool.
743. 
744.         embedding_features : list or numpy.ndarray, optional 
745.          (default=None)
746.             If not None, giving the list of Embedding columns indices.
747.             Use only if X is not catboost.Pool.
748. 
749.         sample_weight : list or numpy.ndarray or pandas.DataFrame 
750.          or pandas.Series, optional (default=None)
751.             Instance weights, 1 dimensional array like.
752. 
753.         baseline : list or numpy.ndarray, optional (default=None)
754.             If not None, giving 2 dimensional array like data.
755.             Use only if X is not catboost.Pool.
756. 
757.         use_best_model : bool, optional (default=None)
758.             Flag to use best model
759. 
760.         eval_set : catboost.Pool or list, optional (default=None)
761.             A list of (X, y) tuple pairs to use as a validation set for early-
762.              stopping
763. 
764.         metric_period : int
765.             Frequency of evaluating metrics.
766. 
767.         verbose : bool or int
768.             If verbose is bool, then if set to True, logging_level is set to 
769.              Verbose,
770.             if set to False, logging_level is set to Silent.
771.             If verbose is int, it determines the frequency of writing 
772.              metrics to output and
773.             logging_level is set to Verbose.
774. 
775.         silent : bool
776.             If silent is True, logging_level is set to Silent.
777.             If silent is False, logging_level is set to Verbose.
778. 
779.         logging_level : string, optional (default=None)
780.             Possible values:
781.                 - 'Silent'
782.                 - 'Verbose'
783.                 - 'Info'
784.                 - 'Debug'
785. 
786.         plot : bool, optional (default=False)
787.             If True, draw train and eval error in Jupyter notebook
788. 
789.         verbose_eval : bool or int
790.             Synonym for verbose. Only one of these parameters should 
791.              be set.
792. 
793.         early_stopping_rounds : int
794.             Activates Iter overfitting detector with od_wait set to 
795.              early_stopping_rounds.
796. 
797.         save_snapshot : bool, [default=None]
798.             Enable progress snapshotting for restoring progress after 
799.              crashes or interruptions
800. 
801.         snapshot_file : string, [default=None]
802.             Learn progress snapshot file path, if None will use default 
803.              filename
804. 
805.         snapshot_interval: int, [default=600]
806.             Interval between saving snapshots (seconds)
807. 
808.         init_model : CatBoost class or string, [default=None]
809.             Continue training starting from the existing model.
810.             If this parameter is a string, load initial model from the path 
811.              specified by this string.
812. 
813.         Returns
814.         -------
815.         model : CatBoost
816.         """
817.         params = self._init_params.copy()
818.         _process_synonyms(params)
819. if 'loss_function' in params:
820.             self._check_is_classification_objective(params
821.              ['loss_function'])
822.         self._fit(X, y, cat_features, text_features, embedding_features, 
823. None, sample_weight, None, None, None, None, baseline, 
824.          use_best_model, eval_set, verbose, logging_level, plot, 
825.          column_description, verbose_eval, metric_period, silent, 
826.          early_stopping_rounds, save_snapshot, snapshot_file, 
827.          snapshot_interval, init_model)
828. return self
829. 
830. def predict(self, data, prediction_type='Class', ntree_start=0, 
831.      ntree_end=0, thread_count=-1, verbose=None):
832. """
833.         Predict with data.
834. 
835.         Parameters
836.         ----------
837.         data : catboost.Pool or list of features or list of lists or numpy.
838.          ndarray or pandas.DataFrame or pandas.Series
839.                 or catboost.FeaturesData
840.             Data to apply model on.
841.             If data is a simple list (not list of lists) or a one-dimensional 
842.              numpy.ndarray it is interpreted
843.             as a list of features for a single object.
844. 
845.         prediction_type : string, optional (default='Class')
846.             Can be:
847.             - 'RawFormulaVal' : return raw formula value.
848.             - 'Class' : return class label.
849.             - 'Probability' : return probability for every class.
850.             - 'LogProbability' : return log probability for every class.
851. 
852.         ntree_start: int, optional (default=0)
853.             Model is applied on the interval [ntree_start, ntree_end) 
854.              (zero-based indexing).
855. 
856.         ntree_end: int, optional (default=0)
857.             Model is applied on the interval [ntree_start, ntree_end) 
858.              (zero-based indexing).
859.             If value equals to 0 this parameter is ignored and ntree_end 
860.              equal to tree_count_.
861. 
862.         thread_count : int (default=-1)
863.             The number of threads to use when applying the model.
864.             Allows you to optimize the speed of execution. This 
865.              parameter doesn't affect results.
866.             If -1, then the number of threads is set to the number of 
867.              CPU cores.
868. 
869.         verbose : bool, optional (default=False)
870.             If True, writes the evaluation metric measured set to stderr.
871. 
872.         Returns
873.         -------
874.         prediction:
875.             If data is for a single object, the return value depends on 
876.              prediction_type value:
877.                 - 'RawFormulaVal' : return raw formula value.
878.                 - 'Class' : return class label.
879.                 - 'Probability' : return one-dimensional numpy.ndarray 
880.                  with probability for every class.
881.                 - 'LogProbability' : return one-dimensional numpy.
882.                  ndarray with
883.                   log probability for every class.
884.             otherwise numpy.ndarray, with values that depend on 
885.              prediction_type value:
886.                 - 'RawFormulaVal' : one-dimensional array of raw formula 
887.                  value for each object.
888.                 - 'Class' : one-dimensional array of class label for each 
889.                  object.
890.                 - 'Probability' : two-dimensional numpy.ndarray with 
891.                  shape (number_of_objects x number_of_classes)
892.                   with probability for every class for each object.
893.                 - 'LogProbability' : two-dimensional numpy.ndarray with 
894.                  shape (number_of_objects x number_of_classes)
895.                   with log probability for every class for each object.
896.         """
897. return self._predict(data, prediction_type, ntree_start, 
898.          ntree_end, thread_count, verbose, 'predict')
899. 
900. def predict_proba(self, data, ntree_start=0, ntree_end=0, 
901.      thread_count=-1, verbose=None):
902. """
903.         Predict class probability with data.
904. 
905.         Parameters
906.         ----------
907.         data : catboost.Pool or list of features or list of lists or numpy.
908.          ndarray or pandas.DataFrame or pandas.Series
909.                 or catboost.FeaturesData
910.             Data to apply model on.
911.             If data is a simple list (not list of lists) or a one-dimensional 
912.              numpy.ndarray it is interpreted
913.             as a list of features for a single object.
914. 
915.         ntree_start: int, optional (default=0)
916.             Model is applied on the interval [ntree_start, ntree_end) 
917.              (zero-based indexing).
918. 
919.         ntree_end: int, optional (default=0)
920.             Model is applied on the interval [ntree_start, ntree_end) 
921.              (zero-based indexing).
922.             If value equals to 0 this parameter is ignored and ntree_end 
923.              equal to tree_count_.
924. 
925.         thread_count : int (default=-1)
926.             The number of threads to use when applying the model.
927.             Allows you to optimize the speed of execution. This 
928.              parameter doesn't affect results.
929.             If -1, then the number of threads is set to the number of 
930.              CPU cores.
931. 
932.         verbose : bool
933.             If True, writes the evaluation metric measured set to stderr.
934. 
935.         Returns
936.         -------
937.         prediction :
938.             If data is for a single object
939.                 return one-dimensional numpy.ndarray with probability 
940.                  for every class.
941.             otherwise
942.                 return two-dimensional numpy.ndarray with shape 
943.                  (number_of_objects x number_of_classes)
944.                 with probability for every class for each object.
945.         """
946. return self._predict(data, 'Probability', ntree_start, ntree_end, 
947.          thread_count, verbose, 'predict_proba')
948. 
949. def predict_log_proba(self, data, ntree_start=0, ntree_end=0, 
950.      thread_count=-1, verbose=None):
951. """
952.         Predict class log probability with data.
953. 
954.         Parameters
955.         ----------
956.         data : catboost.Pool or list of features or list of lists or numpy.
957.          ndarray or pandas.DataFrame or pandas.Series
958.                 or catboost.FeaturesData
959.             Data to apply model on.
960.             If data is a simple list (not list of lists) or a one-dimensional 
961.              numpy.ndarray it is interpreted
962.             as a list of features for a single object.
963. 
964.         ntree_start: int, optional (default=0)
965.             Model is applied on the interval [ntree_start, ntree_end) 
966.              (zero-based indexing).
967. 
968.         ntree_end: int, optional (default=0)
969.             Model is applied on the interval [ntree_start, ntree_end) 
970.              (zero-based indexing).
971.             If value equals to 0 this parameter is ignored and ntree_end 
972.              equal to tree_count_.
973. 
974.         thread_count : int (default=-1)
975.             The number of threads to use when applying the model.
976.             Allows you to optimize the speed of execution. This 
977.              parameter doesn't affect results.
978.             If -1, then the number of threads is set to the number of 
979.              CPU cores.
980. 
981.         verbose : bool
982.             If True, writes the evaluation metric measured set to stderr.
983. 
984.         Returns
985.         -------
986.         prediction :
987.             If data is for a single object
988.                 return one-dimensional numpy.ndarray with log 
989.                  probability for every class.
990.             otherwise
991.                 return two-dimensional numpy.ndarray with shape 
992.                  (number_of_objects x number_of_classes)
993.                 with log probability for every class for each object.
994.         """
995. return self._predict(data, 'LogProbability', ntree_start, 
996.          ntree_end, thread_count, verbose, 'predict_log_proba')
997. 
998. def staged_predict(self, data, prediction_type='Class', 
999.      ntree_start=0, ntree_end=0, eval_period=1, thread_count=-1, 
1000.      verbose=None):
1001. """
1002.         Predict target at each stage for data.
1003. 
1004.         Parameters
1005.         ----------
1006.         data : catboost.Pool or list of features or list of lists or numpy.
1007.          ndarray or pandas.DataFrame or pandas.Series
1008.                 or catboost.FeaturesData
1009.             Data to apply model on.
1010.             If data is a simple list (not list of lists) or a one-dimensional 
1011.              numpy.ndarray it is interpreted
1012.             as a list of features for a single object.
1013. 
1014.         prediction_type : string, optional (default='Class')
1015.             Can be:
1016.             - 'RawFormulaVal' : return raw formula value.
1017.             - 'Class' : return class label.
1018.             - 'Probability' : return probability for every class.
1019.             - 'LogProbability' : return log probability for every class.
1020. 
1021.         ntree_start: int, optional (default=0)
1022.             Model is applied on the interval [ntree_start, ntree_end) 
1023.              with the step eval_period (zero-based indexing).
1024. 
1025.         ntree_end: int, optional (default=0)
1026.             Model is applied on the interval [ntree_start, ntree_end) 
1027.              with the step eval_period (zero-based indexing).
1028.             If value equals to 0 this parameter is ignored and ntree_end 
1029.              equal to tree_count_.
1030. 
1031.         eval_period: int, optional (default=1)
1032.             Model is applied on the interval [ntree_start, ntree_end) 
1033.              with the step eval_period (zero-based indexing).
1034. 
1035.         thread_count : int (default=-1)
1036.             The number of threads to use when applying the model.
1037.             Allows you to optimize the speed of execution. This 
1038.              parameter doesn't affect results.
1039.             If -1, then the number of threads is set to the number of 
1040.              CPU cores.
1041. 
1042.         verbose : bool
1043.             If True, writes the evaluation metric measured set to stderr.
1044. 
1045.         Returns
1046.         -------
1047.         prediction : generator for each iteration that generates:
1048.             If data is for a single object, the return value depends on 
1049.              prediction_type value:
1050.                 - 'RawFormulaVal' : return raw formula value.
1051.                 - 'Class' : return majority vote class.
1052.                 - 'Probability' : return one-dimensional numpy.ndarray 
1053.                  with probability for every class.
1054.                 - 'LogProbability' : return one-dimensional numpy.
1055.                  ndarray with
1056.                   log probability for every class.
1057.             otherwise numpy.ndarray, with values that depend on 
1058.              prediction_type value:
1059.                 - 'RawFormulaVal' : one-dimensional array of raw formula 
1060.                  value for each object.
1061.                 - 'Class' : one-dimensional array of class label for each 
1062.                  object.
1063.                 - 'Probability' : two-dimensional numpy.ndarray with 
1064.                  shape (number_of_objects x number_of_classes)
1065.                   with probability for every class for each object.
1066.                 - 'LogProbability' : two-dimensional numpy.ndarray with 
1067.                  shape (number_of_objects x number_of_classes)
1068.                   with log probability for every class for each object.
1069.         """
1070. return self._staged_predict(data, prediction_type, ntree_start, 
1071.          ntree_end, eval_period, thread_count, verbose, 'staged_predict')
1072. 
1073. def staged_predict_proba(self, data, ntree_start=0, 
1074.      ntree_end=0, eval_period=1, thread_count=-1, verbose=None):
1075. """
1076.         Predict classification target at each stage for data.
1077. 
1078.         Parameters
1079.         ----------
1080.         data : catboost.Pool or list of features or list of lists or numpy.
1081.          ndarray or pandas.DataFrame or pandas.Series
1082.                 or catboost.FeaturesData
1083.             Data to apply model on.
1084.             If data is a simple list (not list of lists) or a one-dimensional 
1085.              numpy.ndarray it is interpreted
1086.             as a list of features for a single object.
1087. 
1088.         ntree_start: int, optional (default=0)
1089.             Model is applied on the interval [ntree_start, ntree_end) 
1090.              with the step eval_period (zero-based indexing).
1091. 
1092.         ntree_end: int, optional (default=0)
1093.             Model is applied on the interval [ntree_start, ntree_end) 
1094.              with the step eval_period (zero-based indexing).
1095.             If value equals to 0 this parameter is ignored and ntree_end 
1096.              equal to tree_count_.
1097. 
1098.         eval_period: int, optional (default=1)
1099.             Model is applied on the interval [ntree_start, ntree_end) 
1100.              with the step eval_period (zero-based indexing).
1101. 
1102.         thread_count : int (default=-1)
1103.             The number of threads to use when applying the model.
1104.             Allows you to optimize the speed of execution. This 
1105.              parameter doesn't affect results.
1106.             If -1, then the number of threads is set to the number of 
1107.              CPU cores.
1108. 
1109.         verbose : bool
1110.             If True, writes the evaluation metric measured set to stderr.
1111. 
1112.         Returns
1113.         -------
1114.         prediction : generator for each iteration that generates:
1115.             If data is for a single object
1116.                 return one-dimensional numpy.ndarray with probability 
1117.                  for every class.
1118.             otherwise
1119.                 return two-dimensional numpy.ndarray with shape 
1120.                  (number_of_objects x number_of_classes)
1121.                 with probability for every class for each object.
1122.         """
1123. return self._staged_predict(data, 'Probability', ntree_start, 
1124.          ntree_end, eval_period, thread_count, verbose, 
1125. 'staged_predict_proba')
1126. 
1127. def staged_predict_log_proba(self, data, ntree_start=0, 
1128.      ntree_end=0, eval_period=1, thread_count=-1, verbose=None):
1129. """
1130.         Predict classification target at each stage for data.
1131. 
1132.         Parameters
1133.         ----------
1134.         data : catboost.Pool or list of features or list of lists or numpy.
1135.          ndarray or pandas.DataFrame or pandas.Series
1136.                 or catboost.FeaturesData
1137.             Data to apply model on.
1138.             If data is a simple list (not list of lists) or a one-dimensional 
1139.              numpy.ndarray it is interpreted
1140.             as a list of features for a single object.
1141. 
1142.         ntree_start: int, optional (default=0)
1143.             Model is applied on the interval [ntree_start, ntree_end) 
1144.              with the step eval_period (zero-based indexing).
1145. 
1146.         ntree_end: int, optional (default=0)
1147.             Model is applied on the interval [ntree_start, ntree_end) 
1148.              with the step eval_period (zero-based indexing).
1149.             If value equals to 0 this parameter is ignored and ntree_end 
1150.              equal to tree_count_.
1151. 
1152.         eval_period: int, optional (default=1)
1153.             Model is applied on the interval [ntree_start, ntree_end) 
1154.              with the step eval_period (zero-based indexing).
1155. 
1156.         thread_count : int (default=-1)
1157.             The number of threads to use when applying the model.
1158.             Allows you to optimize the speed of execution. This 
1159.              parameter doesn't affect results.
1160.             If -1, then the number of threads is set to the number of 
1161.              CPU cores.
1162. 
1163.         verbose : bool
1164.             If True, writes the evaluation metric measured set to stderr.
1165. 
1166.         Returns
1167.         -------
1168.         prediction : generator for each iteration that generates:
1169.             If data is for a single object
1170.                 return one-dimensional numpy.ndarray with log 
1171.                  probability for every class.
1172.             otherwise
1173.                 return two-dimensional numpy.ndarray with shape 
1174.                  (number_of_objects x number_of_classes)
1175.                 with log probability for every class for each object.
1176.         """
1177. return self._staged_predict(data, 'LogProbability', ntree_start, 
1178.          ntree_end, eval_period, thread_count, verbose, 
1179. 'staged_predict_log_proba')
1180. 
1181. def score(self, X, y=None):
1182. """
1183.         Calculate accuracy.
1184. 
1185.         Parameters
1186.         ----------
1187.         X : catboost.Pool or list or numpy.ndarray or pandas.
1188.          DataFrame or pandas.Series
1189.             Data to apply model on.
1190.         y : list or numpy.ndarray
1191.             True labels.
1192. 
1193.         Returns
1194.         -------
1195.         accuracy : float
1196.         """
1197. if isinstance(X, Pool):
1198. if y is not None:
1199. raise CatBoostError("Wrong initializing y: X is catboost.
1200.                  Pool object, y must be initialized inside catboost.Pool.")
1201.             y = X.get_label()
1202. if y is None:
1203. raise CatBoostError("Label in X has not initialized.")
1204. if isinstance(y, DataFrame):
1205. if len(y.columns) != 1:
1206. raise CatBoostError("y is DataFrame and has {} columns, 
1207.                  but must have exactly one.".format(len(y.columns)))
1208.             y = y[y.columns[0]]
1209. elif y is None:
1210. raise CatBoostError("y should be specified.")
1211.         y = np.array(y)
1212.         predicted_classes = self._predict(X, prediction_type='Class', 
1213.          ntree_start=0, ntree_end=0, thread_count=-1, verbose=None, 
1214.          parent_method_name='score').reshape(-1)
1215. if np.issubdtype(predicted_classes.dtype, np.number):
1216. if np.issubdtype(y.dtype, np.character):
1217. raise CatBoostError('predicted classes have numeric type 
1218.                  but specified y contains strings')
1219. elif np.issubdtype(y.dtype, np.number):
1220. raise CatBoostError('predicted classes have string type but 
1221.              specified y is numeric')
1222. elif np.issubdtype(y.dtype, np.bool_):
1223. raise CatBoostError('predicted classes have string type but 
1224.              specified y is boolean')
1225. return np.mean(np.array(predicted_classes) == np.array(y))
1226. 
1227. def _check_is_classification_objective(self, loss_function):
1228. if isinstance(loss_function, str) and not self.
1229.          _is_classification_objective(loss_function):
1230. raise CatBoostError(
1231. "Invalid loss_function='{}': for classifier use "
1232. "Logloss, CrossEntropy, MultiClass, MultiClassOneVsAll 
1233.                  or custom objective object".
1234. format(loss_function))

ML之CatboostC：基于titanic泰坦尼克数据集利用catboost算法实现二分类

基于titanic泰坦尼克数据集利用catboost算法实现二分类

设计思路

输出结果

核心代码

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

ML之CatboostC：基于titanic泰坦尼克数据集利用catboost算法实现二分类

基于titanic泰坦尼克数据集利用catboost算法实现二分类

设计思路

输出结果

核心代码

热门文章

最新文章

相关课程

相关电子书

相关实验场景