飞道的博客

ML之CatboostC:基于titanic泰坦尼克数据集利用catboost算法实现二分类

533人阅读  评论(0)

ML之CatboostC:基于titanic泰坦尼克数据集利用catboost算法实现二分类

 

 

 

 

目录

基于titanic泰坦尼克数据集利用catboost算法实现二分类

设计思路

输出结果

核心代码


 

 

 

相关内容
ML之CatBoost:CatBoost算法的简介、安装、案例应用之详细攻略
ML之CatboostC:基于titanic泰坦尼克数据集利用catboost算法实现二分类
ML之CatboostC:基于titanic泰坦尼克数据集利用catboost算法实现二分类实现

基于titanic泰坦尼克数据集利用catboost算法实现二分类

设计思路

 

输出结果


  
  1. Pclass Sex Age SibSp Parch Survived
  2. 0 3 male 22.0 1 0 0
  3. 1 1 female 38.0 1 0 1
  4. 2 3 female 26.0 0 0 1
  5. 3 1 female 35.0 1 0 1
  6. 4 3 male 35.0 0 0 0
  7. Pclass int64
  8. Sex object
  9. Age float64
  10. SibSp int64
  11. Parch int64
  12. Survived int64
  13. dtype: object
  14. object_features_ID: [ 1]
  15. 0: learn: 0.5469469 test: 0.5358272 best: 0.5358272 ( 0) total: 98.1ms remaining: 9.71s
  16. 1: learn: 0.4884967 test: 0.4770551 best: 0.4770551 ( 1) total: 98.7ms remaining: 4.84s
  17. 2: learn: 0.4459496 test: 0.4453159 best: 0.4453159 ( 2) total: 99.3ms remaining: 3.21s
  18. 3: learn: 0.4331858 test: 0.4352757 best: 0.4352757 ( 3) total: 99.8ms remaining: 2.4s
  19. 4: learn: 0.4197131 test: 0.4266055 best: 0.4266055 ( 4) total: 100ms remaining: 1.91s
  20. 5: learn: 0.4085381 test: 0.4224953 best: 0.4224953 ( 5) total: 101ms remaining: 1.58s
  21. 6: learn: 0.4063807 test: 0.4209804 best: 0.4209804 ( 6) total: 102ms remaining: 1.35s
  22. 7: learn: 0.4007713 test: 0.4155077 best: 0.4155077 ( 7) total: 102ms remaining: 1.17s
  23. 8: learn: 0.3971064 test: 0.4135872 best: 0.4135872 ( 8) total: 103ms remaining: 1.04s
  24. 9: learn: 0.3943774 test: 0.4105674 best: 0.4105674 ( 9) total: 103ms remaining: 928ms
  25. 10: learn: 0.3930801 test: 0.4099915 best: 0.4099915 ( 10) total: 104ms remaining: 839ms
  26. 11: learn: 0.3904409 test: 0.4089840 best: 0.4089840 ( 11) total: 104ms remaining: 764ms
  27. 12: learn: 0.3890830 test: 0.4091666 best: 0.4089840 ( 11) total: 105ms remaining: 701ms
  28. 13: learn: 0.3851196 test: 0.4108839 best: 0.4089840 ( 11) total: 105ms remaining: 647ms
  29. 14: learn: 0.3833366 test: 0.4106298 best: 0.4089840 ( 11) total: 106ms remaining: 600ms
  30. 15: learn: 0.3792283 test: 0.4126097 best: 0.4089840 ( 11) total: 106ms remaining: 558ms
  31. 16: learn: 0.3765680 test: 0.4114997 best: 0.4089840 ( 11) total: 107ms remaining: 522ms
  32. 17: learn: 0.3760966 test: 0.4112166 best: 0.4089840 ( 11) total: 107ms remaining: 489ms
  33. 18: learn: 0.3736951 test: 0.4122305 best: 0.4089840 ( 11) total: 108ms remaining: 461ms
  34. 19: learn: 0.3719966 test: 0.4101199 best: 0.4089840 ( 11) total: 109ms remaining: 435ms
  35. 20: learn: 0.3711460 test: 0.4097299 best: 0.4089840 ( 11) total: 109ms remaining: 411ms
  36. 21: learn: 0.3707144 test: 0.4093512 best: 0.4089840 ( 11) total: 110ms remaining: 389ms
  37. 22: learn: 0.3699238 test: 0.4083409 best: 0.4083409 ( 22) total: 110ms remaining: 370ms
  38. 23: learn: 0.3670864 test: 0.4071850 best: 0.4071850 ( 23) total: 111ms remaining: 351ms
  39. 24: learn: 0.3635514 test: 0.4038399 best: 0.4038399 ( 24) total: 111ms remaining: 334ms
  40. 25: learn: 0.3627657 test: 0.4025837 best: 0.4025837 ( 25) total: 112ms remaining: 319ms
  41. 26: learn: 0.3621028 test: 0.4018449 best: 0.4018449 ( 26) total: 113ms remaining: 304ms
  42. 27: learn: 0.3616121 test: 0.4011693 best: 0.4011693 ( 27) total: 113ms remaining: 291ms
  43. 28: learn: 0.3614262 test: 0.4011820 best: 0.4011693 ( 27) total: 114ms remaining: 278ms
  44. 29: learn: 0.3610673 test: 0.4005475 best: 0.4005475 ( 29) total: 114ms remaining: 267ms
  45. 30: learn: 0.3588062 test: 0.4002801 best: 0.4002801 ( 30) total: 115ms remaining: 256ms
  46. 31: learn: 0.3583703 test: 0.3997255 best: 0.3997255 ( 31) total: 116ms remaining: 246ms
  47. 32: learn: 0.3580553 test: 0.4001878 best: 0.3997255 ( 31) total: 116ms remaining: 236ms
  48. 33: learn: 0.3556808 test: 0.4004169 best: 0.3997255 ( 31) total: 118ms remaining: 228ms
  49. 34: learn: 0.3536833 test: 0.4003229 best: 0.3997255 ( 31) total: 119ms remaining: 220ms
  50. 35: learn: 0.3519948 test: 0.4008047 best: 0.3997255 ( 31) total: 119ms remaining: 212ms
  51. 36: learn: 0.3515452 test: 0.4000576 best: 0.3997255 ( 31) total: 120ms remaining: 204ms
  52. 37: learn: 0.3512962 test: 0.3997214 best: 0.3997214 ( 37) total: 120ms remaining: 196ms
  53. 38: learn: 0.3507648 test: 0.4001569 best: 0.3997214 ( 37) total: 121ms remaining: 189ms
  54. 39: learn: 0.3489575 test: 0.4009203 best: 0.3997214 ( 37) total: 121ms remaining: 182ms
  55. 40: learn: 0.3480966 test: 0.4014031 best: 0.3997214 ( 37) total: 122ms remaining: 175ms
  56. 41: learn: 0.3477613 test: 0.4009293 best: 0.3997214 ( 37) total: 122ms remaining: 169ms
  57. 42: learn: 0.3472945 test: 0.4006602 best: 0.3997214 ( 37) total: 123ms remaining: 163ms
  58. 43: learn: 0.3465271 test: 0.4007531 best: 0.3997214 ( 37) total: 124ms remaining: 157ms
  59. 44: learn: 0.3461538 test: 0.4010608 best: 0.3997214 ( 37) total: 124ms remaining: 152ms
  60. 45: learn: 0.3455060 test: 0.4012489 best: 0.3997214 ( 37) total: 125ms remaining: 146ms
  61. 46: learn: 0.3449922 test: 0.4013439 best: 0.3997214 ( 37) total: 125ms remaining: 141ms
  62. 47: learn: 0.3445333 test: 0.4010754 best: 0.3997214 ( 37) total: 126ms remaining: 136ms
  63. 48: learn: 0.3443186 test: 0.4011180 best: 0.3997214 ( 37) total: 126ms remaining: 132ms
  64. 49: learn: 0.3424633 test: 0.4016071 best: 0.3997214 ( 37) total: 127ms remaining: 127ms
  65. 50: learn: 0.3421565 test: 0.4013135 best: 0.3997214 ( 37) total: 128ms remaining: 123ms
  66. 51: learn: 0.3417523 test: 0.4009993 best: 0.3997214 ( 37) total: 128ms remaining: 118ms
  67. 52: learn: 0.3415669 test: 0.4009101 best: 0.3997214 ( 37) total: 129ms remaining: 114ms
  68. 53: learn: 0.3413867 test: 0.4010833 best: 0.3997214 ( 37) total: 130ms remaining: 110ms
  69. 54: learn: 0.3405166 test: 0.4014830 best: 0.3997214 ( 37) total: 130ms remaining: 107ms
  70. 55: learn: 0.3401535 test: 0.4015556 best: 0.3997214 ( 37) total: 131ms remaining: 103ms
  71. 56: learn: 0.3395217 test: 0.4021097 best: 0.3997214 ( 37) total: 132ms remaining: 99.4ms
  72. 57: learn: 0.3393024 test: 0.4023377 best: 0.3997214 ( 37) total: 132ms remaining: 95.8ms
  73. 58: learn: 0.3389909 test: 0.4019616 best: 0.3997214 ( 37) total: 133ms remaining: 92.3ms
  74. 59: learn: 0.3388494 test: 0.4019746 best: 0.3997214 ( 37) total: 133ms remaining: 88.9ms
  75. 60: learn: 0.3384901 test: 0.4017470 best: 0.3997214 ( 37) total: 134ms remaining: 85.6ms
  76. 61: learn: 0.3382250 test: 0.4018783 best: 0.3997214 ( 37) total: 134ms remaining: 82.4ms
  77. 62: learn: 0.3345761 test: 0.4039633 best: 0.3997214 ( 37) total: 135ms remaining: 79.3ms
  78. 63: learn: 0.3317548 test: 0.4050218 best: 0.3997214 ( 37) total: 136ms remaining: 76.3ms
  79. 64: learn: 0.3306501 test: 0.4036656 best: 0.3997214 ( 37) total: 136ms remaining: 73.3ms
  80. 65: learn: 0.3292310 test: 0.4034339 best: 0.3997214 ( 37) total: 137ms remaining: 70.5ms
  81. 66: learn: 0.3283600 test: 0.4033661 best: 0.3997214 ( 37) total: 137ms remaining: 67.6ms
  82. 67: learn: 0.3282389 test: 0.4034237 best: 0.3997214 ( 37) total: 138ms remaining: 64.9ms
  83. 68: learn: 0.3274603 test: 0.4039310 best: 0.3997214 ( 37) total: 138ms remaining: 62.2ms
  84. 69: learn: 0.3273430 test: 0.4041663 best: 0.3997214 ( 37) total: 139ms remaining: 59.6ms
  85. 70: learn: 0.3271585 test: 0.4044144 best: 0.3997214 ( 37) total: 140ms remaining: 57.1ms
  86. 71: learn: 0.3268457 test: 0.4046981 best: 0.3997214 ( 37) total: 140ms remaining: 54.6ms
  87. 72: learn: 0.3266497 test: 0.4042724 best: 0.3997214 ( 37) total: 141ms remaining: 52.1ms
  88. 73: learn: 0.3259684 test: 0.4048797 best: 0.3997214 ( 37) total: 141ms remaining: 49.7ms
  89. 74: learn: 0.3257845 test: 0.4044766 best: 0.3997214 ( 37) total: 142ms remaining: 47.3ms
  90. 75: learn: 0.3256157 test: 0.4047031 best: 0.3997214 ( 37) total: 143ms remaining: 45.1ms
  91. 76: learn: 0.3251433 test: 0.4043698 best: 0.3997214 ( 37) total: 144ms remaining: 42.9ms
  92. 77: learn: 0.3247743 test: 0.4041652 best: 0.3997214 ( 37) total: 144ms remaining: 40.6ms
  93. 78: learn: 0.3224876 test: 0.4058880 best: 0.3997214 ( 37) total: 145ms remaining: 38.5ms
  94. 79: learn: 0.3223339 test: 0.4058139 best: 0.3997214 ( 37) total: 145ms remaining: 36.3ms
  95. 80: learn: 0.3211858 test: 0.4060056 best: 0.3997214 ( 37) total: 146ms remaining: 34.2ms
  96. 81: learn: 0.3200423 test: 0.4067103 best: 0.3997214 ( 37) total: 147ms remaining: 32.2ms
  97. 82: learn: 0.3198329 test: 0.4069039 best: 0.3997214 ( 37) total: 147ms remaining: 30.1ms
  98. 83: learn: 0.3196561 test: 0.4067853 best: 0.3997214 ( 37) total: 148ms remaining: 28.1ms
  99. 84: learn: 0.3193160 test: 0.4072288 best: 0.3997214 ( 37) total: 148ms remaining: 26.1ms
  100. 85: learn: 0.3184463 test: 0.4077451 best: 0.3997214 ( 37) total: 149ms remaining: 24.2ms
  101. 86: learn: 0.3175777 test: 0.4086243 best: 0.3997214 ( 37) total: 149ms remaining: 22.3ms
  102. 87: learn: 0.3173824 test: 0.4082013 best: 0.3997214 ( 37) total: 150ms remaining: 20.4ms
  103. 88: learn: 0.3172840 test: 0.4083946 best: 0.3997214 ( 37) total: 150ms remaining: 18.6ms
  104. 89: learn: 0.3166252 test: 0.4086761 best: 0.3997214 ( 37) total: 151ms remaining: 16.8ms
  105. 90: learn: 0.3164144 test: 0.4083237 best: 0.3997214 ( 37) total: 151ms remaining: 15ms
  106. 91: learn: 0.3162137 test: 0.4083699 best: 0.3997214 ( 37) total: 152ms remaining: 13.2ms
  107. 92: learn: 0.3155611 test: 0.4091627 best: 0.3997214 ( 37) total: 152ms remaining: 11.5ms
  108. 93: learn: 0.3153976 test: 0.4089484 best: 0.3997214 ( 37) total: 153ms remaining: 9.76ms
  109. 94: learn: 0.3139281 test: 0.4116939 best: 0.3997214 ( 37) total: 154ms remaining: 8.08ms
  110. 95: learn: 0.3128878 test: 0.4146652 best: 0.3997214 ( 37) total: 154ms remaining: 6.42ms
  111. 96: learn: 0.3127863 test: 0.4145767 best: 0.3997214 ( 37) total: 155ms remaining: 4.78ms
  112. 97: learn: 0.3126696 test: 0.4142118 best: 0.3997214 ( 37) total: 155ms remaining: 3.17ms
  113. 98: learn: 0.3120048 test: 0.4140831 best: 0.3997214 ( 37) total: 156ms remaining: 1.57ms
  114. 99: learn: 0.3117563 test: 0.4138267 best: 0.3997214 ( 37) total: 156ms remaining: 0us
  115. bestTest = 0.3997213503
  116. bestIteration = 37
  117. Shrink model to first 38 iterations.

 

核心代码


  
  1. class CatBoostClassifier Found at: catboost.core
  2. class CatBoostClassifier(CatBoost):
  3. _estimator_type = 'classifier'
  4. """
  5. Implementation of the scikit-learn API for CatBoost classification.
  6. Parameters
  7. ----------
  8. iterations : int, [default=500]
  9. Max count of trees.
  10. range: [1,+inf]
  11. learning_rate : float, [default value is selected automatically for
  12. binary classification with other parameters set to default. In all
  13. other cases default is 0.03]
  14. Step size shrinkage used in update to prevents overfitting.
  15. range: (0,1]
  16. depth : int, [default=6]
  17. Depth of a tree. All trees are the same depth.
  18. range: [1,+inf]
  19. l2_leaf_reg : float, [default=3.0]
  20. Coefficient at the L2 regularization term of the cost function.
  21. range: [0,+inf]
  22. model_size_reg : float, [default=None]
  23. Model size regularization coefficient.
  24. range: [0,+inf]
  25. rsm : float, [default=None]
  26. Subsample ratio of columns when constructing each tree.
  27. range: (0,1]
  28. loss_function : string or object, [default='Logloss']
  29. The metric to use in training and also selector of the machine
  30. learning
  31. problem to solve. If string, then the name of a supported
  32. metric,
  33. optionally suffixed with parameter description.
  34. If object, it shall provide methods 'calc_ders_range' or
  35. 'calc_ders_multi'.
  36. border_count : int, [default = 254 for training on CPU or 128 for
  37. training on GPU]
  38. The number of partitions in numeric features binarization.
  39. Used in the preliminary calculation.
  40. range: [1,65535] on CPU, [1,255] on GPU
  41. feature_border_type : string, [default='GreedyLogSum']
  42. The binarization mode in numeric features binarization. Used
  43. in the preliminary calculation.
  44. Possible values:
  45. - 'Median'
  46. - 'Uniform'
  47. - 'UniformAndQuantiles'
  48. - 'GreedyLogSum'
  49. - 'MaxLogSum'
  50. - 'MinEntropy'
  51. per_float_feature_quantization : list of strings, [default=None]
  52. List of float binarization descriptions.
  53. Format : described in documentation on catboost.ai
  54. Example 1: ['0:1024'] means that feature 0 will have 1024
  55. borders.
  56. Example 2: ['0:border_count=1024', '1:border_count=1024',
  57. ...] means that two first features have 1024 borders.
  58. Example 3: ['0:nan_mode=Forbidden,border_count=32,
  59. border_type=GreedyLogSum',
  60. '1:nan_mode=Forbidden,border_count=32,
  61. border_type=GreedyLogSum'] - defines more quantization
  62. properties for first two features.
  63. input_borders : string, [default=None]
  64. input file with borders used in numeric features binarization.
  65. output_borders : string, [default=None]
  66. output file for borders that were used in numeric features
  67. binarization.
  68. fold_permutation_block : int, [default=1]
  69. To accelerate the learning.
  70. The recommended value is within [1, 256]. On small samples,
  71. must be set to 1.
  72. range: [1,+inf]
  73. od_pval : float, [default=None]
  74. Use overfitting detector to stop training when reaching a
  75. specified threshold.
  76. Can be used only with eval_set.
  77. range: [0,1]
  78. od_wait : int, [default=None]
  79. Number of iterations which overfitting detector will wait after
  80. new best error.
  81. od_type : string, [default=None]
  82. Type of overfitting detector which will be used in program.
  83. Posible values:
  84. - 'IncToDec'
  85. - 'Iter'
  86. For 'Iter' type od_pval must not be set.
  87. If None, then od_type=IncToDec.
  88. nan_mode : string, [default=None]
  89. Way to process missing values for numeric features.
  90. Possible values:
  91. - 'Forbidden' - raises an exception if there is a missing value
  92. for a numeric feature in a dataset.
  93. - 'Min' - each missing value will be processed as the
  94. minimum numerical value.
  95. - 'Max' - each missing value will be processed as the
  96. maximum numerical value.
  97. If None, then nan_mode=Min.
  98. counter_calc_method : string, [default=None]
  99. The method used to calculate counters for dataset with
  100. Counter type.
  101. Possible values:
  102. - 'PrefixTest' - only objects up to current in the test dataset
  103. are considered
  104. - 'FullTest' - all objects are considered in the test dataset
  105. - 'SkipTest' - Objects from test dataset are not considered
  106. - 'Full' - all objects are considered for both learn and test
  107. dataset
  108. If None, then counter_calc_method=PrefixTest.
  109. leaf_estimation_iterations : int, [default=None]
  110. The number of steps in the gradient when calculating the
  111. values in the leaves.
  112. If None, then leaf_estimation_iterations=1.
  113. range: [1,+inf]
  114. leaf_estimation_method : string, [default=None]
  115. The method used to calculate the values in the leaves.
  116. Possible values:
  117. - 'Newton'
  118. - 'Gradient'
  119. thread_count : int, [default=None]
  120. Number of parallel threads used to run CatBoost.
  121. If None or -1, then the number of threads is set to the
  122. number of CPU cores.
  123. range: [1,+inf]
  124. random_seed : int, [default=None]
  125. Random number seed.
  126. If None, 0 is used.
  127. range: [0,+inf]
  128. use_best_model : bool, [default=None]
  129. To limit the number of trees in predict() using information
  130. about the optimal value of the error function.
  131. Can be used only with eval_set.
  132. best_model_min_trees : int, [default=None]
  133. The minimal number of trees the best model should have.
  134. verbose: bool
  135. When set to True, logging_level is set to 'Verbose'.
  136. When set to False, logging_level is set to 'Silent'.
  137. silent: bool, synonym for verbose
  138. logging_level : string, [default='Verbose']
  139. Possible values:
  140. - 'Silent'
  141. - 'Verbose'
  142. - 'Info'
  143. - 'Debug'
  144. metric_period : int, [default=1]
  145. The frequency of iterations to print the information to stdout.
  146. The value should be a positive integer.
  147. simple_ctr: list of strings, [default=None]
  148. Binarization settings for categorical features.
  149. Format : see documentation
  150. Example: ['Borders:CtrBorderCount=5:Prior=0:Prior=0.5',
  151. 'BinarizedTargetMeanValue:TargetBorderCount=10:
  152. TargetBorderType=MinEntropy', ...]
  153. CTR types:
  154. CPU and GPU
  155. - 'Borders'
  156. - 'Buckets'
  157. CPU only
  158. - 'BinarizedTargetMeanValue'
  159. - 'Counter'
  160. GPU only
  161. - 'FloatTargetMeanValue'
  162. - 'FeatureFreq'
  163. Number_of_borders, binarization type, target borders and
  164. binarizations, priors are optional parametrs
  165. combinations_ctr: list of strings, [default=None]
  166. per_feature_ctr: list of strings, [default=None]
  167. ctr_target_border_count: int, [default=None]
  168. Maximum number of borders used in target binarization for
  169. categorical features that need it.
  170. If TargetBorderCount is specified in 'simple_ctr',
  171. 'combinations_ctr' or 'per_feature_ctr' option it
  172. overrides this value.
  173. range: [1, 255]
  174. ctr_leaf_count_limit : int, [default=None]
  175. The maximum number of leaves with categorical features.
  176. If the number of leaves exceeds the specified limit, some
  177. leaves are discarded.
  178. The leaves to be discarded are selected as follows:
  179. - The leaves are sorted by the frequency of the values.
  180. - The top N leaves are selected, where N is the value
  181. specified in the parameter.
  182. - All leaves starting from N+1 are discarded.
  183. This option reduces the resulting model size
  184. and the amount of memory required for training.
  185. Note that the resulting quality of the model can be affected.
  186. range: [1,+inf] (for zero limit use ignored_features)
  187. store_all_simple_ctr : bool, [default=None]
  188. Ignore categorical features, which are not used in feature
  189. combinations,
  190. when choosing candidates for exclusion.
  191. Use this parameter with ctr_leaf_count_limit only.
  192. max_ctr_complexity : int, [default=4]
  193. The maximum number of Categ features that can be
  194. combined.
  195. range: [0,+inf]
  196. has_time : bool, [default=False]
  197. To use the order in which objects are represented in the
  198. input data
  199. (do not perform a random permutation of the dataset at the
  200. preprocessing stage).
  201. allow_const_label : bool, [default=False]
  202. To allow the constant label value in dataset.
  203. target_border: float, [default=None]
  204. Border for target binarization.
  205. classes_count : int, [default=None]
  206. The upper limit for the numeric class label.
  207. Defines the number of classes for multiclassification.
  208. Only non-negative integers can be specified.
  209. The given integer should be greater than any of the target
  210. values.
  211. If this parameter is specified the labels for all classes in the
  212. input dataset
  213. should be smaller than the given value.
  214. If several of 'classes_count', 'class_weights', 'class_names'
  215. parameters are defined
  216. the numbers of classes specified by each of them must be
  217. equal.
  218. class_weights : list or dict, [default=None]
  219. Classes weights. The values are used as multipliers for the
  220. object weights.
  221. If None, all classes are supposed to have weight one.
  222. If list - class weights in order of class_names or sequential
  223. classes if class_names is undefined
  224. If dict - dict of class_name -> class_weight.
  225. If several of 'classes_count', 'class_weights', 'class_names'
  226. parameters are defined
  227. the numbers of classes specified by each of them must be
  228. equal.
  229. auto_class_weights : string [default=None]
  230. Enables automatic class weights calculation. Possible values:
  231. - Balanced # weight = maxSummaryClassWeight /
  232. summaryClassWeight, statistics determined from train pool
  233. - SqrtBalanced # weight = sqrt(maxSummaryClassWeight /
  234. summaryClassWeight)
  235. class_names: list of strings, [default=None]
  236. Class names. Allows to redefine the default values for class
  237. labels (integer numbers).
  238. If several of 'classes_count', 'class_weights', 'class_names'
  239. parameters are defined
  240. the numbers of classes specified by each of them must be
  241. equal.
  242. one_hot_max_size : int, [default=None]
  243. Convert the feature to float
  244. if the number of different values that it takes exceeds the
  245. specified value.
  246. Ctrs are not calculated for such features.
  247. random_strength : float, [default=1]
  248. Score standard deviation multiplier.
  249. name : string, [default='experiment']
  250. The name that should be displayed in the visualization tools.
  251. ignored_features : list, [default=None]
  252. Indices or names of features that should be excluded when
  253. training.
  254. train_dir : string, [default=None]
  255. The directory in which you want to record generated in the
  256. process of learning files.
  257. custom_metric : string or list of strings, [default=None]
  258. To use your own metric function.
  259. custom_loss: alias to custom_metric
  260. eval_metric : string or object, [default=None]
  261. To optimize your custom metric in loss.
  262. bagging_temperature : float, [default=None]
  263. Controls intensity of Bayesian bagging. The higher the
  264. temperature the more aggressive bagging is.
  265. Typical values are in range [0, 1] (0 - no bagging, 1 - default).
  266. save_snapshot : bool, [default=None]
  267. Enable progress snapshotting for restoring progress after
  268. crashes or interruptions
  269. snapshot_file : string, [default=None]
  270. Learn progress snapshot file path, if None will use default
  271. filename
  272. snapshot_interval: int, [default=600]
  273. Interval between saving snapshots (seconds)
  274. fold_len_multiplier : float, [default=None]
  275. Fold length multiplier. Should be greater than 1
  276. used_ram_limit : string or number, [default=None]
  277. Set a limit on memory consumption (value like '1.2gb' or 1.2
  278. e9).
  279. WARNING: Currently this option affects CTR memory usage
  280. only.
  281. gpu_ram_part : float, [default=0.95]
  282. Fraction of the GPU RAM to use for training, a value from (0,
  283. 1].
  284. pinned_memory_size: int [default=None]
  285. Size of additional CPU pinned memory used for GPU learning,
  286. usually is estimated automatically, thus usually should not be
  287. set.
  288. allow_writing_files : bool, [default=True]
  289. If this flag is set to False, no files with different diagnostic info
  290. will be created during training.
  291. With this flag no snapshotting can be done. Plus visualisation
  292. will not
  293. work, because visualisation uses files that are created and
  294. updated during training.
  295. final_ctr_computation_mode : string, [default='Default']
  296. Possible values:
  297. - 'Default' - Compute final ctrs for all pools.
  298. - 'Skip' - Skip final ctr computation. WARNING: model
  299. without ctrs can't be applied.
  300. approx_on_full_history : bool, [default=False]
  301. If this flag is set to True, each approximated value is
  302. calculated using all the preceeding rows in the fold (slower, more
  303. accurate).
  304. If this flag is set to False, each approximated value is
  305. calculated using only the beginning 1/fold_len_multiplier fraction
  306. of the fold (faster, slightly less accurate).
  307. boosting_type : string, default value depends on object count
  308. and feature count in train dataset and on learning mode.
  309. Boosting scheme.
  310. Possible values:
  311. - 'Ordered' - Gives better quality, but may slow down the
  312. training.
  313. - 'Plain' - The classic gradient boosting scheme. May result
  314. in quality degradation, but does not slow down the training.
  315. task_type : string, [default=None]
  316. The calcer type used to train the model.
  317. Possible values:
  318. - 'CPU'
  319. - 'GPU'
  320. device_config : string, [default=None], deprecated, use devices
  321. instead
  322. devices : list or string, [default=None], GPU devices to use.
  323. String format is: '0' for 1 device or '0:1:3' for multiple devices
  324. or '0-3' for range of devices.
  325. List format is : [0] for 1 device or [0,1,3] for multiple devices.
  326. bootstrap_type : string, Bayesian, Bernoulli, Poisson, MVS.
  327. Default bootstrap is Bayesian for GPU and MVS for CPU.
  328. Poisson bootstrap is supported only on GPU.
  329. MVS bootstrap is supported only on CPU.
  330. subsample : float, [default=None]
  331. Sample rate for bagging. This parameter can be used Poisson
  332. or Bernoully bootstrap types.
  333. mvs-reg : float, [default is set automatically at each iteration
  334. based on gradient distribution]
  335. Regularization parameter for MVS sampling algorithm
  336. monotone_constraints : list or numpy.ndarray or string or dict,
  337. [default=None]
  338. Monotone constraints for features.
  339. feature_weights : list or numpy.ndarray or string or dict,
  340. [default=None]
  341. Coefficient to multiply split gain with specific feature use.
  342. Should be non-negative.
  343. penalties_coefficient : float, [default=1]
  344. Common coefficient for all penalties. Should be non-negative.
  345. first_feature_use_penalties : list or numpy.ndarray or string or
  346. dict, [default=None]
  347. Penalties to first use of specific feature in model. Should be
  348. non-negative.
  349. per_object_feature_penalties : list or numpy.ndarray or string or
  350. dict, [default=None]
  351. Penalties for first use of feature for each object. Should be
  352. non-negative.
  353. sampling_frequency : string, [default=PerTree]
  354. Frequency to sample weights and objects when building
  355. trees.
  356. Possible values:
  357. - 'PerTree' - Before constructing each new tree
  358. - 'PerTreeLevel' - Before choosing each new split of a tree
  359. sampling_unit : string, [default='Object'].
  360. Possible values:
  361. - 'Object'
  362. - 'Group'
  363. The parameter allows to specify the sampling scheme:
  364. sample weights for each object individually or for an entire
  365. group of objects together.
  366. dev_score_calc_obj_block_size: int, [default=5000000]
  367. CPU only. Size of block of samples in score calculation. Should
  368. be > 0
  369. Used only for learning speed tuning.
  370. Changing this parameter can affect results due to numerical
  371. accuracy differences
  372. dev_efb_max_buckets : int, [default=1024]
  373. CPU only. Maximum bucket count in exclusive features
  374. bundle. Should be in an integer between 0 and 65536.
  375. Used only for learning speed tuning.
  376. sparse_features_conflict_fraction : float, [default=0.0]
  377. CPU only. Maximum allowed fraction of conflicting non-
  378. default values for features in exclusive features bundle.
  379. Should be a real value in [0, 1) interval.
  380. grow_policy : string, [SymmetricTree,Lossguide,Depthwise],
  381. [default=SymmetricTree]
  382. The tree growing policy. It describes how to perform greedy
  383. tree construction.
  384. min_data_in_leaf : int, [default=1].
  385. The minimum training samples count in leaf.
  386. CatBoost will not search for new splits in leaves with samples
  387. count less than min_data_in_leaf.
  388. This parameter is used only for Depthwise and Lossguide
  389. growing policies.
  390. max_leaves : int, [default=31],
  391. The maximum leaf count in resulting tree.
  392. This parameter is used only for Lossguide growing policy.
  393. score_function : string, possible values L2, Cosine, NewtonL2,
  394. NewtonCosine, [default=Cosine]
  395. For growing policy Lossguide default=NewtonL2.
  396. GPU only. Score that is used during tree construction to
  397. select the next tree split.
  398. max_depth : int, Synonym for depth.
  399. n_estimators : int, synonym for iterations.
  400. num_trees : int, synonym for iterations.
  401. num_boost_round : int, synonym for iterations.
  402. colsample_bylevel : float, synonym for rsm.
  403. random_state : int, synonym for random_seed.
  404. reg_lambda : float, synonym for l2_leaf_reg.
  405. objective : string, synonym for loss_function.
  406. num_leaves : int, synonym for max_leaves.
  407. min_child_samples : int, synonym for min_data_in_leaf
  408. eta : float, synonym for learning_rate.
  409. max_bin : float, synonym for border_count.
  410. scale_pos_weight : float, synonym for class_weights.
  411. Can be used only for binary classification. Sets weight
  412. multiplier for
  413. class 1 to scale_pos_weight value.
  414. metadata : dict, string to string key-value pairs to be stored in
  415. model metadata storage
  416. early_stopping_rounds : int
  417. Synonym for od_wait. Only one of these parameters should
  418. be set.
  419. cat_features : list or numpy.ndarray, [default=None]
  420. If not None, giving the list of Categ features indices or names
  421. (names are represented as strings).
  422. If it contains feature names, feature names must be defined
  423. for the training dataset passed to 'fit'.
  424. text_features : list or numpy.ndarray, [default=None]
  425. If not None, giving the list of Text features indices or names
  426. (names are represented as strings).
  427. If it contains feature names, feature names must be defined
  428. for the training dataset passed to 'fit'.
  429. embedding_features : list or numpy.ndarray, [default=None]
  430. If not None, giving the list of Embedding features indices or
  431. names (names are represented as strings).
  432. If it contains feature names, feature names must be defined
  433. for the training dataset passed to 'fit'.
  434. leaf_estimation_backtracking : string, [default=None]
  435. Type of backtracking during gradient descent.
  436. Possible values:
  437. - 'No' - never backtrack; supported on CPU and GPU
  438. - 'AnyImprovement' - reduce the descent step until the
  439. value of loss function is less than before the step; supported on
  440. CPU and GPU
  441. - 'Armijo' - reduce the descent step until Armijo condition
  442. is satisfied; supported on GPU only
  443. model_shrink_rate : float, [default=0]
  444. This parameter enables shrinkage of model at the start of
  445. each iteration. CPU only.
  446. For Constant mode shrinkage coefficient is calculated as (1 -
  447. model_shrink_rate * learning_rate).
  448. For Decreasing mode shrinkage coefficient is calculated as (1
  449. - model_shrink_rate / iteration).
  450. Shrinkage coefficient should be in [0, 1).
  451. model_shrink_mode : string, [default=None]
  452. Mode of shrinkage coefficient calculation. CPU only.
  453. Possible values:
  454. - 'Constant' - Shrinkage coefficient is constant at each
  455. iteration.
  456. - 'Decreasing' - Shrinkage coefficient decreases at each
  457. iteration.
  458. langevin : bool, [default=False]
  459. Enables the Stochastic Gradient Langevin Boosting. CPU only.
  460. diffusion_temperature : float, [default=0]
  461. Langevin boosting diffusion temperature. CPU only.
  462. posterior_sampling : bool, [default=False]
  463. Set group of parameters for further use Uncertainty
  464. prediction:
  465. - Langevin = True
  466. - Model Shrink Rate = 1/(2N), where N is dataset size
  467. - Model Shrink Mode = Constant
  468. - Diffusion-temperature = N, where N is dataset size. CPU
  469. only.
  470. boost_from_average : bool, [default=True for RMSE, False for
  471. other losses]
  472. Enables to initialize approx values by best constant value for
  473. specified loss function.
  474. Available for RMSE, Logloss, CrossEntropy, Quantile and MAE.
  475. tokenizers : list of dicts,
  476. Each dict is a tokenizer description. Example:
  477. ```
  478. [
  479. {
  480. 'tokenizer_id': 'Tokenizer', # Tokeinzer identifier.
  481. 'lowercasing': 'false', # Possible values: 'true', 'false'.
  482. 'number_process_policy': 'LeaveAsIs', # Possible values:
  483. 'Skip', 'LeaveAsIs', 'Replace'.
  484. 'number_token': '%', # Rarely used character. Used in
  485. conjunction with Replace NumberProcessPolicy.
  486. 'separator_type': 'ByDelimiter', # Possible values:
  487. 'ByDelimiter', 'BySense'.
  488. 'delimiter': ' ', # Used in conjunction with ByDelimiter
  489. SeparatorType.
  490. 'split_by_set': 'false', # Each single character in delimiter
  491. used as individual delimiter.
  492. 'skip_empty': 'true', # Possible values: 'true', 'false'.
  493. 'token_types': ['Word', 'Number', 'Unknown'], # Used in
  494. conjunction with BySense SeparatorType.
  495. # Possible values: 'Word', 'Number', 'Punctuation',
  496. 'SentenceBreak', 'ParagraphBreak', 'Unknown'.
  497. 'subtokens_policy': 'SingleToken', # Possible values:
  498. # 'SingleToken' - All subtokens are interpreted as
  499. single token).
  500. # 'SeveralTokens' - All subtokens are interpreted as
  501. several token.
  502. },
  503. ...
  504. ]
  505. ```
  506. dictionaries : list of dicts,
  507. Each dict is a tokenizer description. Example:
  508. ```
  509. [
  510. {
  511. 'dictionary_id': 'Dictionary', # Dictionary identifier.
  512. 'token_level_type': 'Word', # Possible values: 'Word',
  513. 'Letter'.
  514. 'gram_order': '1', # 1 for Unigram, 2 for Bigram, ...
  515. 'skip_step': '0', # 1 for 1-skip-gram, ...
  516. 'end_of_word_token_policy': 'Insert', # Possible values:
  517. 'Insert', 'Skip'.
  518. 'end_of_sentence_token_policy': 'Skip', # Possible values:
  519. 'Insert', 'Skip'.
  520. 'occurrence_lower_bound': '3', # The lower bound of
  521. token occurrences in the text to include it in the dictionary.
  522. 'max_dictionary_size': '50000', # The max dictionary size.
  523. },
  524. ...
  525. ]
  526. ```
  527. feature_calcers : list of strings,
  528. Each string is a calcer description. Example:
  529. ```
  530. [
  531. 'NaiveBayes',
  532. 'BM25',
  533. 'BoW:top_tokens_count=2000',
  534. ]
  535. ```
  536. text_processing : dict,
  537. Text processging description.
  538. """
  539. def __init__(
  540. self,
  541. iterations=None,
  542. learning_rate=None,
  543. depth=None,
  544. l2_leaf_reg=None,
  545. model_size_reg=None,
  546. rsm=None,
  547. loss_function=None,
  548. border_count=None,
  549. feature_border_type=None,
  550. per_float_feature_quantization=None,
  551. input_borders=None,
  552. output_borders=None,
  553. fold_permutation_block=None,
  554. od_pval=None,
  555. od_wait=None,
  556. od_type=None,
  557. nan_mode=None,
  558. counter_calc_method=None,
  559. leaf_estimation_iterations=None,
  560. leaf_estimation_method=None,
  561. thread_count=None,
  562. random_seed=None,
  563. use_best_model=None,
  564. best_model_min_trees=None,
  565. verbose=None,
  566. silent=None,
  567. logging_level=None,
  568. metric_period=None,
  569. ctr_leaf_count_limit=None,
  570. store_all_simple_ctr=None,
  571. max_ctr_complexity=None,
  572. has_time=None,
  573. allow_const_label=None,
  574. target_border=None,
  575. classes_count=None,
  576. class_weights=None,
  577. auto_class_weights=None,
  578. class_names=None,
  579. one_hot_max_size=None,
  580. random_strength=None,
  581. name=None,
  582. ignored_features=None,
  583. train_dir=None,
  584. custom_loss=None,
  585. custom_metric=None,
  586. eval_metric=None,
  587. bagging_temperature=None,
  588. save_snapshot=None,
  589. snapshot_file=None,
  590. snapshot_interval=None,
  591. fold_len_multiplier=None,
  592. used_ram_limit=None,
  593. gpu_ram_part=None,
  594. pinned_memory_size=None,
  595. allow_writing_files=None,
  596. final_ctr_computation_mode=None,
  597. approx_on_full_history=None,
  598. boosting_type=None,
  599. simple_ctr=None,
  600. combinations_ctr=None,
  601. per_feature_ctr=None,
  602. ctr_description=None,
  603. ctr_target_border_count=None,
  604. task_type=None,
  605. device_config=None,
  606. devices=None,
  607. bootstrap_type=None,
  608. subsample=None,
  609. mvs_reg=None,
  610. sampling_unit=None,
  611. sampling_frequency=None,
  612. dev_score_calc_obj_block_size=None,
  613. dev_efb_max_buckets=None,
  614. sparse_features_conflict_fraction=None,
  615. max_depth=None,
  616. n_estimators=None,
  617. num_boost_round=None,
  618. num_trees=None,
  619. colsample_bylevel=None,
  620. random_state=None,
  621. reg_lambda=None,
  622. objective=None,
  623. eta=None,
  624. max_bin=None,
  625. scale_pos_weight=None,
  626. gpu_cat_features_storage=None,
  627. data_partition=None,
  628. metadata=None,
  629. early_stopping_rounds=None,
  630. cat_features=None,
  631. grow_policy=None,
  632. min_data_in_leaf=None,
  633. min_child_samples=None,
  634. max_leaves=None,
  635. num_leaves=None,
  636. score_function=None,
  637. leaf_estimation_backtracking=None,
  638. ctr_history_unit=None,
  639. monotone_constraints=None,
  640. feature_weights=None,
  641. penalties_coefficient=None,
  642. first_feature_use_penalties=None,
  643. per_object_feature_penalties=None,
  644. model_shrink_rate=None,
  645. model_shrink_mode=None,
  646. langevin=None,
  647. diffusion_temperature=None,
  648. posterior_sampling=None,
  649. boost_from_average=None,
  650. text_features=None,
  651. tokenizers=None,
  652. dictionaries=None,
  653. feature_calcers=None,
  654. text_processing=None,
  655. embedding_features=None):
  656. params = {}
  657. not_params = [ "not_params", "self", "params", "__class__"]
  658. for key, value in iteritems(locals().copy()):
  659. if key not in not_params and value is not None:
  660. params[key] = value
  661. super(CatBoostClassifier, self).__init__(params)
  662. def fit(self, X, y=None, cat_features=None, text_features=None,
  663. embedding_features=None, sample_weight=None,
  664. baseline=None, use_best_model=None,
  665. eval_set=None, verbose=None, logging_level=None,
  666. plot=False, column_description=None,
  667. verbose_eval=None, metric_period=None, silent=None,
  668. early_stopping_rounds=None,
  669. save_snapshot=None, snapshot_file=None,
  670. snapshot_interval=None, init_model=None):
  671. """
  672. Fit the CatBoostClassifier model.
  673. Parameters
  674. ----------
  675. X : catboost.Pool or list or numpy.ndarray or pandas.
  676. DataFrame or pandas.Series
  677. If not catboost.Pool, 2 dimensional Feature matrix or string
  678. - file with dataset.
  679. y : list or numpy.ndarray or pandas.DataFrame or pandas.
  680. Series, optional (default=None)
  681. Labels, 1 dimensional array like.
  682. Use only if X is not catboost.Pool.
  683. cat_features : list or numpy.ndarray, optional (default=None)
  684. If not None, giving the list of Categ columns indices.
  685. Use only if X is not catboost.Pool.
  686. text_features : list or numpy.ndarray, optional (default=None)
  687. If not None, giving the list of Text columns indices.
  688. Use only if X is not catboost.Pool.
  689. embedding_features : list or numpy.ndarray, optional
  690. (default=None)
  691. If not None, giving the list of Embedding columns indices.
  692. Use only if X is not catboost.Pool.
  693. sample_weight : list or numpy.ndarray or pandas.DataFrame
  694. or pandas.Series, optional (default=None)
  695. Instance weights, 1 dimensional array like.
  696. baseline : list or numpy.ndarray, optional (default=None)
  697. If not None, giving 2 dimensional array like data.
  698. Use only if X is not catboost.Pool.
  699. use_best_model : bool, optional (default=None)
  700. Flag to use best model
  701. eval_set : catboost.Pool or list, optional (default=None)
  702. A list of (X, y) tuple pairs to use as a validation set for early-
  703. stopping
  704. metric_period : int
  705. Frequency of evaluating metrics.
  706. verbose : bool or int
  707. If verbose is bool, then if set to True, logging_level is set to
  708. Verbose,
  709. if set to False, logging_level is set to Silent.
  710. If verbose is int, it determines the frequency of writing
  711. metrics to output and
  712. logging_level is set to Verbose.
  713. silent : bool
  714. If silent is True, logging_level is set to Silent.
  715. If silent is False, logging_level is set to Verbose.
  716. logging_level : string, optional (default=None)
  717. Possible values:
  718. - 'Silent'
  719. - 'Verbose'
  720. - 'Info'
  721. - 'Debug'
  722. plot : bool, optional (default=False)
  723. If True, draw train and eval error in Jupyter notebook
  724. verbose_eval : bool or int
  725. Synonym for verbose. Only one of these parameters should
  726. be set.
  727. early_stopping_rounds : int
  728. Activates Iter overfitting detector with od_wait set to
  729. early_stopping_rounds.
  730. save_snapshot : bool, [default=None]
  731. Enable progress snapshotting for restoring progress after
  732. crashes or interruptions
  733. snapshot_file : string, [default=None]
  734. Learn progress snapshot file path, if None will use default
  735. filename
  736. snapshot_interval: int, [default=600]
  737. Interval between saving snapshots (seconds)
  738. init_model : CatBoost class or string, [default=None]
  739. Continue training starting from the existing model.
  740. If this parameter is a string, load initial model from the path
  741. specified by this string.
  742. Returns
  743. -------
  744. model : CatBoost
  745. """
  746. params = self._init_params.copy()
  747. _process_synonyms(params)
  748. if 'loss_function' in params:
  749. self._check_is_classification_objective(params
  750. [ 'loss_function'])
  751. self._fit(X, y, cat_features, text_features, embedding_features,
  752. None, sample_weight, None, None, None, None, baseline,
  753. use_best_model, eval_set, verbose, logging_level, plot,
  754. column_description, verbose_eval, metric_period, silent,
  755. early_stopping_rounds, save_snapshot, snapshot_file,
  756. snapshot_interval, init_model)
  757. return self
  758. def predict(self, data, prediction_type='Class', ntree_start=0,
  759. ntree_end=0, thread_count=-1, verbose=None):
  760. """
  761. Predict with data.
  762. Parameters
  763. ----------
  764. data : catboost.Pool or list of features or list of lists or numpy.
  765. ndarray or pandas.DataFrame or pandas.Series
  766. or catboost.FeaturesData
  767. Data to apply model on.
  768. If data is a simple list (not list of lists) or a one-dimensional
  769. numpy.ndarray it is interpreted
  770. as a list of features for a single object.
  771. prediction_type : string, optional (default='Class')
  772. Can be:
  773. - 'RawFormulaVal' : return raw formula value.
  774. - 'Class' : return class label.
  775. - 'Probability' : return probability for every class.
  776. - 'LogProbability' : return log probability for every class.
  777. ntree_start: int, optional (default=0)
  778. Model is applied on the interval [ntree_start, ntree_end)
  779. (zero-based indexing).
  780. ntree_end: int, optional (default=0)
  781. Model is applied on the interval [ntree_start, ntree_end)
  782. (zero-based indexing).
  783. If value equals to 0 this parameter is ignored and ntree_end
  784. equal to tree_count_.
  785. thread_count : int (default=-1)
  786. The number of threads to use when applying the model.
  787. Allows you to optimize the speed of execution. This
  788. parameter doesn't affect results.
  789. If -1, then the number of threads is set to the number of
  790. CPU cores.
  791. verbose : bool, optional (default=False)
  792. If True, writes the evaluation metric measured set to stderr.
  793. Returns
  794. -------
  795. prediction:
  796. If data is for a single object, the return value depends on
  797. prediction_type value:
  798. - 'RawFormulaVal' : return raw formula value.
  799. - 'Class' : return class label.
  800. - 'Probability' : return one-dimensional numpy.ndarray
  801. with probability for every class.
  802. - 'LogProbability' : return one-dimensional numpy.
  803. ndarray with
  804. log probability for every class.
  805. otherwise numpy.ndarray, with values that depend on
  806. prediction_type value:
  807. - 'RawFormulaVal' : one-dimensional array of raw formula
  808. value for each object.
  809. - 'Class' : one-dimensional array of class label for each
  810. object.
  811. - 'Probability' : two-dimensional numpy.ndarray with
  812. shape (number_of_objects x number_of_classes)
  813. with probability for every class for each object.
  814. - 'LogProbability' : two-dimensional numpy.ndarray with
  815. shape (number_of_objects x number_of_classes)
  816. with log probability for every class for each object.
  817. """
  818. return self._predict(data, prediction_type, ntree_start,
  819. ntree_end, thread_count, verbose, 'predict')
  820. def predict_proba(self, data, ntree_start=0, ntree_end=0,
  821. thread_count=-1, verbose=None):
  822. """
  823. Predict class probability with data.
  824. Parameters
  825. ----------
  826. data : catboost.Pool or list of features or list of lists or numpy.
  827. ndarray or pandas.DataFrame or pandas.Series
  828. or catboost.FeaturesData
  829. Data to apply model on.
  830. If data is a simple list (not list of lists) or a one-dimensional
  831. numpy.ndarray it is interpreted
  832. as a list of features for a single object.
  833. ntree_start: int, optional (default=0)
  834. Model is applied on the interval [ntree_start, ntree_end)
  835. (zero-based indexing).
  836. ntree_end: int, optional (default=0)
  837. Model is applied on the interval [ntree_start, ntree_end)
  838. (zero-based indexing).
  839. If value equals to 0 this parameter is ignored and ntree_end
  840. equal to tree_count_.
  841. thread_count : int (default=-1)
  842. The number of threads to use when applying the model.
  843. Allows you to optimize the speed of execution. This
  844. parameter doesn't affect results.
  845. If -1, then the number of threads is set to the number of
  846. CPU cores.
  847. verbose : bool
  848. If True, writes the evaluation metric measured set to stderr.
  849. Returns
  850. -------
  851. prediction :
  852. If data is for a single object
  853. return one-dimensional numpy.ndarray with probability
  854. for every class.
  855. otherwise
  856. return two-dimensional numpy.ndarray with shape
  857. (number_of_objects x number_of_classes)
  858. with probability for every class for each object.
  859. """
  860. return self._predict(data, 'Probability', ntree_start, ntree_end,
  861. thread_count, verbose, 'predict_proba')
  862. def predict_log_proba(self, data, ntree_start=0, ntree_end=0,
  863. thread_count=-1, verbose=None):
  864. """
  865. Predict class log probability with data.
  866. Parameters
  867. ----------
  868. data : catboost.Pool or list of features or list of lists or numpy.
  869. ndarray or pandas.DataFrame or pandas.Series
  870. or catboost.FeaturesData
  871. Data to apply model on.
  872. If data is a simple list (not list of lists) or a one-dimensional
  873. numpy.ndarray it is interpreted
  874. as a list of features for a single object.
  875. ntree_start: int, optional (default=0)
  876. Model is applied on the interval [ntree_start, ntree_end)
  877. (zero-based indexing).
  878. ntree_end: int, optional (default=0)
  879. Model is applied on the interval [ntree_start, ntree_end)
  880. (zero-based indexing).
  881. If value equals to 0 this parameter is ignored and ntree_end
  882. equal to tree_count_.
  883. thread_count : int (default=-1)
  884. The number of threads to use when applying the model.
  885. Allows you to optimize the speed of execution. This
  886. parameter doesn't affect results.
  887. If -1, then the number of threads is set to the number of
  888. CPU cores.
  889. verbose : bool
  890. If True, writes the evaluation metric measured set to stderr.
  891. Returns
  892. -------
  893. prediction :
  894. If data is for a single object
  895. return one-dimensional numpy.ndarray with log
  896. probability for every class.
  897. otherwise
  898. return two-dimensional numpy.ndarray with shape
  899. (number_of_objects x number_of_classes)
  900. with log probability for every class for each object.
  901. """
  902. return self._predict(data, 'LogProbability', ntree_start,
  903. ntree_end, thread_count, verbose, 'predict_log_proba')
  904. def staged_predict(self, data, prediction_type='Class',
  905. ntree_start=0, ntree_end=0, eval_period=1, thread_count=-1,
  906. verbose=None):
  907. """
  908. Predict target at each stage for data.
  909. Parameters
  910. ----------
  911. data : catboost.Pool or list of features or list of lists or numpy.
  912. ndarray or pandas.DataFrame or pandas.Series
  913. or catboost.FeaturesData
  914. Data to apply model on.
  915. If data is a simple list (not list of lists) or a one-dimensional
  916. numpy.ndarray it is interpreted
  917. as a list of features for a single object.
  918. prediction_type : string, optional (default='Class')
  919. Can be:
  920. - 'RawFormulaVal' : return raw formula value.
  921. - 'Class' : return class label.
  922. - 'Probability' : return probability for every class.
  923. - 'LogProbability' : return log probability for every class.
  924. ntree_start: int, optional (default=0)
  925. Model is applied on the interval [ntree_start, ntree_end)
  926. with the step eval_period (zero-based indexing).
  927. ntree_end: int, optional (default=0)
  928. Model is applied on the interval [ntree_start, ntree_end)
  929. with the step eval_period (zero-based indexing).
  930. If value equals to 0 this parameter is ignored and ntree_end
  931. equal to tree_count_.
  932. eval_period: int, optional (default=1)
  933. Model is applied on the interval [ntree_start, ntree_end)
  934. with the step eval_period (zero-based indexing).
  935. thread_count : int (default=-1)
  936. The number of threads to use when applying the model.
  937. Allows you to optimize the speed of execution. This
  938. parameter doesn't affect results.
  939. If -1, then the number of threads is set to the number of
  940. CPU cores.
  941. verbose : bool
  942. If True, writes the evaluation metric measured set to stderr.
  943. Returns
  944. -------
  945. prediction : generator for each iteration that generates:
  946. If data is for a single object, the return value depends on
  947. prediction_type value:
  948. - 'RawFormulaVal' : return raw formula value.
  949. - 'Class' : return majority vote class.
  950. - 'Probability' : return one-dimensional numpy.ndarray
  951. with probability for every class.
  952. - 'LogProbability' : return one-dimensional numpy.
  953. ndarray with
  954. log probability for every class.
  955. otherwise numpy.ndarray, with values that depend on
  956. prediction_type value:
  957. - 'RawFormulaVal' : one-dimensional array of raw formula
  958. value for each object.
  959. - 'Class' : one-dimensional array of class label for each
  960. object.
  961. - 'Probability' : two-dimensional numpy.ndarray with
  962. shape (number_of_objects x number_of_classes)
  963. with probability for every class for each object.
  964. - 'LogProbability' : two-dimensional numpy.ndarray with
  965. shape (number_of_objects x number_of_classes)
  966. with log probability for every class for each object.
  967. """
  968. return self._staged_predict(data, prediction_type, ntree_start,
  969. ntree_end, eval_period, thread_count, verbose, 'staged_predict')
  970. def staged_predict_proba(self, data, ntree_start=0,
  971. ntree_end=0, eval_period=1, thread_count=-1, verbose=None):
  972. """
  973. Predict classification target at each stage for data.
  974. Parameters
  975. ----------
  976. data : catboost.Pool or list of features or list of lists or numpy.
  977. ndarray or pandas.DataFrame or pandas.Series
  978. or catboost.FeaturesData
  979. Data to apply model on.
  980. If data is a simple list (not list of lists) or a one-dimensional
  981. numpy.ndarray it is interpreted
  982. as a list of features for a single object.
  983. ntree_start: int, optional (default=0)
  984. Model is applied on the interval [ntree_start, ntree_end)
  985. with the step eval_period (zero-based indexing).
  986. ntree_end: int, optional (default=0)
  987. Model is applied on the interval [ntree_start, ntree_end)
  988. with the step eval_period (zero-based indexing).
  989. If value equals to 0 this parameter is ignored and ntree_end
  990. equal to tree_count_.
  991. eval_period: int, optional (default=1)
  992. Model is applied on the interval [ntree_start, ntree_end)
  993. with the step eval_period (zero-based indexing).
  994. thread_count : int (default=-1)
  995. The number of threads to use when applying the model.
  996. Allows you to optimize the speed of execution. This
  997. parameter doesn't affect results.
  998. If -1, then the number of threads is set to the number of
  999. CPU cores.
  1000. verbose : bool
  1001. If True, writes the evaluation metric measured set to stderr.
  1002. Returns
  1003. -------
  1004. prediction : generator for each iteration that generates:
  1005. If data is for a single object
  1006. return one-dimensional numpy.ndarray with probability
  1007. for every class.
  1008. otherwise
  1009. return two-dimensional numpy.ndarray with shape
  1010. (number_of_objects x number_of_classes)
  1011. with probability for every class for each object.
  1012. """
  1013. return self._staged_predict(data, 'Probability', ntree_start,
  1014. ntree_end, eval_period, thread_count, verbose,
  1015. 'staged_predict_proba')
  1016. def staged_predict_log_proba(self, data, ntree_start=0,
  1017. ntree_end=0, eval_period=1, thread_count=-1, verbose=None):
  1018. """
  1019. Predict classification target at each stage for data.
  1020. Parameters
  1021. ----------
  1022. data : catboost.Pool or list of features or list of lists or numpy.
  1023. ndarray or pandas.DataFrame or pandas.Series
  1024. or catboost.FeaturesData
  1025. Data to apply model on.
  1026. If data is a simple list (not list of lists) or a one-dimensional
  1027. numpy.ndarray it is interpreted
  1028. as a list of features for a single object.
  1029. ntree_start: int, optional (default=0)
  1030. Model is applied on the interval [ntree_start, ntree_end)
  1031. with the step eval_period (zero-based indexing).
  1032. ntree_end: int, optional (default=0)
  1033. Model is applied on the interval [ntree_start, ntree_end)
  1034. with the step eval_period (zero-based indexing).
  1035. If value equals to 0 this parameter is ignored and ntree_end
  1036. equal to tree_count_.
  1037. eval_period: int, optional (default=1)
  1038. Model is applied on the interval [ntree_start, ntree_end)
  1039. with the step eval_period (zero-based indexing).
  1040. thread_count : int (default=-1)
  1041. The number of threads to use when applying the model.
  1042. Allows you to optimize the speed of execution. This
  1043. parameter doesn't affect results.
  1044. If -1, then the number of threads is set to the number of
  1045. CPU cores.
  1046. verbose : bool
  1047. If True, writes the evaluation metric measured set to stderr.
  1048. Returns
  1049. -------
  1050. prediction : generator for each iteration that generates:
  1051. If data is for a single object
  1052. return one-dimensional numpy.ndarray with log
  1053. probability for every class.
  1054. otherwise
  1055. return two-dimensional numpy.ndarray with shape
  1056. (number_of_objects x number_of_classes)
  1057. with log probability for every class for each object.
  1058. """
  1059. return self._staged_predict(data, 'LogProbability', ntree_start,
  1060. ntree_end, eval_period, thread_count, verbose,
  1061. 'staged_predict_log_proba')
  1062. def score(self, X, y=None):
  1063. """
  1064. Calculate accuracy.
  1065. Parameters
  1066. ----------
  1067. X : catboost.Pool or list or numpy.ndarray or pandas.
  1068. DataFrame or pandas.Series
  1069. Data to apply model on.
  1070. y : list or numpy.ndarray
  1071. True labels.
  1072. Returns
  1073. -------
  1074. accuracy : float
  1075. """
  1076. if isinstance(X, Pool):
  1077. if y is not None:
  1078. raise CatBoostError( "Wrong initializing y: X is catboost.
  1079. Pool object, y must be initialized inside catboost.Pool.")
  1080. y = X.get_label()
  1081. if y is None:
  1082. raise CatBoostError( "Label in X has not initialized.")
  1083. if isinstance(y, DataFrame):
  1084. if len(y.columns) != 1:
  1085. raise CatBoostError( "y is DataFrame and has {} columns,
  1086. but must have exactly one.".format(len(y.columns)))
  1087. y = y[y.columns[ 0]]
  1088. elif y is None:
  1089. raise CatBoostError( "y should be specified.")
  1090. y = np.array(y)
  1091. predicted_classes = self._predict(X, prediction_type= 'Class',
  1092. ntree_start= 0, ntree_end= 0, thread_count= -1, verbose= None,
  1093. parent_method_name= 'score').reshape( -1)
  1094. if np.issubdtype(predicted_classes.dtype, np.number):
  1095. if np.issubdtype(y.dtype, np.character):
  1096. raise CatBoostError( 'predicted classes have numeric type
  1097. but specified y contains strings')
  1098. elif np.issubdtype(y.dtype, np.number):
  1099. raise CatBoostError( 'predicted classes have string type but
  1100. specified y is numeric')
  1101. elif np.issubdtype(y.dtype, np.bool_):
  1102. raise CatBoostError( 'predicted classes have string type but
  1103. specified y is boolean')
  1104. return np.mean(np.array(predicted_classes) == np.array(y))
  1105. def _check_is_classification_objective(self, loss_function):
  1106. if isinstance(loss_function, str) and not self.
  1107. _is_classification_objective(loss_function):
  1108. raise CatBoostError(
  1109. "Invalid loss_function='{}': for classifier use "
  1110. "Logloss, CrossEntropy, MultiClass, MultiClassOneVsAll
  1111. or custom objective object".
  1112. format(loss_function))

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


转载:https://blog.csdn.net/qq_41185868/article/details/114907282
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场