飞道的博客

基于Caret和RandomForest包进行随机森林分析的一般步骤 (1)

769人阅读  评论(0)

Caret构建机器学习流程的一般步骤

Caret依赖trainControl函数设置交叉验证参数,train函数具体训练和评估模型。首先是选择一系列需要评估的参数和参数值的组合,然后设置重采样评估方式,循环训练模型评估结果、计算模型的平均性能,根据设定的度量值选择最好的模型参数组合,使用全部训练集和最优参数组合完成模型的最终训练。

基于Caret和RandomForest包进行随机森林分析的一般步骤

createDataPartition是拆分数据为训练集和测试集的函数。对于分类数据,按照每个类的大小成比例拆分;如果是回归数据,则先把响应值分为n个区间,再成比例拆分。


   
  1. # 拆分数据为测试集和训练集
  2. seed <- 1
  3. set.seed(seed)
  4. train_index <- createDataPartition(metadata[[group]], p= 0.75, list=F)
  5. train_data <- expr_mat[train_index,]
  6. train_data_group <- metadata[[group]][train_index]
  7. test_data <- expr_mat[-train_index,]
  8. test_data_group <- metadata[[group]][-train_index]
  9. # Create model with default parameters
  10. trControl <- trainControl(method= "repeatedcv", number= 10, repeats= 3)
  11. # train model<
  12. span style= "color: rgb(51, 51, 51);font-weight: bold;"> if(file.exists( 'rda/rf_default.rda')){
  13. rf_default <- readRDS( "rda/rf_default.rda")
  14. } else { # 设置随机数种子,使得结果可重复
  15. seed <- 1
  16. set.seed(seed)
  17. rf_default <- train(x=train_data, y=train_data_group, method= "rf",
  18. trControl=trControl)
  19. saveRDS(rf_default, "rda/rf_default.rda")
  20. }
  21. print(rf_default)

   
  1. ## Random Forest
  2. ##
  3. ## 59 samples
  4. ## 7070 predictors
  5. ## 2 classes: 'DLBCL', 'FL'
  6. ##
  7. ## No pre-processing
  8. ## Resampling: Cross-Validated ( 10 fold, repeated 3 times)
  9. ## Summary of sample sizes: 53, 53, 54, 53, 53, 54, ...
  10. ## Resampling results across tuning parameters:
  11. ##
  12. ## mtry Accuracy Kappa
  13. ## 2 0.7707937 0.09159664
  14. ## 118 0.8914286 0.62016807
  15. ## 7069 0.9390476 0.77675070
  16. ##
  17. ## Accuracy was used to select the optimal model using the largest value.
  18. ## The final value used for the model was mtry = 7069.

精确性随默认调参的变化

plot(rf_default)

Kappa值随默认调参的变化

# ?plot.train查看更多使用信息plot(rf_default, metric="Kappa")

#str(rf_default)

Caret比较不同算法的性能

Caret包的流程统一性在这就体现出来了,我之前没有看过rangerparRF包,也不知道他们的具体使用。但我可以借助Caret很快地用他们构建一个初步模型,并与randomForest的结果进行比较。


   
  1. # Too slow
  2. # RRF: Regularized Random Forest
  3. if(file.exists( 'rda/RRF_default.rda')){
  4. RRF_default <- readRDS( "rda/RRF_default.rda")
  5. } else {
  6. set.seed( 1)
  7. RRF_default <- train(x=train_data, y=train_data_group, method= "RRF",
  8. trControl=trControl)
  9. saveRDS(RRF_default, "rda/RRF_default.rda")
  10. }
  11. RRF_default

   
  1. ## Regularized Random Forest
  2. ##
  3. ## 59 samples
  4. ## 7070 predictors
  5. ## 2 classes: 'DLBCL', 'FL'
  6. ##
  7. ## No pre-processing
  8. ## Resampling: Cross-Validated ( 10 fold, repeated 3 times)
  9. ## Summary of sample sizes: 53, 53, 54, 53, 53, 54, ...
  10. ## Resampling results across tuning parameters:
  11. ##
  12. ## mtry coefReg coefImp Accuracy Kappa
  13. ## 2 0.010 0.0 0.6612698 0.06472400
  14. ## 2 0.010 0.5 0.6300000 -0.02007385
  15. ## 2 0.010 1.0 0.7317460 0.26162083
  16. ## 2 0.505 0.0 0.6984127 0.08957347
  17. ## 2 0.505 0.5 0.7711111 0.33546603
  18. ## 2 0.505 1.0 0.7436508 0.28044267
  19. ## 2 1.000 0.0 0.8269841 0.48410091
  20. ## 2 1.000 0.5 0.8450794 0.47661766
  21. ## 2 1.000 1.0 0.7195238 0.22447669
  22. ## 118 0.010 0.0 0.6668254 0.06598972
  23. ## 118 0.010 0.5 0.7785714 0.36287284
  24. ## 118 0.010 1.0 0.8485714 0.55917306
  25. ## 118 0.505 0.0 0.8036508 0.42708196
  26. ## 118 0.505 0.5 0.8422222 0.52480669
  27. ## 118 0.505 1.0 0.8388889 0.53413165
  28. ## 118 1.000 0.0 0.9061905 0.70898014
  29. ## 118 1.000 0.5 0.8550794 0.55650040
  30. ## 118 1.000 1.0 0.8101587 0.49044569
  31. ## 7069 0.010 0.0 0.7260317 0.29956225
  32. ## 7069 0.010 0.5 0.7250794 0.24852437
  33. ## 7069 0.010 1.0 0.7390476 0.28045078
  34. ## 7069 0.505 0.0 0.7982540 0.43476213
  35. ## 7069 0.505 0.5 0.7560317 0.31467344
  36. ## 7069 0.505 1.0 0.7457143 0.28034256
  37. ## 7069 1.000 0.0 0.8950794 0.65746499
  38. ## 7069 1.000 0.5 0.8600000 0.55006709
  39. ## 7069 1.000 1.0 0.7715873 0.34952193
  40. ##
  41. ## Accuracy was used to select the optimal model using the largest value.
  42. ## The final values used for the model were mtry = 118, coefReg = 1 and coefImp
  43. ## = 0.

rangerrandom forest的快速版本,特别适用于高维数据,支持分类、回归和生存分析。


   
  1. # ranger
  2. # ranger is a fast implementation of random forests (Breiman 2001) or recursive partitioning, particularly suited for high dimensional data. Classification, regression, and survival forests are supported. Classification and regression forests are implemented as in the original Random Forest (Breiman 2001), survival forests as in Random Survival Forests (Ishwaran et al. 2008).
  3. if(file.exists( 'rda/ranger_default.rda')){
  4. ranger_default <- readRDS( "rda/ranger_default.rda")
  5. } else {
  6. set.seed( 1)
  7. ranger_default <- train(x=train_data, y=train_data_group, method= "ranger",
  8. trControl=trControl)
  9. saveRDS(ranger_default, "rda/ranger_default.rda")
  10. }
  11. ranger_default

   
  1. ## Random Forest
  2. ##
  3. ## 59 samples
  4. ## 7070 predictors
  5. ## 2 classes: 'DLBCL', 'FL'
  6. ##
  7. ## No pre-processing
  8. ## Resampling: Cross-Validated ( 10 fold, repeated 3 times)
  9. ## Summary of sample sizes: 52, 52, 54, 53, 53, 53, ...
  10. ## Resampling results across tuning parameters:
  11. ##
  12. ## mtry splitrule Accuracy Kappa
  13. ## 2 gini 0.7626984 0.05238095
  14. ## 2 extratrees 0.7504762 0.00000000
  15. ## 118 gini 0.8777778 0.58852454
  16. ## 118 extratrees 0.8441270 0.42128852
  17. ## 7069 gini 0.9434921 0.78263305
  18. ## 7069 extratrees 0.9200000 0.73072829
  19. ##
  20. ## Tuning parameter 'min.node.size' was held constant at a value of 1
  21. ## Accuracy was used to select the optimal model using the largest value.
  22. ## The final values used for the model were mtry = 7069, splitrule = gini
  23. ## and min.node.size = 1.

parRF是并行随机森林算法。


   
  1. # parRF: Parallel Random Forest
  2. if(file.exists( 'rda/parRF_default.rda')){
  3. parRF_default <- readRDS( "rda/parRF_default.rda")
  4. } else { library(doParallel)
  5. cl <- makePSOCKcluster( 2)
  6. registerDoParallel(cl)
  7. set.seed( 1)
  8. parRF_default <- train(x=train_data, y=train_data_group, method= "parRF",
  9. trControl=trControl) ## When you are done:
  10. stopCluster(cl)
  11. saveRDS(parRF_default, "rda/parRF_default.rda")
  12. }
  13. parRF_default

   
  1. ## Parallel Random Forest
  2. ##
  3. ## 59 samples
  4. ## 7070 predictors
  5. ## 2 classes: 'DLBCL', 'FL'
  6. ##
  7. ## No pre-processing
  8. ## Resampling: Cross-Validated ( 10 fold, repeated 3 times)
  9. ## Summary of sample sizes: 52, 54, 54, 53, 52, 53, ...
  10. ## Resampling results across tuning parameters:
  11. ##
  12. ## mtry Accuracy Kappa
  13. ## 2 0.7560317 0.01904762
  14. ## 118 0.8615873 0.53526865
  15. ## 7069 0.9161905 0.72790935
  16. ##
  17. ## Accuracy was used to select the optimal model using the largest value.
  18. ## The final value used for the model was mtry = 7069.

合并三个包的模型结果


   
  1. resamps <- resamples(list(RRF = RRF_default,
  2. rf = rf_default,
  3. parRF = parRF_default,
  4. ranger = ranger_default))
  5. resamps

   
  1. ##
  2. ## Call:
  3. ## resamples. default(x = list(RRF = RRF_default, rf = rf_default, parRF
  4. ## = parRF_default, ranger = ranger_default))
  5. ##
  6. ## Models: RRF, rf, parRF, ranger
  7. ## Number of resamples: 30
  8. ## Performance metrics: Accuracy, Kappa
  9. ## Time estimates for: everything, final model fit

比较绘制三个包的性能差异


   
  1. theme1 <- trellis.par.get()
  2. theme1$plot.symbol$col = rgb( .2, .2, .2, .4)
  3. theme1$plot.symbol$pch = 16
  4. theme1$plot.line$col = rgb( 1, 0, 0, .7)
  5. theme1$plot.line$lwd <- 2
  6. trellis.par.set(theme1)
  7. # 这个结果时对时错,对Kappa值很高估,还没看什么原因
  8. bwplot(resamps)

这个结果跟输出的矩阵是吻合的

dotplot(resamps)

统计检验评估模型之间差异是否显著


   
  1. diff.value <- diff(resamps)
  2. summary(diff.value)

   
  1. ##
  2. ## Call:
  3. ## summary.diff.resamples(object = diff.value)
  4. ##
  5. ## p-value adjustment: bonferroni
  6. ## Upper diagonal: estimates of the difference
  7. ## Lower diagonal: p-value for H0: difference = 0
  8. ##
  9. ## Accuracy
  10. ## RRF rf parRF ranger
  11. ## RRF -0.032857 -0.010000 -0.037302
  12. ## rf 0.5193 0.022857 -0.004444
  13. ## parRF 1.0000 1.0000 -0.027302
  14. ## ranger 1.0000 1.0000 1.0000
  15. ##
  16. ## Kappa
  17. ## RRF rf parRF ranger
  18. ## RRF -0.067771 -0.018929 -0.073653
  19. ## rf 1 0.048841 -0.005882
  20. ## parRF 1 1 -0.054724
  21. ## ranger 1 1 1

Caret训练最终模型


   
  1. if(file.exists( 'rda/rf_finaltest.rda')){
  2. rf_final <- readRDS( "rda/rf_finaltest.rda")
  3. } else { # method= "none" 不再抽样,用全部数据训练模型
  4. trControl <- trainControl(method= "none", classProbs = T) # 设置随机数种子,使得结果可重复
  5. seed <- 1
  6. set.seed(seed)
  7. rf_final <- train(x=train_data, y=train_data_group, method= "rf",
  8. # 用最好的模型的调参值
  9. tuneGrid = rf_default$bestTune,
  10. trControl=trControl)
  11. saveRDS(rf_final, "rda/rf_finaltest.rda")
  12. }
  13. print(rf_final)

   
  1. ## Random Forest
  2. ##
  3. ## 59 samples
  4. ## 7070 predictors
  5. ## 2 classes: 'DLBCL', 'FL'
  6. ##
  7. ## No pre-processing
  8. ## Resampling: None

基于模型对测试集进行预测


   
  1. # ?predict.train
  2. # type: raw (返回预测的类或回归的数值) or prob (返回分类到每个类的概率)
  3. predict(rf_final, newdata=head(test_data))

   
  1. ## [ 1] DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL
  2. ## Levels: DLBCL FL

虽然都能被分类到DLBCL,但分类概率是不同的,如DLBCL_16就在边缘徘徊。


   
  1. # ?predict.train
  2. # type: raw (返回预测的类或回归的数值) or prob (返回分类到每个类的概率)
  3. predict(rf_final, newdata=head(test_data), type= "prob")

   
  1. ## DLBCL FL
  2. ## DLBCL_2 0.960 0.040
  3. ## DLBCL_5 0.934 0.066
  4. ## DLBCL_11 0.742 0.258
  5. ## DLBCL_13 0.892 0.108
  6. ## DLBCL_16 0.866 0.134
  7. ## DLBCL_17 0.838 0.162

获得模型结果评估矩阵(confusion matrix)


   
  1. predictions <- predict(rf_final, newdata=test_data)
  2. confusionMatrix(predictions, test_data_group)

   
  1. ## Confusion Matrix and Statistics
  2. ##
  3. ## Reference
  4. ## Prediction DLBCL FL
  5. ## DLBCL 14 2
  6. ## FL 0 2
  7. ##
  8. ## Accuracy : 0.8889
  9. ## 95% CI : ( 0.6529, 0.9862)
  10. ## No Information Rate : 0.7778
  11. ## P-Value [Acc > NIR] : 0.2022
  12. ##
  13. ## Kappa : 0.6087
  14. ##
  15. ## Mcnemar 's Test P-Value : 0.4795
  16. ##
  17. ## Sensitivity : 1.0000
  18. ## Specificity : 0.5000
  19. ## Pos Pred Value : 0.8750
  20. ## Neg Pred Value : 1.0000
  21. ## Prevalence : 0.7778
  22. ## Detection Rate : 0.7778
  23. ## Detection Prevalence : 0.8889
  24. ## Balanced Accuracy : 0.7500
  25. ##
  26. ## 'Positive ' Class : DLBCL
  27. ##

绘制ROC曲线


   
  1. prediction_prob <- predict(rf_final, newdata=test_data, type= "prob")library(pROC)
  2. roc <- roc(test_data_group, prediction_prob[, 1])
  3. roc

   
  1. ##
  2. ## Call:
  3. ## roc. default(response = test_data_group, predictor = prediction_prob[, 1])
  4. ##
  5. ## Data: prediction_prob[, 1] in 14 controls (test_data_group DLBCL) > 4 cases (test_data_group FL).
  6. ## Area under the curve: 0.9821

   
  1. ROC_data <- data.frame(FPR = 1- roc$specificities, TPR=roc$sensitivities)
  2. ROC_data <- ROC_data[order(ROC_data$FPR),]
  3. p <- ggplot(data=ROC_data, mapping=aes(x=FPR, y=TPR)) +
  4. geom_step(color= "red", size= 1, direction = "mid") +
  5. geom_segment(aes(x= 0, xend= 1, y= 0, yend= 1)) + theme_classic() +
  6. xlab( "False positive rate") +
  7. ylab( "True positive rate") + coord_fixed( 1) + xlim( 0, 1) + ylim( 0, 1) +
  8. annotate( 'text', x= 0.5, y= 0.25, label=paste( 'AUC=', round(roc$auc, 2)))
  9. p

机器学习系列教程

从随机森林开始,一步步理解决策树、随机森林、ROC/AUC、数据集、交叉验证的概念和实践。

文字能说清的用文字、图片能展示的用、描述不清的用公式、公式还不清楚的写个简单代码,一步步理清各个环节和概念。

再到成熟代码应用、模型调参、模型比较、模型评估,学习整个机器学习需要用到的知识和技能。

  1. 机器学习算法 - 随机森林之决策树初探(1)

  2. 机器学习算法-随机森林之决策树R 代码从头暴力实现(2)

  3. 机器学习算法-随机森林之决策树R 代码从头暴力实现(3)

  4. 机器学习算法-随机森林之理论概述

  5. 随机森林拖了这么久,终于到实战了。先分享很多套用于机器学习的多种癌症表达数据集 https://file.biolab.si/biolab/supp/bi-cancer/projections/。

  6. 机器学习算法-随机森林初探(1)

  7. 机器学习 模型评估指标 - ROC曲线和AUC值

  8. 机器学习 - 训练集、验证集、测试集

  9. 机器学习 - 随机森林手动10 折交叉验证

  10. 一个函数统一238个机器学习R包,这也太赞了吧


转载:https://blog.csdn.net/qazplm12_3/article/details/113733552
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场