小言_互联网的博客

【python3】基于随机森林的气温预测

259人阅读  评论(0)

前言

这个项目实战系列主要是跟着网络上的教程来做的,主要参考《跟着迪哥学习机器学习》中的思路和具体实现代码,但是书中使用到的应该是python2的版本,有一些代码也有问题,有的是省略了一些关键的步骤,有的是语法的问题,总之就是并不是直接照着敲就能够一路运行下来的。这里整理了能够运行的代码和数据集(链接容易挂,需要请私聊)。

系列导航

思路

随机森林建模->特征选择->效率对比->参数调优

import pandas as pd

features = pd.read_csv("temps.csv")
features.head(5)
year month day week temp_2 temp_1 average actual friend
0 2019 1 1 Fri 45 45 45.6 45 29
1 2019 1 2 Sat 44 45 45.7 44 61
2 2019 1 3 Sun 45 44 45.8 41 56
3 2019 1 4 Mon 44 41 45.9 40 53
4 2019 1 5 Tues 41 40 46.0 44 41

具体任务

  1. 基本建模任务:
    • 数据预处理
    • 特征展示
    • 建模+可视化
  2. 数据样本量的个数对结果的影响:增阿吉数据样本个数,观察结果变化,重新考虑特征工程,引入性特征后,观察结果
  3. 对算法进行调参,找到最合适的参数:两种经典调仓方法

特征可视化与预处理

观察数据地规模

print("数据维度:",features.shape)
数据维度: (348, 9)
features.describe()
year month day temp_2 temp_1 average actual friend
count 348.0 348.000000 348.000000 348.000000 348.000000 348.000000 348.000000 348.000000
mean 2019.0 6.477011 15.514368 62.652299 62.701149 59.760632 62.543103 60.034483
std 0.0 3.498380 8.772982 12.165398 12.120542 10.527306 11.794146 15.626179
min 2019.0 1.000000 1.000000 35.000000 35.000000 45.100000 35.000000 28.000000
25% 2019.0 3.000000 8.000000 54.000000 54.000000 49.975000 54.000000 47.750000
50% 2019.0 6.000000 15.000000 62.500000 62.500000 58.200000 62.500000 60.000000
75% 2019.0 10.000000 23.000000 71.000000 71.000000 69.025000 71.000000 71.000000
max 2019.0 12.000000 31.000000 117.000000 117.000000 77.400000 92.000000 95.000000

数据:

  1. count结果展示数据集是否有缺少值
  2. mean、std观察数据的分布特征
# 时间转化,用标准时间格式方便后续工作
import datetime
years = features['year']
months = features['month']
days = features['day']
dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years,months,days)]
dates = [datetime.datetime.strptime(date,'%Y-%m-%d') for date in dates]
dates[:5]
[datetime.datetime(2019, 1, 1, 0, 0),
 datetime.datetime(2019, 1, 2, 0, 0),
 datetime.datetime(2019, 1, 3, 0, 0),
 datetime.datetime(2019, 1, 4, 0, 0),
 datetime.datetime(2019, 1, 5, 0, 0)]
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight') # 绘图风格

可视化

布局视图,初步分析需要四项指标:分别为最高气温的标签值、前天、昨天、朋友预测的气温最高值,四个图。

fig,((ax1,ax2),(ax3,ax4)) = plt.subplots(nrows=2,ncols=2,figsize=(10,10))
fig.autofmt_xdate(rotation=45)
# 最高气温的标签值
ax1.plot(dates,features['actual'])
ax1.set_xlabel('');ax1.set_ylabel('Temperature');ax1.set_title('Max Temp')
# 昨天的最高温度值
ax2.plot(dates,features['temp_1'])
ax2.set_xlabel('');ax2.set_ylabel('Temperature');ax2.set_title('Yesterday Max Temp')
# 前天的最高温度值
ax3.plot(dates,features['temp_2'])
ax3.set_xlabel('');ax3.set_ylabel('Temperature');ax3.set_title('Two Days Prior Max Temp')
# 朋友预测的最高温度值
ax4.plot(dates,features['friend'])
ax4.set_xlabel('');ax4.set_ylabel('Temperature');ax4.set_title('Friend Forcast')
plt.tight_layout(pad=2)

预处理

对于一周的标签,并不是数值特征,需要将其转换为特征编码

# 独热编码
features = pd.get_dummies(features) #自动转换,自动添加后缀
features.head(5)
year month day temp_2 temp_1 average actual friend week_Fri week_Mon week_Sat week_Sun week_Thurs week_Tues week_Wed
0 2019 1 1 45 45 45.6 45 29 1 0 0 0 0 0 0
1 2019 1 2 44 45 45.7 44 61 0 0 1 0 0 0 0
2 2019 1 3 45 44 45.8 41 56 0 0 0 1 0 0 0
3 2019 1 4 44 41 45.9 40 53 0 1 0 0 0 0 0
4 2019 1 5 41 40 46.0 44 41 0 0 0 0 0 1 0
print(help(pd.get_dummies))
Help on function get_dummies in module pandas.core.reshape.reshape:

get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None) -> 'DataFrame'
    Convert categorical variable into dummy/indicator variables.
    
    Parameters
    ----------
    data : array-like, Series, or DataFrame
        Data of which to get dummy indicators.
    prefix : str, list of str, or dict of str, default None
        String to append DataFrame column names.
        Pass a list with length equal to the number of columns
        when calling get_dummies on a DataFrame. Alternatively, `prefix`
        can be a dictionary mapping column names to prefixes.
    prefix_sep : str, default '_'
        If appending prefix, separator/delimiter to use. Or pass a
        list or dictionary as with `prefix`.
    dummy_na : bool, default False
        Add a column to indicate NaNs, if False NaNs are ignored.
    columns : list-like, default None
        Column names in the DataFrame to be encoded.
        If `columns` is None then all the columns with
        `object` or `category` dtype will be converted.
    sparse : bool, default False
        Whether the dummy-encoded columns should be backed by
        a :class:`SparseArray` (True) or a regular NumPy array (False).
    drop_first : bool, default False
        Whether to get k-1 dummies out of k categorical levels by removing the
        first level.
    dtype : dtype, default np.uint8
        Data type for new columns. Only a single dtype is allowed.
    
        .. versionadded:: 0.23.0
    
    Returns
    -------
    DataFrame
        Dummy-coded data.
    
    See Also
    --------
    Series.str.get_dummies : Convert Series to dummy codes.
    
    Examples
    --------
    >>> s = pd.Series(list('abca'))
    
    >>> pd.get_dummies(s)
       a  b  c
    0  1  0  0
    1  0  1  0
    2  0  0  1
    3  1  0  0
    
    >>> s1 = ['a', 'b', np.nan]
    
    >>> pd.get_dummies(s1)
       a  b
    0  1  0
    1  0  1
    2  0  0
    
    >>> pd.get_dummies(s1, dummy_na=True)
       a  b  NaN
    0  1  0    0
    1  0  1    0
    2  0  0    1
    
    >>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
    ...                    'C': [1, 2, 3]})
    
    >>> pd.get_dummies(df, prefix=['col1', 'col2'])
       C  col1_a  col1_b  col2_a  col2_b  col2_c
    0  1       1       0       0       1       0
    1  2       0       1       1       0       0
    2  3       1       0       0       0       1
    
    >>> pd.get_dummies(pd.Series(list('abcaa')))
       a  b  c
    0  1  0  0
    1  0  1  0
    2  0  0  1
    3  1  0  0
    4  1  0  0
    
    >>> pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
       b  c
    0  0  0
    1  1  0
    2  0  1
    3  0  0
    4  0  0
    
    >>> pd.get_dummies(pd.Series(list('abc')), dtype=float)
         a    b    c
    0  1.0  0.0  0.0
    1  0.0  1.0  0.0
    2  0.0  0.0  1.0

None
# 数据与标签
import numpy as np
# 标签
labels = np.array(features['actual'])
# 特征中去除标签
features = features.drop('actual',axis=1) # 按照列去掉
# 名字单独保留
feature_list = list(features.columns)
# 转换为合适的格式
features = np.array(features)
# 数据集切分
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(features,labels,test_size=0.25,random_state=42)
print('训练集特征:',train_features.shape)
print('训练集标签:',train_labels.shape)
print('测试集标签:',test_features.shape)
print('测试机标签:',test_labels.shape)
训练集特征: (261, 14)
训练集标签: (261,)
测试集标签: (87, 14)
测试机标签: (87,)

随机森林回归模型

先建立1000棵决策树尝试,参数默认

先使用MAPE指标进行评估,这个评估是平均绝对百分误差,由于样本量比较少,训练模型的速率较快

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=1000,random_state=42)
rf.fit(train_features,train_labels)
predictions = rf.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100 * (errors / test_labels)
print('MAPE:',np.mean(mape))
MAPE: 6.016378550202468

树模型可视化

from sklearn.tree import export_graphviz
import pydot
import os
# os.environ["PATH"] += os.pathsep + 'S:/Graphviz/bin/'
tree = rf.estimators_[5]
export_graphviz(tree,out_file="tree.dot",feature_names=feature_list,rounded=True,precision=1)
(graph,) = pydot.graph_from_dot_file('./tree.dot')
graph.write_png('./tree.png')

tree.png
# 限制树模型
rf_small = RandomForestRegressor(n_estimators=10,max_depth=3,random_state=42)
rf_small.fit(train_features,train_labels)
tree_small = rf_small.estimators_[5]
export_graphviz(tree_small,out_file='small_tree.dot',feature_names=feature_list,rounded=True,precision=1)
(graph,) = pydot.graph_from_dot_file('small_tree.dot')
graph.write_png('small_tree.png')

small_tree.png
# 决策树特征重要性
importances = list(rf.feature_importances_)
# 格式转换
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list,importances)]
feature_importances = sorted(feature_importances,key=lambda x:x[1],reverse=True)
# 打印
[print('Variable:{:20} importance: {}'.format(*pair)) for pair in feature_importances]
Variable:temp_1               importance: 0.69
Variable:average              importance: 0.2
Variable:day                  importance: 0.03
Variable:friend               importance: 0.03
Variable:temp_2               importance: 0.02
Variable:month                importance: 0.01
Variable:year                 importance: 0.0
Variable:week_Fri             importance: 0.0
Variable:week_Mon             importance: 0.0
Variable:week_Sat             importance: 0.0
Variable:week_Sun             importance: 0.0
Variable:week_Thurs           importance: 0.0
Variable:week_Tues            importance: 0.0
Variable:week_Wed             importance: 0.0





[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]
# 绘制为直方图
x_values = list(range(len(importances)))
plt.bar(x_values,importances,orientation='vertical')
plt.xticks(x_values,feature_list,rotation='vertical')
plt.ylabel('Importance');plt.xlabel('Variable');plt.title('Variable Importances')
Text(0.5, 1.0, 'Variable Importances')

观察发现,使用重要性最好的特征建模,可能会有更好的结果,虽然不一定成功,但是速度一定会更快

# 尝试使用最重要的两个特征
rf_most_important = RandomForestRegressor(n_estimators=1000,random_state=42)
# 最重要特征
important_indices = [feature_list.index('temp_1'),feature_list.index('average')]
train_important = train_features[:,important_indices]
test_important = test_features[:,important_indices]
# 重新训练模型
rf_most_important.fit(train_important,train_labels)
# 预测结果
predictions = rf_most_important.predict(test_important)
errors = abs(predictions-test_labels)
# 评估结果,保留两位小数
print('Mean Absolute Error:',round(np.mean(errors),2),'%')
mape = np.mean(100*(errors/test_labels))
print('mape:',mape)
Mean Absolute Error: 3.92 %
mape: 6.243108595734665

这里发现,mape的值从6.0上升到6.2,并没有下降,说明不能只选择最重要的特征

首次预测

# 日期
months = features[:,feature_list.index('month')]
days = features[:,feature_list.index('day')]
years = features[:,feature_list.index('year')]
# 转换日期
dates = [str(int(year))+'-'+str(int(month))+'-'+str(int(day)) for year, month, day in zip(years,months,days)]
dates = [datetime.datetime.strptime(date,'%Y-%m-%d') for date in dates]
# 创建表格保存日期和其对应的标签数据
true_data = pd.DataFrame(data={
   'date':dates,'actual':labels})
# 另一个表格表示日期和对应预测值
months = test_features[:,feature_list.index('month')]
days = test_features[:,feature_list.index('day')]
years = test_features[:,feature_list.index('year')]
test_dates = [str(int(year))+'-'+str(int(month))+'-'+str(int(day)) for year,month,day in zip(years,months,days)]
test_dates = [datetime.datetime.strptime(date,'%Y-%m-%d') for date in test_dates]
predictions_data = pd.DataFrame(data = {
   'date':test_dates,'prediction':predictions})
# 真实值
plt.plot(true_data['date'],true_data['actual'],'b-',label='actual')
# 预测值
plt.plot(predictions_data['date'],predictions_data['prediction'],'ro',label='prediction')
plt.xticks(rotation='60')
(array([17897., 17956., 18017., 18078., 18140., 18201., 18262.]),
 [Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, '')])

深入数据分析

  1. 如果可以利用的数据量增大,会对结果产生什么影响呢
  2. 加入的新特征会改进模型效果吗,此时的时间效率又怎么样

新数据集预测

import pandas as pd
features = pd.read_csv('temps_extended.csv')
features.head(5)
year month day weekday ws_1 prcp_1 snwd_1 temp_2 temp_1 average actual friend
0 2011 1 1 Sat 4.92 0.00 0 36 37 45.6 40 40
1 2011 1 2 Sun 5.37 0.00 0 37 40 45.7 39 50
2 2011 1 3 Mon 6.26 0.00 0 40 39 45.8 42 42
3 2011 1 4 Tues 5.59 0.00 0 39 42 45.9 38 59
4 2011 1 5 Wed 3.80 0.03 0 42 38 46.0 45 39
print('数据规模',features.shape)
数据规模 (2191, 12)

日期处理

# 时间转化,用标准时间格式方便后续工作
import datetime
years = features['year']
months = features['month']
days = features['day']
dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years,months,days)]
dates = [datetime.datetime.strptime(date,'%Y-%m-%d') for date in dates]
dates[:5]
[datetime.datetime(2011, 1, 1, 0, 0),
 datetime.datetime(2011, 1, 2, 0, 0),
 datetime.datetime(2011, 1, 3, 0, 0),
 datetime.datetime(2011, 1, 4, 0, 0),
 datetime.datetime(2011, 1, 5, 0, 0)]

特征可视化

# 对新特征进行可视化展示
fig,((ax1,ax2),(ax3,ax4)) = plt.subplots(nrows=2,ncols=2,figsize=(15,10))
fig.autofmt_xdate(rotation=45)
# 平均最高气温
ax1.plot(dates,features['average'])
ax1.set_xlabel('');ax1.set_ylabel('Tempertature (F)');ax1.set_title('Historical Avg Max Temp')
# 风速
ax2.plot(dates,features['ws_1'],'r-')
ax2.set_xlabel('');ax2.set_ylabel('Wind Speed (mph))');ax2.set_title('Prior Wind Speed')
# 降水
ax3.plot(dates,features['prcp_1'],'r-')
ax3.set_xlabel('Date');ax3.set_ylabel('Precipitation (in)');ax3.set_title('Prior Precipitation')
# 积雪
ax4.plot(dates,features['snwd_1'],'ro')
ax4.set_xlabel('Date');ax4.set_ylabel('Snow Depth (in)');ax4.set_title('Prior Snow Depth')

plt.tight_layout(pad=2)

特征工程

  1. 尽可能多的选择有价值的特征,初始阶段的信息越多,建模可以利用的信息就越多
  2. 通常分析过程中可能会发现新的可以利用的特征,这个时候又放回去加入,过程比较繁琐,同时还需要实验进行对比。
  3. 因此需要在初始阶段就尽可能完善预处理与特征提取工作

关于这份数据:

  1. 天气变换与季节因素有关,然而数据集中并没有体现季节的特征,可以自己创建
# 季节变量
seasons = []
for month in features['month']:
    if month in [1,2,12]:
        seasons.append('winter')
    elif month in [3,4,5]:
        seasons.append('spring')
    elif month in [6,7,8]:
        seasons.append('summer')
    elif month in [9,10,11]:
        seasons.append('fall')
reduced_features = features[['temp_1','prcp_1','average','actual']]
reduced_features['season'] = seasons
<ipython-input-23-b690764ce900>:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reduced_features['season'] = seasons
import seaborn as sns
sns.set(style='ticks',color_codes=True)
# 主题
palette = sns.xkcd_palette(['dark blue','dark green','gold','orange'])
# pairplot绘图
sns.pairplot(reduced_features,hue='season',diag_kind='kde',palette=palette,plot_kws=dict(alpha=0.7),diag_kws=dict(shade=True))
<seaborn.axisgrid.PairGrid at 0x232067fa190>

分析的方法:不同的颜色代表不同的季节(hue),主对角线是同一个参数,表示数值在不同季节的分布。其他都是散点图,代表了各个特征之间的相关性

数据量对结果的影响分析

# 独热编码
features = pd.get_dummies(features)
# 提取特征和标签
labels = features['actual']
features = features.drop('actual',axis=1)
# 特征名字留着备用
feature_list = list(features.columns)
# 转换为所需格式
import numpy as np
features = np.array(features)
labels = np.array(labels)
# 数据集切分
from sklearn.model_selection import train_test_split
train_features,test_features,train_labels,test_labels = train_test_split(features,labels,test_size=0.25,random_state=0)
print("训练集特征:",train_features.shape)
print("训练集标签:",train_labels.shape)
print("测试集特征:",test_features.shape)
print("测试集标签:",test_labels.shape)
训练集特征: (1643, 17)
训练集标签: (1643,)
测试集特征: (548, 17)
测试集标签: (548,)

虽然是新样本,但是需要使用相同的测试集来对比结果。

# 对之前数据惊喜能够处理,对比
import pandas as pd
import numpy as np
# 统一特征
original_feature_indices = [feature_list.index(feature) for feature in feature_list if feature not in ['ws_1','prcp_1','snwd_1']]
# 重新读取老数据
original_features = pd.read_csv('temps.csv')
original_features = pd.get_dummies(original_features)
# 数据标签转换
original_labels = np.array(original_features['actual'])
original_features = original_features.drop('actual',axis=1)
original_feature_list = list(original_features.columns)
original_features = np.array(original_features)

先前数据切分

# 数据集切分
from sklearn.model_selection import train_test_split
original_train_features,original_test_features,original_train_labels,original_test_labels = train_test_split(original_features,original_labels,test_size=0.25,random_state=42)
# 数据建模
from sklearn.ensemble import RandomForestRegressor
# 同样参数和随机种子
rf = RandomForestRegressor(n_estimators=100,random_state=0)
# 老数据集
rf.fit(original_train_features,original_train_labels)
# 统一使用一个测试集,为了公平
predictions = rf.predict(test_features[:,original_feature_indices])
errors = abs(predictions-test_labels)
print('平均温度误差:',round(np.mean(errors),2),'°')
mape = 100 *(errors/test_labels)
# 为了观察设定准确率
accuracy = 100 -np.mean(mape)
print('Accuracy:',round(accuracy,2),'%')
平均温度误差: 4.68 °
Accuracy: 92.19 %

上面是样本较少的结果,这次观察样本增多会更好吗

from sklearn.ensemble import RandomForestRegressor
# 保证标签一致 这里改变了原白哦钱的维度方便计算
original_train_changeed_features = train_features[:,original_feature_indices]
original_test_changed_features = test_features[:,original_feature_indices]
rf = RandomForestRegressor(n_estimators=100,random_state=0)
rf.fit(original_train_changeed_features,train_labels)
# 预测
baseline_predictions = rf.predict(original_test_changed_features)
# 结果
baseline_errors = abs(baseline_predictions-test_labels)
print('平均温度误差:',round(np.mean(baseline_errors),2),'%')
baseline_mape = 100 * np.mean(baseline_errors/test_labels)
# 准确率
baseline_accuracy = 100 - baseline_mape
print('Accuracy:',round(baseline_accuracy,2),'%')
平均温度误差: 4.2 %
Accuracy: 93.12 %

特征的数量对结果影响

加入两次比较没有比较的新的天气特征

from sklearn.ensemble import RandomForestRegressor
rf_exp = RandomForestRegressor(n_estimators=100,random_state=0)
rf_exp.fit(train_features,train_labels)
# 同一测试集
predictions = rf_exp.predict(test_features)
# 评估
errors = abs(predictions - test_labels)
print('平均温度误差:',round(np.mean(errors),2),"%")
mape = np.mean(100*(errors/test_labels))
improvement_baseline = 100 * abs(mape-baseline_mape) / baseline_mape
print('特征增多以后模型效果变化:',round(improvement_baseline,2),'%')
# 准确率
accuracy = 100 - mape
print('Accuracy:',round(accuracy,2),'%')
平均温度误差: 4.05 %
特征增多以后模型效果变化: 3.34 %
Accuracy: 93.35 %
importances = list(rf_exp.feature_importances_)
# 名字和数值拼接在一起
feature_importances = [(feature,round(importance,2)) for feature,importance in zip(feature_list,importances)]
# 排序
feature_importances = sorted(feature_importances,key=lambda x:x[1],reverse=True)
# 打印结果
[print('Variable:{:20} Importance: {}'.format(*pair)) for pair in feature_importances]
Variable:temp_1               Importance: 0.85
Variable:average              Importance: 0.05
Variable:ws_1                 Importance: 0.02
Variable:friend               Importance: 0.02
Variable:year                 Importance: 0.01
Variable:month                Importance: 0.01
Variable:day                  Importance: 0.01
Variable:prcp_1               Importance: 0.01
Variable:temp_2               Importance: 0.01
Variable:snwd_1               Importance: 0.0
Variable:weekday_Fri          Importance: 0.0
Variable:weekday_Mon          Importance: 0.0
Variable:weekday_Sat          Importance: 0.0
Variable:weekday_Sun          Importance: 0.0
Variable:weekday_Thurs        Importance: 0.0
Variable:weekday_Tues         Importance: 0.0
Variable:weekday_Wed          Importance: 0.0





[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]
# 可视化重要指标
plt.style.use('fivethirtyeight')
x_values = list(range(len(importances)))
plt.bar(x_values,importances,orientation="vertical",color="r",edgecolor="k",linewidth=1.2)
plt.xticks(x_values,feature_list,rotation='vertical')
plt.ylabel('Importance')
plt.xlabel('Variable')
plt.title('Variable Importances')
Text(0.5, 1.0, 'Variable Importances')

具体选择几个特征还是比较模糊,可以先把特征按照其重要性进行排序后再计算累计值,设置阈值,观察需要多少特征累加在一起后,特征重要性的累加值才能超过阈值

sorted_importances = [importance[1] for importance  in feature_importances]
sorted_features = [importance[0] for importance in feature_importances]
# 累计重要性
cumulative_importances = np.cumsum(sorted_importances)
# 绘制折线图
plt.plot(x_values,cumulative_importances,'g-')
plt.hlines(y=0.95,xmin=0,xmax=len(sorted_importances),color='r',linestyles='dashed')
plt.xticks(x_values,sorted_features,rotation='vertical')
plt.xlabel('Variable');plt.ylabel('Cumulative Importance')
plt.title('Cumulative Importances')
Text(0.5, 1.0, 'Cumulative Importances')

当使用前五个特征,总体的准确率累加值超过0.95,如果只使用这5个特征建模,观察结果

important_feature_names = [feature[0] for feature in feature_importances[0:5]]
# 名字
important_indices = [feature_list.index(feature) for feature in important_feature_names]
# 训练集
important_train_features = train_features[:,important_indices]
important_test_features = test_features[:,important_indices]
# 数据维度
print("important train features shape:",important_train_features.shape)
print("important test features shape:",important_test_features.shape)
# 训练模型
rf_exp.fit(important_train_features,train_labels)
# 同样的测试集
predictions = rf_exp.predict(important_test_features)
# 评估
errors = abs(predictions-test_labels)
print('平均温度误差:',round(np.mean(errors),2),"°")
mape = 100*(errors/test_labels)
accuracy = 100 - np.mean(mape)
print('Accuracy:',round(accuracy,2),"%")
important train features shape: (1643, 5)
important test features shape: (548, 5)
平均温度误差: 4.11 °
Accuracy: 93.28 %

虽然没有提升效率,那么观察一下在模型时间效率上面有没有提高·

import time
all_features_time = []
for _ in range(10):
    start_time = time.time()
    rf_exp.fit(train_features,train_labels)
    all_features_predictions = rf_exp.predict(test_features)
    end_time = time.time()
    all_features_time.append(end_time-start_time)
    
all_features_time = np.mean(all_features_time)
print("使用所有特征与测试的平均时间消耗:",round(all_features_time,2),'s')
使用所有特征与测试的平均时间消耗: 0.69 s
# 只选用重要特征训练时
reduced_features_time = []
for _ in range(10):
    start_time = time.time()
    rf_exp.fit(important_train_features,train_labels)
    reduced_features_predictions = rf_exp.predict(important_test_features)
    end_time = time.time()
    reduced_features_time.append(end_time-start_time)
    
reduced_features_time = np.mean(reduced_features_time)
print("使用重要特征与测试的平均时间消耗:",round(reduced_features_time,2),'s')
使用重要特征与测试的平均时间消耗: 0.42 s
# 原始模型时间效率
original_features_time =[]
for _ in range(10):
    start_time =time.time()
    rf.fit(original_train_features,original_train_labels)
    original_features_predictions =rf.predict(test_features[:,original_feature_indices])
    end_time =time.time()
    original_features_time.append(end_time -start_time)
original_features_time =np.mean(original_features_time)

print("使用原始模型测试的平均时间消耗:",round(original_features_time,2),'s')
使用原始模型测试的平均时间消耗: 0.18 s
# 对比展示
all_accuracy = 100 * (1-np.mean(abs(all_features_predictions-test_labels)/test_labels))
reduced_accuracy = 100 * (1-np.mean(abs(reduced_features_predictions-test_labels)/test_labels))

# 保存结果并展示
comparision = pd.DataFrame({
   'features':['all(17)','reduced(5)'],
                           'runtime':[round(all_features_time,2),round(reduced_features_time,2)],
                           'accuracy':[round(all_accuracy,2),round(reduced_accuracy,2)]})
comparision[['features','accuracy','runtime']]
features accuracy runtime
0 all(17) 93.35 0.69
1 reduced(5) 93.28 0.42

单独两个结果比较

准确率只是为了方面安眠自己定义,用于对比分析。结果如上表所见,准确率基本没有发生明显变化,时间效率上倒是有很大的提升。

# 时间效率可能会比准确率更加优先考虑
relative_accuracy_decrease =  100 * (all_accuracy - reduced_accuracy) / all_accuracy
print('相对accuracy提升:',round(relative_accuracy_decrease,3),"%")
relative_runtime_decrease = 100 * (all_features_time - reduced_features_time) / all_features_time
print("相对时间效率提升:",round(relative_runtime_decrease,3),"%")
相对accuracy提升: 0.071 %
相对时间效率提升: 39.17 %

各个实验结果

# 原模型的预测温度对比
original_mae = np.mean(abs(original_features_predictions -test_labels))
# 所有特征预测温度对比
exp_all_mae = np.mean(abs(all_features_predictions -test_labels))
# 重要特征预测温度对比
exp_reduced_mae = np.mean(abs(reduced_features_predictions -test_labels))
# 原模型的准确率
original_accuracy = 100 * (1 - np.mean(abs(original_features_predictions - test_labels) /test_labels))
model_comparison = pd.DataFrame({
   'model': ['original', 'exp_all', 'exp_reduced'],
                                'error (degrees)': [original_mae, exp_all_mae, exp_reduced_mae],
                                'accuracy': [original_accuracy, all_accuracy, reduced_accuracy],
                                'run_time (s)': [original_features_time, all_features_time, reduced_features_time]})
# 汇聚所有实验结果
fig, (ax1,ax2,ax3) = plt.subplots(nrows=1,ncols=3,figsize=(16,5),sharex=True)
# X轴
x_values = [0,1,2]
labels = list(model_comparison['model'])
plt.xticks(x_values,labels)
# 字体大小
fontdict = {
   'fontsize':18}
fontdict_yaxis = {
   'fontsize':14}
# 预测温度和真实温度的比对比
ax1.bar(x_values,model_comparison['error (degrees)'], color=['b','r','g'],edgecolor='k',linewidth=1.5)
ax1.set_ylim(bottom=3.5, top=4.5)
ax1.set_ylabel('Error (degree) (F)',fontdict=fontdict_yaxis)
ax1.set_title('Model Error Comparison',fontdict=fontdict)
# 准确率对比
ax2.bar(x_values,model_comparison['accuracy'],color=['b','r','g'],edgecolor='k',linewidth=1.5)
ax2.set_ylim(bottom=92, top=94)
ax2.set_ylabel('Accuracy (%)',fontdict=fontdict_yaxis)
ax2.set_title('Model Accuracy Comparision',fontdict=fontdict)
# 时间效率对比
ax3.bar(x_values,model_comparison['run_time (s)'], color=['b','r','g'],edgecolor='k',linewidth=1.5)
ax3.set_ylim(bottom=0,top=1)
ax3.set_ylabel('run_time (s)',fontdict=fontdict_yaxis)
ax3.set_title('Model Run-Time Comparison',fontdict=fontdict)
plt.show()

模型调参

from sklearn.ensemble import RandomForestRegressor
from pprint import pprint
rf = RandomForestRegressor(random_state=42)
pprint(rf.get_params())
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

自动随机调参

所选择的参数值的候选根据实际情况,每一个参数的取值范围都需要好好把控

from sklearn.model_selection import RandomizedSearchCV
# 建立树的个数
n_estimators = [int(x) for x in np.linspace(start=200,stop=2000,num=10)]
# 最大特征的选择方法
max_features = ['auto','sqrt']
# 树最大深度
max_depth = [int(x) for x in np.linspace(10,20,num=2)]
max_depth.append(None)
# 节点最小分裂所需要的样本个数
min_samples_split = [2,5,10]
# 叶子节点最小的样本数
min_samples_leaf = [1,2,4]
# 样本采样方法
bootstrap = [True,False]
# 随机参数空间
random_grid ={
   'n_estimators':n_estimators,
             'max_features':max_features,
             'max_depth':max_depth,
             'min_samples_split':min_samples_split,
             'min_samples_leaf':min_samples_leaf,
             'bootstrap':bootstrap}

随机组合参数

rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator=rf, # 指定调参模型
                            param_distributions=random_grid, # 指定候选参数列表
                            n_iter=100, # 随机选择参数组合的个数,这里是随机选择100组,找这中间最好的
                            scoring='neg_mean_absolute_error', # 评估方法
                            cv=3, # 交叉验证
                            verbose=2, # 打印信息的数量
                            random_state=42, # 随机种子,随便选
                            n_jobs=-1) # 多线程数目,如果-1代表使用所有线程
# 寻找开始
rf_random.fit(train_features,train_labels)
rf_random.best_params_
# 评估结果
def evaluate(model,test_features,test_labels):
    predictions = model.predict(test_features)
    errors = abs(predictions - test_labels)
    mape = 100 * np.mean(errors / test_labels)
    accuracy = 100 - mape
    print('平均气温误差:',np.mean(errors))
    print('Accuracy = {:0.2f}%'.format(accuracy))
# 默认参数结果
base_model = RandomForestRegressor(random_state=42)
base_model.fit(train_features,train_labels)
evaluate(base_model,test_features,test_labels)
# 新配方
best_random = rf_random.best_estimator_
evaluate(best_random,test_features,test_labels)

网络参数搜索

地毯式搜索最优参数,注意这里的地毯也是根据之前的随机调参得到的大致范围

{
   'n_estimators': 1800,
 'min_samples_split': 10,
 'min_samples_leaf': 4,
 'max_features': 'auto',
 'max_depth': None,
 'bootstrap': True}
from sklearn.model_selection import GridSearchCV
# 候选参数空间
param_grid = {
   
    'n_estimators':[1600,1700,1800,1900,2000],
    'max_features':['auto'],
    'max_depth':[8,10,12],
    'min_samples_split':[3,5,7],
    'min_samples_leaf':[2.3,4,5,6],
    'bootstrap':[True]
}
# 基本算法模型
rf = RandomForestRegressor()
# 网格搜索
grid_search = GridSearchCV(estimator=rf,
                          param_grid=param_grid,
                          scoring='neg_mean_absolute_error',
                          cv=3,
                          n_jobs=-1,
                          verbose=2)
# 搜索开始
grid_search.fit(train_features,train_labels)

网格搜索的时候,如果参数空间过大,遍历次数过多时,将所有可能分成不同的小组进行,比较各个组的最优

贝叶斯优化

from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from sklearn.model_selection import cross_val_score
def hyperopt_train_test(params):
    clf = RandomForestRegressor(**params)
    return cross_val_score(clf,train_features,train_labels).mean()
max_depth = [i for i in range(10,20)]
# max_depth.append(None)
space4rf = {
   
    'max_depth': hp.choice('max_depth', max_depth),
    'max_features': hp.choice('max_features', ['auto','sqrt']),
    'min_samples_split':hp.choice('min_samples_split',range(5,20)),
    'min_samples_leaf':hp.choice('min_samples_leaf',range(2,10)),
    'n_estimators': hp.choice('n_estimators', range(1000,2000)),
    'bootstrap':hp.choice('bootstrap',[True,False])
}

best = 0
def f(params):
    global best
    acc = hyperopt_train_test(params)
    if acc > best:
        best = acc
        print('new best:', best, params)
    return {
   'loss': -acc, 'status': STATUS_OK}

trials = Trials()
best = fmin(f, space4rf, algo=tpe.suggest, max_evals=100, trials=trials)
print("best:",best)
new best:                                                                                                              
0.8670547390932652                                                                                                     
{'bootstrap': True, 'max_depth': 18, 'max_features': 'auto', 'min_samples_leaf': 4, 'min_samples_split': 11, 'n_estimators': 1942}
new best:                                                                                                              
0.8679298658104889                                                                                                     
{'bootstrap': True, 'max_depth': 16, 'max_features': 'auto', 'min_samples_leaf': 7, 'min_samples_split': 10, 'n_estimators': 1734}
new best:                                                                                                              
0.8684034111523674                                                                                                     
{'bootstrap': True, 'max_depth': 14, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 19, 'n_estimators': 1766}
new best:                                                                                                              
0.8685636302610934                                                                                                     
{'bootstrap': True, 'max_depth': 16, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 5, 'n_estimators': 1439}
new best:                                                                                                              
0.8685767383919801                                                                                                     
{'bootstrap': True, 'max_depth': 16, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 5, 'n_estimators': 1404}
new best:                                                                                                              
0.8685919671759731                                                                                                     
{'bootstrap': True, 'max_depth': 19, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 15, 'n_estimators': 1830}
new best:                                                                                                              
0.8686049353034605                                                                                                     
{'bootstrap': True, 'max_depth': 19, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 16, 'n_estimators': 1099}
new best:                                                                                                              
0.8686240725941452                                                                                                     
{'bootstrap': True, 'max_depth': 19, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 16, 'n_estimators': 1088}
100%|█████████████████████████████████████████████| 100/100 [39:13<00:00, 26.47s/trial, best loss: -0.8686240725941452]
best: {'bootstrap': 0, 'max_depth': 9, 'max_features': 0, 'min_samples_leaf': 7, 'min_samples_split': 11, 'n_estimators': 88}

转载:https://blog.csdn.net/weixin_43645287/article/details/116380002
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场