前言
这个项目实战系列主要是跟着网络上的教程来做的,主要参考《跟着迪哥学习机器学习》中的思路和具体实现代码,但是书中使用到的应该是python2的版本,有一些代码也有问题,有的是省略了一些关键的步骤,有的是语法的问题,总之就是并不是直接照着敲就能够一路运行下来的。这里整理了能够运行的代码和数据集(链接容易挂,需要请私聊)。
系列导航
思路
随机森林建模->特征选择->效率对比->参数调优
import pandas as pd
features = pd.read_csv("temps.csv")
features.head(5)
year | month | day | week | temp_2 | temp_1 | average | actual | friend | |
---|---|---|---|---|---|---|---|---|---|
0 | 2019 | 1 | 1 | Fri | 45 | 45 | 45.6 | 45 | 29 |
1 | 2019 | 1 | 2 | Sat | 44 | 45 | 45.7 | 44 | 61 |
2 | 2019 | 1 | 3 | Sun | 45 | 44 | 45.8 | 41 | 56 |
3 | 2019 | 1 | 4 | Mon | 44 | 41 | 45.9 | 40 | 53 |
4 | 2019 | 1 | 5 | Tues | 41 | 40 | 46.0 | 44 | 41 |
具体任务
- 基本建模任务:
- 数据预处理
- 特征展示
- 建模+可视化
- 数据样本量的个数对结果的影响:增阿吉数据样本个数,观察结果变化,重新考虑特征工程,引入性特征后,观察结果
- 对算法进行调参,找到最合适的参数:两种经典调仓方法
特征可视化与预处理
观察数据地规模
print("数据维度:",features.shape)
数据维度: (348, 9)
features.describe()
year | month | day | temp_2 | temp_1 | average | actual | friend | |
---|---|---|---|---|---|---|---|---|
count | 348.0 | 348.000000 | 348.000000 | 348.000000 | 348.000000 | 348.000000 | 348.000000 | 348.000000 |
mean | 2019.0 | 6.477011 | 15.514368 | 62.652299 | 62.701149 | 59.760632 | 62.543103 | 60.034483 |
std | 0.0 | 3.498380 | 8.772982 | 12.165398 | 12.120542 | 10.527306 | 11.794146 | 15.626179 |
min | 2019.0 | 1.000000 | 1.000000 | 35.000000 | 35.000000 | 45.100000 | 35.000000 | 28.000000 |
25% | 2019.0 | 3.000000 | 8.000000 | 54.000000 | 54.000000 | 49.975000 | 54.000000 | 47.750000 |
50% | 2019.0 | 6.000000 | 15.000000 | 62.500000 | 62.500000 | 58.200000 | 62.500000 | 60.000000 |
75% | 2019.0 | 10.000000 | 23.000000 | 71.000000 | 71.000000 | 69.025000 | 71.000000 | 71.000000 |
max | 2019.0 | 12.000000 | 31.000000 | 117.000000 | 117.000000 | 77.400000 | 92.000000 | 95.000000 |
数据:
- count结果展示数据集是否有缺少值
- mean、std观察数据的分布特征
# 时间转化,用标准时间格式方便后续工作
import datetime
years = features['year']
months = features['month']
days = features['day']
dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years,months,days)]
dates = [datetime.datetime.strptime(date,'%Y-%m-%d') for date in dates]
dates[:5]
[datetime.datetime(2019, 1, 1, 0, 0),
datetime.datetime(2019, 1, 2, 0, 0),
datetime.datetime(2019, 1, 3, 0, 0),
datetime.datetime(2019, 1, 4, 0, 0),
datetime.datetime(2019, 1, 5, 0, 0)]
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight') # 绘图风格
可视化
布局视图,初步分析需要四项指标:分别为最高气温的标签值、前天、昨天、朋友预测的气温最高值,四个图。
fig,((ax1,ax2),(ax3,ax4)) = plt.subplots(nrows=2,ncols=2,figsize=(10,10))
fig.autofmt_xdate(rotation=45)
# 最高气温的标签值
ax1.plot(dates,features['actual'])
ax1.set_xlabel('');ax1.set_ylabel('Temperature');ax1.set_title('Max Temp')
# 昨天的最高温度值
ax2.plot(dates,features['temp_1'])
ax2.set_xlabel('');ax2.set_ylabel('Temperature');ax2.set_title('Yesterday Max Temp')
# 前天的最高温度值
ax3.plot(dates,features['temp_2'])
ax3.set_xlabel('');ax3.set_ylabel('Temperature');ax3.set_title('Two Days Prior Max Temp')
# 朋友预测的最高温度值
ax4.plot(dates,features['friend'])
ax4.set_xlabel('');ax4.set_ylabel('Temperature');ax4.set_title('Friend Forcast')
plt.tight_layout(pad=2)
预处理
对于一周的标签,并不是数值特征,需要将其转换为特征编码
# 独热编码
features = pd.get_dummies(features) #自动转换,自动添加后缀
features.head(5)
year | month | day | temp_2 | temp_1 | average | actual | friend | week_Fri | week_Mon | week_Sat | week_Sun | week_Thurs | week_Tues | week_Wed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019 | 1 | 1 | 45 | 45 | 45.6 | 45 | 29 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 2019 | 1 | 2 | 44 | 45 | 45.7 | 44 | 61 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 2019 | 1 | 3 | 45 | 44 | 45.8 | 41 | 56 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3 | 2019 | 1 | 4 | 44 | 41 | 45.9 | 40 | 53 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
4 | 2019 | 1 | 5 | 41 | 40 | 46.0 | 44 | 41 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
print(help(pd.get_dummies))
Help on function get_dummies in module pandas.core.reshape.reshape:
get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None) -> 'DataFrame'
Convert categorical variable into dummy/indicator variables.
Parameters
----------
data : array-like, Series, or DataFrame
Data of which to get dummy indicators.
prefix : str, list of str, or dict of str, default None
String to append DataFrame column names.
Pass a list with length equal to the number of columns
when calling get_dummies on a DataFrame. Alternatively, `prefix`
can be a dictionary mapping column names to prefixes.
prefix_sep : str, default '_'
If appending prefix, separator/delimiter to use. Or pass a
list or dictionary as with `prefix`.
dummy_na : bool, default False
Add a column to indicate NaNs, if False NaNs are ignored.
columns : list-like, default None
Column names in the DataFrame to be encoded.
If `columns` is None then all the columns with
`object` or `category` dtype will be converted.
sparse : bool, default False
Whether the dummy-encoded columns should be backed by
a :class:`SparseArray` (True) or a regular NumPy array (False).
drop_first : bool, default False
Whether to get k-1 dummies out of k categorical levels by removing the
first level.
dtype : dtype, default np.uint8
Data type for new columns. Only a single dtype is allowed.
.. versionadded:: 0.23.0
Returns
-------
DataFrame
Dummy-coded data.
See Also
--------
Series.str.get_dummies : Convert Series to dummy codes.
Examples
--------
>>> s = pd.Series(list('abca'))
>>> pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
>>> s1 = ['a', 'b', np.nan]
>>> pd.get_dummies(s1)
a b
0 1 0
1 0 1
2 0 0
>>> pd.get_dummies(s1, dummy_na=True)
a b NaN
0 1 0 0
1 0 1 0
2 0 0 1
>>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
... 'C': [1, 2, 3]})
>>> pd.get_dummies(df, prefix=['col1', 'col2'])
C col1_a col1_b col2_a col2_b col2_c
0 1 1 0 0 1 0
1 2 0 1 1 0 0
2 3 1 0 0 0 1
>>> pd.get_dummies(pd.Series(list('abcaa')))
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
4 1 0 0
>>> pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
b c
0 0 0
1 1 0
2 0 1
3 0 0
4 0 0
>>> pd.get_dummies(pd.Series(list('abc')), dtype=float)
a b c
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 0.0 0.0 1.0
None
# 数据与标签
import numpy as np
# 标签
labels = np.array(features['actual'])
# 特征中去除标签
features = features.drop('actual',axis=1) # 按照列去掉
# 名字单独保留
feature_list = list(features.columns)
# 转换为合适的格式
features = np.array(features)
# 数据集切分
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(features,labels,test_size=0.25,random_state=42)
print('训练集特征:',train_features.shape)
print('训练集标签:',train_labels.shape)
print('测试集标签:',test_features.shape)
print('测试机标签:',test_labels.shape)
训练集特征: (261, 14)
训练集标签: (261,)
测试集标签: (87, 14)
测试机标签: (87,)
随机森林回归模型
先建立1000棵决策树尝试,参数默认
先使用MAPE指标进行评估,这个评估是平均绝对百分误差,由于样本量比较少,训练模型的速率较快
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=1000,random_state=42)
rf.fit(train_features,train_labels)
predictions = rf.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100 * (errors / test_labels)
print('MAPE:',np.mean(mape))
MAPE: 6.016378550202468
树模型可视化
from sklearn.tree import export_graphviz
import pydot
import os
# os.environ["PATH"] += os.pathsep + 'S:/Graphviz/bin/'
tree = rf.estimators_[5]
export_graphviz(tree,out_file="tree.dot",feature_names=feature_list,rounded=True,precision=1)
(graph,) = pydot.graph_from_dot_file('./tree.dot')
graph.write_png('./tree.png')
# 限制树模型
rf_small = RandomForestRegressor(n_estimators=10,max_depth=3,random_state=42)
rf_small.fit(train_features,train_labels)
tree_small = rf_small.estimators_[5]
export_graphviz(tree_small,out_file='small_tree.dot',feature_names=feature_list,rounded=True,precision=1)
(graph,) = pydot.graph_from_dot_file('small_tree.dot')
graph.write_png('small_tree.png')
# 决策树特征重要性
importances = list(rf.feature_importances_)
# 格式转换
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list,importances)]
feature_importances = sorted(feature_importances,key=lambda x:x[1],reverse=True)
# 打印
[print('Variable:{:20} importance: {}'.format(*pair)) for pair in feature_importances]
Variable:temp_1 importance: 0.69
Variable:average importance: 0.2
Variable:day importance: 0.03
Variable:friend importance: 0.03
Variable:temp_2 importance: 0.02
Variable:month importance: 0.01
Variable:year importance: 0.0
Variable:week_Fri importance: 0.0
Variable:week_Mon importance: 0.0
Variable:week_Sat importance: 0.0
Variable:week_Sun importance: 0.0
Variable:week_Thurs importance: 0.0
Variable:week_Tues importance: 0.0
Variable:week_Wed importance: 0.0
[None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None]
# 绘制为直方图
x_values = list(range(len(importances)))
plt.bar(x_values,importances,orientation='vertical')
plt.xticks(x_values,feature_list,rotation='vertical')
plt.ylabel('Importance');plt.xlabel('Variable');plt.title('Variable Importances')
Text(0.5, 1.0, 'Variable Importances')
观察发现,使用重要性最好的特征建模,可能会有更好的结果,虽然不一定成功,但是速度一定会更快
# 尝试使用最重要的两个特征
rf_most_important = RandomForestRegressor(n_estimators=1000,random_state=42)
# 最重要特征
important_indices = [feature_list.index('temp_1'),feature_list.index('average')]
train_important = train_features[:,important_indices]
test_important = test_features[:,important_indices]
# 重新训练模型
rf_most_important.fit(train_important,train_labels)
# 预测结果
predictions = rf_most_important.predict(test_important)
errors = abs(predictions-test_labels)
# 评估结果,保留两位小数
print('Mean Absolute Error:',round(np.mean(errors),2),'%')
mape = np.mean(100*(errors/test_labels))
print('mape:',mape)
Mean Absolute Error: 3.92 %
mape: 6.243108595734665
这里发现,mape的值从6.0上升到6.2,并没有下降,说明不能只选择最重要的特征
首次预测
# 日期
months = features[:,feature_list.index('month')]
days = features[:,feature_list.index('day')]
years = features[:,feature_list.index('year')]
# 转换日期
dates = [str(int(year))+'-'+str(int(month))+'-'+str(int(day)) for year, month, day in zip(years,months,days)]
dates = [datetime.datetime.strptime(date,'%Y-%m-%d') for date in dates]
# 创建表格保存日期和其对应的标签数据
true_data = pd.DataFrame(data={
'date':dates,'actual':labels})
# 另一个表格表示日期和对应预测值
months = test_features[:,feature_list.index('month')]
days = test_features[:,feature_list.index('day')]
years = test_features[:,feature_list.index('year')]
test_dates = [str(int(year))+'-'+str(int(month))+'-'+str(int(day)) for year,month,day in zip(years,months,days)]
test_dates = [datetime.datetime.strptime(date,'%Y-%m-%d') for date in test_dates]
predictions_data = pd.DataFrame(data = {
'date':test_dates,'prediction':predictions})
# 真实值
plt.plot(true_data['date'],true_data['actual'],'b-',label='actual')
# 预测值
plt.plot(predictions_data['date'],predictions_data['prediction'],'ro',label='prediction')
plt.xticks(rotation='60')
(array([17897., 17956., 18017., 18078., 18140., 18201., 18262.]),
[Text(0, 0, ''),
Text(0, 0, ''),
Text(0, 0, ''),
Text(0, 0, ''),
Text(0, 0, ''),
Text(0, 0, ''),
Text(0, 0, '')])
深入数据分析
- 如果可以利用的数据量增大,会对结果产生什么影响呢
- 加入的新特征会改进模型效果吗,此时的时间效率又怎么样
新数据集预测
import pandas as pd
features = pd.read_csv('temps_extended.csv')
features.head(5)
year | month | day | weekday | ws_1 | prcp_1 | snwd_1 | temp_2 | temp_1 | average | actual | friend | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2011 | 1 | 1 | Sat | 4.92 | 0.00 | 0 | 36 | 37 | 45.6 | 40 | 40 |
1 | 2011 | 1 | 2 | Sun | 5.37 | 0.00 | 0 | 37 | 40 | 45.7 | 39 | 50 |
2 | 2011 | 1 | 3 | Mon | 6.26 | 0.00 | 0 | 40 | 39 | 45.8 | 42 | 42 |
3 | 2011 | 1 | 4 | Tues | 5.59 | 0.00 | 0 | 39 | 42 | 45.9 | 38 | 59 |
4 | 2011 | 1 | 5 | Wed | 3.80 | 0.03 | 0 | 42 | 38 | 46.0 | 45 | 39 |
print('数据规模',features.shape)
数据规模 (2191, 12)
日期处理
# 时间转化,用标准时间格式方便后续工作
import datetime
years = features['year']
months = features['month']
days = features['day']
dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years,months,days)]
dates = [datetime.datetime.strptime(date,'%Y-%m-%d') for date in dates]
dates[:5]
[datetime.datetime(2011, 1, 1, 0, 0),
datetime.datetime(2011, 1, 2, 0, 0),
datetime.datetime(2011, 1, 3, 0, 0),
datetime.datetime(2011, 1, 4, 0, 0),
datetime.datetime(2011, 1, 5, 0, 0)]
特征可视化
# 对新特征进行可视化展示
fig,((ax1,ax2),(ax3,ax4)) = plt.subplots(nrows=2,ncols=2,figsize=(15,10))
fig.autofmt_xdate(rotation=45)
# 平均最高气温
ax1.plot(dates,features['average'])
ax1.set_xlabel('');ax1.set_ylabel('Tempertature (F)');ax1.set_title('Historical Avg Max Temp')
# 风速
ax2.plot(dates,features['ws_1'],'r-')
ax2.set_xlabel('');ax2.set_ylabel('Wind Speed (mph))');ax2.set_title('Prior Wind Speed')
# 降水
ax3.plot(dates,features['prcp_1'],'r-')
ax3.set_xlabel('Date');ax3.set_ylabel('Precipitation (in)');ax3.set_title('Prior Precipitation')
# 积雪
ax4.plot(dates,features['snwd_1'],'ro')
ax4.set_xlabel('Date');ax4.set_ylabel('Snow Depth (in)');ax4.set_title('Prior Snow Depth')
plt.tight_layout(pad=2)
特征工程
- 尽可能多的选择有价值的特征,初始阶段的信息越多,建模可以利用的信息就越多
- 通常分析过程中可能会发现新的可以利用的特征,这个时候又放回去加入,过程比较繁琐,同时还需要实验进行对比。
- 因此需要在初始阶段就尽可能完善预处理与特征提取工作
关于这份数据:
- 天气变换与季节因素有关,然而数据集中并没有体现季节的特征,可以自己创建
# 季节变量
seasons = []
for month in features['month']:
if month in [1,2,12]:
seasons.append('winter')
elif month in [3,4,5]:
seasons.append('spring')
elif month in [6,7,8]:
seasons.append('summer')
elif month in [9,10,11]:
seasons.append('fall')
reduced_features = features[['temp_1','prcp_1','average','actual']]
reduced_features['season'] = seasons
<ipython-input-23-b690764ce900>:13: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
reduced_features['season'] = seasons
import seaborn as sns
sns.set(style='ticks',color_codes=True)
# 主题
palette = sns.xkcd_palette(['dark blue','dark green','gold','orange'])
# pairplot绘图
sns.pairplot(reduced_features,hue='season',diag_kind='kde',palette=palette,plot_kws=dict(alpha=0.7),diag_kws=dict(shade=True))
<seaborn.axisgrid.PairGrid at 0x232067fa190>
分析的方法:不同的颜色代表不同的季节(hue),主对角线是同一个参数,表示数值在不同季节的分布。其他都是散点图,代表了各个特征之间的相关性
数据量对结果的影响分析
# 独热编码
features = pd.get_dummies(features)
# 提取特征和标签
labels = features['actual']
features = features.drop('actual',axis=1)
# 特征名字留着备用
feature_list = list(features.columns)
# 转换为所需格式
import numpy as np
features = np.array(features)
labels = np.array(labels)
# 数据集切分
from sklearn.model_selection import train_test_split
train_features,test_features,train_labels,test_labels = train_test_split(features,labels,test_size=0.25,random_state=0)
print("训练集特征:",train_features.shape)
print("训练集标签:",train_labels.shape)
print("测试集特征:",test_features.shape)
print("测试集标签:",test_labels.shape)
训练集特征: (1643, 17)
训练集标签: (1643,)
测试集特征: (548, 17)
测试集标签: (548,)
虽然是新样本,但是需要使用相同的测试集来对比结果。
# 对之前数据惊喜能够处理,对比
import pandas as pd
import numpy as np
# 统一特征
original_feature_indices = [feature_list.index(feature) for feature in feature_list if feature not in ['ws_1','prcp_1','snwd_1']]
# 重新读取老数据
original_features = pd.read_csv('temps.csv')
original_features = pd.get_dummies(original_features)
# 数据标签转换
original_labels = np.array(original_features['actual'])
original_features = original_features.drop('actual',axis=1)
original_feature_list = list(original_features.columns)
original_features = np.array(original_features)
先前数据切分
# 数据集切分
from sklearn.model_selection import train_test_split
original_train_features,original_test_features,original_train_labels,original_test_labels = train_test_split(original_features,original_labels,test_size=0.25,random_state=42)
# 数据建模
from sklearn.ensemble import RandomForestRegressor
# 同样参数和随机种子
rf = RandomForestRegressor(n_estimators=100,random_state=0)
# 老数据集
rf.fit(original_train_features,original_train_labels)
# 统一使用一个测试集,为了公平
predictions = rf.predict(test_features[:,original_feature_indices])
errors = abs(predictions-test_labels)
print('平均温度误差:',round(np.mean(errors),2),'°')
mape = 100 *(errors/test_labels)
# 为了观察设定准确率
accuracy = 100 -np.mean(mape)
print('Accuracy:',round(accuracy,2),'%')
平均温度误差: 4.68 °
Accuracy: 92.19 %
上面是样本较少的结果,这次观察样本增多会更好吗
from sklearn.ensemble import RandomForestRegressor
# 保证标签一致 这里改变了原白哦钱的维度方便计算
original_train_changeed_features = train_features[:,original_feature_indices]
original_test_changed_features = test_features[:,original_feature_indices]
rf = RandomForestRegressor(n_estimators=100,random_state=0)
rf.fit(original_train_changeed_features,train_labels)
# 预测
baseline_predictions = rf.predict(original_test_changed_features)
# 结果
baseline_errors = abs(baseline_predictions-test_labels)
print('平均温度误差:',round(np.mean(baseline_errors),2),'%')
baseline_mape = 100 * np.mean(baseline_errors/test_labels)
# 准确率
baseline_accuracy = 100 - baseline_mape
print('Accuracy:',round(baseline_accuracy,2),'%')
平均温度误差: 4.2 %
Accuracy: 93.12 %
特征的数量对结果影响
加入两次比较没有比较的新的天气特征
from sklearn.ensemble import RandomForestRegressor
rf_exp = RandomForestRegressor(n_estimators=100,random_state=0)
rf_exp.fit(train_features,train_labels)
# 同一测试集
predictions = rf_exp.predict(test_features)
# 评估
errors = abs(predictions - test_labels)
print('平均温度误差:',round(np.mean(errors),2),"%")
mape = np.mean(100*(errors/test_labels))
improvement_baseline = 100 * abs(mape-baseline_mape) / baseline_mape
print('特征增多以后模型效果变化:',round(improvement_baseline,2),'%')
# 准确率
accuracy = 100 - mape
print('Accuracy:',round(accuracy,2),'%')
平均温度误差: 4.05 %
特征增多以后模型效果变化: 3.34 %
Accuracy: 93.35 %
importances = list(rf_exp.feature_importances_)
# 名字和数值拼接在一起
feature_importances = [(feature,round(importance,2)) for feature,importance in zip(feature_list,importances)]
# 排序
feature_importances = sorted(feature_importances,key=lambda x:x[1],reverse=True)
# 打印结果
[print('Variable:{:20} Importance: {}'.format(*pair)) for pair in feature_importances]
Variable:temp_1 Importance: 0.85
Variable:average Importance: 0.05
Variable:ws_1 Importance: 0.02
Variable:friend Importance: 0.02
Variable:year Importance: 0.01
Variable:month Importance: 0.01
Variable:day Importance: 0.01
Variable:prcp_1 Importance: 0.01
Variable:temp_2 Importance: 0.01
Variable:snwd_1 Importance: 0.0
Variable:weekday_Fri Importance: 0.0
Variable:weekday_Mon Importance: 0.0
Variable:weekday_Sat Importance: 0.0
Variable:weekday_Sun Importance: 0.0
Variable:weekday_Thurs Importance: 0.0
Variable:weekday_Tues Importance: 0.0
Variable:weekday_Wed Importance: 0.0
[None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None]
# 可视化重要指标
plt.style.use('fivethirtyeight')
x_values = list(range(len(importances)))
plt.bar(x_values,importances,orientation="vertical",color="r",edgecolor="k",linewidth=1.2)
plt.xticks(x_values,feature_list,rotation='vertical')
plt.ylabel('Importance')
plt.xlabel('Variable')
plt.title('Variable Importances')
Text(0.5, 1.0, 'Variable Importances')
具体选择几个特征还是比较模糊,可以先把特征按照其重要性进行排序后再计算累计值,设置阈值,观察需要多少特征累加在一起后,特征重要性的累加值才能超过阈值
sorted_importances = [importance[1] for importance in feature_importances]
sorted_features = [importance[0] for importance in feature_importances]
# 累计重要性
cumulative_importances = np.cumsum(sorted_importances)
# 绘制折线图
plt.plot(x_values,cumulative_importances,'g-')
plt.hlines(y=0.95,xmin=0,xmax=len(sorted_importances),color='r',linestyles='dashed')
plt.xticks(x_values,sorted_features,rotation='vertical')
plt.xlabel('Variable');plt.ylabel('Cumulative Importance')
plt.title('Cumulative Importances')
Text(0.5, 1.0, 'Cumulative Importances')
当使用前五个特征,总体的准确率累加值超过0.95,如果只使用这5个特征建模,观察结果
important_feature_names = [feature[0] for feature in feature_importances[0:5]]
# 名字
important_indices = [feature_list.index(feature) for feature in important_feature_names]
# 训练集
important_train_features = train_features[:,important_indices]
important_test_features = test_features[:,important_indices]
# 数据维度
print("important train features shape:",important_train_features.shape)
print("important test features shape:",important_test_features.shape)
# 训练模型
rf_exp.fit(important_train_features,train_labels)
# 同样的测试集
predictions = rf_exp.predict(important_test_features)
# 评估
errors = abs(predictions-test_labels)
print('平均温度误差:',round(np.mean(errors),2),"°")
mape = 100*(errors/test_labels)
accuracy = 100 - np.mean(mape)
print('Accuracy:',round(accuracy,2),"%")
important train features shape: (1643, 5)
important test features shape: (548, 5)
平均温度误差: 4.11 °
Accuracy: 93.28 %
虽然没有提升效率,那么观察一下在模型时间效率上面有没有提高·
import time
all_features_time = []
for _ in range(10):
start_time = time.time()
rf_exp.fit(train_features,train_labels)
all_features_predictions = rf_exp.predict(test_features)
end_time = time.time()
all_features_time.append(end_time-start_time)
all_features_time = np.mean(all_features_time)
print("使用所有特征与测试的平均时间消耗:",round(all_features_time,2),'s')
使用所有特征与测试的平均时间消耗: 0.69 s
# 只选用重要特征训练时
reduced_features_time = []
for _ in range(10):
start_time = time.time()
rf_exp.fit(important_train_features,train_labels)
reduced_features_predictions = rf_exp.predict(important_test_features)
end_time = time.time()
reduced_features_time.append(end_time-start_time)
reduced_features_time = np.mean(reduced_features_time)
print("使用重要特征与测试的平均时间消耗:",round(reduced_features_time,2),'s')
使用重要特征与测试的平均时间消耗: 0.42 s
# 原始模型时间效率
original_features_time =[]
for _ in range(10):
start_time =time.time()
rf.fit(original_train_features,original_train_labels)
original_features_predictions =rf.predict(test_features[:,original_feature_indices])
end_time =time.time()
original_features_time.append(end_time -start_time)
original_features_time =np.mean(original_features_time)
print("使用原始模型测试的平均时间消耗:",round(original_features_time,2),'s')
使用原始模型测试的平均时间消耗: 0.18 s
# 对比展示
all_accuracy = 100 * (1-np.mean(abs(all_features_predictions-test_labels)/test_labels))
reduced_accuracy = 100 * (1-np.mean(abs(reduced_features_predictions-test_labels)/test_labels))
# 保存结果并展示
comparision = pd.DataFrame({
'features':['all(17)','reduced(5)'],
'runtime':[round(all_features_time,2),round(reduced_features_time,2)],
'accuracy':[round(all_accuracy,2),round(reduced_accuracy,2)]})
comparision[['features','accuracy','runtime']]
features | accuracy | runtime | |
---|---|---|---|
0 | all(17) | 93.35 | 0.69 |
1 | reduced(5) | 93.28 | 0.42 |
单独两个结果比较
准确率只是为了方面安眠自己定义,用于对比分析。结果如上表所见,准确率基本没有发生明显变化,时间效率上倒是有很大的提升。
# 时间效率可能会比准确率更加优先考虑
relative_accuracy_decrease = 100 * (all_accuracy - reduced_accuracy) / all_accuracy
print('相对accuracy提升:',round(relative_accuracy_decrease,3),"%")
relative_runtime_decrease = 100 * (all_features_time - reduced_features_time) / all_features_time
print("相对时间效率提升:",round(relative_runtime_decrease,3),"%")
相对accuracy提升: 0.071 %
相对时间效率提升: 39.17 %
各个实验结果
# 原模型的预测温度对比
original_mae = np.mean(abs(original_features_predictions -test_labels))
# 所有特征预测温度对比
exp_all_mae = np.mean(abs(all_features_predictions -test_labels))
# 重要特征预测温度对比
exp_reduced_mae = np.mean(abs(reduced_features_predictions -test_labels))
# 原模型的准确率
original_accuracy = 100 * (1 - np.mean(abs(original_features_predictions - test_labels) /test_labels))
model_comparison = pd.DataFrame({
'model': ['original', 'exp_all', 'exp_reduced'],
'error (degrees)': [original_mae, exp_all_mae, exp_reduced_mae],
'accuracy': [original_accuracy, all_accuracy, reduced_accuracy],
'run_time (s)': [original_features_time, all_features_time, reduced_features_time]})
# 汇聚所有实验结果
fig, (ax1,ax2,ax3) = plt.subplots(nrows=1,ncols=3,figsize=(16,5),sharex=True)
# X轴
x_values = [0,1,2]
labels = list(model_comparison['model'])
plt.xticks(x_values,labels)
# 字体大小
fontdict = {
'fontsize':18}
fontdict_yaxis = {
'fontsize':14}
# 预测温度和真实温度的比对比
ax1.bar(x_values,model_comparison['error (degrees)'], color=['b','r','g'],edgecolor='k',linewidth=1.5)
ax1.set_ylim(bottom=3.5, top=4.5)
ax1.set_ylabel('Error (degree) (F)',fontdict=fontdict_yaxis)
ax1.set_title('Model Error Comparison',fontdict=fontdict)
# 准确率对比
ax2.bar(x_values,model_comparison['accuracy'],color=['b','r','g'],edgecolor='k',linewidth=1.5)
ax2.set_ylim(bottom=92, top=94)
ax2.set_ylabel('Accuracy (%)',fontdict=fontdict_yaxis)
ax2.set_title('Model Accuracy Comparision',fontdict=fontdict)
# 时间效率对比
ax3.bar(x_values,model_comparison['run_time (s)'], color=['b','r','g'],edgecolor='k',linewidth=1.5)
ax3.set_ylim(bottom=0,top=1)
ax3.set_ylabel('run_time (s)',fontdict=fontdict_yaxis)
ax3.set_title('Model Run-Time Comparison',fontdict=fontdict)
plt.show()
模型调参
from sklearn.ensemble import RandomForestRegressor
from pprint import pprint
rf = RandomForestRegressor(random_state=42)
pprint(rf.get_params())
{'bootstrap': True,
'ccp_alpha': 0.0,
'criterion': 'mse',
'max_depth': None,
'max_features': 'auto',
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_impurity_split': None,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 100,
'n_jobs': None,
'oob_score': False,
'random_state': 42,
'verbose': 0,
'warm_start': False}
自动随机调参
所选择的参数值的候选根据实际情况,每一个参数的取值范围都需要好好把控
from sklearn.model_selection import RandomizedSearchCV
# 建立树的个数
n_estimators = [int(x) for x in np.linspace(start=200,stop=2000,num=10)]
# 最大特征的选择方法
max_features = ['auto','sqrt']
# 树最大深度
max_depth = [int(x) for x in np.linspace(10,20,num=2)]
max_depth.append(None)
# 节点最小分裂所需要的样本个数
min_samples_split = [2,5,10]
# 叶子节点最小的样本数
min_samples_leaf = [1,2,4]
# 样本采样方法
bootstrap = [True,False]
# 随机参数空间
random_grid ={
'n_estimators':n_estimators,
'max_features':max_features,
'max_depth':max_depth,
'min_samples_split':min_samples_split,
'min_samples_leaf':min_samples_leaf,
'bootstrap':bootstrap}
随机组合参数
rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator=rf, # 指定调参模型
param_distributions=random_grid, # 指定候选参数列表
n_iter=100, # 随机选择参数组合的个数,这里是随机选择100组,找这中间最好的
scoring='neg_mean_absolute_error', # 评估方法
cv=3, # 交叉验证
verbose=2, # 打印信息的数量
random_state=42, # 随机种子,随便选
n_jobs=-1) # 多线程数目,如果-1代表使用所有线程
# 寻找开始
rf_random.fit(train_features,train_labels)
rf_random.best_params_
# 评估结果
def evaluate(model,test_features,test_labels):
predictions = model.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100 * np.mean(errors / test_labels)
accuracy = 100 - mape
print('平均气温误差:',np.mean(errors))
print('Accuracy = {:0.2f}%'.format(accuracy))
# 默认参数结果
base_model = RandomForestRegressor(random_state=42)
base_model.fit(train_features,train_labels)
evaluate(base_model,test_features,test_labels)
# 新配方
best_random = rf_random.best_estimator_
evaluate(best_random,test_features,test_labels)
网络参数搜索
地毯式搜索最优参数,注意这里的地毯也是根据之前的随机调参得到的大致范围
{
'n_estimators': 1800,
'min_samples_split': 10,
'min_samples_leaf': 4,
'max_features': 'auto',
'max_depth': None,
'bootstrap': True}
from sklearn.model_selection import GridSearchCV
# 候选参数空间
param_grid = {
'n_estimators':[1600,1700,1800,1900,2000],
'max_features':['auto'],
'max_depth':[8,10,12],
'min_samples_split':[3,5,7],
'min_samples_leaf':[2.3,4,5,6],
'bootstrap':[True]
}
# 基本算法模型
rf = RandomForestRegressor()
# 网格搜索
grid_search = GridSearchCV(estimator=rf,
param_grid=param_grid,
scoring='neg_mean_absolute_error',
cv=3,
n_jobs=-1,
verbose=2)
# 搜索开始
grid_search.fit(train_features,train_labels)
网格搜索的时候,如果参数空间过大,遍历次数过多时,将所有可能分成不同的小组进行,比较各个组的最优
贝叶斯优化
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from sklearn.model_selection import cross_val_score
def hyperopt_train_test(params):
clf = RandomForestRegressor(**params)
return cross_val_score(clf,train_features,train_labels).mean()
max_depth = [i for i in range(10,20)]
# max_depth.append(None)
space4rf = {
'max_depth': hp.choice('max_depth', max_depth),
'max_features': hp.choice('max_features', ['auto','sqrt']),
'min_samples_split':hp.choice('min_samples_split',range(5,20)),
'min_samples_leaf':hp.choice('min_samples_leaf',range(2,10)),
'n_estimators': hp.choice('n_estimators', range(1000,2000)),
'bootstrap':hp.choice('bootstrap',[True,False])
}
best = 0
def f(params):
global best
acc = hyperopt_train_test(params)
if acc > best:
best = acc
print('new best:', best, params)
return {
'loss': -acc, 'status': STATUS_OK}
trials = Trials()
best = fmin(f, space4rf, algo=tpe.suggest, max_evals=100, trials=trials)
print("best:",best)
new best:
0.8670547390932652
{'bootstrap': True, 'max_depth': 18, 'max_features': 'auto', 'min_samples_leaf': 4, 'min_samples_split': 11, 'n_estimators': 1942}
new best:
0.8679298658104889
{'bootstrap': True, 'max_depth': 16, 'max_features': 'auto', 'min_samples_leaf': 7, 'min_samples_split': 10, 'n_estimators': 1734}
new best:
0.8684034111523674
{'bootstrap': True, 'max_depth': 14, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 19, 'n_estimators': 1766}
new best:
0.8685636302610934
{'bootstrap': True, 'max_depth': 16, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 5, 'n_estimators': 1439}
new best:
0.8685767383919801
{'bootstrap': True, 'max_depth': 16, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 5, 'n_estimators': 1404}
new best:
0.8685919671759731
{'bootstrap': True, 'max_depth': 19, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 15, 'n_estimators': 1830}
new best:
0.8686049353034605
{'bootstrap': True, 'max_depth': 19, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 16, 'n_estimators': 1099}
new best:
0.8686240725941452
{'bootstrap': True, 'max_depth': 19, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 16, 'n_estimators': 1088}
100%|█████████████████████████████████████████████| 100/100 [39:13<00:00, 26.47s/trial, best loss: -0.8686240725941452]
best: {'bootstrap': 0, 'max_depth': 9, 'max_features': 0, 'min_samples_leaf': 7, 'min_samples_split': 11, 'n_estimators': 88}
转载:https://blog.csdn.net/weixin_43645287/article/details/116380002