上篇说到了数据预处理和EDA,数据预处理是为了提高数据的可用性,而EDA则可以挖掘数据的规律,便于构造特征。在一个机器学习数据竞赛任务中,有句话叫做“特征决定任务能达到的高度,而模型和算法包括调参只是逼近这个高度”。特征工程的重要性是不言而喻的。对于我们这个任务而言,由于是时间序列问题,很多地方和其他类问题的处理不一样,比如时间序列问题在构造特征的时候就要包括到历史特征,还有时间窗特征,包括时间窗的sum,mean,median,std等,而线下验证集的划分则需要根据时间线来,不能用以往的交叉验证。
首先是构造label特征,由于训练数据给的是13年1月到15年10月的数据,需要我们预测的是15年11月份的数据,所以我们将每组数据在时间线上向后平移,即1月份的label是2月份的销量,依此类推,如下:
train_monthly['item_cnt_month'] = train_monthly.sort_values('date_block_num').groupby(['shop_id', 'item_id'])['item_cnt'].shift(-1) #下个月的销量
构造商品的单位价格特征,即收入//销量,缺失数据采用零填充:
train_monthly['item_price_unit'] = train_monthly['item_price'] // train_monthly['item_cnt']
train_monthly['item_price_unit'].fillna(0, inplace=True)
对于每一个产品,计算产品的价格波动特征,包括最高价格和最低价格,以及价格增量和价格减量。
gp_item_price = train_monthly.sort_values('date_block_num').groupby(['item_id'], as_index=False).agg({'item_price':[np.min, np.max]})
gp_item_price.columns = ['item_id', 'hist_min_item_price', 'hist_max_item_price']
train_monthly = pd.merge(train_monthly, gp_item_price, on='item_id', how='left')
train_monthly['price_increase'] = train_monthly['item_price'] - train_monthly['hist_min_item_price']
train_monthly['price_decrease'] = train_monthly['hist_max_item_price'] - train_monthly['item_price']
看一下此时的数据:
接下来构造时间窗特征,窗口大小设置为3,时间窗可以对数据起到平滑的效果,同时也包含了一定的历史信息。这里我们用时窗构造出min,max,mean以及std特征,并对缺失数据进行零填充。
# Min value
f_min = lambda x: x.rolling(window=3, min_periods=1).min()
# Max value
f_max = lambda x: x.rolling(window=3, min_periods=1).max()
# Mean value
f_mean = lambda x: x.rolling(window=3, min_periods=1).mean()
# Standard deviation
f_std = lambda x: x.rolling(window=3, min_periods=1).std()
function_list = [f_min, f_max, f_mean, f_std]
function_name = ['min', 'max', 'mean', 'std']
for i in range(len(function_list)):
train_monthly[('item_cnt_%s' % function_name[i])] = train_monthly.sort_values('date_block_num').groupby(['shop_id', 'item_category_id', 'item_id'])['item_cnt'].apply(function_list[i])
# Fill the empty std features with 0
train_monthly['item_cnt_std'].fillna(0, inplace=True)
构造滞后历史特征,也就是将历史三个月的数据平移过来:
lag_list = [1, 2, 3]
for lag in lag_list:
ft_name = ('item_cnt_shifted%s' % lag)
train_monthly[ft_name] = train_monthly.sort_values('date_block_num').groupby(['shop_id', 'item_category_id', 'item_id'])['item_cnt'].shift(lag)
# Fill the empty shifted features with 0
train_monthly[ft_name].fillna(0, inplace=True)
构造销量变化特征,通过计算滞后历史特征的变化量来得出:
train_monthly['item_trend'] = train_monthly['item_cnt']
for lag in lag_list:
ft_name = ('item_cnt_shifted%s' % lag)
train_monthly['item_trend'] -= train_monthly[ft_name]
train_monthly['item_trend'] /= len(lag_list) + 1
下面先划分数据集和验证集,这里由于我们使用了滞后平移操作以及时间窗的计算,所以我们丢弃前三个月的数据,同时将date_block_sum介于27到33之间的数据划分为线下的验证集。
train_set = train_monthly.query('date_block_num >= 3 and date_block_num < 28').copy()
validation_set = train_monthly.query('date_block_num >= 28 and date_block_num < 33').copy()
test_set = train_monthly.query('date_block_num == 33').copy()
train_set.dropna(subset=['item_cnt_month'], inplace=True)
validation_set.dropna(subset=['item_cnt_month'], inplace=True)
train_set.dropna(inplace=True)
validation_set.dropna(inplace=True)
分别对商店,商品,年和月构造销量的均值特征。
# Shop mean encoding.
gp_shop_mean = train_set.groupby(['shop_id']).agg({'item_cnt_month': ['mean']})
gp_shop_mean.columns = ['shop_mean']
gp_shop_mean.reset_index(inplace=True)
# Item mean encoding.
gp_item_mean = train_set.groupby(['item_id']).agg({'item_cnt_month': ['mean']})
gp_item_mean.columns = ['item_mean']
gp_item_mean.reset_index(inplace=True)
# Shop with item mean encoding.
gp_shop_item_mean = train_set.groupby(['shop_id', 'item_id']).agg({'item_cnt_month': ['mean']})
gp_shop_item_mean.columns = ['shop_item_mean']
gp_shop_item_mean.reset_index(inplace=True)
# Year mean encoding.
gp_year_mean = train_set.groupby(['year']).agg({'item_cnt_month': ['mean']})
gp_year_mean.columns = ['year_mean']
gp_year_mean.reset_index(inplace=True)
# Month mean encoding.
gp_month_mean = train_set.groupby(['month']).agg({'item_cnt_month': ['mean']})
gp_month_mean.columns = ['month_mean']
gp_month_mean.reset_index(inplace=True)
# Add meand encoding features to train set.
train_set = pd.merge(train_set, gp_shop_mean, on=['shop_id'], how='left')
train_set = pd.merge(train_set, gp_item_mean, on=['item_id'], how='left')
train_set = pd.merge(train_set, gp_shop_item_mean, on=['shop_id', 'item_id'], how='left')
train_set = pd.merge(train_set, gp_year_mean, on=['year'], how='left')
train_set = pd.merge(train_set, gp_month_mean, on=['month'], how='left')
# Add meand encoding features to validation set.
validation_set = pd.merge(validation_set, gp_shop_mean, on=['shop_id'], how='left')
validation_set = pd.merge(validation_set, gp_item_mean, on=['item_id'], how='left')
validation_set = pd.merge(validation_set, gp_shop_item_mean, on=['shop_id', 'item_id'], how='left')
validation_set = pd.merge(validation_set, gp_year_mean, on=['year'], how='left')
validation_set = pd.merge(validation_set, gp_month_mean, on=['month'], how='left')
分离出训练集和验证集的输入X和输出Y。
# Create train and validation sets and labels.
X_train = train_set.drop(['item_cnt_month', 'date_block_num'], axis=1)
Y_train = train_set['item_cnt_month'].astype(int)
X_validation = validation_set.drop(['item_cnt_month', 'date_block_num'], axis=1)
Y_validation = validation_set['item_cnt_month'].astype(int)
转换integer特征。
# Integer features (used by catboost model).
int_features = ['shop_id', 'item_id', 'year', 'month']
X_train[int_features] = X_train[int_features].astype('int32')
X_validation[int_features] = X_validation[int_features].astype('int32')
对测试集进行缺失特征的填充,填充规则为最近的一个月的特征,所以这里构造了一个lastest_records,latest_records为每个shop_id,item_id组合的最新的特征记录,如果每验证集的最后一个月出现了某种组合,则该组合特征记录一定是在最有一个月,如果最后一个月没有,则找最近的一个月份的特征记录。
latest_records = pd.concat([train_set, validation_set]).drop_duplicates(subset=['shop_id', 'item_id'], keep='last')
#latest_records为每个shop_id,item_id组合的最新的特征记录,如果每验证集的最后一个月出现了某种组合,则该组合特征记录一定是在最有一个月,如果最后一个月没有,则找最近的一个月份的
#特征记录
X_test = pd.merge(test, latest_records, on=['shop_id', 'item_id'], how='left', suffixes=['', '_']) #让每个组合拥有最近的记录
X_test.head().append(X_test.tail())
X_test['year'] = 2015
X_test['month'] = 9
X_test.drop('item_cnt_month', axis=1, inplace=True)
X_test[int_features] = X_test[int_features].astype('int32')
X_test = X_test[X_train.columns]
对缺失数据按照shop_id进行中位数填充。
sets = [X_train, X_validation, X_test]
for dataset in sets:
for shop_id in dataset['shop_id'].unique():
for column in dataset.columns:
shop_median = dataset[(dataset['shop_id'] == shop_id)][column].median()
dataset.loc[(dataset[column].isnull()) & (dataset['shop_id'] == shop_id), column] = shop_median
# Fill remaining missing values on test set with mean.
X_test.fillna(X_test.mean(), inplace=True)
丢弃无用的id。
X_train.drop(['item_category_id'], axis=1, inplace=True)
X_validation.drop(['item_category_id'], axis=1, inplace=True)
X_test.drop(['item_category_id'], axis=1, inplace=True)
查看一下测试集数据和验证集数据:
测试集:
验证集:
转载:https://blog.csdn.net/wlx19970505/article/details/101032156