飞道的博客

翼支付杯大数据建模大赛-季军方案

353人阅读  评论(0)


大家好啊,我是小爱同学。
先上一个比赛链接:
链接: link.
这个一个风险用户识别的比赛。如果大家感兴趣的话,可以阅读本文:
一、赛题理解

比赛提供三个数据表格,分别是用户基础信息,用户操作行为记录,用户交易行为记录。评价指标是AUC,因此我们可以不考虑该题样本不均衡对我们的模型产生的影响。因为是用户信用风险识别,所以时间,金额,地域是我们构造特征的关键。

二、数据预处理
1、缺失值处理
因为赛题的特殊性,我们不对缺失值进行常规填充,而是将其作为单独的一种特征:将类别型特征赋一个‘\N’,数字型赋-1。
下面展示缺失值处理的 代码片

    # 缺失值处理
    cols = ['sex', 'balance_avg', 'balance1_avg', 'provider', 'province', 'city','level']
    for col in cols:
        data[col].fillna(r'\N', inplace=True)
    cols = ['balance_avg','balance1_avg','level']
    for col in cols:
        data[col].replace({r'\N': -1}, inplace=True)
        data[col] = data[col]
    # 缺失值处理
    cols = ['sex', 'balance_avg', 'balance1_avg', 'provider', 'province', 'city','level']
    for col in cols:
        data[col].fillna(r'\N', inplace=True)
    cols = ['balance_avg','balance1_avg','level']
    for col in cols:
        data[col].replace({
   r'\N': -1}, inplace=True)
        data[col] = data[col]

2、编码
(1)无序低基数类别特征(例如性别这样的):我们用Label Encoder进行编码
下面展示一些 内联代码片

    cols = ['sex','provider','verified','regist_type','agreement1','agreement2','agreement3','agreement4','province','city','service3']
    for col in cols:
        if data[col].dtype == 'object':
            data[col] = data[col].astype(str)
            labelEncoder_df(data, cols)
    print(data.info())
// 无序低基数类别特征
    cols = ['sex','provider','verified','regist_type','agreement1','agreement2','agreement3','agreement4','province','city','service3']
    for col in cols:
        if data[col].dtype == 'object':
            data[col] = data[col].astype(str)
            labelEncoder_df(data, cols)
    print(data.info())

(2)无序高基数类别特征(例如城市,省份这样的):我们用目标编码,为减小过拟合现象,采用5折交叉验证的思路,转化特征值,见下图

下面展示一些 内联代码片

// A code block
def kfold_stats_feature(train, test, feats, k):
    folds = StratifiedKFold(n_splits=k, shuffle=True, random_state=44)  # 这里最好和后面模型的K折交叉验证保持一致

    train['fold'] = None
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(train, train['label'])):
        train.loc[val_idx, 'fold'] = fold_

    kfold_features = []
    for feat in feats:
        nums_columns = ['label']
        for f in nums_columns:
            colname = feat + '_' + f + '_kfold_mean'
            kfold_features.append(colname)
            train[colname] = None
            for fold_, (trn_idx, val_idx) in enumerate(folds.split(train, train['label'])):
                tmp_trn = train.iloc[trn_idx]
                order_label = tmp_trn.groupby([feat])[f].mean()
                tmp = train.loc[train.fold == fold_, [feat]]
                train.loc[train.fold == fold_, colname] = tmp[feat].map(order_label)
                # fillna
                global_mean = train[f].mean()
                train.loc[train.fold == fold_, colname] = train.loc[train.fold == fold_, colname].fillna(global_mean)
            train[colname] = train[colname].astype(float)

        for f in nums_columns:
            colname = feat + '_' + f + '_kfold_mean'
            test[colname] = None
            order_label = train.groupby([feat])[f].mean()
            test[colname] = test[feat].map(order_label)
            # fillna
            global_mean = train[f].mean()
            test[colname] = test[colname].fillna(global_mean)
            test[colname] = test[colname].astype(float)
    del train['fold']
    return train, test
// 目标编码
def kfold_stats_feature(train, test, feats, k):
    folds = StratifiedKFold(n_splits=k, shuffle=True, random_state=44)  # 这里最好和后面模型的K折交叉验证保持一致

    train['fold'] = None
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(train, train['label'])):
        train.loc[val_idx, 'fold'] = fold_

    kfold_features = []
    for feat in feats:
        nums_columns = ['label']
        for f in nums_columns:
            colname = feat + '_' + f + '_kfold_mean'
            kfold_features.append(colname)
            train[colname] = None
            for fold_, (trn_idx, val_idx) in enumerate(folds.split(train, train['label'])):
                tmp_trn = train.iloc[trn_idx]
                order_label = tmp_trn.groupby([feat])[f].mean()
                tmp = train.loc[train.fold == fold_, [feat]]
                train.loc[train.fold == fold_, colname] = tmp[feat].map(order_label)
                # fillna
                global_mean = train[f].mean()
                train.loc[train.fold == fold_, colname] = train.loc[train.fold == fold_, colname].fillna(global_mean)
            train[colname] = train[colname].astype(float)

        for f in nums_columns:
            colname = feat + '_' + f + '_kfold_mean'
            test[colname] = None
            order_label = train.groupby([feat])[f].mean()
            test[colname] = test[feat].map(order_label)
            # fillna
            global_mean = train[f].mean()
            test[colname] = test[colname].fillna(global_mean)
            test[colname] = test[colname].astype(float)
    del train['fold']
    return train, test

(3)等级和连续型:转为整型数值
下面展示一些 内联代码片

    # 转等级和连续型
    cols_int = [f for f in data.columns if
                f in ['level', 'balance', 'balance_avg', 'balance1', 'balance1_avg', 'balance2', 'balance2_avg',
                      'product1_amount', 'product2_amount', 'product3_amount', 'product4_amount', 'product5_amount',
                      'product6_amount']]
    for col in cols_int:
        for i in range(0, 50):
            data.loc[data[col] == "category %d" % i, col] = i
            data.loc[data[col] == "level %d" % i, col] = i
        data[col].isnull().sum()
        data[col].astype(int)
    # 转等级和连续型
    cols_int = [f for f in data.columns if
                f in ['level', 'balance', 'balance_avg', 'balance1', 'balance1_avg', 'balance2', 'balance2_avg',
                      'product1_amount', 'product2_amount', 'product3_amount', 'product4_amount', 'product5_amount',
                      'product6_amount']]
    for col in cols_int:
        for i in range(0, 50):
            data.loc[data[col] == "category %d" % i, col] = i
            data.loc[data[col] == "level %d" % i, col] = i
        data[col].isnull().sum()
        data[col].astype(int)

三、特征工程
1、时间特征分析


对比train和test的操作量的时间分布,可以假定起始时间点相同。.根据正负样本的每小时交易量分布差异,我们可以放心大胆的构造窗口时间特征。
例如:用户在星期n的交易金额的统计特征,用户在交易n天之后的交易金额的统计特征,用户在每天n点之后的交易金额的统计特征
下面展示一些 内联代码片

def gen_user_window_amount_features(df, window):
    group_df = df[df['days_diff']>window].groupby('user')['amount'].agg({
        'user_amount_mean_{}d'.format(window): 'mean',
        'user_amount_std_{}d'.format(window): 'std',
        'user_amount_max_{}d'.format(window): 'max',
        'user_amount_min_{}d'.format(window): 'min',
        'user_amount_sum_{}d'.format(window): 'sum',
        'user_amount_med_{}d'.format(window): 'median',
        'user_amount_cnt_{}d'.format(window): 'count',
        }).reset_index()
    return group_df
def gen_user_window_amount_hour_features(df, window):
    group_df = df[df['hour']>window].groupby('user')['amount'].agg({
        'user_amount_mean_{}h'.format(window): 'mean',
        'user_amount_std_{}h'.format(window): 'std',
        'user_amount_max_{}h'.format(window): 'max',
        'user_amount_min_{}h'.format(window): 'min',
        'user_amount_sum_{}h'.format(window): 'sum',
        'user_amount_med_{}h'.format(window): 'median',
        'user_amount_cnt_{}h'.format(window): 'count',
        }).reset_index()
    return group_df
def gen_user_window_amount_week_features(df, window):
    group_df = df[df['week']==window].groupby('user')['amount'].agg({
        'user_amount_mean_{}w'.format(window): 'mean',
        'user_amount_std_{}w'.format(window): 'std',
        'user_amount_max_{}w'.format(window):'max',
        'user_amount_min_{}w'.format(window): 'min',
        'user_amount_sum_{}w'.format(window):'sum',
        'user_amount_med_{}w'.format(window):'median',
        'user_amount_cnt_{}w'.format(window):'count',
        }).reset_index()
    return group_df

def gen_user_window_amount_features(df, window):
    group_df = df[df['days_diff']>window].groupby('user')['amount'].agg({
   
        'user_amount_mean_{}d'.format(window): 'mean',
        'user_amount_std_{}d'.format(window): 'std',
        'user_amount_max_{}d'.format(window): 'max',
        'user_amount_min_{}d'.format(window): 'min',
        'user_amount_sum_{}d'.format(window): 'sum',
        'user_amount_med_{}d'.format(window): 'median',
        'user_amount_cnt_{}d'.format(window): 'count',
        }).reset_index()
    return group_df
def gen_user_window_amount_hour_features(df, window):
    group_df = df[df['hour']>window].groupby('user')['amount'].agg({
   
        'user_amount_mean_{}h'.format(window): 'mean',
        'user_amount_std_{}h'.format(window): 'std',
        'user_amount_max_{}h'.format(window): 'max',
        'user_amount_min_{}h'.format(window): 'min',
        'user_amount_sum_{}h'.format(window): 'sum',
        'user_amount_med_{}h'.format(window): 'median',
        'user_amount_cnt_{}h'.format(window): 'count',
        }).reset_index()
    return group_df
def gen_user_window_amount_week_features(df, window):
    group_df = df[df['week']==window].groupby('user')['amount'].agg({
   
        'user_amount_mean_{}w'.format(window): 'mean',
        'user_amount_std_{}w'.format(window): 'std',
        'user_amount_max_{}w'.format(window):'max',
        'user_amount_min_{}w'.format(window): 'min',
        'user_amount_sum_{}w'.format(window):'sum',
        'user_amount_med_{}w'.format(window):'median',
        'user_amount_cnt_{}w'.format(window):'count',
        }).reset_index()
    return group_df

2、RFM特征
通过调查资料,我们了解了RFM模型,他是衡量客户价值和客户创利能力的重要工具和手段。通过这个信息我们构造出了很多有用特征。具体见下图


3、TF-IDF特征
我们对操作模式和操作类型进行提取TF-IDF特征。
下面展示一些 内联代码片

def gen_user_tfidf_features(df, value):
    df[value] = df[value].astype(str)
    df[value].fillna('-1', inplace=True)
    group_df = df.groupby(['user']).apply(lambda x: x[value].tolist()).reset_index()#把每个用户的op_mode转成列表
    group_df.columns = ['user', 'list']
    group_df['list'] = group_df['list'].apply(lambda x: ','.join(x))#将op_mode用,连接
    enc_vec = TfidfVectorizer()#得到tf-idf矩阵
    tfidf_vec = enc_vec.fit_transform(group_df['list'])#得到词频矩阵,将op_mode转为词向量,即计算机能识别的编码
    svd_enc = TruncatedSVD(n_components=10, n_iter=20, random_state=2020)#降维,提取op_mode的特征,TtuncatedSVD和SVD:TSVD可以选择需要提取的维度
    vec_svd = svd_enc.fit_transform(tfidf_vec)
    vec_svd = pd.DataFrame(vec_svd)
    vec_svd.columns = ['svd_tfidf_{}_{}'.format(value, i) for i in range(10)]
    group_df = pd.concat([group_df, vec_svd], axis=1)
    del group_df['list']
    return group_df

def gen_user_tfidf_features(df, value):
    df[value] = df[value].astype(str)
    df[value].fillna('-1', inplace=True)
    group_df = df.groupby(['user']).apply(lambda x: x[value].tolist()).reset_index()#把每个用户的op_mode转成列表
    group_df.columns = ['user', 'list']
    group_df['list'] = group_df['list'].apply(lambda x: ','.join(x))#将op_mode用,连接
    enc_vec = TfidfVectorizer()#得到tf-idf矩阵
    tfidf_vec = enc_vec.fit_transform(group_df['list'])#得到词频矩阵,将op_mode转为词向量,即计算机能识别的编码
    svd_enc = TruncatedSVD(n_components=10, n_iter=20, random_state=2020)#降维,提取op_mode的特征,TtuncatedSVD和SVD:TSVD可以选择需要提取的维度
    vec_svd = svd_enc.fit_transform(tfidf_vec)
    vec_svd = pd.DataFrame(vec_svd)
    vec_svd.columns = ['svd_tfidf_{}_{}'.format(value, i) for i in range(10)]
    group_df = pd.concat([group_df, vec_svd], axis=1)
    del group_df['list']
    return group_df

四、模型融合
我们采用了三个模型:LightGBM,Xgboost,Catboost多个参数进行模型融合。
具体模型相关性分析见下图:


下面展示一些 内联代码片

def lgb_model(train, target, test, k):

    feats = [f for f in train.columns if f not in ['user', 'label']]
    print('Current num of features:', len(feats))
    
    oof_probs = np.zeros(train.shape[0])
    output_preds = 0
    offline_score = []
    feature_importance_df = pd.DataFrame()
    parameters = {
        'learning_rate': 0.01,
        'boosting_type': 'gbdt',
        'objective': 'binary',
        'metric': 'auc',
        'num_leaves': 68,
        'feature_fraction': 0.4,
        'bagging_fraction': 0.8,
        'min_data_in_leaf': 25,
        'verbose': -1,
        'nthread': 8,
        'max_depth':8
    }

    seeds = [2020]
    for seed in seeds:
        folds = StratifiedKFold(n_splits=k, shuffle=True, random_state=seed)
        for i, (train_index, test_index) in enumerate(folds.split(train, target)):
            train_y, test_y = target[train_index], target[test_index]
            train_X, test_X = train[feats].iloc[train_index, :], train[feats].iloc[test_index, :]

            dtrain = lgb.Dataset(train_X,
                                 label=train_y)
            dval = lgb.Dataset(test_X,
                               label=test_y)
            lgb_model = lgb.train(
                    parameters,
                    dtrain,
                    num_boost_round=5000,
                    valid_sets=[dval],
                    early_stopping_rounds=200,
                    verbose_eval=100,
            )
            oof_probs[test_index] = lgb_model.predict(test_X[feats], num_iteration=lgb_model.best_iteration)/len(seeds)
            offline_score.append(lgb_model.best_score['valid_0']['auc'])
            output_preds += lgb_model.predict(test[feats], num_iteration=lgb_model.best_iteration)/folds.n_splits/len(seeds)
            print(offline_score)
            # feature importance
            fold_importance_df = pd.DataFrame()
            fold_importance_df["feature"] = feats
            fold_importance_df["importance"] = lgb_model.feature_importance(importance_type='gain')
            fold_importance_df["fold"] = i + 1
            feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    print('OOF-MEAN-AUC:%.6f, OOF-STD-AUC:%.6f' % (np.mean(offline_score), np.std(offline_score)))
    print('feature importance:')
    print(feature_importance_df.groupby(['feature'])['importance'].mean().sort_values(ascending=False).head(310))
    feature_importance_df.groupby(['feature'])['importance'].mean().sort_values(ascending=False).head(457).to_csv('../importance/08_26_452.csv')
    return output_preds, oof_probs, np.mean(offline_score)

def lgb_model(train, target, test, k):

    feats = [f for f in train.columns if f not in ['user', 'label']]
    print('Current num of features:', len(feats))
    
    oof_probs = np.zeros(train.shape[0])
    output_preds = 0
    offline_score = []
    feature_importance_df = pd.DataFrame()
    parameters = {
   
        'learning_rate': 0.01,
        'boosting_type': 'gbdt',
        'objective': 'binary',
        'metric': 'auc',
        'num_leaves': 68,
        'feature_fraction': 0.4,
        'bagging_fraction': 0.8,
        'min_data_in_leaf': 25,
        'verbose': -1,
        'nthread': 8,
        'max_depth':8
    }

    seeds = [2020]
    for seed in seeds:
        folds = StratifiedKFold(n_splits=k, shuffle=True, random_state=seed)
        for i, (train_index, test_index) in enumerate(folds.split(train, target)):
            train_y, test_y = target[train_index], target[test_index]
            train_X, test_X = train[feats].iloc[train_index, :], train[feats].iloc[test_index, :]

            dtrain = lgb.Dataset(train_X,
                                 label=train_y)
            dval = lgb.Dataset(test_X,
                               label=test_y)
            lgb_model = lgb.train(
                    parameters,
                    dtrain,
                    num_boost_round=5000,
                    valid_sets=[dval],
                    early_stopping_rounds=200,
                    verbose_eval=100,
            )
            oof_probs[test_index] = lgb_model.predict(test_X[feats], num_iteration=lgb_model.best_iteration)/len(seeds)
            offline_score.append(lgb_model.best_score['valid_0']['auc'])
            output_preds += lgb_model.predict(test[feats], num_iteration=lgb_model.best_iteration)/folds.n_splits/len(seeds)
            print(offline_score)
            # feature importance
            fold_importance_df = pd.DataFrame()
            fold_importance_df["feature"] = feats
            fold_importance_df["importance"] = lgb_model.feature_importance(importance_type='gain')
            fold_importance_df["fold"] = i + 1
            feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    print('OOF-MEAN-AUC:%.6f, OOF-STD-AUC:%.6f' % (np.mean(offline_score), np.std(offline_score)))
    print('feature importance:')
    print(feature_importance_df.groupby(['feature'])['importance'].mean().sort_values(ascending=False).head(310))
    feature_importance_df.groupby(['feature'])['importance'].mean().sort_values(ascending=False).head(457).to_csv('../importance/08_26_452.csv')
    return output_preds, oof_probs, np.mean(offline_score)

五、规则上分
通过分析,我们发现当用户的余额等级为1,产品金额等级为21时,用户风险率极大,我们将其结果置为原来的1/2。

具体代码文件在这里:https://github.com/poplar1hhh/yipay,希望大家给点个star,记得双击么么哒!


转载:https://blog.csdn.net/weixin_45966291/article/details/109819437
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场