招商银行fintech精英挑战赛top6%开源
我是个弱小白,纯粹是记录加分享,求大佬们轻拍
有一说一,起这个名字是怕你们不仔细看分享的内容,纯属标题党。
最终成绩
b榜0.7805,50名左右,一共900多不到1000人参赛。
1 赛题介绍
招行比赛
2020年5月11日
9:44
比赛规则
1.竞赛时间:4月29日11:00-5月12日17:00;
2.采用数据竞赛的形式,4月29日11:00-5月9日24:00,赛道开放A榜数据,预测结果数据每天限提交5次;5月10日00:00-5月12日17:00,赛道开放B榜数据,预测结果数据每天限提交3次。结果提交后请务必点击“运行”按钮,方可查看当前个人排名,最终排名以B榜成绩为准;
3.请您务必在比赛截止时间前,提交最终运行结果所对应的代码材料。
重要提示:请您勿将竞赛题目外传,平台将对您的竞赛全程进行监测,并对您所提交结果进行审查。一经发现有作弊行为,平台将取消您的参赛资格,并录入招商银行校园招聘诚信档案。
赛题说明
【招商银行2020FinTech精英训练营数据赛道重要通知】
各位同学:
根据前期竞赛通知,平台已完成数据集更新,并对部分字段内容进行了调整,请各位同学使用前认真查看,细致分析,并重新提交运行结果。
如有疑问,可反馈至联系邮箱:kaoshi@nowcoder.com;联系qq群:732244713(加群请填写“个人姓名+报名手机号”)。预祝您取得好成绩!
招商银行FinTech精英训练营项目组
2020年4月30日
信用风险评分 收起题目详情
一、赛题背景
在当今大数据时代,信用评分不仅仅用在办理信用卡、贷款等金融场景,类似的评分产品已经触及到我们生活的方方面面,比如借充电宝免押金、打车先用后付等,甚至在招聘、婚恋场景都有一席之地。
招行作为金融科技的先行者,APP月活用户数上亿,APP服务不仅涵盖资金交易、理财、信贷等金融场景,也延伸到饭票、影票、出行、资讯等非金融场景,可以构建用户的信用评分,基于信用评分为用户提供更优质便捷的服务。
二、课题研究要求
本次大赛为参赛选手提供了两个数据集(训练数据集和评分数据集),包含用户标签数据、过去60天的交易行为数据、过去30天的APP行为数据。希望参赛选手基于训练数据集,通过有效的特征提取,构建信用违约预测模型,并将模型应用在评分数据集上,输出评分数据集中每个用户的违约概率。
三、评价指标
其中D^+与D^-分别为评分数据集中发生违约与未发生违约用户集合,|D^+ |与|D^- |为集合中的用户量,f(x)为参赛者对于评分数据集中用户发生违约的概率估计值,I为逻辑函数。
四、数据说明
1.训练数据集tag.csv,评分数据集_tag.csv提供了训练数据集和评分数据集的用户标签数据;
2.训练数据集_trd.csv,评分数据集_trd.csv提供了训练数据集和评分数据集的用户60天交易行为数据;
3.训练数据集_beh.csv,评分数据集 beh.csv提供了训练数据集和评分数据集的用户30天APP行为数据;
4.数据说明.xlsx为数据集字段说明和数据示例;
5.提交样例:
5.1采⽤UTF-8⽆BOM编码的txt⽂件提交,⼀共提交⼀份txt⽂件。
5.2输出评分数据集中每个用户违约的预测概率,输出字段为:用户标识和违约预测概率,用\t分割,每个用户的预测结果为一行,注意不能有遗漏的数据或多出的数据。
来自 https://www.nowcoder.com/activity/2020cmb/index?type=2
2 简单思路介绍
三个表,tag、beh、trd
训练集39923行、a榜测试集6000行,b榜测试集4000行。
tag和trd都是全的,beh的数据大概只有全量数据的三分之一,(训练集11000行左右、a榜测试集2000行左右,b榜测试集1200行左右)
我这边是beh一加就掉分,最后也没找到合适的使用方法。所以就主要是对tag和trd做特征,tag单表能做到0.69左右,trd单表能做到0.72左右。tag表有时间序列,能做的特征更多一些,在金融科技数据中,时间序列相关的数据的价值总是更大一些,下面放代码介绍一下思路。
3 代码分享
-- coding: utf-8 --
'''
三个表的特征都有
xgb预测
'''
### 基础工具包导入
import numpy as np
import pandas as pd
import warnings
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.special import jn
from IPython.display import display, clear_output
import time
from dateutil.parser import parse
import datetime
###模型预测的
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
##数据降维处理的
from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCA
import lightgbm as lgb
import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot
##参数搜索和评价的
from sklearn.model_selection import GridSearchCV,cross_val_score,StratifiedKFold,train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
### 通过Pandas对于数据进行读取 (pandas是一个很友好的数据读取函数库)
data_path = '../data/'
Train_data = pd.read_csv(data_path+'train_data/train_tag.csv')
TestA_data = pd.read_csv(data_path+'test_data/test_tag_b.csv')
Train_data = Train_data.replace('\\N',-2)
TestA_data = TestA_data.replace('\\N',-2)
num_ori = len(TestA_data.columns)
### step2 对trd这个表进行特征加工
Train_trd = pd.read_csv(data_path+'train_data/train_trd.csv').sort_values(["id",'trx_tm'])
TestA_trd = pd.read_csv(data_path+'test_data/test_trd_b.csv').sort_values(["id",'trx_tm'])
def fea_cishu(df,name):#得到次数统计
tmp = df.groupby(['id'],as_index = False).count()
tmp = tmp[['id','trx_tm']]
tmp = tmp.rename(columns={'trx_tm':(name+'count')})
return tmp
def amt_sum(df,name):#得到金额统计
tmp_sum = df.groupby(['id'],as_index = False).sum()
tmp_sum = tmp_sum[['id','cny_trx_amt']]
tmp_sum = tmp_sum.rename(columns={'cny_trx_amt':(name+'sum')})
tmp_mean = df.groupby(['id'],as_index = False).mean()
tmp_mean = tmp_mean[['id','cny_trx_amt']]
tmp_mean = tmp_mean.rename(columns={'cny_trx_amt':(name+'mean')})
tmp = pd.merge(tmp_sum,tmp_mean,on = 'id',how = 'left')
return tmp
def get_fea_cishu(TestA_trd_fea,df,name):
TestA_trd_fea_zhichu = fea_cishu(df,name+'_trd_')
TestA_trd_fea = pd.merge(TestA_trd_fea,TestA_trd_fea_zhichu,on = 'id',how = 'left')
return TestA_trd_fea
def get_amt_sum(TestA_trd_fea,df,name):
TestA_trd_fea_zhichu = amt_sum(df,name+'_trd_amt_')
TestA_trd_fea = pd.merge(TestA_trd_fea,TestA_trd_fea_zhichu,on = 'id',how = 'left')
return TestA_trd_fea
def get_long_trd_fea(TestA_trd,pre_name):
TestA_trd_fea = fea_cishu(TestA_trd,pre_name+'trd_count')
#根据id算出每个人的金额加和
tmp = amt_sum(TestA_trd,pre_name+'trd_amt_sum')
TestA_trd_fea = pd.merge(TestA_trd_fea,tmp,on = 'id',how = 'left')
##支付与收入分离
TestA_trd_zhichu = TestA_trd[TestA_trd['Dat_Flg1_Cd']=='B']
TestA_trd_shouru = TestA_trd[TestA_trd['Dat_Flg1_Cd']=='C']
#交易次数
TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_zhichu,pre_name+'zhichu')
TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_shouru,pre_name+'shouru')
#交易额
TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_zhichu,pre_name+'zhichu')
TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_shouru,pre_name+'shouru')
## 支付方式分离
TestA_trd_A = TestA_trd[TestA_trd['Dat_Flg3_Cd']=='A']
TestA_trd_B = TestA_trd[TestA_trd['Dat_Flg3_Cd']=='B']
TestA_trd_C = TestA_trd[TestA_trd['Dat_Flg3_Cd']=='C']
#交易次数
TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_A,pre_name+'A')
TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_B,pre_name+'B')
TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_C,pre_name+'C')
#交易额
TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_A,pre_name+'A')
TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_B,pre_name+'B')
TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_C,pre_name+'C')
## 一级分类分离
TestA_trd_cod1_1 = TestA_trd[TestA_trd['Trx_Cod1_Cd']==1]
TestA_trd_cod1_2 = TestA_trd[TestA_trd['Trx_Cod1_Cd']==2]
TestA_trd_cod1_3 = TestA_trd[TestA_trd['Trx_Cod1_Cd']==3]
#交易次数
TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod1_1,pre_name+'cod1_1')
TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod1_2,pre_name+'cod1_2')
TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod1_3,pre_name+'cod1_3')
#交易额
TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod1_1,pre_name+'cod1_1')
TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod1_1,pre_name+'cod1_2')
TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod1_3,pre_name+'cod1_3')
## 二级分类分离
TestA_trd_cod2_136 = TestA_trd[TestA_trd['Trx_Cod2_Cd']==136]
TestA_trd_cod2_132 = TestA_trd[TestA_trd['Trx_Cod2_Cd']==132]
TestA_trd_cod2_309 = TestA_trd[TestA_trd['Trx_Cod2_Cd']==309]
TestA_trd_cod2_308 = TestA_trd[TestA_trd['Trx_Cod2_Cd']==308]
TestA_trd_cod2_213 = TestA_trd[TestA_trd['Trx_Cod2_Cd']==213]
TestA_trd_cod2_111 = TestA_trd[TestA_trd['Trx_Cod2_Cd']==111]
TestA_trd_cod2_103 = TestA_trd[TestA_trd['Trx_Cod2_Cd']==103]
TestA_trd_cod2_117 = TestA_trd[TestA_trd['Trx_Cod2_Cd']==117]
TestA_trd_cod2_208 = TestA_trd[TestA_trd['Trx_Cod2_Cd']==208]
TestA_trd_cod2_102 = TestA_trd[TestA_trd['Trx_Cod2_Cd']==102]
#交易次数
TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod2_136,pre_name+'cod2_136')
TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod2_132,pre_name+'cod2_132')
TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod2_309,pre_name+'cod2_309')
TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod2_308,pre_name+'cod2_308')
TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod2_213,pre_name+'cod2_213')
TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod2_111,pre_name+'cod2_111')
TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod2_103,pre_name+'cod2_103')
TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod2_117,pre_name+'cod2_117')
TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod2_208,pre_name+'cod2_208')
TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod2_102,pre_name+'cod2_102')
#交易额
TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod2_136,pre_name+'cod2_136')
TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod2_132,pre_name+'cod2_132')
TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod2_309,pre_name+'cod2_309')
TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod2_308,pre_name+'cod2_308')
TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod2_213,pre_name+'cod2_213')
TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod2_111,pre_name+'cod2_111')
TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod2_103,pre_name+'cod2_103')
TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod2_117,pre_name+'cod2_117')
TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod2_208,pre_name+'cod2_208')
TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod2_102,pre_name+'cod2_102')
return TestA_trd_fea
def get_short_trd_fea(TestA_trd,pre_name):
TestA_trd_fea = fea_cishu(TestA_trd,pre_name+'trd_count')
#根据id算出每个人的金额加和
tmp = amt_sum(TestA_trd,pre_name+'trd_amt_sum')
TestA_trd_fea = pd.merge(TestA_trd_fea,tmp,on = 'id',how = 'left')
##支付与收入分离
TestA_trd_zhichu = TestA_trd[TestA_trd['Dat_Flg1_Cd']=='B']
TestA_trd_shouru = TestA_trd[TestA_trd['Dat_Flg1_Cd']=='C']
#交易次数
TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_zhichu,pre_name+'zhichu')
TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_shouru,pre_name+'shouru')
#交易额
TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_zhichu,pre_name+'zhichu')
TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_shouru,pre_name+'shouru')
## 支付方式分离
TestA_trd_B = TestA_trd[TestA_trd['Dat_Flg3_Cd']=='B']
#交易次数
TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_B,pre_name+'B')
#交易额
TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_B,pre_name+'B')
## 一级分类分离
TestA_trd_cod1_1 = TestA_trd[TestA_trd['Trx_Cod1_Cd']==1]
TestA_trd_cod1_3 = TestA_trd[TestA_trd['Trx_Cod1_Cd']==3]
#交易次数
TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod1_1,pre_name+'cod1_1')
TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod1_3,pre_name+'cod1_3')
#交易额
TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod1_1,pre_name+'cod1_1')
TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod1_3,pre_name+'cod1_3')
## 二级分类分离
TestA_trd_cod2_309 = TestA_trd[TestA_trd['Trx_Cod2_Cd']==309]
#交易次数
TestA_trd_fea = get_fea_cishu(TestA_trd_fea,TestA_trd_cod2_309,pre_name+'cod2_309')
#交易额
TestA_trd_fea = get_amt_sum(TestA_trd_fea,TestA_trd_cod2_309,pre_name+'cod2_309')
return TestA_trd_fea
#最近一个月特征
TestA_trd['trx_tm'] = TestA_trd['trx_tm'].apply(lambda x: parse(x))
I15day_TestA_trd = TestA_trd[TestA_trd['trx_tm']>datetime.datetime(2019, 6, 15, 23, 59, 59)]
I1month_TestA_trd = TestA_trd[TestA_trd['trx_tm']>datetime.datetime(2019, 5, 30, 23, 59, 59)]
I45day_TestA_trd = TestA_trd[TestA_trd['trx_tm']>datetime.datetime(2019, 5, 15, 23, 59, 59)]
Train_trd['trx_tm'] = Train_trd['trx_tm'].apply(lambda x: parse(x))
I15day_Train_trd = Train_trd[Train_trd['trx_tm']>datetime.datetime(2019, 6, 15, 23, 59, 59)]
I1month_Train_trd = Train_trd[Train_trd['trx_tm']>datetime.datetime(2019, 5, 30, 23, 59, 59)]
I45day_Train_trd = Train_trd[Train_trd['trx_tm']>datetime.datetime(2019, 5, 15, 23, 59, 59)]
I15day_Train_trd_fea = get_short_trd_fea(I15day_Train_trd,'I15d_')
I1month_Train_trd_fea = get_long_trd_fea(I1month_Train_trd,'I1m_')
I45day_Train_trd_fea = get_long_trd_fea(I45day_Train_trd,'I45d_')
I15day_TestA_trd_fea = get_short_trd_fea(I15day_TestA_trd,'I15d_')
I1month_TestA_trd_fea = get_long_trd_fea(I1month_TestA_trd,'I1m_')
I45day_TestA_trd_fea = get_long_trd_fea(I45day_TestA_trd,'I45d_')
#四个时间段的特征
Train_trd['hour'] = Train_trd['trx_tm'].apply(lambda x: x.hour)
TestA_trd['hour'] = TestA_trd['trx_tm'].apply(lambda x: x.hour)
Per_1_Train_trd = Train_trd[Train_trd['hour']<6]
Per_2_Train_trd = Train_trd[(Train_trd['hour']>6)&(Train_trd['hour']<12)]
Per_3_Train_trd = Train_trd[(Train_trd['hour']>12)&(Train_trd['hour']<18)]
Per_4_Train_trd = Train_trd[(Train_trd['hour']>18)&(Train_trd['hour']<24)]
Per_1_TestA_trd = TestA_trd[TestA_trd['hour']<6]
Per_2_TestA_trd = TestA_trd[(TestA_trd['hour']>6)&(TestA_trd['hour']<12)]
Per_3_TestA_trd = TestA_trd[(TestA_trd['hour']>12)&(TestA_trd['hour']<18)]
Per_4_TestA_trd = TestA_trd[(TestA_trd['hour']>18)&(TestA_trd['hour']<24)]
Per_1_Train_trd_fea = get_short_trd_fea(Per_1_Train_trd,'Per_1_')
Per_2_Train_trd_fea = get_short_trd_fea(Per_2_Train_trd,'Per_2_')
Per_3_Train_trd_fea = get_short_trd_fea(Per_3_Train_trd,'Per_3_')
Per_4_Train_trd_fea = get_short_trd_fea(Per_4_Train_trd,'Per_4_')
Per_1_TestA_trd_fea = get_short_trd_fea(Per_1_TestA_trd,'Per_1_')
Per_2_TestA_trd_fea = get_short_trd_fea(Per_2_TestA_trd,'Per_2_')
Per_3_TestA_trd_fea = get_short_trd_fea(Per_3_TestA_trd,'Per_3_')
Per_4_TestA_trd_fea = get_short_trd_fea(Per_4_TestA_trd,'Per_4_')
TestA_trd['weekend_bool'] = TestA_trd['trx_tm'].apply(lambda x: True if x.dayofweek in [5, 6] else False)
Train_trd['weekend_bool'] = Train_trd['trx_tm'].apply(lambda x: True if x.dayofweek in [5, 6] else False)
TestA_trd_weekday = TestA_trd[TestA_trd['weekend_bool'] == False]
TestA_trd_weekend = TestA_trd[TestA_trd['weekend_bool'] == True]
Train_trd_weekday = Train_trd[Train_trd['weekend_bool'] == False]
Train_trd_weekend = Train_trd[Train_trd['weekend_bool'] == True]
weekday_TestA_trd_fea = get_short_trd_fea(TestA_trd_weekday,'weekday_')
weekend_TestA_trd_fea = get_short_trd_fea(TestA_trd_weekend,'weekend_')
weekday_Train_trd_fea = get_short_trd_fea(Train_trd_weekday,'weekday_')
weekend_Train_trd_fea = get_short_trd_fea(Train_trd_weekend,'weekend_')
##全量特征
Train_trd_fea = get_long_trd_fea(Train_trd,'')
TestA_trd_fea = get_long_trd_fea(TestA_trd,'')
Train_data = pd.merge(Train_data,Train_trd_fea,on = 'id',how = 'left')
Train_data = pd.merge(Train_data,I1month_Train_trd_fea,on = 'id',how = 'left')
Train_data = pd.merge(Train_data,I15day_Train_trd_fea,on = 'id',how = 'left')
Train_data = pd.merge(Train_data,I45day_Train_trd_fea,on = 'id',how = 'left')
Train_data = pd.merge(Train_data,Per_1_Train_trd_fea,on = 'id',how = 'left')
Train_data = pd.merge(Train_data,Per_2_Train_trd_fea,on = 'id',how = 'left')
Train_data = pd.merge(Train_data,Per_3_Train_trd_fea,on = 'id',how = 'left')
Train_data = pd.merge(Train_data,Per_4_Train_trd_fea,on = 'id',how = 'left')
Train_data = pd.merge(Train_data,weekday_Train_trd_fea,on = 'id',how = 'left')
Train_data = pd.merge(Train_data,weekend_Train_trd_fea,on = 'id',how = 'left')
TestA_data = pd.merge(TestA_data,TestA_trd_fea,on = 'id',how = 'left')
TestA_data = pd.merge(TestA_data,I1month_TestA_trd_fea,on = 'id',how = 'left')
TestA_data = pd.merge(TestA_data,I15day_TestA_trd_fea,on = 'id',how = 'left')
TestA_data = pd.merge(TestA_data,I45day_TestA_trd_fea,on = 'id',how = 'left')
TestA_data = pd.merge(TestA_data,Per_1_TestA_trd_fea,on = 'id',how = 'left')
TestA_data = pd.merge(TestA_data,Per_2_TestA_trd_fea,on = 'id',how = 'left')
TestA_data = pd.merge(TestA_data,Per_3_TestA_trd_fea,on = 'id',how = 'left')
TestA_data = pd.merge(TestA_data,Per_4_TestA_trd_fea,on = 'id',how = 'left')
TestA_data = pd.merge(TestA_data,weekday_TestA_trd_fea,on = 'id',how = 'left')
TestA_data = pd.merge(TestA_data,weekend_TestA_trd_fea,on = 'id',how = 'left')
'''
读入间隔信息
'''
###step2 对trd表中的时间间隔进行特征加工
def get_jiange(TestA_trd):
list_tmp = list(TestA_trd['trx_tm'])
list_tmp.insert(0,list_tmp[0])
list_tmp = list_tmp[:-1]
TestA_trd['trx_tm_2'] = list_tmp
TestA_trd['jiange_trx_tm'] = TestA_trd['trx_tm']-TestA_trd['trx_tm_2']
TestA_trd = TestA_trd[TestA_trd['jiange_trx_tm']>datetime.timedelta(seconds=0.01)]
TestA_trd['jiange_trx_tm'] = TestA_trd['jiange_trx_tm'].apply(lambda x:x.days+round(x.seconds/86400,2))
del TestA_trd['trx_tm_2']
return TestA_trd
TestA_trd_jiange = get_jiange(TestA_trd)
Train_trd_jiange = get_jiange(Train_trd)
def jiange_sta(df,name):
tmp_mean = df.groupby(['id'],as_index = False).mean()
tmp_mean = tmp_mean[['id','jiange_trx_tm']]
tmp_mean = tmp_mean.rename(columns={'jiange_trx_tm':(name+'jiange_mean')})
tmp_max = df.groupby(['id'],as_index = False).max()
tmp_max = tmp_max[['id','jiange_trx_tm']]
tmp_max = tmp_max.rename(columns={'jiange_trx_tm':(name+'jiange_max')})
tmp_min = df.groupby(['id'],as_index = False).min()
tmp_min = tmp_min[['id','jiange_trx_tm']]
tmp_min = tmp_min.rename(columns={'jiange_trx_tm':(name+'jiange_min')})
tmp = pd.merge(tmp_mean,tmp_max,on = 'id',how = 'left')
tmp = pd.merge(tmp,tmp_min,on = 'id',how = 'left')
return tmp
def get_jiange_sta(TestA_trd_fea,df,name):
TestA_trd_fea_zhichu = jiange_sta(df,name+'_trd_')
TestA_trd_fea = pd.merge(TestA_trd_fea,TestA_trd_fea_zhichu,on = 'id',how = 'left')
return TestA_trd_fea
def get_jiange_fea(TestA_trd_jiange,pre_name):
# 先计算总的
TestA_trd_jiange_fea = jiange_sta(TestA_trd_jiange,pre_name+'trd_count')
#根据id算出每个人的金额加和
##支付与收入分离
TestA_trd_jiange_zhichu = TestA_trd_jiange[TestA_trd_jiange['Dat_Flg1_Cd']=='B']
TestA_trd_jiange_shouru = TestA_trd_jiange[TestA_trd_jiange['Dat_Flg1_Cd']=='C']
TestA_trd_jiange_fea = get_jiange_sta(TestA_trd_jiange_fea,TestA_trd_jiange_zhichu,pre_name+'zhichu')
TestA_trd_jiange_fea = get_jiange_sta(TestA_trd_jiange_fea,TestA_trd_jiange_shouru,pre_name+'shouru')
## 支付方式分离
TestA_trd_jiange_A = TestA_trd_jiange[TestA_trd_jiange['Dat_Flg3_Cd']=='A']
TestA_trd_jiange_B = TestA_trd_jiange[TestA_trd_jiange['Dat_Flg3_Cd']=='B']
TestA_trd_jiange_C = TestA_trd_jiange[TestA_trd_jiange['Dat_Flg3_Cd']=='C']
#交易次数
TestA_trd_jiange_fea = get_jiange_sta(TestA_trd_jiange_fea,TestA_trd_jiange_A,pre_name+'A')
TestA_trd_jiange_fea = get_jiange_sta(TestA_trd_jiange_fea,TestA_trd_jiange_B,pre_name+'B')
TestA_trd_jiange_fea = get_jiange_sta(TestA_trd_jiange_fea,TestA_trd_jiange_C,pre_name+'C')
## 一级分类分离
TestA_trd_jiange_cod1_1 = TestA_trd_jiange[TestA_trd_jiange['Trx_Cod1_Cd']==1]
TestA_trd_jiange_cod1_2 = TestA_trd_jiange[TestA_trd_jiange['Trx_Cod1_Cd']==2]
TestA_trd_jiange_cod1_3 = TestA_trd_jiange[TestA_trd_jiange['Trx_Cod1_Cd']==3]
#交易次数
TestA_trd_jiange_fea = get_jiange_sta(TestA_trd_jiange_fea,TestA_trd_jiange_cod1_1,pre_name+'cod1_1')
TestA_trd_jiange_fea = get_jiange_sta(TestA_trd_jiange_fea,TestA_trd_jiange_cod1_2,pre_name+'cod1_2')
TestA_trd_jiange_fea = get_jiange_sta(TestA_trd_jiange_fea,TestA_trd_jiange_cod1_3,pre_name+'cod1_3')
return TestA_trd_jiange_fea
Train_trd_jiange_fea = get_jiange_fea(Train_trd_jiange,'')
TestA_trd_jiange_fea = get_jiange_fea(TestA_trd_jiange,'')
jiange_fea_cols =['id', 'trd_countjiange_mean', 'trd_countjiange_max',
'trd_countjiange_min', 'zhichu_trd_jiange_mean',
'zhichu_trd_jiange_max', 'zhichu_trd_jiange_min',
'shouru_trd_jiange_mean', 'shouru_trd_jiange_max',
'shouru_trd_jiange_min', 'B_trd_jiange_mean', 'B_trd_jiange_max',
'B_trd_jiange_min', 'cod1_1_trd_jiange_mean', 'cod1_1_trd_jiange_max',
'cod1_1_trd_jiange_min',
'cod1_3_trd_jiange_mean', 'cod1_3_trd_jiange_max',
'cod1_3_trd_jiange_min']
Train_trd_jiange_fea = Train_trd_jiange_fea[jiange_fea_cols]
TestA_trd_jiange_fea = TestA_trd_jiange_fea[jiange_fea_cols]
Train_data = pd.merge(Train_data,Train_trd_jiange_fea,on = 'id',how = 'left')
TestA_data = pd.merge(TestA_data,TestA_trd_jiange_fea,on = 'id',how = 'left')
num_trd = len(TestA_data.columns)
###step3 对tag这个表构造一些特征
def jin_fea(quan_data,col_a,col_b):
dict_jin = {'oneyear':'l1y_crd_card_csm_amt_dlm_cd','yongjiu':'perm_crd_lmt_cd',
'aum':'l6mon_daim_aum_cd','qianli':'pot_ast_lvl_cd'}
quan_data['jin_'+col_a+'_jia_'+col_b] = quan_data[dict_jin[col_a]]+quan_data[dict_jin[col_b]]
quan_data['jin_'+col_a+'_jian_'+col_b] = quan_data[dict_jin[col_a]]-quan_data[dict_jin[col_b]]
quan_data['jin_'+col_a+'_cheng_'+col_b] = quan_data[dict_jin[col_a]]*quan_data[dict_jin[col_b]]
quan_data['jin_'+col_a+'_chu_'+col_b] = quan_data[dict_jin[col_a]]/(quan_data[dict_jin[col_b]]+3)
return quan_data
def tag_fea(quan_data):
print('tag维度',len(quan_data))
# 把能转的转换成数值类型
for col in quan_data.columns:
try:
quan_data[col] = quan_data[col].apply(float)
except:
print('这列不能转换:{}'.format(col))
### 组内
quan_data['crecnt_chu_lvl'] = quan_data['cur_credit_cnt']/(quan_data['hld_crd_card_grd_cd']+5)#信用卡天数/等级
#quan_data['debcnt_chu_lvl'] = quan_data['cur_credit_cnt']/(quan_data['hld_crd_card_grd_cd']+5)#借贷卡天数/等级
# 张数
quan_data['cre_chu_deb_cnt'] =( quan_data['cur_credit_cnt']+1)/(quan_data['cur_debit_cnt']+1)#信用卡张数/借贷卡张数
quan_data['zhang_deb_jia_cre'] = quan_data['cur_debit_cnt']+quan_data['cur_credit_cnt']
quan_data['zhang_deb_jian_cre'] = quan_data['cur_debit_cnt']-quan_data['cur_credit_cnt']
# 天数
quan_data['tian_deb_jia_cre'] = quan_data['cur_debit_min_opn_dt_cnt']+quan_data['cur_credit_min_opn_dt_cnt']
quan_data['tian_deb_jian_cre'] = quan_data['cur_debit_min_opn_dt_cnt']-quan_data['cur_credit_min_opn_dt_cnt']
quan_data['tian_cre_chu_cre'] = quan_data['cur_credit_min_opn_dt_cnt']/(quan_data['cur_debit_min_opn_dt_cnt']+3)
#等级
quan_data['deng_deb_jia_cre'] = quan_data['cur_debit_crd_lvl']+quan_data['hld_crd_card_grd_cd']
quan_data['deng_deb_jian_cre'] = quan_data['cur_debit_crd_lvl']-quan_data['hld_crd_card_grd_cd']
quan_data['deng_deb_jian_cre'] = quan_data['cur_debit_crd_lvl']-quan_data['hld_crd_card_grd_cd']
quan_data['deng_cre_chu_cre'] = quan_data['cur_debit_crd_lvl']/(quan_data['hld_crd_card_grd_cd']+3)
#各种金额之间
quan_data = jin_fea(quan_data,'oneyear','yongjiu')
quan_data = jin_fea(quan_data,'oneyear','aum')
quan_data = jin_fea(quan_data,'oneyear','qianli')
quan_data = jin_fea(quan_data,'yongjiu','aum')
quan_data = jin_fea(quan_data,'yongjiu','qianli')
quan_data = jin_fea(quan_data,'aum','qianli')
### 组间
#等级*/天数,等级*/张数,
quan_data['deng_cre_tian_chu_deng'] = quan_data['cur_credit_min_opn_dt_cnt']/(quan_data['hld_crd_card_grd_cd']+3)
quan_data['deng_cre_tian_cheng_deng'] = quan_data['cur_credit_min_opn_dt_cnt']*quan_data['hld_crd_card_grd_cd']
quan_data['deng_deb_tian_chu_deng'] = quan_data['cur_debit_min_opn_dt_cnt']/(quan_data['cur_debit_crd_lvl']+3)
quan_data['deng_deb_tian_cheng_deng'] = quan_data['cur_debit_min_opn_dt_cnt']*quan_data['cur_debit_crd_lvl']
quan_data['deng_cre_zhang_cheng_deng'] = quan_data['cur_credit_cnt']*quan_data['hld_crd_card_grd_cd']
quan_data['deng_cre_zhang_chu_deng'] = quan_data['cur_credit_cnt']/(quan_data['hld_crd_card_grd_cd']+3)
quan_data['deng_deb_zhang_cheng_deng'] = quan_data['cur_debit_cnt']*quan_data['hld_crd_card_grd_cd']
quan_data['deng_deb_zhang_chu_deng'] = quan_data['cur_debit_cnt']/(quan_data['hld_crd_card_grd_cd']+3)
#信用卡消费金额:等级+-/*金额 ,张数+-/*金额
quan_data['jin_oy_cre_deng_cheng_jin'] = quan_data['hld_crd_card_grd_cd']*quan_data['l1y_crd_card_csm_amt_dlm_cd']
quan_data['jin_oy_cre_jin_chu_deng'] = quan_data['l1y_crd_card_csm_amt_dlm_cd']/(quan_data['hld_crd_card_grd_cd']+3)
quan_data['jin_oy_cre_deng_jia_jin'] = quan_data['hld_crd_card_grd_cd']+quan_data['l1y_crd_card_csm_amt_dlm_cd']
quan_data['jin_oy_cre_deng_jian_jin'] = quan_data['hld_crd_card_grd_cd']-quan_data['l1y_crd_card_csm_amt_dlm_cd']
quan_data['jin_oy_cre_zhang_cheng_jin'] = quan_data['cur_credit_cnt']*quan_data['l1y_crd_card_csm_amt_dlm_cd']
quan_data['jin_oy_cre_jin_chu_zhang'] = quan_data['l1y_crd_card_csm_amt_dlm_cd']/(quan_data['cur_credit_cnt']+3)
quan_data['jin_oy_cre_zhang_jia_jin'] = quan_data['cur_credit_cnt']+quan_data['l1y_crd_card_csm_amt_dlm_cd']
quan_data['jin_oy_cre_zhang_jian_jin'] = quan_data['cur_credit_cnt']-quan_data['l1y_crd_card_csm_amt_dlm_cd']
#信用卡永久信用额度分层: 额度+-*等级,额度+-*张数,天数*/额度,
quan_data['jin_yj_cre_deng_cheng_jin'] = quan_data['hld_crd_card_grd_cd']*quan_data['perm_crd_lmt_cd']
quan_data['jin_yj_cre_deng_jia_jin'] = quan_data['hld_crd_card_grd_cd']+quan_data['perm_crd_lmt_cd']
quan_data['jin_yj_cre_deng_jian_jin'] = quan_data['hld_crd_card_grd_cd']-quan_data['perm_crd_lmt_cd']
quan_data['jin_yj_cre_zhang_cheng_jin'] = quan_data['cur_credit_cnt']*quan_data['perm_crd_lmt_cd']
quan_data['jin_yj_cre_zhang_jia_jin'] = quan_data['cur_credit_cnt']+quan_data['perm_crd_lmt_cd']
quan_data['jin_yj_cre_zhang_jian_jin'] = quan_data['cur_credit_cnt']-quan_data['perm_crd_lmt_cd']
quan_data['jin_yj_cre_tian_c***'] = quan_data['cur_debit_min_opn_dt_cnt']/(quan_data['perm_crd_lmt_cd']+3)
quan_data['jin_yj_cre_tian_cheng_jin'] = quan_data['cur_debit_min_opn_dt_cnt']*quan_data['perm_crd_lmt_cd']
#job_year
quan_data['job_cre_tian_chu_job'] = quan_data['cur_credit_min_opn_dt_cnt']/((quan_data['job_year']+0.5)*360)
quan_data['job_deb_tian_chu_job'] = quan_data['cur_debit_min_opn_dt_cnt']/((quan_data['job_year']+0.5)*360)
#dnl_mbl_bnk_ind 下载、绑定以及活跃度
quan_data['xia_jia_huo'] = quan_data['dnl_mbl_bnk_ind']+quan_data['crd_card_act_ind']
quan_data['xia_jia_bang'] = quan_data['dnl_mbl_bnk_ind']+quan_data['dnl_bind_cmb_lif_ind']
quan_data['huo_jia_bang'] = quan_data['crd_card_act_ind']+quan_data['dnl_bind_cmb_lif_ind']
quan_data['xia_jia_huo_jia_bang'] = quan_data['dnl_mbl_bnk_ind']+quan_data['crd_card_act_ind']+quan_data['dnl_bind_cmb_lif_ind']
#car house
quan_data['car_jia_house'] = quan_data['hav_car_grp_ind']+quan_data['hav_hou_grp_ind']
print('tag维度',len(quan_data))
return quan_data
Train_data = tag_fea(Train_data)
TestA_data = tag_fea(TestA_data)
num_tag = len(TestA_data.columns)
for col in Train_data.columns:
try:
Train_data[col] = Train_data[col].apply(float)
TestA_data[col] = TestA_data[col].apply(float)
except:
print('这列不能转换:{}'.format(col))
####Step 3:特征与标签构建
### 3.1 提取数值类型特征列名
numerical_cols = Train_data.select_dtypes(exclude = 'object').columns
print(numerical_cols)
categorical_cols = Train_data.select_dtypes(include = 'object').columns
print(categorical_cols)
shaoqu_0503 = ['id', 'flag', 'ic_ind', 'fr_or_sh_ind',
'hav_hou_grp_ind', 'cust_inv_rsk_endu_lvl_cd','l12mon_buy_fin_mng_whl_tms',
'l12_mon_fnd_buy_whl_tms', 'l12_mon_insu_buy_whl_tms','l12_mon_gld_buy_whl_tms',
'loan_act_ind', 'pl_crd_lmt_cd','ovd_30d_loan_tot_cnt', 'his_lng_ovd_day']
jihao =['cur_credit_min_opn_dt_cnt','l1y_crd_card_csm_amt_dlm_cd','gdr_cd', 'edu_deg_cd']
### 3.2 构建训练和测试样本
## 选择特征列
numerical_cols = [col for col in numerical_cols if col not in shaoqu_0503]
print(numerical_cols)
categorical_cols = [col for col in categorical_cols if col not in shaoqu_0503]
print(categorical_cols)
feature_cols = numerical_cols + categorical_cols
feature_cols = [col for col in feature_cols if 'Type' not in col]
## 提前特征列,标签列构造训练样本和测试样_data = Train_data[feature_cols].replace('\\N',-2)
X_data = Train_data[feature_cols].replace('\\N',-2)
Y_data = Train_data['flag']
X_test = TestA_data[feature_cols].replace('\\N',-2)
## 定义SMOTE模型,random_state相当于随机数种子的作用
#smo = SMOTE(random_state=42)
#X_smo, y_smo = smo.fit_sample(X_data, Y_data)
# 进行类型转换,以适应xgb
def label_encoder_tag(tag_data):
#类别变量转换为数值
gdr_dict = { 'F': 1, 'M': 0}
tag_data['gdr_cd'] = tag_data['gdr_cd'].map(gdr_dict)
mrg_situ_dict = { 'A': 1, 'B': 2, 'O': 3, 'Z': 4}
tag_data['mrg_situ_cd'] = tag_data['mrg_situ_cd'].map(mrg_situ_dict)
edu_dict = { 'A': 1, 'B': 2,'C': 3, 'D': 4,'F': 5, 'G': 6,'J': 7, 'K': 8,'L': 9, 'M': 10,'Z': 11}
tag_data['edu_deg_cd'] = tag_data['edu_deg_cd'].map(edu_dict)
acdm_deg_dict = { 'C': 1, 'D': 2,'F': 3, 'G': 4,'Z': 5, 30: 6, 31: 7}
tag_data['acdm_deg_cd'] = tag_data['acdm_deg_cd'].map(acdm_deg_dict)
deg_dict = { 'A': 1, 'B': 2,'C': 3, 'D': 4,'Z': 5}
tag_data['deg_cd'] = tag_data['deg_cd'].map(deg_dict)
return tag_data
try:
X_data = label_encoder_tag(X_data)
X_test = label_encoder_tag(X_test)
except:
print('类别转换存在问题')
# 此自定义函数参照书本提供的代码,但是依据Python的风格,应该有相应的内置函数?晚点再找。
num_oht = len(X_test.columns)
## 定义了一个统计函数,方便后续信息统计
def Sta_inf(data):
print('_min',np.min(data))
print('_max:',np.max(data))
print('_mean',np.mean(data))
print('_ptp',np.ptp(data))#极差
print('_std',np.std(data))
print('_var',np.var(data))
print('Sta of label:')
Sta_inf(Y_data)
## 绘制标签的统计图,查看标签分布
plt.hist(Y_data)
plt.show()
plt.close()
### step 3.3 特征筛选
# 删除常变量(单个值超过0.8)和缺失变量(超过0.6的)
col_shan = []
def shan_que_chang(col_shan,X_test):
yuzhi = 0.85
queshi_yuzhi = 0.8
test_hang = 4000
num = X_test.isna().sum()
df_queshi = pd.DataFrame()
df_queshi['col_name'] = num.index
df_queshi['rate'] = list(num/test_hang)
df_shan = df_queshi[df_queshi['rate']>queshi_yuzhi]
col_shan = col_shan+list(df_shan['col_name'])
for col in X_test.columns:
vc_test =pd.DataFrame(X_test[col].value_counts()/test_hang)
if vc_test.iloc[0,0] >yuzhi:
col_shan.append(col)
return col_shan
col_shan = shan_que_chang(col_shan,X_test)
X_data = X_data.drop(columns = col_shan)
X_test = X_test.drop(columns = col_shan)
## 共线性的去除
# Threshold for removing correlated variables
threshold = 0.97
# Absolute value correlation matrix
corr_matrix = X_data.corr().abs()
# Upper triangle of correlations
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Select columns with correlations above threshold
to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
print('There are %d columns to remove.' % (len(to_drop)))
X_data = X_data.drop(columns = to_drop)
X_test = X_test.drop(columns = to_drop)
## 特征重要性选择
# Initialize an empty array to hold feature importances
feature_importances = np.zeros(X_data.shape[1])
# Create the model with several hyperparameters
model_fi = xgb.XGBRegressor(n_estimators=120, learning_rate=0.1, gamma=0, subsample=0.8,\
colsample_bytree=0.9, max_depth=7)
# Fit the model twice to avoid overfitting
for i in range(2):
# Split into training and validation set
train_features, valid_features, train_y, valid_y = train_test_split(X_data, Y_data, test_size = 0.25, random_state = i)
# Train using early stopping
model_fi.fit(train_features, train_y, early_stopping_rounds=100, eval_set = [(valid_features, valid_y)],
eval_metric = 'auc', verbose = 200)
# Record the feature importances
feature_importances += model_fi.feature_importances_
# Make sure to average feature importances!
feature_importances = feature_importances / 2
feature_importances = pd.DataFrame({'feature': list(X_data.columns), 'importance': feature_importances}).sort_values('importance', ascending = False)
# Find the features with zero importance
zero_features = list(feature_importances[feature_importances['importance'] == 0.0]['feature'])
print('There are %d features with 0.0 importance' % len(zero_features))
X_data = X_data.drop(columns = zero_features)
X_test = X_test.drop(columns = zero_features)
print(zero_features)
#### Step 4:模型训练与预测
### 4.1 利用xgb进行五折交叉验证查看模型的参数效果
## xgb-Model
lgr = lgb.LGBMRegressor(objective='regression',num_leaves=130,learning_rate=0.05,n_estimators=150)
#,objective ='reg:squarederror'
scores_train = []
scores = []
## 5折交叉验证方式
sk=StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
for train_ind,val_ind in sk.split(X_data,Y_data):
train_x=X_data.iloc[train_ind].values
train_y=Y_data.iloc[train_ind]
val_x=X_data.iloc[val_ind].values
val_y=Y_data.iloc[val_ind]
lgr.fit(train_x,train_y)
pred_train_xgb=lgr.predict(train_x)
pred_xgb=lgr.predict(val_x)
score_train = mean_absolute_error(train_y,pred_train_xgb)
scores_train.append(score_train)
score = mean_absolute_error(val_y,pred_xgb)
scores.append(score)
print('Train mae:',np.mean(score_train))
print('Val mae',np.mean(scores))
## 定义xgb和lgb模型函数
def build_model_xgb(x_train,y_train,r_seed):
model = xgb.XGBRegressor(n_estimators=150,seed = r_seed, learning_rate=0.1, gamma=0, subsample=0.8,\
colsample_bytree=0.9, max_depth=7) #, objective ='reg:squarederror'
model.fit(x_train, y_train)
return model
def build_model_lgb(x_train,y_train,r_seed):
estimator = lgb.LGBMRegressor(num_leaves=127,seed = r_seed,n_estimators = 150)
param_grid = {
'learning_rate': [0.01, 0.05, 0.1, 0.2],
}
gbm = GridSearchCV(estimator, param_grid)
gbm.fit(x_train, y_train)
return gbm
## 切分数据集(Train,Val)进行模型训练,评价和预测
## Split data with val
x_train,x_val,y_train,y_val = train_test_split(X_data,Y_data,test_size=0.3)
print('Train xgb...')
model_xgb = build_model_xgb(x_train,y_train,123)
val_xgb = model_xgb.predict(x_val)
MAE_xgb = mean_absolute_error(y_val,val_xgb)
print('MAE of val with xgb:',MAE_xgb)
print('Predict xgb...')
model_xgb_pre = build_model_xgb(X_data,Y_data,123)
subA_xgb = model_xgb_pre.predict(X_test)
print('Sta of Predict xgb:')
Sta_inf(subA_xgb)
print('Predict xgb...')
model_xgb_pre_1 = build_model_xgb(X_data,Y_data,1)
subA_xgb_1 = model_xgb_pre_1.predict(X_test)
print('Sta of Predict xgb:')
Sta_inf(subA_xgb_1)
print('Train lgb...')
model_lgb = build_model_lgb(x_train,y_train,123)
val_lgb = model_lgb.predict(x_val)
MAE_lgb = mean_absolute_error(y_val,val_lgb)
print('MAE of val with lgb:',MAE_lgb)
print('Predict lgb...')
model_lgb_pre = build_model_lgb(X_data,Y_data,123)
subA_lgb = model_lgb_pre.predict(X_test)
print('Sta of Predict lgb:')
Sta_inf(subA_lgb)
print('Predict lgb...')
model_lgb_pre_1 = build_model_lgb(X_data,Y_data,1)
subA_lgb_1 = model_lgb_pre_1.predict(X_test)
print('Sta of Predict lgb:')
Sta_inf(subA_lgb_1)
# plot feature importance
#plot_importance(model_xgb_pre)
#pyplot.show()
# 打印出有多少特征重要性非零的特征
feature_score_dict = {}
for fn, s in zip(X_data.columns, model_xgb_pre.feature_importances_):
feature_score_dict[fn] = s
m = 0
for k in feature_score_dict:
if feature_score_dict[k] == 0.0:
m += 1
print ('number of not-zero features:' + str(len(feature_score_dict) - m))
# 打印出特征重要性
feature_score_dict_sorted = sorted(feature_score_dict.items(),
key=lambda d: d[1], reverse=True)
print ('xgb_feature_importance:')
for ii in range(len(feature_score_dict_sorted)):
print (feature_score_dict_sorted[ii][0], feature_score_dict_sorted[ii][1])
print ('\n')
f = open('../eda/sub_0511_4_xgb_feature_importance.txt', 'w')
f.write('Rank\tFeature Name\tFeature Importance\n')
for i in range(len(feature_score_dict_sorted)):
f.write(str(i) + '\t' + str(feature_score_dict_sorted[i][0]) + '\t' + str(feature_score_dict_sorted[i][1]) + '\n')
f.close()
## 这里我们采取了简单的rank加权融合的方式
val_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*val_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*val_xgb
val_Weighted[val_Weighted<0]=10 # 由于我们发现预测的最小值有负数,而真实情况下,price为负是不存在的,由此我们进行对应的后修正
print('MAE of val with Weighted ensemble:',mean_absolute_error(y_val,val_Weighted))
sub_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*subA_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*subA_xgb
tmps = pd.DataFrame()
tmps['id'] = TestA_data.id
tmps['tag_weight'] = sub_Weighted
tmps['tag_xgb'] = subA_xgb
tmps['tag_xgb_1'] = subA_xgb_1
tmps['tag_lgb'] = subA_lgb
tmps['tag_lgb_1'] = subA_lgb_1
for met in ['weight','xgb','lgb','xgb_1','lgb_1']:
tmps[met+'_rank'] = tmps['tag_'+met].rank(method = 'min')
tmps.head()
tmps['score_rank'] = tmps[[m+'_rank' for m in ['weight','xgb','lgb','xgb_1','lgb_1']]].sum(1)
max_min_scaler = lambda x:(x-np.min(x))/(np.max(x)-np.min(x))
tmps['score'] = tmps[['score_rank']].apply(max_min_scaler)
## 查看预测值的统计进行
plt.hist(sub_Weighted)
plt.show()
plt.close()
sub = pd.DataFrame()
sub['id'] = TestA_data.id
sub['tag'] = tmps['score']
sub['tag'] = sub['tag'].apply(lambda r: max(r,0))
sub['tag'] = sub['tag'].apply(lambda r: min(r,1))
sub = sub[['id','tag']]
#sub.to_csv('../result/sub_0511_ronghe_4_b.txt', sep='\t', index=False,header = None ,encoding = 'utf-8')
