数据来源阿里天池学习赛:【教学赛】金融数据分析赛题1:银行客户认购产品预测

直接放代码

import pandas as pd
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split

filename = r'train.csv'
train = pd.read_csv(filename)
data=trian.copy()

subscribe_dict = {'yes':1,'no':0}
data['subscribe'] = data['subscribe'].map(subscribe_dict)
features_list = list(data.select_dtypes(include=['object']).columns)

X = data.iloc[:,1:-1]
y = data.iloc[:,-1]
x_train,x_test,y_train,y_test = train_test_split(X,y,random_state=15,shuffle=True)

model = CatBoostClassifier(iterations=400,
                           learning_rate=0.2,
                           max_depth=10,
                           loss_function='Logloss',
                           one_hot_max_size=13,
                           eval_metric='AUC')

model.fit(x_train,y_train,cat_features=features_list,eval_set=(x_test,y_test),verbose=False,use_best_model=True)
importance = list(zip(model.feature_names_,model.feature_importances_))
pred = model.predict(x_test)
print(model.score(x_train,y_train))
print(model.score(x_test,y_test))
print(sorted(importance,key=lambda x:x[1],reverse=True))


filename1 = r'test.csv'
test = pd.read_csv(filename1)
result = pd.DataFrame()
result['id'] = test['id']
subscribe_dict1 = {1:'yes',0:'no'}
pre = model.predict(test.iloc[:,1:])
result['subscribe'] = pre
result['subscribe'] = result['subscribe'].map(subscribe_dict1)
result.to_csv(r'result.csv',index=0)

模型得分和特征得分

0.9222518518518519
0.8787555555555555
[('duration', 31.562158049271915),
 ('emp_var_rate', 12.698946664792766),
 ('month', 8.367504457082596),
 ('pdays', 5.792981302062174),
 ('campaign', 5.037045154538229),
 ('age', 4.735792398997339),
 ('lending_rate3m', 3.560042990101733),
 ('nr_employed', 3.1324586348858263),
 ('cons_conf_index', 3.1090591899823705),
 ('cons_price_index', 2.9085820533638276),
 ('previous', 2.8527892013204585),
 ('contact', 2.7002900783006725),
 ('loan', 2.568280006753603),
 ('marital', 2.271147360645339),
 ('day_of_week', 2.150930241458051),
 ('poutcome', 1.9447035460281297),
 ('default', 1.7135445811448156),
 ('job', 1.3418564919909857),
 ('housing', 0.8936975936995399),
 ('education', 0.6581900035796278)]
  • 结果提交得分accuracy:0.9529,排名116。
  • EDA和特征工程基本没有,数据质量很好,只是简单的分出类别变量直接丢给模型,结果直接就有0.95精确率。
  • 调参过程只是手动调了一下one_hot_max_size,默认值是4,即对唯一值<4的类别型变量使用one-hot编码。这里变量唯一值最多的是job12个,且训练数据和测试数据取值没有差别,将值设为13,accuracy从0.9295提升到0.9529
Logo

为开发者提供学习成长、分享交流、生态实践、资源工具等服务,帮助开发者快速成长。

更多推荐