数据挖掘:银行客户认购产品预测
数据挖掘:银行客户认购产品预测。
·
数据来源阿里天池学习赛:【教学赛】金融数据分析赛题1:银行客户认购产品预测
直接放代码
import pandas as pd
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
filename = r'train.csv'
train = pd.read_csv(filename)
data=trian.copy()
subscribe_dict = {'yes':1,'no':0}
data['subscribe'] = data['subscribe'].map(subscribe_dict)
features_list = list(data.select_dtypes(include=['object']).columns)
X = data.iloc[:,1:-1]
y = data.iloc[:,-1]
x_train,x_test,y_train,y_test = train_test_split(X,y,random_state=15,shuffle=True)
model = CatBoostClassifier(iterations=400,
learning_rate=0.2,
max_depth=10,
loss_function='Logloss',
one_hot_max_size=13,
eval_metric='AUC')
model.fit(x_train,y_train,cat_features=features_list,eval_set=(x_test,y_test),verbose=False,use_best_model=True)
importance = list(zip(model.feature_names_,model.feature_importances_))
pred = model.predict(x_test)
print(model.score(x_train,y_train))
print(model.score(x_test,y_test))
print(sorted(importance,key=lambda x:x[1],reverse=True))
filename1 = r'test.csv'
test = pd.read_csv(filename1)
result = pd.DataFrame()
result['id'] = test['id']
subscribe_dict1 = {1:'yes',0:'no'}
pre = model.predict(test.iloc[:,1:])
result['subscribe'] = pre
result['subscribe'] = result['subscribe'].map(subscribe_dict1)
result.to_csv(r'result.csv',index=0)
模型得分和特征得分
0.9222518518518519
0.8787555555555555
[('duration', 31.562158049271915),
('emp_var_rate', 12.698946664792766),
('month', 8.367504457082596),
('pdays', 5.792981302062174),
('campaign', 5.037045154538229),
('age', 4.735792398997339),
('lending_rate3m', 3.560042990101733),
('nr_employed', 3.1324586348858263),
('cons_conf_index', 3.1090591899823705),
('cons_price_index', 2.9085820533638276),
('previous', 2.8527892013204585),
('contact', 2.7002900783006725),
('loan', 2.568280006753603),
('marital', 2.271147360645339),
('day_of_week', 2.150930241458051),
('poutcome', 1.9447035460281297),
('default', 1.7135445811448156),
('job', 1.3418564919909857),
('housing', 0.8936975936995399),
('education', 0.6581900035796278)]
- 结果提交得分accuracy:0.9529,排名116。
- EDA和特征工程基本没有,数据质量很好,只是简单的分出类别变量直接丢给模型,结果直接就有0.95精确率。
- 调参过程只是手动调了一下
one_hot_max_size
,默认值是4,即对唯一值<4的类别型变量使用one-hot编码。这里变量唯一值最多的是job
12个,且训练数据和测试数据取值没有差别,将值设为13,accuracy从0.9295提升到0.9529
更多推荐
已为社区贡献1条内容
所有评论(0)