- November 25, 2020
- saptrxuy_learnit
- 0 Comments
- 1921 Views
- 5 Likes
- Python
Machine Learning with Credit Card Default data – Part 1
1. Introduction
2. Load data
3. Left Blank
4. Print details of loaded dataset
5. Find unique values of some columns
6. Change the columns
7. Rename pay columns
8. Set ‘default as target and create a new dataset X by dropping this column from original dataset
9. Use RobustScaler to transform X
10. Create a dataset Y with only target column
11. Create train and test data
12. Create a dataframe to store result of different models
13. Create Logistic Regression Model
14. Fit the data into model
15. Update result of model into model metrics
16. Confusion Matrix
17. Implement Bagging Model
18. RandomForest Model
19. Boosting Model
20. Save Models into a file
1.Introduction▲
Download data from https://raw.githubusercontent.com/MLWave/Black-Boxxy/master/credit-card-default.csv See implementation by Vladimir G. Drugov at: https://rstudio-pubs-static.s3.amazonaws.com/281390_8a4ea1f1d23043479814ec4a38dbbfd9.html
2.Load data▲
2.1.Code▲
10:import pandas as pd 20: 30:data_path= 'F:/data/input/credit_card_default.csv' 40:ccdefaults = pd.read_csv(data_path, index_col="ID")
3.Left Blank▲
4.Print details of loaded dataset▲
4.1.Code▲
50:print("\n", 50 * "-", "\n", "ccdefaults.head(10)", "\n", ccdefaults.head(10)) 60:print("\n", 50 * "-", "\n", "ccdefaults.columns", "\n", ccdefaults.columns) 70:print("\n", 50 * "-", "\n", "ccdefaults.shape", "\n", ccdefaults.shape) 80: 90: 100:print("\n", 50 * "-", "\n", "\nLower the column names") 110: 120:ccdefaults.rename(columns=lambda x: x.lower(), inplace=True) 130:print("\n", 50 * "-", "\n", "ccdefaults.columns", "\n", ccdefaults.columns) 140: 150: 160:print("\n", 50 * "-", "\n", "\nChange the column names pay_0 and default payment next month") 170:ccdefaults.rename(columns={'pay_0':'pay_1','default payment next month':'default'}, inplace=True) 180:print("\n", 50 * "-", "\n", "ccdefaults.columns", "\n", ccdefaults.columns) 190:
4.2.Output▲
-------------------------------------------------- ccdefaults.head(10) LIMIT_BAL SEX EDUCATION ... PAY_AMT5 PAY_AMT6 default payment next month ID ... 1 20000 2 2 ... 0 0 1 2 120000 2 2 ... 0 2000 1 3 90000 2 2 ... 1000 5000 0 4 50000 2 2 ... 1069 1000 0 5 50000 1 2 ... 689 679 0 6 50000 1 1 ... 1000 800 0 7 500000 1 1 ... 13750 13770 0 8 100000 2 2 ... 1687 1542 0 9 140000 2 3 ... 1000 1000 0 10 20000 1 3 ... 1122 0 0 [10 rows x 24 columns] -------------------------------------------------- ccdefaults.columns Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'default payment next month'], dtype='object') -------------------------------------------------- ccdefaults.shape (30000, 24) -------------------------------------------------- Lower the column names -------------------------------------------------- ccdefaults.columns Index(['limit_bal', 'sex', 'education', 'marriage', 'age', 'pay_0', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6', 'bill_amt1', 'bill_amt2', 'bill_amt3', 'bill_amt4', 'bill_amt5', 'bill_amt6', 'pay_amt1', 'pay_amt2', 'pay_amt3', 'pay_amt4', 'pay_amt5', 'pay_amt6', 'default payment next month'], dtype='object') -------------------------------------------------- Change the column names pay_0 and default payment next month -------------------------------------------------- ccdefaults.columns Index(['limit_bal', 'sex', 'education', 'marriage', 'age', 'pay_1', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6', 'bill_amt1', 'bill_amt2', 'bill_amt3', 'bill_amt4', 'bill_amt5', 'bill_amt6', 'pay_amt1', 'pay_amt2', 'pay_amt3', 'pay_amt4', 'pay_amt5', 'pay_amt6', 'default'], dtype='object') --------------------------------------------------
5.Find unique values of some columns▲
5.1.Code▲
210: 220:print("\n", "ccdefaults['education'].unique()", "\n", ccdefaults['education'].unique()) 230:print("\n", "ccdefaults['marriage'].unique()", "\n", ccdefaults['marriage'].unique()) 240: 250:
5.2.Output▲
ccdefaults['education'].unique() [2 1 3 5 4 6 0] ccdefaults['marriage'].unique() [1 2 3 0]
6.Change the columns▲
6.1.Code▲
260:print("\n", 50 * "-", "\n", "\nTransform the values for education and marital status") 270: 280:# Base values: female, other_education, not_married 290:ccdefaults['grad_school'] = (ccdefaults['education'] == 1).astype('int') 300:ccdefaults['university'] = (ccdefaults['education'] == 2).astype('int') 310:ccdefaults['high_school'] = (ccdefaults['education'] == 3).astype('int') 320:ccdefaults['male'] = (ccdefaults['sex']==1).astype('int') 330:ccdefaults['married'] = (ccdefaults['marriage'] == 1).astype('int') 340: 350:ccdefaults.drop(['sex','marriage', 'education'], axis=1, inplace=True) 360:print("\n", 50 * "-", "\n", "ccdefaults.head(10)", "\n", ccdefaults.head(10)) 370:print("\n", 50 * "-", "\n", "ccdefaults.columns", "\n", ccdefaults.columns) 380: 390: 400:
6.2.Output▲
-------------------------------------------------- Transform the values for education and marital status -------------------------------------------------- ccdefaults.head(10) limit_bal age pay_1 pay_2 ... university high_school male married ID ... 1 20000 24 2 2 ... 1 0 0 1 2 120000 26 -1 2 ... 1 0 0 0 3 90000 34 0 0 ... 1 0 0 0 4 50000 37 0 0 ... 1 0 0 1 5 50000 57 -1 0 ... 1 0 1 1 6 50000 37 0 0 ... 0 0 1 0 7 500000 29 0 0 ... 0 0 1 0 8 100000 23 0 -1 ... 1 0 0 0 9 140000 28 0 0 ... 0 1 0 1 10 20000 35 -2 -2 ... 0 1 1 0 [10 rows x 26 columns] -------------------------------------------------- ccdefaults.columns Index(['limit_bal', 'age', 'pay_1', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6', 'bill_amt1', 'bill_amt2', 'bill_amt3', 'bill_amt4', 'bill_amt5', 'bill_amt6', 'pay_amt1', 'pay_amt2', 'pay_amt3', 'pay_amt4', 'pay_amt5', 'pay_amt6', 'default', 'grad_school', 'university', 'high_school', 'male', 'married'], dtype='object')
7.Rename pay columns▲
7.1.Code▲
410:print("\n", 50 * "-", "\n", "\nLower the column names for pay delay") 420: 430: 440:# For pay_i features: if >0 then it means the customer was delayed i months ago 450:pay_features = ['pay_' + str(i) for i in range(1,7)] 460:for p in pay_features: 470: ccdefaults = (ccdefaults > 0).astype(int) 480: 490: 500: 510:print("\n", 50 * "-", "\n", "pay_features", "\n", pay_features) 520:print("\n", 50 * "-", "\n", "ccdefaults.head(10)", "\n", ccdefaults.head(10)) 530:print("\n", 50 * "-", "\n", "ccdefaults.columns", "\n", ccdefaults.columns) 540: 550: 560:
7.2.Output▲
Lower the column names for pay delay -------------------------------------------------- pay_features ['pay_1', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6'] -------------------------------------------------- ccdefaults.head(10) limit_bal age pay_1 pay_2 ... university high_school male married ID ... 1 20000 24 1 1 ... 1 0 0 1 2 120000 26 0 1 ... 1 0 0 0 3 90000 34 0 0 ... 1 0 0 0 4 50000 37 0 0 ... 1 0 0 1 5 50000 57 0 0 ... 1 0 1 1 6 50000 37 0 0 ... 0 0 1 0 7 500000 29 0 0 ... 0 0 1 0 8 100000 23 0 0 ... 1 0 0 0 9 140000 28 0 0 ... 0 1 0 1 10 20000 35 0 0 ... 0 1 1 0 [10 rows x 26 columns] -------------------------------------------------- ccdefaults.columns Index(['limit_bal', 'age', 'pay_1', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6', 'bill_amt1', 'bill_amt2', 'bill_amt3', 'bill_amt4', 'bill_amt5', 'bill_amt6', 'pay_amt1', 'pay_amt2', 'pay_amt3', 'pay_amt4', 'pay_amt5', 'pay_amt6', 'default', 'grad_school', 'university', 'high_school', 'male', 'married'], dtype='object') --------------------------------------------------
8.Set ‘default as target and create a new dataset X by dropping this column from original dataset▲
8.1.Code▲
570: 580:target_name = 'default' 590:X = ccdefaults.drop('default', axis=1) 600:print("\n", "X.head(10)", "\n", X.head(10)) 610:print("\n", "X.columns", "\n", X.columns) 620: 630:
9.Use RobustScaler to transform X▲
9.1.Code▲
12:from sklearn.preprocessing import RobustScaler 640:robust_scaler = RobustScaler() 650:feature_names = X.columns 660:X = robust_scaler.fit_transform(X) 670: 680:print("\n", "after robust_scaler.fit_transform(X), X is as follows:", "\n", X) 690: 700:
9.2.Output▲
X.head(10) limit_bal age pay_1 pay_2 ... university high_school male married ID ... 1 20000 24 1 1 ... 1 0 0 1 2 120000 26 0 1 ... 1 0 0 0 3 90000 34 0 0 ... 1 0 0 0 4 50000 37 0 0 ... 1 0 0 1 5 50000 57 0 0 ... 1 0 1 1 6 50000 37 0 0 ... 0 0 1 0 7 500000 29 0 0 ... 0 0 1 0 8 100000 23 0 0 ... 1 0 0 0 9 140000 28 0 0 ... 0 1 0 1 10 20000 35 0 0 ... 0 1 1 0 [10 rows x 25 columns] X.columns Index(['limit_bal', 'age', 'pay_1', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6', 'bill_amt1', 'bill_amt2', 'bill_amt3', 'bill_amt4', 'bill_amt5', 'bill_amt6', 'pay_amt1', 'pay_amt2', 'pay_amt3', 'pay_amt4', 'pay_amt5', 'pay_amt6', 'grad_school', 'university', 'high_school', 'male', 'married'], dtype='object') after robust_scaler.fit_transform(X), X is as follows: [[-0.63157895 -0.76923077 1. ... 0. 0. 1. ] [-0.10526316 -0.61538462 0. ... 0. 0. 0. ] [-0.26315789 0. 0. ... 0. 0. 0. ] ... [-0.57894737 0.23076923 1. ... 0. 1. 0. ] [-0.31578947 0.53846154 1. ... 1. 1. 1. ] [-0.47368421 0.92307692 0. ... 0. 1. 1. ]]
10.Create a dataset Y with only target column▲
10.1.Code▲
710: 720:y = ccdefaults[target_name] 730:print("\n", "y", "\n", y.head(10)) 740: 750:
10.2.Output▲
y ID 1 1 2 1 3 0 4 0 5 0 6 0 7 0 8 0 9 0 10 0 Name: default, dtype: int64
11.Create train and test data▲
11.1.Code▲
14:from sklearn.model_selection import train_test_split 760: 770:X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=55, stratify=y) 780: 790:print("\n", "X_train", "\n", X_train) 800:print("\n", "X_test", "\n", X_test) 810: 820:print("\n", "y_train", "\n", y_train.head(10)) 830:print("\n", "y_test", "\n", y_test.head(10)) 840:
11.2.Output▲
X_train [[-0.31578947 -0.69230769 1. ... 0. 0. 0. ] [-0.47368421 -0.92307692 0. ... 0. 1. 0. ] [ 0.31578947 0.38461538 0. ... 0. 0. 1. ] ... [-0.47368421 -0.92307692 0. ... 0. 1. 0. ] [-0.63157895 0. 0. ... 0. 1. 0. ] [ 1.89473684 1.15384615 0. ... 0. 1. 1. ]] X_test [[ 0.36842105 -0.53846154 0. ... 0. 0. 0. ] [-0.47368421 -0.84615385 0. ... 0. 0. 0. ] [ 0.42105263 -0.61538462 0. ... 0. 0. 0. ] ... [ 0. -0.46153846 0. ... 0. 1. 0. ] [-0.63157895 -0.69230769 0. ... 0. 1. 0. ] [ 0.63157895 1.53846154 0. ... 1. 0. 1. ]] y_train ID 18737 1 23949 0 12307 0 4023 0 27774 1 14158 0 3247 0 5478 0 12982 0 29966 0 Name: default, dtype: int64 y_test ID 18786 0 3878 0 27816 0 29680 0 19370 0 8996 0 23983 0 11830 1 16718 1 25556 0 Name: default, dtype: int64
12.Create a dataframe to store result of different models▲
12.1.Code▲
850: 860:print("\n", 50 * "-", "\nCreating Data Frame Evaluation Matrix") 870:# Data frame for evaluation metrics 880:metrics = pd.DataFrame(index=['accuracy', 'precision' ,'recall'], 890: columns=['LogisticReg', 'Bagging', 'RandomForest', 'Boosting']) 900: 910:print("\n", "metrics:", "\n", metrics) 920: 930:
12.2.Output▲
Creating Data Frame Evaluation Matrix metrics: LogisticReg Bagging RandomForest Boosting accuracy NaN NaN NaN NaN precision NaN NaN NaN NaN recall NaN NaN NaN NaN --------------------------------------------------
13.Create Logistic Regression Model▲
13.1.Code▲
990: 1000: 1010:print("\n", "import LogisticRegression", "\n", 50 * "-") 1020:from sklearn.linear_model import LogisticRegression 1030: 1040:print("\n", "create an instance of LogisticRegression", "\n", 50 * "-") 1050:logistic_regression = LogisticRegression(solver='liblinear', random_state=55) 1060:print("logistic_regression", "\n", 50 * "-", "\n", logistic_regression) 1070: 1080:
13.2.Output▲
import LogisticRegression -------------------------------------------------- create an instance of LogisticRegression -------------------------------------------------- logistic_regression -------------------------------------------------- LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2', random_state=55, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
14.Fit the data into model▲
14.1.Code▲
1100: 1110:print("\n", "Use the training data to train the estimator") 1120:logistic_regression.fit(X_train, y_train) 1130:print("\n", "After training, X_train", "\n", 50 * "-", "\n", X_train) 1140:print("\n", "After training, y_train", "\n", 50 * "-", "\n", y_train) 1150: 1160: 1170:
14.2.Output▲
Use the training data to train the estimator After training, X_train -------------------------------------------------- [[-0.31578947 -0.69230769 1. ... 0. 0. 0. ] [-0.47368421 -0.92307692 0. ... 0. 1. 0. ] [ 0.31578947 0.38461538 0. ... 0. 0. 1. ] ... [-0.47368421 -0.92307692 0. ... 0. 1. 0. ] [-0.63157895 0. 0. ... 0. 1. 0. ] [ 1.89473684 1.15384615 0. ... 0. 1. 1. ]] After training, y_train -------------------------------------------------- ID 18737 1 23949 0 12307 0 4023 0 27774 1 .. 9827 1 21215 0 23936 0 11066 0 29735 1 Name: default, Length: 25500, dtype: int64
15.Update result of model into model metrics▲
15.1.Code▲
16:from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, precision_recall_curve 1180: 1190: 1200:print("\n", "Evaluate Model and update metrics") 1210:y_pred_test = logistic_regression.predict(X_test) 1220:metrics.loc['accuracy','LogisticReg'] = accuracy_score(y_pred=y_pred_test, y_true=y_test) 1230:metrics.loc['precision','LogisticReg'] = precision_score(y_pred=y_pred_test, y_true=y_test) 1240:metrics.loc['recall','LogisticReg'] = recall_score(y_pred=y_pred_test, y_true=y_test) 1250: 1260: 1270:print("\n", "metrics", "\n", 50 * "-", "\n", metrics) 1280: 1290:
15.2.Output▲
Evaluate Model and update metrics metrics -------------------------------------------------- LogisticReg Bagging RandomForest Boosting accuracy 0.805778 NaN NaN NaN precision 0.620758 NaN NaN NaN recall 0.312563 NaN NaN NaN
16.Confusion Matrix▲
16.1.Code▲
1300: 1310: 1320:print("\n", "Confusion matrix") 1330:CM = confusion_matrix(y_pred=y_pred_test, y_true=y_test) 1340:print("\n", "confusion_matrix", "\n", 50 * "-", "\n", CM) 1350: 1360: 1370:def CMatrix(CM, labels=['pay','default']): 1380: df = pd.DataFrame(data=CM, index=labels, columns=labels) 1390: df.index.name='TRUE' 1400: df.columns.name='PREDICTION' 1410: df.loc['Total'] = df.sum() 1420: df['Total'] = df.sum(axis=1) 1430: return df 1440: 1450:CMatrix(CM) 1460: 1470:print("\n", "Print more informative confusion matrix", "\n", 50 * "-", "\n", CMatrix(CM)) 1480: 1490:
16.2.Output▲
confusion_matrix -------------------------------------------------- [[3315 190] [ 684 311]] Print more informative confusion matrix -------------------------------------------------- PREDICTION pay default Total TRUE pay 3315 190 3505 default 684 311 995 Total 3999 501 4500
17.Implement Bagging Model▲
17.1.Code▲
1520: 1530: 1540: 1550:print("\n", "import BaggingClassifier") 1560:from sklearn.ensemble import BaggingClassifier 1570: 1580:print("create instance of LogisticRegression and BaggingClassifier") 1590:log_reg_for_bagging = LogisticRegression(solver = 'liblinear') 1600:bagging = BaggingClassifier(base_estimator=log_reg_for_bagging, n_estimators=10, 1610: random_state=55, n_jobs=-1) 1620: 1630:print("log_reg_for_bagging", "\n", 50 * "-", "\n", log_reg_for_bagging) 1640:print("bagging", "\n", 50 * "-", "\n", bagging) 1650: 1660:print("\n", "Use the training data to train the estimator") 1670:bagging.fit(X_train, y_train) 1680:print("\n", "After training, X_train", "\n", 50 * "-", "\n", X_train) 1690:print("\n", "After training, y_train", "\n", 50 * "-", "\n", y_train) 1700: 1710:print("\n", "Evaluate Model") 1720:y_pred_test = bagging.predict(X_test) 1730:metrics.loc['accuracy','Bagging'] = accuracy_score(y_pred=y_pred_test, y_true=y_test) 1740:metrics.loc['precision','Bagging'] = precision_score(y_pred=y_pred_test, y_true=y_test) 1750:metrics.loc['recall','Bagging'] = recall_score(y_pred=y_pred_test, y_true=y_test) 1760:print("\n", "models after evaluating", "\n", 50 * "-", "\n", metrics) 1770: 1780:#Confusion matrix 1790:print("\n", "Confusion matrix") 1800:CM = confusion_matrix(y_pred=y_pred_test, y_true=y_test) 1810:print("\n", "confusion_matrix", "\n", 50 * "-", "\n", CM) 1820: 1830: 1840:CMatrix(CM) 1850: 1860:print("\n", "Print more informative confusion matrix", "\n", 50 * "-", "\n", CMatrix(CM)) 1870: 1880:
17.2.Output▲
import BaggingClassifier create instance of LogisticRegression and BaggingClassifier log_reg_for_bagging -------------------------------------------------- LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False) bagging -------------------------------------------------- BaggingClassifier(base_estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False), bootstrap=True, bootstrap_features=False, max_features=1.0, max_samples=1.0, n_estimators=10, n_jobs=-1, oob_score=False, random_state=55, verbose=0, warm_start=False) Use the training data to train the estimator After training, X_train -------------------------------------------------- [[-0.31578947 -0.69230769 1. ... 0. 0. 0. ] [-0.47368421 -0.92307692 0. ... 0. 1. 0. ] [ 0.31578947 0.38461538 0. ... 0. 0. 1. ] ... [-0.47368421 -0.92307692 0. ... 0. 1. 0. ] [-0.63157895 0. 0. ... 0. 1. 0. ] [ 1.89473684 1.15384615 0. ... 0. 1. 1. ]] After training, y_train -------------------------------------------------- ID 18737 1 23949 0 12307 0 4023 0 27774 1 .. 9827 1 21215 0 23936 0 11066 0 29735 1 Name: default, Length: 25500, dtype: int64 Evaluate Model models after evaluating -------------------------------------------------- LogisticReg Bagging RandomForest Boosting accuracy 0.805778 0.805333 NaN NaN precision 0.620758 0.617822 NaN NaN recall 0.312563 0.313568 NaN NaN Confusion matrix confusion_matrix -------------------------------------------------- [[3312 193] [ 683 312]] Print more informative confusion matrix -------------------------------------------------- PREDICTION pay default Total TRUE pay 3312 193 3505 default 683 312 995 Total 3995 505 4500
18.RandomForest Model▲
18.1.Code▲
1920:print("\n", "import RandomForestClassifier", "\n", 50 * "-") 1930:from sklearn.ensemble import RandomForestClassifier 1940: 1950: 1960:print("create instance of RandomForestClassifier") 1970:RF = RandomForestClassifier(n_estimators=35, max_depth=20, random_state=55, max_features='sqrt', 1980: n_jobs=-1) 1990: 2000:print("RF", "\n", 50 * "-", "\n", RF) 2010: 2020:print("\n", "Use the training data to train the estimator") 2030:RF.fit(X_train, y_train) 2040:print("\n", "After training, X_train", "\n", 50 * "-", "\n", X_train) 2050:print("\n", "After training, y_train", "\n", 50 * "-", "\n", y_train) 2060: 2070:print("\n", "Evaluate Model") 2080:y_pred_test = RF.predict(X_test) 2090:metrics.loc['accuracy','RandomForest'] = accuracy_score(y_pred=y_pred_test, y_true=y_test) 2100:metrics.loc['precision','RandomForest'] = precision_score(y_pred=y_pred_test, y_true=y_test) 2110:metrics.loc['recall','RandomForest'] = recall_score(y_pred=y_pred_test, y_true=y_test) 2120:print("\n", "metrics after evaluating", "\n", 50 * "-", "\n", metrics) 2130: 2140:#Confusion matrix 2150:print("\n", "Confusion matrix") 2160:CM = confusion_matrix(y_pred=y_pred_test, y_true=y_test) 2170:print("\n", "confusion_matrix", "\n", 50 * "-", "\n", CM) 2180: 2190:CMatrix(CM) 2200: 2210:print("\n", "Print more informative confusion matrix", "\n", 50 * "-", "\n", CMatrix(CM)) 2220:
18.2.Output▲
import RandomForestClassifier -------------------------------------------------- create instance of RandomForestClassifier RF -------------------------------------------------- RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=20, max_features='sqrt', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=35, n_jobs=-1, oob_score=False, random_state=55, verbose=0, warm_start=False) Use the training data to train the estimator After training, X_train -------------------------------------------------- [[-0.31578947 -0.69230769 1. ... 0. 0. 0. ] [-0.47368421 -0.92307692 0. ... 0. 1. 0. ] [ 0.31578947 0.38461538 0. ... 0. 0. 1. ] ... [-0.47368421 -0.92307692 0. ... 0. 1. 0. ] [-0.63157895 0. 0. ... 0. 1. 0. ] [ 1.89473684 1.15384615 0. ... 0. 1. 1. ]] After training, y_train -------------------------------------------------- ID 18737 1 23949 0 12307 0 4023 0 27774 1 .. 9827 1 21215 0 23936 0 11066 0 29735 1 Name: default, Length: 25500, dtype: int64 Evaluate Model metrics after evaluating -------------------------------------------------- LogisticReg Bagging RandomForest Boosting accuracy 0.805778 0.805333 0.810222 NaN precision 0.620758 0.617822 0.61809 NaN recall 0.312563 0.313568 0.370854 NaN Confusion matrix confusion_matrix -------------------------------------------------- [[3277 228] [ 626 369]] Print more informative confusion matrix -------------------------------------------------- PREDICTION pay default Total TRUE pay 3277 228 3505 default 626 369 995 Total 3903 597 4500
19.Boosting Model▲
19.1.Code▲
2240: 2250:print("\n", "import AdaBoostClassifier", "\n", 50 * "-") 2260:from sklearn.ensemble import AdaBoostClassifier 2270: 2280: 2290:print("create instance of AdaBoostClassifier") 2300:boosting = AdaBoostClassifier(n_estimators=50, learning_rate=0.1, random_state=55) 2310:print("boosting", "\n", 50 * "-", "\n", boosting) 2320: 2330:print("\n", "Use the training data to train the estimator") 2340:boosting.fit(X_train, y_train) 2350:print("\n", "After training, X_train", "\n", 50 * "-", "\n", X_train) 2360:print("\n", "After training, y_train", "\n", 50 * "-", "\n", y_train) 2370: 2380:print("\n", "Evaluate Model") 2390:y_pred_test = boosting.predict(X_test) 2400:metrics.loc['accuracy','Boosting'] = accuracy_score(y_pred=y_pred_test, y_true=y_test) 2410:metrics.loc['precision','Boosting'] = precision_score(y_pred=y_pred_test, y_true=y_test) 2420:metrics.loc['recall','Boosting'] = recall_score(y_pred=y_pred_test, y_true=y_test) 2430:print("\n", "metrics after evaluating", "\n", 50 * "-", "\n", metrics) 2440: 2450:#Confusion matrix 2460:print("\n", "Confusion matrix") 2470:CM = confusion_matrix(y_pred=y_pred_test, y_true=y_test) 2480:print("\n", "confusion_matrix", "\n", 50 * "-", "\n", CM) 2490: 2500:CMatrix(CM) 2510: 2520:print("\n", "Print more informative confusion matrix", "\n", 50 * "-", "\n", CMatrix(CM)) 2530: 2540:
19.2.Output▲
import AdaBoostClassifier -------------------------------------------------- create instance of AdaBoostClassifier boosting -------------------------------------------------- AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=0.1, n_estimators=50, random_state=55) Use the training data to train the estimator After training, X_train -------------------------------------------------- [[-0.31578947 -0.69230769 1. ... 0. 0. 0. ] [-0.47368421 -0.92307692 0. ... 0. 1. 0. ] [ 0.31578947 0.38461538 0. ... 0. 0. 1. ] ... [-0.47368421 -0.92307692 0. ... 0. 1. 0. ] [-0.63157895 0. 0. ... 0. 1. 0. ] [ 1.89473684 1.15384615 0. ... 0. 1. 1. ]] After training, y_train -------------------------------------------------- ID 18737 1 23949 0 12307 0 4023 0 27774 1 .. 9827 1 21215 0 23936 0 11066 0 29735 1 Name: default, Length: 25500, dtype: int64 Evaluate Model metrics after evaluating -------------------------------------------------- LogisticReg Bagging RandomForest Boosting accuracy 0.805778 0.805333 0.810222 0.804 precision 0.620758 0.617822 0.61809 0.631702 recall 0.312563 0.313568 0.370854 0.272362 Confusion matrix confusion_matrix -------------------------------------------------- [[3347 158] [ 724 271]] Print more informative confusion matrix -------------------------------------------------- PREDICTION pay default Total TRUE pay 3347 158 3505 default 724 271 995 Total 4071 429 4500 --------------------------------------------------
20.Save Models into a file▲
20.1.Code▲
2560: 2570: 2580:import pickle 2590: 2600:filename = '../data/all_metrics.sav' 2610: 2620:print("\n", 50 * "-", "\nDumping metrics to", filename) 2630: 2640:pickle.dump(metrics, open(filename, 'wb')) 2650: 2660:
Leave a Comment