Machine Learning with Credit Card Default data – Part 1

1. Introduction

2. Load data

3. Left Blank

4. Print details of loaded dataset

5. Find unique values of some columns

6. Change the columns

7. Rename pay columns

8. Set ‘default as target and create a new dataset X by dropping this column from original dataset

9. Use RobustScaler to transform X

10. Create a dataset Y with only target column

11. Create train and test data

12. Create a dataframe to store result of different models

13. Create Logistic Regression Model

14. Fit the data into model

15. Update result of model into model metrics

16. Confusion Matrix

17. Implement Bagging Model

18. RandomForest Model

19. Boosting Model

20. Save Models into a file

1.Introduction▲

Download data from
https://raw.githubusercontent.com/MLWave/Black-Boxxy/master/credit-card-default.csv

See implementation by Vladimir G. Drugov at:
https://rstudio-pubs-static.s3.amazonaws.com/281390_8a4ea1f1d23043479814ec4a38dbbfd9.html

2.Load data▲

2.1.Code▲

10:import pandas as pd
20:
30:data_path= 'F:/data/input/credit_card_default.csv'
40:ccdefaults = pd.read_csv(data_path, index_col="ID")

3.Left Blank▲

4.Print details of loaded dataset▲

4.1.Code▲

50:print("\n", 50 * "-", "\n", "ccdefaults.head(10)", "\n", ccdefaults.head(10))
60:print("\n", 50 * "-", "\n", "ccdefaults.columns", "\n", ccdefaults.columns)
70:print("\n", 50 * "-", "\n", "ccdefaults.shape", "\n", ccdefaults.shape)
80:
90:
100:print("\n", 50 * "-", "\n", "\nLower the column names")
110:
120:ccdefaults.rename(columns=lambda x: x.lower(), inplace=True)
130:print("\n", 50 * "-", "\n", "ccdefaults.columns", "\n", ccdefaults.columns)
140:
150:
160:print("\n", 50 * "-", "\n", "\nChange the column names pay_0 and default payment next month")
170:ccdefaults.rename(columns={'pay_0':'pay_1','default payment next month':'default'}, inplace=True)
180:print("\n", 50 * "-", "\n", "ccdefaults.columns", "\n", ccdefaults.columns)
190:

4.2.Output▲

-------------------------------------------------- 
 ccdefaults.head(10) 
     LIMIT_BAL  SEX  EDUCATION  ...  PAY_AMT5  PAY_AMT6  default payment next month
ID                             ...                                                
1       20000    2          2  ...         0         0                           1
2      120000    2          2  ...         0      2000                           1
3       90000    2          2  ...      1000      5000                           0
4       50000    2          2  ...      1069      1000                           0
5       50000    1          2  ...       689       679                           0
6       50000    1          1  ...      1000       800                           0
7      500000    1          1  ...     13750     13770                           0
8      100000    2          2  ...      1687      1542                           0
9      140000    2          3  ...      1000      1000                           0
10      20000    1          3  ...      1122         0                           0

[10 rows x 24 columns]

 -------------------------------------------------- 
 ccdefaults.columns 
 Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2',
       'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'default payment next month'],
      dtype='object')

 -------------------------------------------------- 
 ccdefaults.shape 
 (30000, 24)

 -------------------------------------------------- 
 
Lower the column names

 -------------------------------------------------- 
 ccdefaults.columns 
 Index(['limit_bal', 'sex', 'education', 'marriage', 'age', 'pay_0', 'pay_2',
       'pay_3', 'pay_4', 'pay_5', 'pay_6', 'bill_amt1', 'bill_amt2',
       'bill_amt3', 'bill_amt4', 'bill_amt5', 'bill_amt6', 'pay_amt1',
       'pay_amt2', 'pay_amt3', 'pay_amt4', 'pay_amt5', 'pay_amt6',
       'default payment next month'],
      dtype='object')

 -------------------------------------------------- 
 
Change the column names pay_0 and default payment next month

 -------------------------------------------------- 
 ccdefaults.columns 
 Index(['limit_bal', 'sex', 'education', 'marriage', 'age', 'pay_1', 'pay_2',
       'pay_3', 'pay_4', 'pay_5', 'pay_6', 'bill_amt1', 'bill_amt2',
       'bill_amt3', 'bill_amt4', 'bill_amt5', 'bill_amt6', 'pay_amt1',
       'pay_amt2', 'pay_amt3', 'pay_amt4', 'pay_amt5', 'pay_amt6', 'default'],
      dtype='object')

 --------------------------------------------------

5.Find unique values of some columns▲

5.1.Code▲

210:
220:print("\n", "ccdefaults['education'].unique()", "\n", ccdefaults['education'].unique())
230:print("\n", "ccdefaults['marriage'].unique()", "\n", ccdefaults['marriage'].unique())
240:
250:

5.2.Output▲

ccdefaults['education'].unique() 
 [2 1 3 5 4 6 0]

 ccdefaults['marriage'].unique() 
 [1 2 3 0]

6.Change the columns▲

6.1.Code▲

260:print("\n", 50 * "-", "\n", "\nTransform the values for education and marital status")
270:
280:# Base values: female, other_education, not_married
290:ccdefaults['grad_school'] = (ccdefaults['education'] == 1).astype('int')
300:ccdefaults['university'] = (ccdefaults['education'] == 2).astype('int')
310:ccdefaults['high_school'] = (ccdefaults['education'] == 3).astype('int')
320:ccdefaults['male'] = (ccdefaults['sex']==1).astype('int')
330:ccdefaults['married'] = (ccdefaults['marriage'] == 1).astype('int')
340:
350:ccdefaults.drop(['sex','marriage', 'education'], axis=1, inplace=True)
360:print("\n", 50 * "-", "\n", "ccdefaults.head(10)", "\n", ccdefaults.head(10))
370:print("\n", 50 * "-", "\n", "ccdefaults.columns", "\n", ccdefaults.columns)
380:
390:
400:

6.2.Output▲

-------------------------------------------------- 
 
Transform the values for education and marital status

 -------------------------------------------------- 
 ccdefaults.head(10) 
     limit_bal  age  pay_1  pay_2  ...  university  high_school  male  married
ID                                ...                                        
1       20000   24      2      2  ...           1            0     0        1
2      120000   26     -1      2  ...           1            0     0        0
3       90000   34      0      0  ...           1            0     0        0
4       50000   37      0      0  ...           1            0     0        1
5       50000   57     -1      0  ...           1            0     1        1
6       50000   37      0      0  ...           0            0     1        0
7      500000   29      0      0  ...           0            0     1        0
8      100000   23      0     -1  ...           1            0     0        0
9      140000   28      0      0  ...           0            1     0        1
10      20000   35     -2     -2  ...           0            1     1        0

[10 rows x 26 columns]

 -------------------------------------------------- 
 ccdefaults.columns 
 Index(['limit_bal', 'age', 'pay_1', 'pay_2', 'pay_3', 'pay_4', 'pay_5',
       'pay_6', 'bill_amt1', 'bill_amt2', 'bill_amt3', 'bill_amt4',
       'bill_amt5', 'bill_amt6', 'pay_amt1', 'pay_amt2', 'pay_amt3',
       'pay_amt4', 'pay_amt5', 'pay_amt6', 'default', 'grad_school',
       'university', 'high_school', 'male', 'married'],
      dtype='object')

7.Rename pay columns▲

7.1.Code▲

410:print("\n", 50 * "-", "\n", "\nLower the column names for pay delay")
420:
430:
440:# For pay_i features: if >0 then it means the customer was delayed i months ago
450:pay_features = ['pay_' + str(i) for i in range(1,7)]
460:for p in pay_features:
470:    ccdefaults
 = (ccdefaults > 0).astype(int)
480:
490:
500:
510:print("\n", 50 * "-", "\n", "pay_features", "\n", pay_features)
520:print("\n", 50 * "-", "\n", "ccdefaults.head(10)", "\n", ccdefaults.head(10))
530:print("\n", 50 * "-", "\n", "ccdefaults.columns", "\n", ccdefaults.columns)
540:
550:
560:

7.2.Output▲

Lower the column names for pay delay

 -------------------------------------------------- 
 pay_features 
 ['pay_1', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6']

 -------------------------------------------------- 
 ccdefaults.head(10) 
     limit_bal  age  pay_1  pay_2  ...  university  high_school  male  married
ID                                ...                                        
1       20000   24      1      1  ...           1            0     0        1
2      120000   26      0      1  ...           1            0     0        0
3       90000   34      0      0  ...           1            0     0        0
4       50000   37      0      0  ...           1            0     0        1
5       50000   57      0      0  ...           1            0     1        1
6       50000   37      0      0  ...           0            0     1        0
7      500000   29      0      0  ...           0            0     1        0
8      100000   23      0      0  ...           1            0     0        0
9      140000   28      0      0  ...           0            1     0        1
10      20000   35      0      0  ...           0            1     1        0

[10 rows x 26 columns]

 -------------------------------------------------- 
 ccdefaults.columns 
 Index(['limit_bal', 'age', 'pay_1', 'pay_2', 'pay_3', 'pay_4', 'pay_5',
       'pay_6', 'bill_amt1', 'bill_amt2', 'bill_amt3', 'bill_amt4',
       'bill_amt5', 'bill_amt6', 'pay_amt1', 'pay_amt2', 'pay_amt3',
       'pay_amt4', 'pay_amt5', 'pay_amt6', 'default', 'grad_school',
       'university', 'high_school', 'male', 'married'],
      dtype='object')

 --------------------------------------------------

8.Set ‘default as target and create a new dataset X by dropping this column from original dataset▲

8.1.Code▲

570:
580:target_name = 'default'
590:X = ccdefaults.drop('default', axis=1)
600:print("\n", "X.head(10)", "\n", X.head(10))
610:print("\n", "X.columns", "\n", X.columns)
620:
630:

9.Use RobustScaler to transform X▲

9.1.Code▲

12:from sklearn.preprocessing import RobustScaler
640:robust_scaler = RobustScaler()
650:feature_names = X.columns
660:X = robust_scaler.fit_transform(X)
670:
680:print("\n", "after robust_scaler.fit_transform(X), X is as follows:", "\n", X)
690:
700:

9.2.Output▲

X.head(10) 
     limit_bal  age  pay_1  pay_2  ...  university  high_school  male  married
ID                                ...                                        
1       20000   24      1      1  ...           1            0     0        1
2      120000   26      0      1  ...           1            0     0        0
3       90000   34      0      0  ...           1            0     0        0
4       50000   37      0      0  ...           1            0     0        1
5       50000   57      0      0  ...           1            0     1        1
6       50000   37      0      0  ...           0            0     1        0
7      500000   29      0      0  ...           0            0     1        0
8      100000   23      0      0  ...           1            0     0        0
9      140000   28      0      0  ...           0            1     0        1
10      20000   35      0      0  ...           0            1     1        0

[10 rows x 25 columns]

 X.columns 
 Index(['limit_bal', 'age', 'pay_1', 'pay_2', 'pay_3', 'pay_4', 'pay_5',
       'pay_6', 'bill_amt1', 'bill_amt2', 'bill_amt3', 'bill_amt4',
       'bill_amt5', 'bill_amt6', 'pay_amt1', 'pay_amt2', 'pay_amt3',
       'pay_amt4', 'pay_amt5', 'pay_amt6', 'grad_school', 'university',
       'high_school', 'male', 'married'],
      dtype='object')

 after robust_scaler.fit_transform(X), X is as follows: 
 [[-0.63157895 -0.76923077  1.         ...  0.          0.
   1.        ]
 [-0.10526316 -0.61538462  0.         ...  0.          0.
   0.        ]
 [-0.26315789  0.          0.         ...  0.          0.
   0.        ]
 ...
 [-0.57894737  0.23076923  1.         ...  0.          1.
   0.        ]
 [-0.31578947  0.53846154  1.         ...  1.          1.
   1.        ]
 [-0.47368421  0.92307692  0.         ...  0.          1.
   1.        ]]

10.Create a dataset Y with only target column▲

10.1.Code▲

710:
720:y = ccdefaults[target_name]
730:print("\n", "y", "\n", y.head(10))
740:
750:

10.2.Output▲

y 
 ID
1     1
2     1
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
Name: default, dtype: int64

11.Create train and test data▲

11.1.Code▲

14:from sklearn.model_selection import train_test_split
760:
770:X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=55, stratify=y)
780:
790:print("\n", "X_train", "\n", X_train)
800:print("\n", "X_test", "\n", X_test)
810:
820:print("\n", "y_train", "\n", y_train.head(10))
830:print("\n", "y_test", "\n", y_test.head(10))
840:

11.2.Output▲

X_train 
 [[-0.31578947 -0.69230769  1.         ...  0.          0.
   0.        ]
 [-0.47368421 -0.92307692  0.         ...  0.          1.
   0.        ]
 [ 0.31578947  0.38461538  0.         ...  0.          0.
   1.        ]
 ...
 [-0.47368421 -0.92307692  0.         ...  0.          1.
   0.        ]
 [-0.63157895  0.          0.         ...  0.          1.
   0.        ]
 [ 1.89473684  1.15384615  0.         ...  0.          1.
   1.        ]]

 X_test 
 [[ 0.36842105 -0.53846154  0.         ...  0.          0.
   0.        ]
 [-0.47368421 -0.84615385  0.         ...  0.          0.
   0.        ]
 [ 0.42105263 -0.61538462  0.         ...  0.          0.
   0.        ]
 ...
 [ 0.         -0.46153846  0.         ...  0.          1.
   0.        ]
 [-0.63157895 -0.69230769  0.         ...  0.          1.
   0.        ]
 [ 0.63157895  1.53846154  0.         ...  1.          0.
   1.        ]]

 y_train 
 ID
18737    1
23949    0
12307    0
4023     0
27774    1
14158    0
3247     0
5478     0
12982    0
29966    0
Name: default, dtype: int64

 y_test 
 ID
18786    0
3878     0
27816    0
29680    0
19370    0
8996     0
23983    0
11830    1
16718    1
25556    0
Name: default, dtype: int64

12.Create a dataframe to store result of different models▲

12.1.Code▲

850:
860:print("\n", 50 * "-", "\nCreating Data Frame Evaluation Matrix")
870:# Data frame for evaluation metrics
880:metrics = pd.DataFrame(index=['accuracy', 'precision' ,'recall'],
890:                      columns=['LogisticReg', 'Bagging', 'RandomForest', 'Boosting'])
900:
910:print("\n", "metrics:", "\n", metrics)
920:
930:

12.2.Output▲

Creating Data Frame Evaluation Matrix

 metrics: 
           LogisticReg Bagging RandomForest Boosting
accuracy          NaN     NaN          NaN      NaN
precision         NaN     NaN          NaN      NaN
recall            NaN     NaN          NaN      NaN

 --------------------------------------------------

13.Create Logistic Regression Model▲

13.1.Code▲

990:
1000:
1010:print("\n", "import LogisticRegression", "\n", 50 * "-")
1020:from sklearn.linear_model import LogisticRegression
1030:
1040:print("\n", "create an instance of LogisticRegression", "\n", 50 * "-")
1050:logistic_regression = LogisticRegression(solver='liblinear', random_state=55)
1060:print("logistic_regression", "\n", 50 * "-", "\n", logistic_regression)
1070:
1080:

13.2.Output▲

import LogisticRegression 
 --------------------------------------------------

 create an instance of LogisticRegression 
 --------------------------------------------------
logistic_regression 
 -------------------------------------------------- 
 LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=55, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

14.Fit the data into model▲

14.1.Code▲

1100:
1110:print("\n", "Use the training data to train the estimator")
1120:logistic_regression.fit(X_train, y_train)
1130:print("\n", "After training, X_train", "\n", 50 * "-", "\n", X_train)
1140:print("\n", "After training, y_train", "\n", 50 * "-", "\n", y_train)
1150:
1160:
1170:

14.2.Output▲

Use the training data to train the estimator

 After training, X_train 
 -------------------------------------------------- 
 [[-0.31578947 -0.69230769  1.         ...  0.          0.
   0.        ]
 [-0.47368421 -0.92307692  0.         ...  0.          1.
   0.        ]
 [ 0.31578947  0.38461538  0.         ...  0.          0.
   1.        ]
 ...
 [-0.47368421 -0.92307692  0.         ...  0.          1.
   0.        ]
 [-0.63157895  0.          0.         ...  0.          1.
   0.        ]
 [ 1.89473684  1.15384615  0.         ...  0.          1.
   1.        ]]

 After training, y_train 
 -------------------------------------------------- 
 ID
18737    1
23949    0
12307    0
4023     0
27774    1
        ..
9827     1
21215    0
23936    0
11066    0
29735    1
Name: default, Length: 25500, dtype: int64

15.Update result of model into model metrics▲

15.1.Code▲

16:from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, precision_recall_curve
1180:
1190:
1200:print("\n", "Evaluate Model and update metrics")
1210:y_pred_test = logistic_regression.predict(X_test)
1220:metrics.loc['accuracy','LogisticReg'] = accuracy_score(y_pred=y_pred_test, y_true=y_test)
1230:metrics.loc['precision','LogisticReg'] = precision_score(y_pred=y_pred_test, y_true=y_test)
1240:metrics.loc['recall','LogisticReg'] = recall_score(y_pred=y_pred_test, y_true=y_test)
1250:
1260:
1270:print("\n", "metrics", "\n", 50 * "-", "\n", metrics)
1280:
1290:

15.2.Output▲

Evaluate Model and update metrics

 metrics 
 -------------------------------------------------- 
           LogisticReg Bagging RandomForest Boosting
accuracy     0.805778     NaN          NaN      NaN
precision    0.620758     NaN          NaN      NaN
recall       0.312563     NaN          NaN      NaN

16.Confusion Matrix▲

16.1.Code▲

1300:
1310:
1320:print("\n", "Confusion matrix")
1330:CM = confusion_matrix(y_pred=y_pred_test, y_true=y_test)
1340:print("\n", "confusion_matrix", "\n", 50 * "-", "\n", CM)
1350:
1360:
1370:def CMatrix(CM, labels=['pay','default']):
1380:    df = pd.DataFrame(data=CM, index=labels, columns=labels)
1390:    df.index.name='TRUE'
1400:    df.columns.name='PREDICTION'
1410:    df.loc['Total'] = df.sum()
1420:    df['Total'] = df.sum(axis=1)
1430:    return df
1440:
1450:CMatrix(CM)
1460:
1470:print("\n", "Print more informative confusion matrix", "\n", 50 * "-", "\n", CMatrix(CM))
1480:
1490:

16.2.Output▲

confusion_matrix 
 -------------------------------------------------- 
 [[3315  190]
 [ 684  311]]

 Print more informative confusion matrix 
 -------------------------------------------------- 
 PREDICTION   pay  default  Total
TRUE                            
pay         3315      190   3505
default      684      311    995
Total       3999      501   4500

17.Implement Bagging Model▲

17.1.Code▲

1520:
1530:
1540:
1550:print("\n", "import BaggingClassifier")
1560:from sklearn.ensemble import BaggingClassifier
1570:
1580:print("create instance of LogisticRegression and BaggingClassifier")
1590:log_reg_for_bagging = LogisticRegression(solver = 'liblinear')
1600:bagging = BaggingClassifier(base_estimator=log_reg_for_bagging, n_estimators=10,
1610:                            random_state=55, n_jobs=-1)
1620:
1630:print("log_reg_for_bagging", "\n", 50 * "-", "\n", log_reg_for_bagging)
1640:print("bagging", "\n", 50 * "-", "\n", bagging)
1650:
1660:print("\n", "Use the training data to train the estimator")
1670:bagging.fit(X_train, y_train)
1680:print("\n", "After training, X_train", "\n", 50 * "-", "\n", X_train)
1690:print("\n", "After training, y_train", "\n", 50 * "-", "\n", y_train)
1700:
1710:print("\n", "Evaluate Model")
1720:y_pred_test = bagging.predict(X_test)
1730:metrics.loc['accuracy','Bagging'] = accuracy_score(y_pred=y_pred_test, y_true=y_test)
1740:metrics.loc['precision','Bagging'] = precision_score(y_pred=y_pred_test, y_true=y_test)
1750:metrics.loc['recall','Bagging'] = recall_score(y_pred=y_pred_test, y_true=y_test)
1760:print("\n", "models after evaluating", "\n", 50 * "-", "\n", metrics)
1770:
1780:#Confusion matrix
1790:print("\n", "Confusion matrix")
1800:CM = confusion_matrix(y_pred=y_pred_test, y_true=y_test)
1810:print("\n", "confusion_matrix", "\n", 50 * "-", "\n", CM)
1820:
1830:
1840:CMatrix(CM)
1850:
1860:print("\n", "Print more informative confusion matrix", "\n", 50 * "-", "\n", CMatrix(CM))
1870:
1880:

17.2.Output▲

import BaggingClassifier
create instance of LogisticRegression and BaggingClassifier
log_reg_for_bagging 
 -------------------------------------------------- 
 LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)
bagging 
 -------------------------------------------------- 
 BaggingClassifier(base_estimator=LogisticRegression(C=1.0, class_weight=None,
                                                    dual=False,
                                                    fit_intercept=True,
                                                    intercept_scaling=1,
                                                    l1_ratio=None, max_iter=100,
                                                    multi_class='warn',
                                                    n_jobs=None, penalty='l2',
                                                    random_state=None,
                                                    solver='liblinear',
                                                    tol=0.0001, verbose=0,
                                                    warm_start=False),
                  bootstrap=True, bootstrap_features=False, max_features=1.0,
                  max_samples=1.0, n_estimators=10, n_jobs=-1, oob_score=False,
                  random_state=55, verbose=0, warm_start=False)

 Use the training data to train the estimator

 After training, X_train 
 -------------------------------------------------- 
 [[-0.31578947 -0.69230769  1.         ...  0.          0.
   0.        ]
 [-0.47368421 -0.92307692  0.         ...  0.          1.
   0.        ]
 [ 0.31578947  0.38461538  0.         ...  0.          0.
   1.        ]
 ...
 [-0.47368421 -0.92307692  0.         ...  0.          1.
   0.        ]
 [-0.63157895  0.          0.         ...  0.          1.
   0.        ]
 [ 1.89473684  1.15384615  0.         ...  0.          1.
   1.        ]]

 After training, y_train 
 -------------------------------------------------- 
 ID
18737    1
23949    0
12307    0
4023     0
27774    1
        ..
9827     1
21215    0
23936    0
11066    0
29735    1
Name: default, Length: 25500, dtype: int64

 Evaluate Model

 models after evaluating 
 -------------------------------------------------- 
           LogisticReg   Bagging RandomForest Boosting
accuracy     0.805778  0.805333          NaN      NaN
precision    0.620758  0.617822          NaN      NaN
recall       0.312563  0.313568          NaN      NaN

 Confusion matrix

 confusion_matrix 
 -------------------------------------------------- 
 [[3312  193]
 [ 683  312]]

 Print more informative confusion matrix 
 -------------------------------------------------- 
 PREDICTION   pay  default  Total
TRUE                            
pay         3312      193   3505
default      683      312    995
Total       3995      505   4500

18.RandomForest Model▲

18.1.Code▲

1920:print("\n", "import RandomForestClassifier", "\n", 50 * "-")
1930:from sklearn.ensemble import RandomForestClassifier
1940:
1950:
1960:print("create instance of RandomForestClassifier")
1970:RF = RandomForestClassifier(n_estimators=35, max_depth=20, random_state=55, max_features='sqrt',
1980:                            n_jobs=-1)
1990:
2000:print("RF", "\n", 50 * "-", "\n", RF)
2010:
2020:print("\n", "Use the training data to train the estimator")
2030:RF.fit(X_train, y_train)
2040:print("\n", "After training, X_train", "\n", 50 * "-", "\n", X_train)
2050:print("\n", "After training, y_train", "\n", 50 * "-", "\n", y_train)
2060:
2070:print("\n", "Evaluate Model")
2080:y_pred_test = RF.predict(X_test)
2090:metrics.loc['accuracy','RandomForest'] = accuracy_score(y_pred=y_pred_test, y_true=y_test)
2100:metrics.loc['precision','RandomForest'] = precision_score(y_pred=y_pred_test, y_true=y_test)
2110:metrics.loc['recall','RandomForest'] = recall_score(y_pred=y_pred_test, y_true=y_test)
2120:print("\n", "metrics after evaluating", "\n", 50 * "-", "\n", metrics)
2130:
2140:#Confusion matrix
2150:print("\n", "Confusion matrix")
2160:CM = confusion_matrix(y_pred=y_pred_test, y_true=y_test)
2170:print("\n", "confusion_matrix", "\n", 50 * "-", "\n", CM)
2180:
2190:CMatrix(CM)
2200:
2210:print("\n", "Print more informative confusion matrix", "\n", 50 * "-", "\n", CMatrix(CM))
2220:

18.2.Output▲

import RandomForestClassifier 
 --------------------------------------------------
create instance of RandomForestClassifier
RF 
 -------------------------------------------------- 
 RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=20, max_features='sqrt', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=35, n_jobs=-1,
                       oob_score=False, random_state=55, verbose=0,
                       warm_start=False)

 Use the training data to train the estimator

 After training, X_train 
 -------------------------------------------------- 
 [[-0.31578947 -0.69230769  1.         ...  0.          0.
   0.        ]
 [-0.47368421 -0.92307692  0.         ...  0.          1.
   0.        ]
 [ 0.31578947  0.38461538  0.         ...  0.          0.
   1.        ]
 ...
 [-0.47368421 -0.92307692  0.         ...  0.          1.
   0.        ]
 [-0.63157895  0.          0.         ...  0.          1.
   0.        ]
 [ 1.89473684  1.15384615  0.         ...  0.          1.
   1.        ]]

 After training, y_train 
 -------------------------------------------------- 
 ID
18737    1
23949    0
12307    0
4023     0
27774    1
        ..
9827     1
21215    0
23936    0
11066    0
29735    1
Name: default, Length: 25500, dtype: int64

 Evaluate Model

 metrics after evaluating 
 -------------------------------------------------- 
           LogisticReg   Bagging RandomForest Boosting
accuracy     0.805778  0.805333     0.810222      NaN
precision    0.620758  0.617822      0.61809      NaN
recall       0.312563  0.313568     0.370854      NaN

 Confusion matrix

 confusion_matrix 
 -------------------------------------------------- 
 [[3277  228]
 [ 626  369]]

 Print more informative confusion matrix 
 -------------------------------------------------- 
 PREDICTION   pay  default  Total
TRUE                            
pay         3277      228   3505
default      626      369    995
Total       3903      597   4500

19.Boosting Model▲

19.1.Code▲

2240:
2250:print("\n", "import AdaBoostClassifier", "\n", 50 * "-")
2260:from sklearn.ensemble import AdaBoostClassifier
2270:
2280:
2290:print("create instance of AdaBoostClassifier")
2300:boosting = AdaBoostClassifier(n_estimators=50, learning_rate=0.1, random_state=55)
2310:print("boosting", "\n", 50 * "-", "\n", boosting)
2320:
2330:print("\n", "Use the training data to train the estimator")
2340:boosting.fit(X_train, y_train)
2350:print("\n", "After training, X_train", "\n", 50 * "-", "\n", X_train)
2360:print("\n", "After training, y_train", "\n", 50 * "-", "\n", y_train)
2370:
2380:print("\n", "Evaluate Model")
2390:y_pred_test = boosting.predict(X_test)
2400:metrics.loc['accuracy','Boosting'] = accuracy_score(y_pred=y_pred_test, y_true=y_test)
2410:metrics.loc['precision','Boosting'] = precision_score(y_pred=y_pred_test, y_true=y_test)
2420:metrics.loc['recall','Boosting'] = recall_score(y_pred=y_pred_test, y_true=y_test)
2430:print("\n", "metrics after evaluating", "\n", 50 * "-", "\n", metrics)
2440:
2450:#Confusion matrix
2460:print("\n", "Confusion matrix")
2470:CM = confusion_matrix(y_pred=y_pred_test, y_true=y_test)
2480:print("\n", "confusion_matrix", "\n", 50 * "-", "\n", CM)
2490:
2500:CMatrix(CM)
2510:
2520:print("\n", "Print more informative confusion matrix", "\n", 50 * "-", "\n", CMatrix(CM))
2530:
2540:

19.2.Output▲

import AdaBoostClassifier 
 --------------------------------------------------
create instance of AdaBoostClassifier
boosting 
 -------------------------------------------------- 
 AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=0.1,
                   n_estimators=50, random_state=55)

 Use the training data to train the estimator

 After training, X_train 
 -------------------------------------------------- 
 [[-0.31578947 -0.69230769  1.         ...  0.          0.
   0.        ]
 [-0.47368421 -0.92307692  0.         ...  0.          1.
   0.        ]
 [ 0.31578947  0.38461538  0.         ...  0.          0.
   1.        ]
 ...
 [-0.47368421 -0.92307692  0.         ...  0.          1.
   0.        ]
 [-0.63157895  0.          0.         ...  0.          1.
   0.        ]
 [ 1.89473684  1.15384615  0.         ...  0.          1.
   1.        ]]

 After training, y_train 
 -------------------------------------------------- 
 ID
18737    1
23949    0
12307    0
4023     0
27774    1
        ..
9827     1
21215    0
23936    0
11066    0
29735    1
Name: default, Length: 25500, dtype: int64

 Evaluate Model

 metrics after evaluating 
 -------------------------------------------------- 
           LogisticReg   Bagging RandomForest  Boosting
accuracy     0.805778  0.805333     0.810222     0.804
precision    0.620758  0.617822      0.61809  0.631702
recall       0.312563  0.313568     0.370854  0.272362

 Confusion matrix

 confusion_matrix 
 -------------------------------------------------- 
 [[3347  158]
 [ 724  271]]

 Print more informative confusion matrix 
 -------------------------------------------------- 
 PREDICTION   pay  default  Total
TRUE                            
pay         3347      158   3505
default      724      271    995
Total       4071      429   4500

 --------------------------------------------------

20.Save Models into a file▲

20.1.Code▲

2560:
2570:
2580:import pickle
2590:
2600:filename = '../data/all_metrics.sav'
2610:
2620:print("\n", 50 * "-", "\nDumping metrics to", filename)
2630:
2640:pickle.dump(metrics, open(filename, 'wb'))
2650:
2660: