Machine Learning with Diamond Data – Part 1

1. Introduction

2. Load data

3. Print list of members and methods of loaded data object

4. Print details of loaded dataset

5. Find unique values of some columns

6. Change the nominal values of cut into ordinal (numeric values)

7. Change nominal values of color and clarity as well into ordinal values

8. set price as target and create a new dataset X by dropping this column from original dataset

9. Use RobustScaler to transform X

10. Create a dataset Y with only target column

11. Create train and test data

12. Create a dataframe to store result of different models

13. Create KNN model

14. Fit the data into model

15. Update result of model into model matrix

16. Implement Bagging Model

17. RandomForest Model

18. Boosting Model

19. Save Models into a file

1.Introduction▲

Download data from https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/diamonds.csv

1.1.Code▲

2.Load data▲

2.1.Code▲

10:# Download data from https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/diamonds.csv
20:import pandas as pd
30:
40:data_path= 'F:/data/input/diamonds.csv'
50:diamonds = pd.read_csv(data_path)
9000:print("\n", 50 * "-", "\nProgram Over")

3.Print list of members and methods of loaded data object▲

3.1.Code▲

60:
70:print(dir(diamonds.__class__))

3.2.Output▲

['T', '_AXIS_ALIASES', '_AXIS_IALIASES', '_AXIS_LEN', '_AXIS_NAMES', '_AXIS_NUMBERS', '_AXIS_ORDERS', '_AXIS_REVERSED', '__abs__', '__add__', '__and__', '__array__', '__array_priority__', '__array_wrap__', '__bool__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__div__', '__doc__', '__eq__', '__finalize__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdiv__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror__', '__round__', '__rpow__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__weakref__', '__xor__', '_accessors', '_add_numeric_operations', '_add_series_only_operations', '_add_series_or_dataframe_operations', '_agg_by_level', '_agg_examples_doc', '_agg_summary_and_see_also_doc', '_aggregate', '_aggregate_multiple_funcs', '_align_frame', '_align_series', '_box_col_values', '_box_item_values', '_builtin_table', '_check_inplace_setting', '_check_is_chained_assignment_possible', '_check_label_or_level_ambiguity', '_check_percentile', '_check_setitem_copy', '_clear_item_cache', '_clip_with_one_bound', '_clip_with_scalar', '_combine_const', '_combine_frame', '_combine_match_columns', '_combine_match_index', '_consolidate', '_consolidate_inplace', '_construct_axes_dict', '_construct_axes_dict_from', '_construct_axes_from_arguments', '_constructor', '_constructor_expanddim', '_constructor_sliced', '_convert', '_count_level', '_create_indexer', '_cython_table', '_data', '_deprecations', '_dir_additions', '_dir_deletions', '_drop_axis', '_drop_labels_or_levels', '_ensure_valid_index', '_find_valid_index', '_from_arrays', '_from_axes', '_get_agg_axis', '_get_axis', '_get_axis_name', '_get_axis_number', '_get_axis_resolvers', '_get_block_manager_axis', '_get_bool_data', '_get_cacher', '_get_index_resolvers', '_get_item_cache', '_get_label_or_level_values', '_get_numeric_data', '_get_space_character_free_column_resolvers', '_get_value', '_get_values', '_getitem_bool_array', '_getitem_frame', '_getitem_multilevel', '_gotitem', '_iget_item_cache', '_indexed_same', '_info_axis', '_info_axis_name', '_info_axis_number', '_info_repr', '_init_mgr', '_internal_get_values', '_internal_names', '_internal_names_set', '_is_builtin_func', '_is_cached', '_is_copy', '_is_cython_func', '_is_datelike_mixed_type', '_is_homogeneous_type', '_is_label_or_level_reference', '_is_label_reference', '_is_level_reference', '_is_mixed_type', '_is_numeric_mixed_type', '_is_view', '_ix', '_ixs', '_join_compat', '_maybe_cache_changed', '_maybe_update_cacher', '_metadata', '_needs_reindex_multi', '_obj_with_exclusions', '_protect_consolidate', '_reduce', '_reindex_axes', '_reindex_columns', '_reindex_index', '_reindex_multi', '_reindex_with_indexers', '_repr_data_resource_', '_repr_fits_horizontal_', '_repr_fits_vertical_', '_repr_html_', '_repr_latex_', '_reset_cache', '_reset_cacher', '_sanitize_column', '_selected_obj', '_selection', '_selection_list', '_selection_name', '_series', '_set_as_cached', '_set_axis', '_set_axis_name', '_set_is_copy', '_set_item', '_set_value', '_setitem_array', '_setitem_frame', '_setitem_slice', '_setup_axes', '_shallow_copy', '_slice', '_stat_axis', '_stat_axis_name', '_stat_axis_number', '_to_dict_of_blocks', '_try_aggregate_string_function', '_typ', '_unpickle_frame_compat', '_unpickle_matrix_compat', '_update_inplace', '_validate_dtype', '_values', '_where', '_xs', 'abs', 'add', 'add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'all', 'any', 'append', 'apply', 'applymap', 'as_blocks', 'as_matrix', 'asfreq', 'asof', 'assign', 'astype', 'at', 'at_time', 'axes', 'between_time', 'bfill', 'blocks', 'bool', 'boxplot', 'clip', 'clip_lower', 'clip_upper', 'columns', 'combine', 'combine_first', 'compound', 'copy', 'corr', 'corrwith', 'count', 'cov', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'div', 'divide', 'dot', 'drop', 'drop_duplicates', 'droplevel', 'dropna', 'dtypes', 'duplicated', 'empty', 'eq', 'equals', 'eval', 'ewm', 'expanding', 'explode', 'ffill', 'fillna', 'filter', 'first', 'first_valid_index', 'floordiv', 'from_dict', 'from_items', 'from_records', 'ftypes', 'ge', 'get', 'get_dtype_counts', 'get_ftype_counts', 'get_value', 'get_values', 'groupby', 'gt', 'head', 'hist', 'iat', 'idxmax', 'idxmin', 'iloc', 'index', 'infer_objects', 'info', 'insert', 'interpolate', 'is_copy', 'isin', 'isna', 'isnull', 'items', 'iteritems', 'iterrows', 'itertuples', 'ix', 'join', 'keys', 'kurt', 'kurtosis', 'last', 'last_valid_index', 'le', 'loc', 'lookup', 'lt', 'mad', 'mask', 'max', 'mean', 'median', 'melt', 'memory_usage', 'merge', 'min', 'mod', 'mode', 'mul', 'multiply', 'ndim', 'ne', 'nlargest', 'notna', 'notnull', 'nsmallest', 'nunique', 'pct_change', 'pipe', 'pivot', 'pivot_table', 'plot', 'pop', 'pow', 'prod', 'product', 'quantile', 'query', 'radd', 'rank', 'rdiv', 'reindex', 'reindex_like', 'rename', 'rename_axis', 'reorder_levels', 'replace', 'resample', 'reset_index', 'rfloordiv', 'rmod', 'rmul', 'rolling', 'round', 'rpow', 'rsub', 'rtruediv', 'sample', 'select_dtypes', 'sem', 'set_axis', 'set_index', 'set_value', 'shape', 'shift', 'size', 'skew', 'slice_shift', 'sort_index', 'sort_values', 'sparse', 'squeeze', 'stack', 'std', 'style', 'sub', 'subtract', 'sum', 'swapaxes', 'swaplevel', 'tail', 'take', 'to_clipboard', 'to_csv', 'to_dense', 'to_dict', 'to_excel', 'to_feather', 'to_gbq', 'to_hdf', 'to_html', 'to_json', 'to_latex', 'to_msgpack', 'to_numpy', 'to_parquet', 'to_period', 'to_pickle', 'to_records', 'to_sparse', 'to_sql', 'to_stata', 'to_string', 'to_timestamp', 'to_xarray', 'transform', 'transpose', 'truediv', 'truncate', 'tshift', 'tz_convert', 'tz_localize', 'unstack', 'update', 'values', 'var', 'where', 'xs']

4.Print details of loaded dataset▲

4.1.Code▲

60:
70:#print(dir(diamonds.__class__))
80:
90:print("\n", "diamonds.head(10)", "\n", diamonds.head(10))
100:print("\n", "diamonds.columns", "\n", diamonds.columns)
110:
120:print("\n", "diamonds.shape", "\n", diamonds.shape)

4.2.Output▲

diamonds.head(10) 
    carat        cut color clarity  depth  table  price     x     y     z
0   0.23      Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
1   0.21    Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
2   0.23       Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
3   0.29    Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63
4   0.31       Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75
5   0.24  Very Good     J    VVS2   62.8   57.0    336  3.94  3.96  2.48
6   0.24  Very Good     I    VVS1   62.3   57.0    336  3.95  3.98  2.47
7   0.26  Very Good     H     SI1   61.9   55.0    337  4.07  4.11  2.53
8   0.22       Fair     E     VS2   65.1   61.0    337  3.87  3.78  2.49
9   0.23  Very Good     H     VS1   59.4   61.0    338  4.00  4.05  2.39

 diamonds.columns 
 Index(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y',
       'z'],
      dtype='object')

 diamonds.shape 
 (53940, 10)

5.Find unique values of some columns▲

5.1.Code▲

130:
140:print("\n", "diamonds['cut'].unique()", "\n", diamonds['cut'].unique())
150:print("\n", "diamonds['color'].unique()", "\n", diamonds['color'].unique())
160:print("\n", "diamonds['clarity'].unique()", "\n", diamonds['clarity'].unique())

5.2.Output▲

diamonds['cut'].unique() 
 ['Ideal' 'Premium' 'Good' 'Very Good' 'Fair']

 diamonds['color'].unique() 
 ['E' 'I' 'J' 'H' 'F' 'G' 'D']

 diamonds['clarity'].unique() 
 ['SI2' 'SI1' 'VS1' 'VS2' 'VVS2' 'VVS1' 'I1' 'IF']

6.Change the nominal values of cut into ordinal (numeric values)▲

6.1.Code▲

170:
180:dummysCut = pd.get_dummies(diamonds['cut'], prefix='cut', drop_first=True)
190:print("\n", "dummysCut = pd.get_dummies(diamonds['cut'], prefix='cut', drop_first=True)\n", "dummysCut.head(10)", "\n", dummysCut.head(10))
200:
210:diamonds = pd.concat([diamonds, dummysCut],axis=1)
220:print("\n", "After concatinating dummysCut with diamonds \ndiamonds.head(10)", "\n", diamonds.head(10))
230:print("\n", "diamonds.columns", "\n", diamonds.columns)
240:print("\n", "diamonds.head(10)", "\n", diamonds.head(10))
250:

6.2.Output▲

dummysCut = pd.get_dummies(diamonds['cut'], prefix='cut', drop_first=True)
 dummysCut.head(10) 
    cut_Good  cut_Ideal  cut_Premium  cut_Very Good
0         0          1            0              0
1         0          0            1              0
2         1          0            0              0
3         0          0            1              0
4         1          0            0              0
5         0          0            0              1
6         0          0            0              1
7         0          0            0              1
8         0          0            0              0
9         0          0            0              1

 After concatinating dummysCut with diamonds 
diamonds.head(10) 
    carat        cut color  ... cut_Ideal  cut_Premium  cut_Very Good
0   0.23      Ideal     E  ...         1            0              0
1   0.21    Premium     E  ...         0            1              0
2   0.23       Good     E  ...         0            0              0
3   0.29    Premium     I  ...         0            1              0
4   0.31       Good     J  ...         0            0              0
5   0.24  Very Good     J  ...         0            0              1
6   0.24  Very Good     I  ...         0            0              1
7   0.26  Very Good     H  ...         0            0              1
8   0.22       Fair     E  ...         0            0              0
9   0.23  Very Good     H  ...         0            0              1

[10 rows x 14 columns]

 diamonds.columns 
 Index(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y',
       'z', 'cut_Good', 'cut_Ideal', 'cut_Premium', 'cut_Very Good'],
      dtype='object')

 diamonds.head(10) 
    carat        cut color  ... cut_Ideal  cut_Premium  cut_Very Good
0   0.23      Ideal     E  ...         1            0              0
1   0.21    Premium     E  ...         0            1              0
2   0.23       Good     E  ...         0            0              0
3   0.29    Premium     I  ...         0            1              0
4   0.31       Good     J  ...         0            0              0
5   0.24  Very Good     J  ...         0            0              1
6   0.24  Very Good     I  ...         0            0              1
7   0.26  Very Good     H  ...         0            0              1
8   0.22       Fair     E  ...         0            0              0
9   0.23  Very Good     H  ...         0            0              1

[10 rows x 14 columns]

7.Change nominal values of color and clarity as well into ordinal values▲

7.1.Code▲

260:
270:diamonds = pd.concat([diamonds, pd.get_dummies(diamonds['cut'], prefix='cut', drop_first=True)],axis=1)
280:diamonds = pd.concat([diamonds, pd.get_dummies(diamonds['color'], prefix='color', drop_first=True)],axis=1)
290:diamonds = pd.concat([diamonds, pd.get_dummies(diamonds['clarity'], prefix='clarity', drop_first=True)],axis=1)
300:diamonds.drop(['cut','color','clarity'], axis=1, inplace=True)
310:
320:print("\n", "diamonds.head(10)", "\n", diamonds.head(10))
330:print("\n", "diamonds.columns", "\n", diamonds.columns)
340:

7.2.Output▲

diamonds.head(10) 
    carat  depth  table  ...  clarity_VS2  clarity_VVS1  clarity_VVS2
0   0.23   61.5   55.0  ...            0             0             0
1   0.21   59.8   61.0  ...            0             0             0
2   0.23   56.9   65.0  ...            0             0             0
3   0.29   62.4   58.0  ...            1             0             0
4   0.31   63.3   58.0  ...            0             0             0
5   0.24   62.8   57.0  ...            0             0             1
6   0.24   62.3   57.0  ...            0             1             0
7   0.26   61.9   55.0  ...            0             0             0
8   0.22   65.1   61.0  ...            1             0             0
9   0.23   59.4   61.0  ...            0             0             0

[10 rows x 24 columns]

 diamonds.columns 
 Index(['carat', 'depth', 'table', 'price', 'x', 'y', 'z', 'cut_Good',
       'cut_Ideal', 'cut_Premium', 'cut_Very Good', 'color_E', 'color_F',
       'color_G', 'color_H', 'color_I', 'color_J', 'clarity_IF', 'clarity_SI1',
       'clarity_SI2', 'clarity_VS1', 'clarity_VS2', 'clarity_VVS1',
       'clarity_VVS2'],
      dtype='object')

8.set price as target and create a new dataset X by dropping this column from original dataset▲

8.1.Code▲

370:target_name = 'price'
380:X = diamonds.drop('price', axis=1)
390:
400:print("\n", "X.columns", "\n", X.columns)
410:print("\n", "X.head(5)", "\n", X.head(5))
420:print(X.describe())
430:

8.2.Output▲

X.columns 
 Index(['carat', 'depth', 'table', 'x', 'y', 'z', 'cut_Good', 'cut_Ideal',
       'cut_Premium', 'cut_Very Good', 'cut_Good', 'cut_Ideal', 'cut_Premium',
       'cut_Very Good', 'color_E', 'color_F', 'color_G', 'color_H', 'color_I',
       'color_J', 'clarity_IF', 'clarity_SI1', 'clarity_SI2', 'clarity_VS1',
       'clarity_VS2', 'clarity_VVS1', 'clarity_VVS2'],
      dtype='object')

 X.head(5) 
    carat  depth  table  ...  clarity_VS2  clarity_VVS1  clarity_VVS2
0   0.23   61.5   55.0  ...            0             0             0
1   0.21   59.8   61.0  ...            0             0             0
2   0.23   56.9   65.0  ...            0             0             0
3   0.29   62.4   58.0  ...            1             0             0
4   0.31   63.3   58.0  ...            0             0             0

[5 rows x 27 columns]
              carat         depth  ...  clarity_VVS1  clarity_VVS2
count  53940.000000  53940.000000  ...  53940.000000  53940.000000
mean       0.797940     61.749405  ...      0.067760      0.093919
std        0.474011      1.432621  ...      0.251337      0.291719
min        0.200000     43.000000  ...      0.000000      0.000000
25%        0.400000     61.000000  ...      0.000000      0.000000
50%        0.700000     61.800000  ...      0.000000      0.000000
75%        1.040000     62.500000  ...      0.000000      0.000000
max        5.010000     79.000000  ...      1.000000      1.000000

[8 rows x 27 columns]

9.Use RobustScaler to transform X▲

9.1.Code▲

21:from sklearn.preprocessing import RobustScaler
440:#robust_scaler = RobustScaler()
450:robust_scaler = RobustScaler(with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True)
460:
470:X = robust_scaler.fit_transform(X)
480:print("\n", "after robust_scaler.fit_transform(X), X is as follows:", "\n", X.shape)

9.2.Output▲

after robust_scaler.fit_transform(X), X is as follows: 
 (53940, 27)

10.Create a dataset Y with only target column▲

10.1.Code▲

490:
500:y = diamonds[target_name]
510:print("\n", "y", "\n", y.head(10))

10.2.Output▲

y 
 0    326
1    326
2    327
3    334
4    335
5    336
6    336
7    337
8    337
9    338
Name: price, dtype: int64

11.Create train and test data▲

11.1.Code▲

26:from sklearn.model_selection import train_test_split
520:
530:X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=55)
540:
550:print("\n", "X_train.shape", "\n", X_train.shape)
560:print("\n", "X_test.shape", "\n", X_test.shape)
570:
580:print("\n", "y_train.shape", "\n", y_train.shape)
590:print("\n", "y_test.shape", "\n", y_test.shape)
600:

11.2.Output▲

X_train.shape 
 (43152, 27)

 X_test.shape 
 (10788, 27)

 y_train.shape 
 (43152,)

 y_test.shape 
 (10788,)

12.Create a dataframe to store result of different models▲

12.1.Code▲

600:models = pd.DataFrame(index=['train_mse', 'test_mse'],
610:                      columns=['KNN', 'Bagging', 'RandomForest', 'Boosting'])
620:
630:print("\n", "models", "\n", models)
640:
650:

12.2.Output▲

models 
            KNN Bagging RandomForest Boosting
train_mse  NaN     NaN          NaN      NaN
test_mse   NaN     NaN          NaN      NaN

13.Create KNN model▲

13.1.Code▲

660:print("\n", "import KNeighborsRegressor")
670:from sklearn.neighbors import KNeighborsRegressor
680:
690:print("create instance of KNeighborsRegressor")
700:knn = KNeighborsRegressor(n_neighbors=20, weights='distance', metric='euclidean', n_jobs=-1)
710:print("knn", "\n", 50 * "-", "\n", knn)

13.2.Output▲

import KNeighborsRegressor
create instance of KNeighborsRegressor
knn 
 -------------------------------------------------- 
 KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='euclidean',
                    metric_params=None, n_jobs=-1, n_neighbors=20, p=2,
                    weights='distance')

14.Fit the data into model▲

14.1.Code▲

720:
730:print("\n", "Use the training data to train the estimator")
740:knn.fit(X_train, y_train)
750:print("\n", "After training, X_train", "\n", 50 * "-", "\n", X_train)
760:print("\n", "After training, y_train", "\n", 50 * "-", "\n", y_train)
770:

14.2.Output▲

Use the training data to train the estimator

 After training, X_train 
 -------------------------------------------------- 
 [[ 0.015625   -0.06666667  0.         ...  0.          0.
   0.        ]
 [ 2.046875    0.33333333  1.         ...  0.          0.
   0.        ]
 [ 0.625       0.46666667  0.         ...  0.          0.
   0.        ]
 ...
 [ 0.984375   -0.2         0.33333333 ...  0.          0.
   0.        ]
 [-0.609375    0.33333333  1.         ...  1.          0.
   0.        ]
 [ 0.3125      1.8         1.33333333 ...  0.          0.
   0.        ]]

 After training, y_train 
 -------------------------------------------------- 
 51408     2370
25582    14426
8877      4484
17084     6811
35353      898
         ...  
10213     4742
16253     6501
17352     6963
28967      435
4762      3689
Name: price, Length: 43152, dtype: int64

15.Update result of model into model matrix▲

15.1.Code▲

24:from sklearn.metrics import mean_squared_error
780:
790:# 4. Update the model matrix
800:models.loc['train_mse','KNN'] = mean_squared_error(y_pred=knn.predict(X_train),
810:                                                    y_true=y_train)
820:
830:models.loc['test_mse','KNN'] = mean_squared_error(y_pred=knn.predict(X_test),
840:                                                   y_true=y_test)
850:
860:print("\n", "models after evaluating", "\n", 50 * "-", "\n", models)
870:
880:
890:

15.2.Output▲

models after evaluating 
 -------------------------------------------------- 
               KNN Bagging RandomForest Boosting
train_mse  78.503     NaN          NaN      NaN
test_mse   774504     NaN          NaN      NaN

16.Implement Bagging Model▲

16.1.Code▲

890:#1600 Implement Bagging Model
900:print("\n", "import BaggingRegressor")
910:#from sklearn.neighbors import KNeighborsRegressor  #already imported
920:from sklearn.ensemble import BaggingRegressor
930:
940:print("create instance of KNeighborsRegressor and BaggingRegressor")
950:#knn = KNeighborsRegressor(n_neighbors=20, weights='distance', metric='euclidean', n_jobs=-1)
960:
970:knn_for_bagging = KNeighborsRegressor(n_neighbors=20, weights='distance', metric='euclidean')
980:
990:bagging = BaggingRegressor(base_estimator=knn_for_bagging, n_estimators=15, max_features=0.75,
1000:                            random_state=55, n_jobs=-1)
1010:
1020:print("knn_for_bagging", "\n", 50 * "-", "\n", knn_for_bagging)
1030:print("bagging", "\n", 50 * "-", "\n", bagging)
1040:
1050:print("\n", "Use the training data to train the estimator")
1060:bagging.fit(X_train, y_train)
1070:print("\n", "After training, X_train", "\n", 50 * "-", "\n", X_train)
1080:print("\n", "After training, y_train", "\n", 50 * "-", "\n", y_train)
1090:
1100:# 4. Evaluate the model
1110:models.loc['train_mse','Bagging'] = mean_squared_error(y_pred=bagging.predict(X_train),
1120:                                                    y_true=y_train)
1130:
1140:models.loc['test_mse','Bagging'] = mean_squared_error(y_pred=bagging.predict(X_test),
1150:                                                   y_true=y_test)
1160:
1170:print("\n", "models after evaluating", "\n", 50 * "-", "\n", models)
1180:
1190:
1200:print("\n", 50 * "-", "\nProgram Over")

16.2.Output▲

import BaggingRegressor
create instance of KNeighborsRegressor and BaggingRegressor
knn_for_bagging 
 -------------------------------------------------- 
 KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='euclidean',
                    metric_params=None, n_jobs=None, n_neighbors=20, p=2,
                    weights='distance')
bagging 
 -------------------------------------------------- 
 BaggingRegressor(base_estimator=KNeighborsRegressor(algorithm='auto',
                                                    leaf_size=30,
                                                    metric='euclidean',
                                                    metric_params=None,
                                                    n_jobs=None, n_neighbors=20,
                                                    p=2, weights='distance'),
                 bootstrap=True, bootstrap_features=False, max_features=0.75,
                 max_samples=1.0, n_estimators=15, n_jobs=-1, oob_score=False,
                 random_state=55, verbose=0, warm_start=False)

 Use the training data to train the estimator

 After training, X_train 
 -------------------------------------------------- 
 [[ 0.015625   -0.06666667  0.         ...  0.          0.
   0.        ]
 [ 2.046875    0.33333333  1.         ...  0.          0.
   0.        ]
 [ 0.625       0.46666667  0.         ...  0.          0.
   0.        ]
 ...
 [ 0.984375   -0.2         0.33333333 ...  0.          0.
   0.        ]
 [-0.609375    0.33333333  1.         ...  1.          0.
   0.        ]
 [ 0.3125      1.8         1.33333333 ...  0.          0.
   0.        ]]

 After training, y_train 
 -------------------------------------------------- 
 51408     2370
25582    14426
8877      4484
17084     6811
35353      898
         ...  
10213     4742
16253     6501
17352     6963
28967      435
4762      3689
Name: price, Length: 43152, dtype: int64

 models after evaluating 
 -------------------------------------------------- 
               KNN Bagging RandomForest Boosting
train_mse  78.503  125735          NaN      NaN
test_mse   774504  752601          NaN      NaN

 --------------------------------------------------

17.RandomForest Model▲

Warning!!!

It take 10-20 minutes

17.1.Code▲

1270:
1280:
1290:print("\n", "import RandomForestRegressor")
1300:from sklearn.ensemble import RandomForestRegressor
1390:
1400:RF = RandomForestRegressor(n_estimators=50, max_depth=16, random_state=55, n_jobs=-1)
1410:
1420:print("RF", "\n", 50 * "-", "\n", RF)
1430:
1440:
1450:print("\n", "Use the training data to train the estimator")
1460:RF.fit(X_train, y_train)
1470:print("\n", "After training, X_train", "\n", 50 * "-", "\n", X_train)
1480:print("\n", "After training, y_train", "\n", 50 * "-", "\n", y_train)
1490:
1500:
1510:models.loc['train_mse','RandomForest'] = mean_squared_error(y_pred=RF.predict(X_train),
1520:                                                    y_true=y_train)
1530:
1540:models.loc['test_mse','RandomForest'] = mean_squared_error(y_pred=RF.predict(X_test),
1550:                                                   y_true=y_test)
1560:
1570:
1580:print("\n", "models after evaluating", "\n", 50 * "-", "\n", models)
1590:
1600:
1620:

17.2.Output▲

import RandomForestRegressor
RF 
 -------------------------------------------------- 
 RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=16,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=-1,
                      oob_score=False, random_state=55, verbose=0,
                      warm_start=False)

 Use the training data to train the estimator

 After training, X_train 
 -------------------------------------------------- 
 [[ 0.015625   -0.06666667  0.         ...  0.          0.
   0.        ]
 [ 2.046875    0.33333333  1.         ...  0.          0.
   0.        ]
 [ 0.625       0.46666667  0.         ...  0.          0.
   0.        ]
 ...
 [ 0.984375   -0.2         0.33333333 ...  0.          0.
   0.        ]
 [-0.609375    0.33333333  1.         ...  1.          0.
   0.        ]
 [ 0.3125      1.8         1.33333333 ...  0.          0.
   0.        ]]

 After training, y_train 
 -------------------------------------------------- 
 51408     2370
25582    14426
8877      4484
17084     6811
35353      898
         ...  
10213     4742
16253     6501
17352     6963
28967      435
4762      3689
Name: price, Length: 43152, dtype: int64

 models after evaluating 
 -------------------------------------------------- 
               KNN Bagging RandomForest Boosting
train_mse  78.503  125735       142396      NaN
test_mse   774504  752601       374999      NaN

 --------------------------------------------------

18.Boosting Model▲

Warning!!!

It take 20-30 minutes

18.1.Code▲

1700:print("\n", "import AdaBoostRegressor")
1710:from sklearn.ensemble import AdaBoostRegressor
1720:
1730:print("create instance of RandomForestRegressor")
1740:#knn = KNeighborsRegressor(n_neighbors=20, weights='distance', metric='euclidean', n_jobs=-1)
1750:
1760:#knn_for_bagging = KNeighborsRegressor(n_neighbors=20, weights='distance', metric='euclidean')
1770:
1780:#bagging = BaggingRegressor(base_estimator=knn_for_bagging, n_estimators=15, max_features=0.75,
1790:#                            random_state=55, n_jobs=-1)
1800:
1810:#RF = RandomForestRegressor(n_estimators=50, max_depth=16, random_state=55, n_jobs=-1)
1820:boosting = AdaBoostRegressor(n_estimators=50, learning_rate=0.05, random_state=55)
1830:
1840:
1850:print("boosting", "\n", 50 * "-", "\n", boosting)
1860:
1870:
1880:print("\n", "Use the training data to train the estimator")
1890:boosting.fit(X_train, y_train)
1900:print("\n", "After training, X_train", "\n", 50 * "-", "\n", X_train)
1910:print("\n", "After training, y_train", "\n", 50 * "-", "\n", y_train)
1920:
1930:
1940:models.loc['train_mse','Boosting'] = mean_squared_error(y_pred=boosting.predict(X_train),
1950:                                                    y_true=y_train)
1960:
1970:models.loc['test_mse','Boosting'] = mean_squared_error(y_pred=boosting.predict(X_test),
1980:                                                   y_true=y_test)
1990:
2000:print("\n", "models after evaluating", "\n", 50 * "-", "\n", models)
2010:
2020:

18.2.Output▲

import AdaBoostRegressor
create instance of RandomForestRegressor
boosting 
 -------------------------------------------------- 
 AdaBoostRegressor(base_estimator=None, learning_rate=0.05, loss='linear',
                  n_estimators=50, random_state=55)

 Use the training data to train the estimator

 After training, X_train 
 -------------------------------------------------- 
 [[ 0.015625   -0.06666667  0.         ...  0.          0.
   0.        ]
 [ 2.046875    0.33333333  1.         ...  0.          0.
   0.        ]
 [ 0.625       0.46666667  0.         ...  0.          0.
   0.        ]
 ...
 [ 0.984375   -0.2         0.33333333 ...  0.          0.
   0.        ]
 [-0.609375    0.33333333  1.         ...  1.          0.
   0.        ]
 [ 0.3125      1.8         1.33333333 ...  0.          0.
   0.        ]]

 After training, y_train 
 -------------------------------------------------- 
 51408     2370
25582    14426
8877      4484
17084     6811
35353      898
         ...  
10213     4742
16253     6501
17352     6963
28967      435
4762      3689
Name: price, Length: 43152, dtype: int64

 models after evaluating 
 -------------------------------------------------- 
               KNN Bagging RandomForest     Boosting
train_mse  78.503  125735       142396  1.82036e+06
test_mse   774504  752601       374999  1.81305e+06

 --------------------------------------------------

19.Save Models into a file▲

Warning!!!

It take 20-40 minutes

19.1.Code▲

2030:import pickle
2040:
2050:filename = 'output/all_models.sav'
2060:
2070:print("\n", 50 * "-", "\nDumping models to", filename)
2080:
2090:pickle.dump(models, open(filename, 'wb'))
2100:
2110:

19.2.Output▲

models after evaluating 
 -------------------------------------------------- 
               KNN Bagging RandomForest     Boosting
train_mse  78.503  125735       142396  1.82036e+06
test_mse   774504  752601       374999  1.81305e+06

 -------------------------------------------------- 
Dumping models to output/all_models.sav

 -------------------------------------------------- 
Program Over

Machine Learning with Diamond Data – Part 1

1.Introduction▲

1.1.Code▲

2.Load data▲

2.1.Code▲

3.Print list of members and methods of loaded data object▲

3.1.Code▲

3.2.Output▲

4.Print details of loaded dataset▲

4.1.Code▲

4.2.Output▲

5.Find unique values of some columns▲

5.1.Code▲

5.2.Output▲

6.Change the nominal values of cut into ordinal (numeric values)▲

6.1.Code▲

6.2.Output▲

7.Change nominal values of color and clarity as well into ordinal values▲

7.1.Code▲

7.2.Output▲

8.set price as target and create a new dataset X by dropping this column from original dataset▲

8.1.Code▲

8.2.Output▲

9.Use RobustScaler to transform X▲

9.1.Code▲

9.2.Output▲

10.Create a dataset Y with only target column▲

10.1.Code▲

10.2.Output▲

11.Create train and test data▲

11.1.Code▲

11.2.Output▲

12.Create a dataframe to store result of different models▲

12.1.Code▲

12.2.Output▲

13.Create KNN model▲

13.1.Code▲

13.2.Output▲

14.Fit the data into model▲

14.1.Code▲

14.2.Output▲

15.Update result of model into model matrix▲

15.1.Code▲

15.2.Output▲

16.Implement Bagging Model▲

16.1.Code▲

16.2.Output▲

17.RandomForest Model▲

17.1.Code▲

17.2.Output▲

18.Boosting Model▲

18.1.Code▲

18.2.Output▲

19.Save Models into a file▲

19.1.Code▲

19.2.Output▲

Tags:

Leave a Comment Cancel Comment