def draw_tree(t, df, size=10, ratio=0.6, precision=0, **kwargs):
=export_graphviz(t, out_file=None, feature_names=df.columns, filled=True, rounded=True,
s=True, rotate=False, precision=precision, **kwargs)
special_charactersreturn graphviz.Source(re.sub('Tree {', f'Tree {{ size={size}; ratio={ratio}', s))
def cluster_columns(df, figsize=(10,6), font_size=12):
= np.round(scipy.stats.spearmanr(df).correlation, 4)
corr = hc.distance.squareform(1-corr)
corr_condensed = hc.linkage(corr_condensed, method='average')
z = plt.figure(figsize=figsize)
fig =df.columns, orientation='left', leaf_font_size=font_size)
hc.dendrogram(z, labels plt.show()
Analysis
Definitions
Let us first define some functions used throughout the notebook. We don’t include the imports for the sake of space, but please look at the notebook nbs/analysis.ipynb for details.
Let us also define some variables:
= Path('data/datasets/autos.csv')
path = Path('results/used_cars')
results_path =True, exist_ok=True)
results_path.mkdir (parents
= 'sale_duration'
target_variable = dict (numeric_variables_high=0.99,
outliers =0.01,
numeric_variables_low=True)
date_created= True add_datepart_flag
cd ..
/home/jaumeamllo/workspace/mine/tabularml
Load data
The first step is to load the data and have a quick look at it:
= pd.read_csv(path, encoding="ISO-8859-1", parse_dates=['dateCrawled', 'dateCreated', 'lastSeen']) df
df.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | kilometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-24 11:52:17 | Golf_3_1.6 | privat | Angebot | 480 | test | NaN | 1993 | manuell | 0 | golf | 150000 | 0 | benzin | volkswagen | NaN | 2016-03-24 | 0 | 70435 | 2016-04-07 03:16:57 |
1 | 2016-03-24 10:58:45 | A5_Sportback_2.7_Tdi | privat | Angebot | 18300 | test | coupe | 2011 | manuell | 190 | NaN | 125000 | 5 | diesel | audi | ja | 2016-03-24 | 0 | 66954 | 2016-04-07 01:46:50 |
2 | 2016-03-14 12:52:21 | Jeep_Grand_Cherokee_"Overland" | privat | Angebot | 9800 | test | suv | 2004 | automatik | 163 | grand | 125000 | 8 | diesel | jeep | NaN | 2016-03-14 | 0 | 90480 | 2016-04-05 12:47:46 |
3 | 2016-03-17 16:54:04 | GOLF_4_1_4__3TÜRER | privat | Angebot | 1500 | test | kleinwagen | 2001 | manuell | 75 | golf | 150000 | 6 | benzin | volkswagen | nein | 2016-03-17 | 0 | 91074 | 2016-03-17 17:40:17 |
4 | 2016-03-31 17:25:20 | Skoda_Fabia_1.4_TDI_PD_Classic | privat | Angebot | 3600 | test | kleinwagen | 2008 | manuell | 69 | fabia | 90000 | 7 | diesel | skoda | nein | 2016-03-31 | 0 | 60437 | 2016-04-06 10:17:21 |
Cleaning and EDA
We first gather the numerical and categorical columns:
= [
numerical_variables for x in df.dtypes.index
x if (str(df.dtypes[x]).startswith ('int') or str(df.dtypes[x]).startswith ('float')) and x != 'postalCode'
] numerical_variables
['price',
'yearOfRegistration',
'powerPS',
'kilometer',
'monthOfRegistration',
'nrOfPictures']
= set (df.columns).difference (numerical_variables)
categorical_variables categorical_variables
{'abtest',
'brand',
'dateCrawled',
'dateCreated',
'fuelType',
'gearbox',
'lastSeen',
'model',
'name',
'notRepairedDamage',
'offerType',
'postalCode',
'seller',
'vehicleType'}
= categorical_variables.difference ({'dateCrawled', 'dateCreated', 'name', 'lastSeen'})
categorical_variables = list (categorical_variables) categorical_variables
Outliers
Next, we remove outliers found in the data. For that purpose, we first examine the statistics of numerical variables:
df.describe()
price | yearOfRegistration | powerPS | kilometer | monthOfRegistration | nrOfPictures | postalCode | |
---|---|---|---|---|---|---|---|
count | 3.715280e+05 | 371528.000000 | 371528.000000 | 371528.000000 | 371528.000000 | 371528.0 | 371528.00000 |
mean | 1.729514e+04 | 2004.577997 | 115.549477 | 125618.688228 | 5.734445 | 0.0 | 50820.66764 |
std | 3.587954e+06 | 92.866598 | 192.139578 | 40112.337051 | 3.712412 | 0.0 | 25799.08247 |
min | 0.000000e+00 | 1000.000000 | 0.000000 | 5000.000000 | 0.000000 | 0.0 | 1067.00000 |
25% | 1.150000e+03 | 1999.000000 | 70.000000 | 125000.000000 | 3.000000 | 0.0 | 30459.00000 |
50% | 2.950000e+03 | 2003.000000 | 105.000000 | 150000.000000 | 6.000000 | 0.0 | 49610.00000 |
75% | 7.200000e+03 | 2008.000000 | 150.000000 | 150000.000000 | 9.000000 | 0.0 | 71546.00000 |
max | 2.147484e+09 | 9999.000000 | 20000.000000 | 150000.000000 | 12.000000 | 0.0 | 99998.00000 |
- We see outliers in:
- Price: minimum value 0, and maximum value > 2000M euros
- Year of registration: minimum value 1000, and maximum value > 3000
- Month of registration: minimum value 0
- powerPS: minimum value 0, maximum value 20K
- We also see that nrOfPictures is always zero, so that it is not relevant
We have two possibilities:
- Remove rows with outliers
- Treat outlier values as missing values
In this analysis, we go for the second option.
Typical outlier detection methods:
- Based on percentile
- Based on z-score
In this analysis, we go for the first option:
= df.copy()
df_clean = df_clean[numerical_variables].apply (
df_clean[numerical_variables] lambda x: x.mask ((x < x.quantile(outliers['numeric_variables_low'])) | (x > x.quantile(outliers['numeric_variables_high'])))
)
# see if there are rows that have all values missing and need to be removed, and
# check maximum ratio of missing values in one row
all(axis=1).any(), df_clean[numerical_variables].isna().mean(axis=1).max() df_clean[numerical_variables].isna().
(False, 0.5)
Remove nrOfPictures
We remove the column nrOfPictures
, since it is not relevant:
= df_clean.drop (columns='nrOfPictures')
df_clean 'nrOfPictures') numerical_variables.remove (
dateCreated
We examine the column dateCreated
:
max()-df_clean.dateCreated.min() df_clean.dateCreated.
Timedelta('759 days 00:00:00')
The time span is more than 2 years. However, we see below that there are only few cases where the ad was posted more than few months ago, while the rest are just a few months old. To see that, we transform dateCreated
to the offset in terms of number of months since the earliest ad in the dataset:
'offset_date'] = (df_clean['dateCreated']-df_clean['dateCreated'].min()).dt.days df_clean[
df_clean.offset_date.hist()
<Axes: >
Clearly, almost all the ads are in one bin. Let’s look at this:
= np.histogram (df_clean.offset_date)
number_ads, bins 1:]], index=['number_ads','bins']) pd.DataFrame ([number_ads, bins[
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
number_ads | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 4.0 | 5.0 | 53.0 | 371464.0 |
bins | 75.9 | 151.8 | 227.7 | 303.6 | 379.5 | 455.4 | 531.3 | 607.2 | 683.1 | 759.0 |
In the table above, we see that the last bin is where almost all the ads are, followed by the second to last bin, with few ads. From the 5th bin downwards, there are only 2 ads.
= df_clean[df_clean.offset_date < 450]
very_old_ads = df_clean[df_clean.offset_date >= 450]
other_ads
print ('oldest ads created between: ', very_old_ads.dateCreated.min(),very_old_ads.dateCreated.max())
print ('remaining ads created between: ', other_ads.dateCreated.min(),other_ads.dateCreated.max())
oldest ads created between: 2014-03-10 00:00:00 2015-03-20 00:00:00
remaining ads created between: 2015-06-11 00:00:00 2016-04-07 00:00:00
We have four options:
- no outlier
- oldest (2014) is outlier
- two oldest (before mid year 2016) are outliers.
- all but the ones in the last two bins (with 53 and 371464 rows) are outliers
- use a quantile as done for the other numerical variables.
In this analysis, we use the third option: treat the two oldest ads as two outliers:
if outliers['date_created']:
= df_clean[df_clean.dateCreated >= '2015-06-01'] df_clean
= df_clean.drop (columns = 'offset_date') df_clean
yearRegistration + monthRegistration
We use yearOfRegistration
and monthOfRegistration
to calculate the age of the vehicle, in number of months since the ad was posted. We first look if the outlier removal removed suspicious low and high values:
'monthOfRegistration','yearOfRegistration']].describe().loc[['min','max']] df_clean[[
monthOfRegistration | yearOfRegistration | |
---|---|---|
min | 0.0 | 1978.0 |
max | 12.0 | 2018.0 |
There are still months with value 0, we treat them as missing values. We impute them using the middle of the year, month 6.
==0, 'monthOfRegistration'] = 6 df_clean.loc [df_clean.monthOfRegistration
'age'] = pd.to_datetime(df_clean.dateCreated.max()) - pd.to_datetime ({
df_clean['month': df_clean.monthOfRegistration,
'year': df_clean.yearOfRegistration,
'day': np.tile (15, df_clean.shape[0])}
)
'age'] = df_clean['age'].dt.days df_clean[
'yearOfRegistration')
numerical_variables.remove ('monthOfRegistration')
numerical_variables.remove ('age') numerical_variables.append (
Target variable
= (df_clean ['lastSeen'] - df_clean['dateCreated']).dt.days df_clean[target_variable]
date information
Depending on whether the ad was posted on a weekend, on holidays, etc., more people might be able to look at it and it might be sold more quickly. We add this information here:
if add_datepart_flag:
= holidays.country_holidays('DE')
country_holidays 'holidays'] = [int(day in country_holidays) for day in df_clean.dateCreated]
df_clean['day_of_week'] = df_clean.dateCreated.dt.dayofweek.values df_clean[
'day_of_week')
numerical_variables.append ('holidays') categorical_variables.append (
/ 'df_clean.pkl') joblib.dump ([df_clean, numerical_variables, categorical_variables], results_path
['results/used_cars/df_clean.pkl']
Histograms
We look at the resulting histograms to see if there are clear outliers. This doesn’t seem to be the case.
'price'].hist() df_clean[
<Axes: >
'age'].hist() df_clean[
<Axes: >
'powerPS'].hist() df_clean[
<Axes: >
'kilometer'].hist() df_clean[
<Axes: >
Split data and preprocess
The first step is to split the data in training and validation. For simplicity, e don’t use a test set here, although we should use it if we wanted to estimate the final accuracy after selecting our best model on the validation set.
from sklearn.model_selection import train_test_split
= train_test_split(np.arange(df_clean.shape[0]), test_size=0.30, random_state=42)
train_idx, valid_idx / 'indexes.pkl') joblib.dump ([train_idx, valid_idx], results_path
['results/used_cars/indexes.pkl']
The second step is to preprocess the data. We do that with the help of TabularPandas
in the fastai
library:
Using
Categorify
, we replace columns incategorical_variables
list with numeric categorical columns. We use just an discrete value instead of using a one-hot encoding. This tends to work better for decision trees and random forests, as explored in “Splitting on Categorical Predictors in Random Forests”Using
FillMissing
, we replace missing values with the median, and we create a boolean column that is True for any row where the value was missing
= [Categorify, FillMissing]
procs = TabularPandas (df_clean, procs, categorical_variables, numerical_variables, y_names=target_variable, splits=(list(train_idx),list(valid_idx))) tabular
= Path('data/datasets')
path / 'tabular.pkl') joblib.dump (tabular, path
['data/datasets/tabular.pkl']
Decision Tree
We start by fitting a very simple yet powerful type of model, the decision tree. A good characteristic of this type of model is that it is interpretable and allows to analyze the data and the important variables.
= joblib.load (path / 'tabular.pkl') tabular
= tabular.train.xs, tabular.train.y
X, y = tabular.valid.xs, tabular.valid.y valid_X, valid_y
Visualization
We can visualize on which basis the tree splits the data. We see that:
- The most important predictor is the `price`: the lower the price, the lower the duration of the sale.
- The data where most errors happen is the one for large durations. This might be due to the few cases with a long duration, which could be considered almost outliers.
= DecisionTreeRegressor(max_leaf_nodes=4)
m ; m.fit (X, y)
=10, leaves_parallel=True, precision=2) draw_tree(m, X, size
Warning: Could not load "/home/jaumeamllo/miniconda3/envs/tsforecast/bin/../lib/graphviz/libgvplugin_pango.so.6" - It was found, so perhaps one of its dependents was not. Try ldd.
Warning: no value for width of non-ASCII character 226. Falling back to width of space character
We can visualize the same using the dtreeviz
library:
= np.random.permutation(len(y))[:500]
samp_idx = dtreeviz.model (m, X.iloc[samp_idx], y.iloc[samp_idx], feature_names=X.columns, target_name=target_variable) viz
='DejaVu Sans', scale=2, label_fontsize=16,
viz.view (fontname='LR') orientation
/home/jaumeamllo/miniconda3/envs/tsforecast/lib/python3.10/site-packages/sklearn/base.py:420: UserWarning: X does not have valid feature names, but DecisionTreeRegressor was fitted with feature names
Warning: Could not load "/home/jaumeamllo/miniconda3/envs/tsforecast/bin/../lib/graphviz/libgvplugin_pango.so.6" - It was found, so perhaps one of its dependents was not. Try ldd.
Warning: no value for width of non-ASCII character 226. Falling back to width of space character
Training
Let’s first start by training a full size decision tree.
= DecisionTreeRegressor()
dt_model ; dt_model.fit(X, y)
Evaluation
We use the root of the mean squared error as evaluation metric.
def r_mse (pred, y):
return round(math.sqrt(((pred-y)**2).mean()), 6)
def m_rmse (dt_model, X, y):
return r_mse(dt_model.predict(X), y)
Let’s see at the error on the training set:
m_rmse (dt_model, X, y)
0.283855
… and on the validation set:
m_rmse (dt_model, valid_X, valid_y)
11.614361
We can see that we are clearly overfitting. Let’s look at the number of leaves, and compare it against the total number of observations:
len(X) dt_model.get_n_leaves(),
(201472, 260068)
The number of leaves is similar to the number of observations. We need to reduce the size of the decision tree:
= DecisionTreeRegressor (min_samples_leaf=25)
dt_model
dt_model.fit (tabular.train.xs, tabular.train.y) m_rmse (dt_model, X, y), m_rmse(dt_model, valid_X, valid_y)
(7.683234, 8.817406)
We get a much better error on validation set, let’s see the number of leaves:
dt_model.get_n_leaves()
7949
Random Forest
We have built a baseline using Decision Trees. Let us explore now the use of Random Rorests
Training
def fitted_rf (X, y, n_estimators=128, max_features=0.5, min_samples_leaf=5, **kwargs):
return RandomForestRegressor (
=-1, n_estimators=n_estimators, max_features=max_features,
n_jobs=min_samples_leaf, oob_score=True, **kwargs
min_samples_leaf ).fit(X, y)
= fitted_rf (X, y);
rf_model / 'rf_model1.pkl') joblib.dump (rf_model, results_path
Evaluation
m_rmse (rf_model, X, y), m_rmse(rf_model, valid_X, valid_y), r_mse (rf_model.oob_prediction_, y)
(6.048382, 8.272029, 8.215691)
We see whether using another rule like the sqrt
for calculating max_features
per tree improves the model:
= fitted_rf (X, y, max_features='sqrt');
rf_model_2 m_rmse (rf_model_2, X, y), m_rmse(rf_model_2, valid_X, valid_y), r_mse(rf_model_2.oob_prediction_, y)
(6.649456, 8.296415, 8.233054)
It doesnt’ improve. We could play with the ratio of max features, min samples per leaf, and other hyper-parameters. We leave this as future work.
Visualization
Let’s see the impact of n_estimators on the performance:
import warnings
'ignore')
warnings.filterwarnings (= np.stack([t.predict(valid_X) for t in rf_model.estimators_])
preds +1].mean(0), valid_y) for i in range(128)]); plt.plot([r_mse(preds[:i
We see that the accuracy plateaus from 80 estimators onwards
Interpretation
Confidence
We look at the standard deviation of the predictions across the different trees. For those sales where there is low standard deviation, most of the trees agree on the sale duration estimate. This is useful in production, to maybe avoid providing a predition for those requests where there is little agreement on the sale duration.
= np.stack([t.predict(valid_X) for t in rf_model.estimators_])
preds = preds.std(0) preds_std
; plt.hist (preds_std)
We see that there is in general high standard deviation for most sales, except for few cases.
Feature importance
Let’s look at what features are most important for the regression task:
def rf_feat_importance (rf_model, df):
return pd.DataFrame({'cols':df.columns, 'imp':rf_model.feature_importances_}).sort_values('imp', ascending=False)
= rf_feat_importance (rf_model, X)
importance importance
cols | imp | |
---|---|---|
7 | postalCode | 0.193067 |
14 | price | 0.184392 |
17 | age | 0.153438 |
15 | powerPS | 0.099052 |
6 | model | 0.088209 |
0 | brand | 0.061477 |
18 | day_of_week | 0.060955 |
3 | vehicleType | 0.042012 |
16 | kilometer | 0.036495 |
9 | abtest | 0.023295 |
4 | fuelType | 0.016574 |
8 | gearbox | 0.015988 |
5 | notRepairedDamage | 0.015318 |
10 | holidays | 0.006917 |
11 | price_na | 0.001413 |
13 | age_na | 0.000947 |
12 | powerPS_na | 0.000451 |
1 | seller | 0.000000 |
2 | offerType | 0.000000 |
We see that the most important features are:
- postalCode: probably due to the higher density of population, and thus buyers, in certain areas.
- price: cheaper cars are sold faster.
- age: newer is better
Surprisingly, the day of the week plays a role even more important than the number of kilometers, to be sold faster, maybe because people look at ads more at certain days of the week.
Let’s look at the same graphically:
def plot_feature_importance (importance):
return importance.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)
; plot_feature_importance(importance)
Removing variables
We see if we can improve the performance by removing those features that have very little importance and might be adding noise:
def evaluate_removing_variables (importance, X, y, valid_X, valid_y, threhsold):
= importance[importance.imp>threhsold].cols
to_keep = X[to_keep]
X_imp = valid_X[to_keep]
valid_X_imp
= fitted_rf(X_imp, y)
rf_model print (m_rmse(rf_model, X_imp, y), m_rmse(rf_model, valid_X_imp, valid_y), r_mse (rf_model.oob_prediction_, y))
return rf_model, X_imp, valid_X_imp, to_keep
= evaluate_removing_variables (importance, X, y, valid_X, valid_y, 0.0001)
rf_model, X_imp, valid_X_imp, to_keep / 'removing_low_importance.pkl') joblib.dump (rf_model, results_path
6.045863 8.268369 8.211243
['results/used_cars/removing_low_importance.pkl']
There is almost no improvement, let’s see a bit more agressive pruning:
= evaluate_removing_variables (importance, X, y, valid_X, valid_y, 0.001) rf_model2, X_imp2, valid_X_imp2, to_keep2
6.050095 8.272465 8.213242
= evaluate_removing_variables (importance, X, y, valid_X, valid_y, 0.005) rf_model2, X_imp2, valid_X_imp2, to_keep2
6.010566 8.271691 8.210651
The best threshold seems to be 0.0001. Let’s use it and look at the feature importance again, after removing variables with low importance:
= rf_feat_importance(rf_model, X_imp)
importance ; plot_feature_importance(importance)
Removing redundant variables
We see if there are variables that might be closely correlated, in terms of ranking:
cluster_columns(X_imp)
PowerPS and price seem to be quite correlated. We see what happens when we remove each of the variables in turn. To do it quickly, we use the out-of-bag score, by using a random forest where each tree is trained on a smaller subset of data, so that the out-of-bag score is measured on the remaining subset. This score indicates the generalization provided, being the higher the better.
def get_oob (df, y):
= RandomForestRegressor(n_estimators=40, min_samples_leaf=15,
m =50000, max_features=0.5, n_jobs=-1, oob_score=True)
max_samples
m.fit(df, y)return m.oob_score_
Let’s see the score of the original features:
def fitted_rf2 (X, y):
return RandomForestRegressor (
=40, min_samples_leaf=15,
n_estimators=50000, max_features=0.5, n_jobs=-1, oob_score=True).fit(X, y) max_samples
= get_oob(X_imp, y)
original_oob original_oob
0.06008218588804892
= {c:get_oob(X_imp.drop(c, axis=1), y) for c in (importance.cols)}
oob_scores oob_scores
{'postalCode': 0.058280710022504234,
'price': 0.0415493436634512,
'age': 0.05244495695648832,
'powerPS': 0.059019327633660135,
'model': 0.0593482908544406,
'brand': 0.058739696982127,
'day_of_week': 0.05945691943436404,
'vehicleType': 0.05943518125387026,
'kilometer': 0.05984718680711576,
'abtest': 0.06071818811404983,
'fuelType': 0.060080938385732696,
'gearbox': 0.06072952180624125,
'notRepairedDamage': 0.06106010397906847,
'holidays': 0.05685565776346668,
'price_na': 0.060948957015006355,
'age_na': 0.06027832457888893,
'powerPS_na': 0.06106806907313633}
for k in oob_scores if oob_scores[k] > original_oob} {k:oob_scores[k]
{'abtest': 0.06071818811404983,
'gearbox': 0.06072952180624125,
'notRepairedDamage': 0.06106010397906847,
'price_na': 0.060948957015006355,
'age_na': 0.06027832457888893,
'powerPS_na': 0.06106806907313633}
It seems that dropping the notRepairedDamage, abtest, and powerPS_na might be beneficial
def evaluate_removing_redundant (to_drop, X_imp, y, valid_X, valid_y):
print ('OOB score: ', get_oob(X_imp.drop(to_drop, axis=1), y))
= X_imp.drop(to_drop, axis=1)
X_final = valid_X_imp.drop(to_drop, axis=1)
valid_X_final = fitted_rf(X_final, y)
rf_model print ('Regression errors: ', m_rmse(rf_model, X_final, y), m_rmse(rf_model, valid_X_final, valid_y), r_mse(rf_model.oob_prediction_, y))
return rf_model, X_final, valid_X_final
= evaluate_removing_redundant (['abtest', 'notRepairedDamage', 'powerPS_na'], X_imp, y, valid_X, valid_y)
rf_model, X_final, valid_X_final / 'removing_redundant.pkl') joblib.dump ([rf_model, X_final, valid_X_final], results_path
OOB score: 0.0611047486037124
Regression errors: 6.050136 8.261787 8.202955
['results/used_cars/removing_redundant.pkl']
= evaluate_removing_redundant (['notRepairedDamage', 'powerPS_na'], X_imp, y, valid_X, valid_y) rf_model2, X_final2, valid_X_final2
OOB score: 0.0606092565973716
Regression errors: 6.065446 8.26426 8.207189
= evaluate_removing_redundant (['notRepairedDamage'], X_imp, y, valid_X, valid_y) rf_model2, X_final2, valid_X_final2
OOB score: 0.060027249156653584
Regression errors: 6.03148 8.262977 8.207328
= evaluate_removing_redundant (['abtest'], X_imp, y, valid_X, valid_y) rf_model2, X_final2, valid_X_final2
OOB score: 0.060528790706093516
Regression errors: 6.037068 8.266976 8.205677
We select the model without [‘abtest’, ‘notRepairedDamage’, ‘powerPS_na’]. The result is slightly better, with lower number of variables, which tends to be good.
Neural Network
We use a similar preprocessing as the one used for RF, but with two important differences: - The numerical variables are standard normalized. - The categorical variables will be passed to an embedding layer in the Neural Network, which maps discrete integers to embeddings that have been learned to optimally represent each category in the variable. This tends to be a bit more effective and efficient than using one-hot-encoding.
= set (X_final.columns).intersection(df_clean.columns)
selected_columns = (df_clean[list(selected_columns) + [target_variable]]).copy()
df_nn = df_nn[target_variable].astype (np.float32)
df_nn[target_variable] = set (selected_columns).intersection (categorical_variables)
categorical_variables = set (selected_columns).intersection (numerical_variables)
numerical_variables = [Categorify, FillMissing, Normalize]
procs = TabularPandas (df_nn, procs, list(categorical_variables), list(numerical_variables), y_names=target_variable, splits=(list(train_idx),list(valid_idx))) tabular
We use a relatively high batch-size due to the fact that tabular data does not occupy much memory.
= tabular.dataloaders(1024)
dls = tabular.train.y
y = tabular_learner(dls, y_range=(np.floor(y.min()), np.ceil(y.max())), layers=[500,250],
learn =1, loss_func=F.mse_loss)
n_out learn.lr_find()
SuggestedLRs(valley=0.001737800776027143)
We try first with learning rate 1e-2 and 10 epochs:
10, 1e-2) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 602.610474 | 76.169884 | 00:11 |
1 | 78.452293 | 79.093147 | 00:10 |
2 | 72.395233 | 75.213257 | 00:10 |
3 | 68.382301 | 73.556313 | 00:10 |
4 | 64.683945 | 75.480438 | 00:10 |
5 | 61.083923 | 77.244652 | 00:10 |
6 | 50.728420 | 79.223137 | 00:11 |
7 | 43.104126 | 84.684929 | 00:10 |
8 | 34.933899 | 85.208206 | 00:10 |
9 | 30.915373 | 87.217155 | 00:10 |
We see that something close to 4 or 5 epochs is better
= tabular_learner(dls, y_range=(np.floor(y.min()), np.ceil(y.max())), layers=[500,250],
learn =1, loss_func=F.mse_loss)
n_out4, 1e-2)
learn.fit_one_cycle(= learn.get_preds()
preds, targs r_mse (preds,targs)
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 345.358978 | 78.925217 | 00:11 |
1 | 74.916069 | 75.385986 | 00:10 |
2 | 65.404991 | 74.644577 | 00:10 |
3 | 51.674995 | 76.773880 | 00:11 |
8.762071
The validation error is 8.76, bigger than the one obtained by the RF. Let’s see with a lower learning rate:
= tabular_learner(dls, y_range=(np.floor(y.min()), np.ceil(y.max())), layers=[500,250],
learn2 =1, loss_func=F.mse_loss)
n_out10, 1e-3) learn2.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 12781.711914 | 7655.472656 | 00:11 |
1 | 333.679382 | 107.907936 | 00:11 |
2 | 68.636696 | 75.976501 | 00:11 |
3 | 60.806812 | 77.812447 | 00:10 |
4 | 54.039669 | 81.903381 | 00:10 |
5 | 46.701462 | 79.440315 | 00:10 |
6 | 37.569061 | 79.386604 | 00:10 |
7 | 29.916273 | 80.873795 | 00:10 |
8 | 24.533495 | 81.103035 | 00:11 |
9 | 21.414110 | 81.744980 | 00:10 |
We see that the loss is not better than using a bigger learning rate. Let’s see even bigger:
= tabular_learner(dls, y_range=(np.floor(y.min()), np.ceil(y.max())), layers=[500,250],
learn2 =1, loss_func=F.mse_loss)
n_out4, 1e-1) learn2.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 198.138016 | 131.202820 | 00:10 |
1 | 75.276131 | 75.907112 | 00:11 |
2 | 113.341293 | 151.957809 | 00:11 |
3 | 150.795166 | 151.957809 | 00:10 |
Clearly, the best learning rate seems to be 0.01
We could now examine other hyper-parameters, like regularization, dropout, adding more layers or units per layer, adding batch-normalization, etc.
Ensemble of RF and NN
We see if using an ensemble of RF and NN improves the results:
= learn.get_preds()
preds, targs
= rf_model.predict(valid_X_final)
rf_preds = (to_np(preds.squeeze()) + rf_preds) /2
ens_preds r_mse(ens_preds,valid_y)
8.339824
While it improves the result of using NN, using RF alone seems to be a better option in this data.
Future lines of work
- As indicated above, we should now examine other hyper-parameters, like regularization, dropout, adding more layers or units per layer, adding batch-normalization, etc.
- Given that a random forest seems to be the best type of model for this data, it is a good idea to explore using XGBoost, CatBoost, or other types of tree ensembles.