Gradient Boosting the Naive Bayes algorithm
0. Intro
In the last post, I have included a general implementation of the Naive Bayes algorithm. This implementation differed from sklearn
’s Naive Bayes algorithms as discussed here.
It is well known that Naive Bayes does well with text related classification tasks. However, it is not routinely successful in other areas. In this post, I will try to boost the Naive Bayes algorithm in order to end up with a stronger algorithm that does well more often and in general.
I organized this post as follows:
- First part will import the code from a file named nb.py and it will show the usage.
- Second part will try the code on three publicly available datasets that can be found in the UCI public ML dataset archive.
- Last part will discuss the implementation by diving deeper into the code.
Before we get started, let’s have a quick refresher about the idea behind boosting; very roughly, it is the idea to stack the so called weak-classifiers on top of each other, in such a way that the next one learns from the mistakes of the previous ones combined. In this notebook, our weak-learners are Naive Bayes classifiers. Traditionally though, weak learners are decision-stumps, or in other words very shallow decision trees.
Disclaimer: Just as in the previous posts, the goal is to get to a working implementation, so the code is sub-optimal.
Let’s dive right into it…
1. Intro: Baby steps
Let’s try the boosted NB on the simple, toy “play tennis” dataset. This is the same dataset as in the previous NB post.
import pandas as pd
# # Data is from below. Hardcoding it in order to remove dependency
# data = pd.read_csv(
# 'https://raw.githubusercontent.com/petehunt/c4.5-compiler/master/example/tennis.csv',
# usecols=['outlook', 'temp', 'humidity', 'wind', 'play']
# )
data = [
['Sunny', 'Hot', 'High', 'Weak', 'No'],
['Sunny', 'Hot', 'High', 'Strong', 'No'],
['Overcast', 'Hot', 'High', 'Weak', 'Yes'],
['Rain', 'Mild', 'High', 'Weak', 'Yes'],
['Rain', 'Cool', 'Normal', 'Weak', 'Yes'],
['Rain', 'Cool', 'Normal', 'Strong', 'No'],
['Overcast', 'Cool', 'Normal', 'Strong', 'Yes'],
['Sunny', 'Mild', 'High', 'Weak', 'No'],
['Sunny', 'Cool', 'Normal', 'Weak', 'Yes'],
['Rain', 'Mild', 'Normal', 'Weak', 'Yes'],
['Sunny', 'Mild', 'Normal', 'Strong', 'Yes'],
['Overcast', 'Mild', 'High', 'Strong', 'Yes'],
['Overcast', 'Hot', 'Normal', 'Weak', 'Yes'],
['Rain', 'Mild', 'High', 'Strong', 'No'],
]
data = pd.DataFrame(data, columns='Outlook,Temperature,Humidity,Wind,PlayTennis'.split(','))
X = data[data.columns[:-1]]
y = data.PlayTennis == 'Yes'
data
Outlook | Temperature | Humidity | Wind | PlayTennis | |
---|---|---|---|---|---|
0 | Sunny | Hot | High | Weak | No |
1 | Sunny | Hot | High | Strong | No |
2 | Overcast | Hot | High | Weak | Yes |
3 | Rain | Mild | High | Weak | Yes |
4 | Rain | Cool | Normal | Weak | Yes |
5 | Rain | Cool | Normal | Strong | No |
6 | Overcast | Cool | Normal | Strong | Yes |
7 | Sunny | Mild | High | Weak | No |
8 | Sunny | Cool | Normal | Weak | Yes |
9 | Rain | Mild | Normal | Weak | Yes |
10 | Sunny | Mild | Normal | Strong | Yes |
11 | Overcast | Mild | High | Strong | Yes |
12 | Overcast | Hot | Normal | Weak | Yes |
13 | Rain | Mild | High | Strong | No |
So we have 4 predictor fields, 14 samples and a binary classification problem. Let’s get the boosted NB classifier from nb.py and try it out. The class is called NaiveBayesBoostingClassifier
. The api is again very sklearn
-like, with methods such as fit
, predict
, predict_proba
, decision_function
etc. Let’s also compare the boosted NB with the vanilla NB, NaiveBayesClassifier
, which is the same as before.
We will have 2 versions of the boosted NB.
- First one with only 1 iteration, that is, 0 boosting iterations. This case is therefore equivalent to vanilla NB.
- Second one with 2 iterations. The first vanilla NB iteration, and 1 boosting iteration on top of that.
The goal is to observe the difference in the calculated posterior probabilities. We will also include the NaiveBayesClassifier
, and observe that it produces the same posteriors as the trivial boosting case.
%run nb.py
nb = NaiveBayesClassifier()
# 1 boosting iteration means, no boosting, and this should spit out exactly the same probas.
fake_boosting = NaiveBayesBoostingClassifier(n_iter=1)
# the following is 2 boosting iterations, so the posteriors should differ a bit;
real_boosting = NaiveBayesBoostingClassifier(n_iter=2)
fake_boosting.fit(X, y)
real_boosting.fit(X, y)
nb.fit(X, y)
out = pd.Series(nb.predict_proba(X)[1], name='vanilla')
out = pd.DataFrame(out)
cols = ['vanilla', 'boost_fake', 'boost_once', 'truth']
out['boost_once'] = real_boosting.decision_function(X)
out['boost_fake'] = fake_boosting.decision_function(X)
out['truth'] = y
out = out[cols]
out
vanilla | boost_fake | boost_once | truth | |
---|---|---|---|---|
0 | 0.312031 | 0.312031 | 0.313928 | False |
1 | 0.162746 | 0.162746 | 0.164083 | False |
2 | 0.751472 | 0.751472 | 0.752014 | True |
3 | 0.573354 | 0.573354 | 0.573302 | True |
4 | 0.875858 | 0.875858 | 0.870270 | True |
5 | 0.751472 | 0.751472 | 0.745584 | False |
6 | 0.918955 | 0.918955 | 0.915018 | True |
7 | 0.430499 | 0.430499 | 0.432647 | False |
8 | 0.798736 | 0.798736 | 0.794659 | True |
9 | 0.854638 | 0.854638 | 0.851980 | True |
10 | 0.586325 | 0.586325 | 0.584993 | True |
11 | 0.683522 | 0.683522 | 0.684268 | True |
12 | 0.929719 | 0.929719 | 0.928607 | True |
13 | 0.365459 | 0.365459 | 0.365355 | False |
Observe that boosting with 1 iteration, i.e. the trivial, non-boosting, produces the same posterior probability as the vanilla NB. On the other hand, the first non-trivial boosting differs from them a tiny bit. It has been changed according to the gradient of the vanilla NB.
Now let’s apply the boostied NB to some of the publicly available ML datasets. Note that the current version of the boosted NB is binary classification only.
1.b Framework for comparing the performance
I’ll start this subsection by importing the necessary tools and defining functions which will make it easy to compare the performance of several ML classification algorithms, including NaiveBayesClassifier
and NaiveBayesBoostingClassifier
. The other ones are:
GaussianNB
AdaBoost
GradientBoostingClassifier
fromsklearn
, which is gradient boosted decision trees.
The classifiers listed above are included with their default choice of parameters, without any parameter tuning, as that is beyond the scope of this post. On the other hand, I included two versions of the boosted NB, one with 10 boosting iterations and one with 20. This will help us have a feeling about how the classifier progresses with more iterations.
import pandas as pd
from sklearn.model_selection import ShuffleSplit
from sklearn.naive_bayes import GaussianNB
from collections import defaultdict
from sklearn.pipeline import Pipeline
from datetime import datetime
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import Imputer
def make_pipe(est):
pipe = [('imputer', Imputer()),
('estimator', est)]
return Pipeline(pipe)
def make_boost_pipe(n):
pipe_boost = [('preprocessor', NaiveBayesPreprocessor(bins=20)),
('nbb', NaiveBayesBoostingClassifier(n_iter=n))]
pipe_boost = Pipeline(pipe_boost)
return pipe_boost
%run nb.py
pipe = [('preprocessor', NaiveBayesPreprocessor(bins=20)),
('nb', NaiveBayesClassifier())]
pipe = Pipeline(pipe)
algos = {'nbayes': pipe,
'gnb': make_pipe(GaussianNB()),
'ada': make_pipe(AdaBoostClassifier()),
'gbt': make_pipe(GradientBoostingClassifier())}
for i in [10, 20]:
algos['nbayes+gboost {}'.format(i)] = make_boost_pipe(i)
def compare(X, y):
rs = ShuffleSplit(test_size=0.25, n_splits=3)
accuracies = defaultdict(list)
rocs = defaultdict(list)
times = {}
for train_index, test_index in rs.split(X):
X_train, X_test = X.loc[train_index], X.loc[test_index]
y_train, y_test = y.loc[train_index], y.loc[test_index]
for key, est in algos.items():
then = datetime.now()
est.fit(X_train, y_train)
accuracies[key].append(est.score(X_test, y_test))
# now the roc auc
if not isinstance(est.steps[-1][-1], GaussianNB):
ys = est.decision_function(X_test)
if len(ys.shape) == 2 and ys.shape[1] == 2:
ys = ys[1]
else:
ys = est.predict_proba(X_test)[:, 1]
rocs[key].append(roc_auc_score(y_test, ys))
times[key] = datetime.now() - then
accuracies = pd.DataFrame(accuracies)
rocs = pd.DataFrame(rocs)
times = pd.Series(times)
accuracies.index.name = 'accuracy'
rocs.index.name = 'roc-auc'
times.index.name = 'time'
return accuracies, rocs, times
def go(X, y):
accuracies, rocs, times = compare(X, y)
print(accuracies.to_string())
print('-'*80)
print(rocs.to_string())
print('-'*80)
print(times.to_string())
Quick look at the last line of the compare
function tells us that it returns the accuracies, roc-auc’s and the times it took for all the algorithms it tried. The variable algos
contains the algorithms we are trying. Now we are in a good position to apply these algorithms to various binary classification problems and see what happens.
2. Trying it out
In this section, the code cells contain urls which refer to the datasets being used. For more information about them, you can follow those links.
2.1 Spamdata
The first choice is the spam/non-spam binary classification for emails. A problem that is well known to be a good use case for classical Naive Bayes approach, although that sort of implies the dataset has bag-of-words type features and the NB algorithm in question is either MultinomialNB
or BernoulliNB
.
Let’s load the dataset, create our X
and y
, and take a quick glance at them.
spamdata = pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data',
header=None)
X = spamdata[spamdata.columns[:-1]].rename(columns={i: 'col{}'.format(i) for i in spamdata})
y = spamdata[spamdata.columns[-1]]
print(y.value_counts())
print('-'*80)
print(X.info())
X.sample(3)
0 2788
1 1813
Name: 57, dtype: int64
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4601 entries, 0 to 4600
Data columns (total 57 columns):
col0 4601 non-null float64
col1 4601 non-null float64
col2 4601 non-null float64
col3 4601 non-null float64
col4 4601 non-null float64
col5 4601 non-null float64
col6 4601 non-null float64
col7 4601 non-null float64
col8 4601 non-null float64
col9 4601 non-null float64
col10 4601 non-null float64
col11 4601 non-null float64
col12 4601 non-null float64
col13 4601 non-null float64
col14 4601 non-null float64
col15 4601 non-null float64
col16 4601 non-null float64
col17 4601 non-null float64
col18 4601 non-null float64
col19 4601 non-null float64
col20 4601 non-null float64
col21 4601 non-null float64
col22 4601 non-null float64
col23 4601 non-null float64
col24 4601 non-null float64
col25 4601 non-null float64
col26 4601 non-null float64
col27 4601 non-null float64
col28 4601 non-null float64
col29 4601 non-null float64
col30 4601 non-null float64
col31 4601 non-null float64
col32 4601 non-null float64
col33 4601 non-null float64
col34 4601 non-null float64
col35 4601 non-null float64
col36 4601 non-null float64
col37 4601 non-null float64
col38 4601 non-null float64
col39 4601 non-null float64
col40 4601 non-null float64
col41 4601 non-null float64
col42 4601 non-null float64
col43 4601 non-null float64
col44 4601 non-null float64
col45 4601 non-null float64
col46 4601 non-null float64
col47 4601 non-null float64
col48 4601 non-null float64
col49 4601 non-null float64
col50 4601 non-null float64
col51 4601 non-null float64
col52 4601 non-null float64
col53 4601 non-null float64
col54 4601 non-null float64
col55 4601 non-null int64
col56 4601 non-null int64
dtypes: float64(55), int64(2)
memory usage: 2.0 MB
None
col0 | col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 | ... | col47 | col48 | col49 | col50 | col51 | col52 | col53 | col54 | col55 | col56 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1783 | 0.33 | 0.84 | 0.67 | 0.0 | 0.67 | 0.33 | 0.67 | 0.0 | 0.33 | 0.0 | ... | 0.0 | 0.0 | 0.183 | 0.000 | 0.156 | 0.104 | 0.026 | 6.500 | 525 | 858 |
3351 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | ... | 0.0 | 0.0 | 0.000 | 0.751 | 0.000 | 0.000 | 0.000 | 1.428 | 4 | 10 |
2601 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | ... | 0.0 | 0.0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.769 | 8 | 23 |
3 rows × 57 columns
go(X, y)
ada gbt gnb nbayes nbayes+gboost 10 nbayes+gboost 20
accuracy
0 0.936577 0.946134 0.810599 0.893136 0.919201 0.932233
1 0.934839 0.947003 0.820156 0.894005 0.933970 0.943527
2 0.942659 0.947871 0.817550 0.905300 0.931364 0.942659
--------------------------------------------------------------------------------
ada gbt gnb nbayes nbayes+gboost 10 nbayes+gboost 20
roc-auc
0 0.975604 0.983871 0.939048 0.565013 0.975606 0.978455
1 0.977105 0.987682 0.942894 0.545387 0.980848 0.984901
2 0.979726 0.988510 0.956382 0.527922 0.981500 0.984293
--------------------------------------------------------------------------------
time
ada 00:00:00.250069
gbt 00:00:00.527525
gnb 00:00:00.006865
nbayes 00:00:04.577352
nbayes+gboost 10 00:00:28.502304
nbayes+gboost 20 00:00:59.384143
Evaluation:
GaussianNB
produced around 94% roc auc, which is not a bad number.- Vanilla NB produced a 56% roc auc. That somehow translated to a higher accuracy, which suggests a threshold problem with the
GaussianNB
. - Classical boosting algorithms got to 98% roc auc.
- Boosted NB with 10 iterations improved significantly over Vanilla NB, and it achieved 98% roc auc. 20 iterations improved slighlty over 10 iterations, which is not a bad sign.
So in this case, boosted NB did very comparatively with AdaBoost and Gradient Boosted DTs. Moreover, we observed a good improvement provided by the boosting iterations over the vanilla NB algorithm. We hope this to be a recurring theme.
2.2 Bankruptcy Data
Next up I would like to switch to a more business oriented problem. Here we have the data of Polish companies and the ones that went bankrupt. Let’s again load the data and take a quick look. Afterwards, let’s apply the algorithms and compare.
# Polish bankruptcy data from
# http://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data
from scipy.io import arff
def load_arff(fn):
X = arff.loadarff(fn)[0]
X = pd.DataFrame(X).applymap(lambda r: r if r != '?' else None)
# X.dropna(how='any', inplace=True)
y = X.pop('class')
y = (y == b'1').astype(int)
return X, y
fns = !ls /home/taylanbil/Downloads/*year.arff
_ = [load_arff(fn) for fn in fns]
X = pd.concat([X for X, y in _]).reset_index(drop=True)
y = pd.concat([y for X, y in _]).reset_index(drop=True)
del _
print(y.value_counts())
print('-'*80)
print(X.info())
X.sample(3)
0 41314
1 2091
Name: class, dtype: int64
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43405 entries, 0 to 43404
Data columns (total 64 columns):
Attr1 43397 non-null float64
Attr2 43397 non-null float64
Attr3 43397 non-null float64
Attr4 43271 non-null float64
Attr5 43316 non-null float64
Attr6 43397 non-null float64
Attr7 43397 non-null float64
Attr8 43311 non-null float64
Attr9 43396 non-null float64
Attr10 43397 non-null float64
Attr11 43361 non-null float64
Attr12 43271 non-null float64
Attr13 43278 non-null float64
Attr14 43397 non-null float64
Attr15 43369 non-null float64
Attr16 43310 non-null float64
Attr17 43311 non-null float64
Attr18 43397 non-null float64
Attr19 43277 non-null float64
Attr20 43278 non-null float64
Attr21 37551 non-null float64
Attr22 43397 non-null float64
Attr23 43278 non-null float64
Attr24 42483 non-null float64
Attr25 43397 non-null float64
Attr26 43310 non-null float64
Attr27 40641 non-null float64
Attr28 42593 non-null float64
Attr29 43397 non-null float64
Attr30 43278 non-null float64
Attr31 43278 non-null float64
Attr32 43037 non-null float64
Attr33 43271 non-null float64
Attr34 43311 non-null float64
Attr35 43397 non-null float64
Attr36 43397 non-null float64
Attr37 24421 non-null float64
Attr38 43397 non-null float64
Attr39 43278 non-null float64
Attr40 43271 non-null float64
Attr41 42651 non-null float64
Attr42 43278 non-null float64
Attr43 43278 non-null float64
Attr44 43278 non-null float64
Attr45 41258 non-null float64
Attr46 43270 non-null float64
Attr47 43108 non-null float64
Attr48 43396 non-null float64
Attr49 43278 non-null float64
Attr50 43311 non-null float64
Attr51 43397 non-null float64
Attr52 43104 non-null float64
Attr53 42593 non-null float64
Attr54 42593 non-null float64
Attr55 43404 non-null float64
Attr56 43278 non-null float64
Attr57 43398 non-null float64
Attr58 43321 non-null float64
Attr59 43398 non-null float64
Attr60 41253 non-null float64
Attr61 43303 non-null float64
Attr62 43278 non-null float64
Attr63 43271 non-null float64
Attr64 42593 non-null float64
dtypes: float64(64)
memory usage: 21.2 MB
None
Attr1 | Attr2 | Attr3 | Attr4 | Attr5 | Attr6 | Attr7 | Attr8 | Attr9 | Attr10 | ... | Attr55 | Attr56 | Attr57 | Attr58 | Attr59 | Attr60 | Attr61 | Attr62 | Attr63 | Attr64 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
28436 | 0.027486 | 0.28592 | -0.070966 | 0.64665 | -68.132 | 0.048296 | 0.033511 | 2.22740 | 1.06920 | 0.63686 | ... | -3472.5 | 0.064732 | 0.043158 | 0.93527 | 0.1336 | 15.0500 | 14.057 | 90.126 | 4.0499 | 0.93478 |
31905 | 0.148640 | 0.18727 | 0.594540 | 10.38300 | 353.680 | 0.000000 | 0.183680 | 4.33980 | 0.74969 | 0.81273 | ... | 3424.4 | 0.232490 | 0.182890 | 0.75900 | 0.0000 | 19.1100 | 34.970 | 30.850 | 11.8320 | 2.19150 |
11506 | 0.001574 | 0.50511 | 0.084972 | 1.17170 | -32.080 | 0.000000 | 0.002360 | 0.97975 | 2.94260 | 0.49489 | ... | 108.0 | -0.004278 | 0.003180 | 0.99921 | 0.0000 | 8.6374 | 12.808 | 61.386 | 5.9459 | 7.00370 |
3 rows × 64 columns
go(X, y)
ada gbt gnb nbayes nbayes+gboost 10 nbayes+gboost 20
accuracy
0 0.954386 0.970052 0.078787 0.727055 0.906008 0.932547
1 0.957704 0.971250 0.065334 0.730280 0.911076 0.943421
2 0.956782 0.971618 0.067822 0.731755 0.902046 0.946461
--------------------------------------------------------------------------------
ada gbt gnb nbayes nbayes+gboost 10 nbayes+gboost 20
roc-auc
0 0.879490 0.912178 0.497303 0.733217 0.760798 0.794332
1 0.890159 0.921132 0.503383 0.776884 0.786405 0.830294
2 0.888399 0.925583 0.495057 0.758146 0.777545 0.813152
--------------------------------------------------------------------------------
time
ada 00:00:12.529434
gbt 00:00:15.847193
gnb 00:00:00.074148
nbayes 00:00:52.060308
nbayes+gboost 10 00:04:27.274454
nbayes+gboost 20 00:08:38.005423
Evaluation:
- Once again,
GaussianNB
performed the worst among the bunch. Not only that, but this time it provided no predictivity, with 50% roc auc. - Classical boosting algorithms did well, with
AdaBoost
around 89% andGradientBoostingClassifier
around 92% roc auc. - Vanilla NB had around 75% roc auc.
- Boosted NB with 10 iterations -> 78% roc auc. Not a ground breaking improvement but an improvement nonetheless.
- 20 iterations got us to 81% roc. This suggests a consistent increase in performance.
In this example, we have another case of boosting iterations improving the performance. As known from classical boosting algorithms, number of iterations is a hyperparameter one needs to tune for a given problem. It does not seem that we have hit a ceiling of performance here. However, tuning those parameters is not the goal of this post, so we’ll skip that discussion.
2.3 Credit Card Defaults Data
Continuing with the finance theme, third dataset is the credit card default dataset from consumers in Taiwan. Let’s dive in.
'https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients'
bankruptcy = pd.read_excel('/home/taylanbil/Downloads/default of credit card clients.xls', skiprows=1)
y = bankruptcy.pop('default payment next month')
X = bankruptcy
print(y.value_counts())
print('-'*80)
print(X.info())
X.sample(3)
0 23364
1 6636
Name: default payment next month, dtype: int64
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 24 columns):
ID 30000 non-null int64
LIMIT_BAL 30000 non-null int64
SEX 30000 non-null int64
EDUCATION 30000 non-null int64
MARRIAGE 30000 non-null int64
AGE 30000 non-null int64
PAY_0 30000 non-null int64
PAY_2 30000 non-null int64
PAY_3 30000 non-null int64
PAY_4 30000 non-null int64
PAY_5 30000 non-null int64
PAY_6 30000 non-null int64
BILL_AMT1 30000 non-null int64
BILL_AMT2 30000 non-null int64
BILL_AMT3 30000 non-null int64
BILL_AMT4 30000 non-null int64
BILL_AMT5 30000 non-null int64
BILL_AMT6 30000 non-null int64
PAY_AMT1 30000 non-null int64
PAY_AMT2 30000 non-null int64
PAY_AMT3 30000 non-null int64
PAY_AMT4 30000 non-null int64
PAY_AMT5 30000 non-null int64
PAY_AMT6 30000 non-null int64
dtypes: int64(24)
memory usage: 5.5 MB
None
ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | ... | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
22959 | 22960 | 210000 | 2 | 1 | 2 | 30 | 0 | 0 | 0 | 0 | ... | 89173 | 91489 | 93463 | 95021 | 4600 | 3789 | 3811 | 3465 | 3186 | 4389 |
24953 | 24954 | 60000 | 1 | 2 | 2 | 30 | 0 | 0 | 0 | 0 | ... | 6733 | 7662 | 8529 | 9884 | 1500 | 1300 | 1200 | 1000 | 1500 | 800 |
10352 | 10353 | 360000 | 1 | 3 | 1 | 58 | -1 | -1 | -1 | -1 | ... | 1090 | 780 | 390 | 388 | 554 | 1096 | 780 | 0 | 388 | 887 |
3 rows × 24 columns
go(X, y)
ada gbt gnb nbayes nbayes+gboost 10 nbayes+gboost 20
accuracy
0 0.813600 0.815467 0.410800 0.757333 0.788133 0.791867
1 0.818933 0.823200 0.365067 0.758667 0.790133 0.795333
2 0.814667 0.821867 0.371867 0.760933 0.793467 0.799333
--------------------------------------------------------------------------------
ada gbt gnb nbayes nbayes+gboost 10 nbayes+gboost 20
roc-auc
0 0.762167 0.770514 0.658582 0.483075 0.745817 0.739989
1 0.779408 0.783143 0.677352 0.479125 0.762096 0.754629
2 0.779661 0.787814 0.659249 0.483573 0.758837 0.750321
--------------------------------------------------------------------------------
time
ada 00:00:01.701464
gbt 00:00:02.663667
gnb 00:00:00.023219
nbayes 00:00:13.379899
nbayes+gboost 10 00:02:21.470685
nbayes+gboost 20 00:04:33.042267
Evaluation:
- Similar as above,
AdaBoost
andGBT
are the best ones. GaussianNB
did ok.- Vanilla NB had atrocious performance. 48% roc auc.
- Boosted NB with 10 iterations had 76% roc auc. Pretty good performance.
- 20 iterations had slightly worse performance; 75% roc auc.
This case illustrates the importance of boosting very well. With 1 iteration, NB has absolutely no predictive value. However, once we start boosting, we get to a pretty good performance, competing with the classical boosting algorithms! That is a striking difference. 20 iterations being worse than 10 suggests that the optimal value is less than 20 (it could be less than 10 too, we did not check that).
All in all, we saw the improvement offered by boosting the basic NB algorithm. Although it did not beat the boosted DTs in these examples, there may be cases where it does. There seems to be value in adding this algorithm to one’s ML toolkit.
Now let’s get down to the nitty-gritty and see how this all works.
3. Code
The code can be found here. I will paste bits and pieces from that file and try to explain how it comes together.
3.1. Review of vanilla NB
First of all, the boosted NB builds on the classes introduced in the previous post. For convenience, here’s the code from that discussion.
As a quick reminder; the code contains three major parts;
NaiveBayesClassifier
, which implements the NB algorithm.NaiveBayesPreprocessor
, which fits and/or transforms a dataset so that the output is suitable for applying theNaiveBayesClassifier
.Discretizer
, which takes a continuouspandas
Series and bins it. It also remembers the bins for future use (typically @ scoring time). This is employed by theNaiveBayesPreprocessor
.
from collections import defaultdict
from operator import itemgetter
from bisect import bisect_right
import numpy as np
import pandas as pd
class Discretizer(object):
def __init__(self, upperlim=20, bottomlim=0, mapping=False):
self.mapping = mapping
self.set_lims(upperlim, bottomlim)
@property
def cutoffs(self):
return [i[0] for i in self.mapping]
def set_lims(self, upperlim, bottomlim):
if not self.mapping:
self.bottomlim = bottomlim
self.upperlim = upperlim
else:
vals = sorted(np.unique(map(itemgetter(1), self.mapping)))
self.bottomlim = vals[0]
self.upperlim = vals[-1]
assert self.bottomlim < self.upperlim
def fit(self, continuous_series, subsample=None):
self.mapping = []
continuous_series = pd.Series(continuous_series).reset_index(drop=True).dropna()
if subsample is not None:
n = len(continuous_series)*subsample if subsample < 1 else subsample
continuous_series = np.random.choice(continuous_series, n, replace=False)
ranked = pd.Series(continuous_series).rank(pct=1, method='average')
ranked *= self.upperlim - self.bottomlim
ranked += self.bottomlim
ranked = ranked.map(round)
nvals = sorted(np.unique(ranked)) # sorted in case numpy changes
for nval in nvals:
cond = ranked == nval
self.mapping.append((continuous_series[cond].min(), int(nval)))
def transform_single(self, val):
if not self.mapping:
raise NotImplementedError('Haven\'t been fitted yet')
elif pd.isnull(val):
return None
i = bisect_right(self.cutoffs, val) - 1
if i == -1:
return 0
return self.mapping[i][1]
def transform(self, vals):
if isinstance(vals, float):
return self.transform_single(vals)
elif vals is None:
return None
return pd.Series(vals).map(self.transform_single)
def fit_transform(self, vals):
self.fit(vals)
return self.transform(vals)
class NaiveBayesPreprocessor(object):
"""
Don't pass in Nans. fill with keyword.
"""
OTHER = '____OTHER____'
FILLNA = '____NA____'
def __init__(self, alpha=1.0, min_freq=0.01, bins=20):
self.alpha = alpha # Laplace smoothing
self.min_freq = min_freq # drop values occuring less frequently than this
self.bins = bins # number of bins for continuous fields
def learn_continuous_transf(self, series):
D = Discretizer(upperlim=self.bins)
D.fit(series)
return D
def learn_discrete_transf(self, series):
vcs = series.value_counts(dropna=False, normalize=True)
vcs = vcs[vcs >= self.min_freq]
keep = set(vcs.index)
transf = lambda r: r if r in keep else self.OTHER
return transf
def learn_transf(self, series):
if series.dtype == np.float64:
return self.learn_continuous_transf(series)
else:
return self.learn_discrete_transf(series)
def fit(self, X_orig, y=None):
"""
Expects pandas series and pandas DataFrame
"""
# get dtypes
self.dtypes = defaultdict(set)
for fld, dtype in X_orig.dtypes.iteritems():
self.dtypes[dtype].add(fld)
X = X_orig
# X = X_orig.fillna(self.FILLNA)
# get transfs
self.transformations = {
fld: self.learn_transf(series)
for fld, series in X.iteritems()}
def transform(self, X_orig, y=None):
"""
Expects pandas series and pandas DataFrame
"""
X = X_orig.copy()
# X = X_orig.fillna(self.FILLNA)
for fld, func in self.transformations.items():
if isinstance(func, Discretizer):
X[fld] = func.transform(X[fld])
else:
X[fld] = X[fld].map(func)
return X
def fit_transform(self, X, y=None):
self.fit(X)
return self.transform(X)
class NaiveBayesClassifier(object):
def __init__(self, alpha=1.0, class_priors=None, **kwargs):
self.alpha = alpha
self.class_priors = class_priors
def get_class_log_priors(self, y):
self.classes_ = y.unique()
if self.class_priors is None:
self.class_priors = y.value_counts(normalize=1)
elif isinstance(self.class_priors, str) and self.class_priors == 'equal':
raise NotImplementedError
self.class_log_priors = self.class_priors.map(np.log)
def get_log_likelihoods(self, fld):
table = self.groups[fld].value_counts(dropna=False).unstack(fill_value=0)
table += self.alpha
sums = table.sum(axis=1)
likelihoods = table.apply(lambda r: r/sums, axis=0)
log_likelihoods = likelihoods.applymap(np.log)
return {k if pd.notnull(k) else None: v for k, v in log_likelihoods.items()}
def fit(self, X, y):
y = pd.Series(y)
self.get_class_log_priors(y)
self.groups = X.groupby(y)
self.log_likelihoods = {
fld: self.get_log_likelihoods(fld)
for fld, series in X.iteritems()
}
def get_approx_log_posterior(self, series, class_):
log_posterior = self.class_log_priors[class_] # prior
for fld, val in series.iteritems():
# there are cases where the `val` is not seen before
# as in having a `nan` in the scoring dataset,
# but no `nans in the training set
# in those cases, we want to not add anything to the log_posterior
# This is to handle the Nones and np.nans etc.
val = val if pd.notnull(val) else None
if val not in self.log_likelihoods[fld]:
continue
log_posterior += self.log_likelihoods[fld][val][class_]
return log_posterior
def decision_function_series(self, series):
approx_log_posteriors = [
self.get_approx_log_posterior(series, class_)
for class_ in self.classes_]
return pd.Series(approx_log_posteriors, index=self.classes_)
def decision_function_df(self, df):
return df.apply(self.decision_function_series, axis=1)
def decision_function(self, X):
"""
returns the log posteriors
"""
if isinstance(X, pd.DataFrame):
return self.decision_function_df(X)
elif isinstance(X, pd.Series):
return self.decision_function_series(X)
elif isinstance(X, dict):
return self.decision_function_series(pd.Series(X))
def predict_proba(self, X, normalize=True):
"""
returns the (normalized) posterior probability
normalization is just division by the evidence. doesn't change the argmax.
"""
log_post = self.decision_function(X)
if isinstance(log_post, pd.Series):
post = log_post.map(np.exp)
elif isinstance(log_post, pd.DataFrame):
post = log_post.applymap(np.exp)
else:
raise NotImplementedError('type of X is "{}"'.format(type(X)))
if normalize:
if isinstance(post, pd.Series):
post /= post.sum()
elif isinstance(post, pd.DataFrame):
post = post.div(post.sum(axis=1), axis=0)
return post
def predict(self, X):
probas = self.decision_function(X)
if isinstance(probas, pd.Series):
return np.argmax(probas)
return probas.apply(np.argmax, axis=1)
def score(self, X, y):
preds = self.predict(X)
return np.mean(np.array(y) == preds.values)
3.2. Quick Review of boosting
The idea of boosting consists of these following major parts;
- A weak (or “base”) learner which fits to the dataset.
- A loss function which evaluates the predictions and the truth values.
- A new weak learner that is trained on the new “truth values”, which are adjusted so that the past mistakes are weighed more heavily. The idea is to try to correct the mistakes iteratively.
- A method (usually line search) that will define how to bring the new weak learner that is freshly trained into the mix of the past weak learners.
To implement this with the NB classifier as the weak learner, we first need a new version of our NB classifier, that can be trained with arbitrary values as the truth set. But the NB algotihm is naturally a classifier. The classical boosting algorithms as AdaBoost
and GBT
’s use “regression trees”, which are decision trees that output continuous values, as their weak learners. Modifying NB to do that seems a bit awkward. Therefore, I chose to not touch the natural “classifier” aspect of NB. Instead, I coded a version of the NB classifier with sample weights. That is, the samples that are gotten wrong by the past weak learners are weighed heavily in the next iteration.
Our loss function will be the same as the BinomialDeviance
loss function that GBT
s use in sklearn
.
3.3. Code for the Boosted NB
First let’s introduce the weak learner, i.e. the WeightedNaiveBayesClassifier
. This inherits from the NaiveBayesClassifier
, and modifies the key methods to accomodate sample weights. The code differs in the way that it stores the log likelihoods and the class priors. The scoring / predicting methods are the same as they retrieve data from the log likelihoods and class priors only.
class WeightedNaiveBayesClassifier(NaiveBayesClassifier):
def get_class_log_priors(self, y, weights):
self.classes_ = y.unique()
self.class_priors = {
class_: ((y == class_)*weights).mean()
for class_ in self.classes_}
self.class_log_priors = {
class_: np.log(prior)
for class_, prior in self.class_priors.items()}
def get_log_likelihood(self, y, class_, series, val, vals, weights):
# indices of `series` and `y` and `weights` must be the same
# XXX: what to do with alpha? should we smooth it out somehow?
cond = y == class_
num = ((series.loc[cond] == val) * weights.loc[cond]).sum() + self.alpha
denom = (weights.loc[cond]).sum() + len(vals) * self.alpha
return np.log(num/denom)
def get_log_likelihoods(self, series, y, weights):
vals = series.unique()
y_ = y.reset_index(drop=True, inplace=False)
series_ = series.reset_index(drop=True, inplace=False)
weights_ = weights.reset_index(drop=True, inplace=False)
log_likelihoods = {
val: {class_: self.get_log_likelihood(y_, class_, series_, val, vals, weights_)
for class_ in self.classes_}
for val in vals}
return log_likelihoods
def fit(self, X, y, weights=None):
if weights is None:
return super(WeightedNaiveBayesClassifier, self).fit(X, y)
weights *= len(weights) / weights.sum()
y = pd.Series(y)
self.get_class_log_priors(y, weights)
self.log_likelihoods = {
fld: self.get_log_likelihoods(series, y, weights)
for fld, series in X.iteritems()
}
Now we’re ready for the final piece, the NaiveBayesBoostingClassifier
. Here’s the code.
class NaiveBayesBoostingClassifier(object):
"""
For now, this is Binary classification only.
`y` needs to consist of 0 or 1
"""
TRUTHFLD = '__'
def __init__(self, alpha=1.0, n_iter=5, learning_rate=0.1):
self.n_iter = n_iter
self.alpha = alpha
self.learning_rate = learning_rate
def loss(self, y, pred, vectorize=False):
"""
This uses the `BinomialDeviance` loss function.
That means, this implementation of NaiveBayesBoostingClassifier
is for binary classification only. Change this function
to implement multiclass classification
Compute the deviance (= 2 * negative log-likelihood).
note that `pred` here is the actual predicted posterior proba.
`pred` in sklearn is log odds
"""
logodds = np.log(pred/(1-pred))
# logaddexp(0, v) == log(1.0 + exp(v))
ans = -2.0 * ((y * logodds) - np.logaddexp(0.0, logodds))
return ans if vectorize else np.mean(ans)
def adjust_weights(self, X, y, weights):
if weights is None:
return pd.Series([1]*len(y), index=y.index)
# return -- this errors out bc defers to non weighted nb above
pred = self.predicted_posteriors
return self.loss(y, pred, vectorize=True)
def line_search_helper(self, new_preds, step):
if self.predicted_posteriors is None:
return new_preds
ans = self.predicted_posteriors * (1-step)
ans += step * new_preds
return ans
def line_search(self, new_preds, y):
# TODO: This can be done using scipy.optimize.line_search
# but for now, we'll just try 10 values and pick the best one
if not self.line_search_results:
self.line_search_results = [1]
return 1
steps_to_try = -1 * np.arange(
-self.learning_rate, 0, self.learning_rate/10)
step = min(
steps_to_try,
key=lambda s: self.loss(
y, self.line_search_helper(new_preds, s))
)
self.line_search_results.append(step)
return step
def _predict_proba_1(self, est, X):
# XXX: another place where we assume binary clf
# TODO: need a robust way to get 1
return est.predict_proba(X)[1]
def fit(self, X, y, weights=None):
self.stages = []
self.predicted_posteriors = None
self.line_search_results = []
weights = None
for i in range(self.n_iter):
weights = self.adjust_weights(X, y, weights)
nbc = WeightedNaiveBayesClassifier(alpha=self.alpha)
nbc.fit(X, y, weights)
new_preds = self._predict_proba_1(nbc, X)
self.stages.append(nbc)
self.line_search(new_preds, y)
self.predicted_posteriors = self.line_search_helper(
new_preds, self.line_search_results[-1])
def decision_function_df(self, X, staged=False):
stage_posteriors = [
self._predict_proba_1(est, X) for est in self.stages]
posteriors = 0
if staged:
staged_posteriors = dict()
for i, (sp, step) in enumerate(zip(stage_posteriors, self.line_search_results)):
posteriors = (1-step)*posteriors + step*sp
if staged:
staged_posteriors[i] = posteriors
return posteriors if not staged else pd.DataFrame(staged_posteriors)
def decision_function(self, X):
if isinstance(X, pd.DataFrame):
return self.decision_function_df(X)
else:
raise NotImplementedError
def predict(self, X, thr=0.5):
probas = self.decision_function(X)
return (probas >= thr).astype(int)
# # draft for multiclass approach below
# if isinstance(probas, pd.Series):
# return np.argmax(probas)
# return probas.apply(np.argmax, axis=1)
def score(self, X, y):
preds = self.predict(X)
return np.mean(np.array(y) == preds.values)
Perhaps the most interesting part of this code is the fit
method. Here we see the for loop which builds the NB classifiers for each boosting iteration. The adjust_weights
method computes the new sample weights according to the loss function, which then is fed to the WeightedNaiveBayesClassifier
for the next weak learner.
The way the loss function and the adjust_weights
method are coded make them restricted to the binary classification case. One would need to replace binomial deviance loss with multinomial deviance loss and modify the weights adjustment accordingly to accomodate the multiclass case.
Then comes the line search part. Although it is possible to do an actual line search here, I opted out for a simple “try 10 things and pick the best one” method. Not the best approach for sure.
Feel free to contact me (@taylanbil in twitter, linkedin, facebook, etc) for any questions/comments.