taylanbil.github.io:wq!

Math, Data, Python, Machine Learning @ Facebook

View on GitHub Home

Gradient Boosting the Naive Bayes algorithm

0. Intro

In the last post, I have included a general implementation of the Naive Bayes algorithm. This implementation differed from sklearn’s Naive Bayes algorithms as discussed here.

It is well known that Naive Bayes does well with text related classification tasks. However, it is not routinely successful in other areas. In this post, I will try to boost the Naive Bayes algorithm in order to end up with a stronger algorithm that does well more often and in general.

I organized this post as follows:

Before we get started, let’s have a quick refresher about the idea behind boosting; very roughly, it is the idea to stack the so called weak-classifiers on top of each other, in such a way that the next one learns from the mistakes of the previous ones combined. In this notebook, our weak-learners are Naive Bayes classifiers. Traditionally though, weak learners are decision-stumps, or in other words very shallow decision trees.

Disclaimer: Just as in the previous posts, the goal is to get to a working implementation, so the code is sub-optimal.
Let’s dive right into it…


1. Intro: Baby steps

Let’s try the boosted NB on the simple, toy “play tennis” dataset. This is the same dataset as in the previous NB post.

import pandas as pd

# # Data is from below. Hardcoding it in order to remove dependency
# data = pd.read_csv(
#     'https://raw.githubusercontent.com/petehunt/c4.5-compiler/master/example/tennis.csv',
#     usecols=['outlook', 'temp', 'humidity', 'wind', 'play']
# )

data = [
    ['Sunny', 'Hot', 'High', 'Weak', 'No'],
    ['Sunny', 'Hot', 'High', 'Strong', 'No'],
    ['Overcast', 'Hot', 'High', 'Weak', 'Yes'],
    ['Rain', 'Mild', 'High', 'Weak', 'Yes'],
    ['Rain', 'Cool', 'Normal', 'Weak', 'Yes'],
    ['Rain', 'Cool', 'Normal', 'Strong', 'No'],
    ['Overcast', 'Cool', 'Normal', 'Strong', 'Yes'],
    ['Sunny', 'Mild', 'High', 'Weak', 'No'],
    ['Sunny', 'Cool', 'Normal', 'Weak', 'Yes'],
    ['Rain', 'Mild', 'Normal', 'Weak', 'Yes'],
    ['Sunny', 'Mild', 'Normal', 'Strong', 'Yes'],
    ['Overcast', 'Mild', 'High', 'Strong', 'Yes'],
    ['Overcast', 'Hot', 'Normal', 'Weak', 'Yes'],
    ['Rain', 'Mild', 'High', 'Strong', 'No'],
]
data = pd.DataFrame(data, columns='Outlook,Temperature,Humidity,Wind,PlayTennis'.split(','))
X = data[data.columns[:-1]]
y = data.PlayTennis == 'Yes'
data
Outlook Temperature Humidity Wind PlayTennis
0 Sunny Hot High Weak No
1 Sunny Hot High Strong No
2 Overcast Hot High Weak Yes
3 Rain Mild High Weak Yes
4 Rain Cool Normal Weak Yes
5 Rain Cool Normal Strong No
6 Overcast Cool Normal Strong Yes
7 Sunny Mild High Weak No
8 Sunny Cool Normal Weak Yes
9 Rain Mild Normal Weak Yes
10 Sunny Mild Normal Strong Yes
11 Overcast Mild High Strong Yes
12 Overcast Hot Normal Weak Yes
13 Rain Mild High Strong No

So we have 4 predictor fields, 14 samples and a binary classification problem. Let’s get the boosted NB classifier from nb.py and try it out. The class is called NaiveBayesBoostingClassifier. The api is again very sklearn-like, with methods such as fit, predict, predict_proba, decision_function etc. Let’s also compare the boosted NB with the vanilla NB, NaiveBayesClassifier, which is the same as before.

We will have 2 versions of the boosted NB.

  1. First one with only 1 iteration, that is, 0 boosting iterations. This case is therefore equivalent to vanilla NB.
  2. Second one with 2 iterations. The first vanilla NB iteration, and 1 boosting iteration on top of that.

The goal is to observe the difference in the calculated posterior probabilities. We will also include the NaiveBayesClassifier, and observe that it produces the same posteriors as the trivial boosting case.

%run nb.py

nb = NaiveBayesClassifier()
# 1 boosting iteration means, no boosting, and this should spit out exactly the same probas.
fake_boosting = NaiveBayesBoostingClassifier(n_iter=1)  
# the following is 2 boosting iterations, so the posteriors should differ a bit;
real_boosting = NaiveBayesBoostingClassifier(n_iter=2)
fake_boosting.fit(X, y)
real_boosting.fit(X, y)
nb.fit(X, y)

out = pd.Series(nb.predict_proba(X)[1], name='vanilla')
out = pd.DataFrame(out)
cols = ['vanilla', 'boost_fake', 'boost_once', 'truth']
out['boost_once'] = real_boosting.decision_function(X)
out['boost_fake'] = fake_boosting.decision_function(X)
out['truth'] = y
out = out[cols]
out
vanilla boost_fake boost_once truth
0 0.312031 0.312031 0.313928 False
1 0.162746 0.162746 0.164083 False
2 0.751472 0.751472 0.752014 True
3 0.573354 0.573354 0.573302 True
4 0.875858 0.875858 0.870270 True
5 0.751472 0.751472 0.745584 False
6 0.918955 0.918955 0.915018 True
7 0.430499 0.430499 0.432647 False
8 0.798736 0.798736 0.794659 True
9 0.854638 0.854638 0.851980 True
10 0.586325 0.586325 0.584993 True
11 0.683522 0.683522 0.684268 True
12 0.929719 0.929719 0.928607 True
13 0.365459 0.365459 0.365355 False

Observe that boosting with 1 iteration, i.e. the trivial, non-boosting, produces the same posterior probability as the vanilla NB. On the other hand, the first non-trivial boosting differs from them a tiny bit. It has been changed according to the gradient of the vanilla NB.

Now let’s apply the boostied NB to some of the publicly available ML datasets. Note that the current version of the boosted NB is binary classification only.

1.b Framework for comparing the performance

I’ll start this subsection by importing the necessary tools and defining functions which will make it easy to compare the performance of several ML classification algorithms, including NaiveBayesClassifier and NaiveBayesBoostingClassifier. The other ones are:

The classifiers listed above are included with their default choice of parameters, without any parameter tuning, as that is beyond the scope of this post. On the other hand, I included two versions of the boosted NB, one with 10 boosting iterations and one with 20. This will help us have a feeling about how the classifier progresses with more iterations.

import pandas as pd
from sklearn.model_selection import ShuffleSplit
from sklearn.naive_bayes import GaussianNB
from collections import defaultdict
from sklearn.pipeline import Pipeline
from datetime import datetime
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import Imputer


def make_pipe(est):
    pipe = [('imputer', Imputer()),
            ('estimator', est)]
    return Pipeline(pipe)

def make_boost_pipe(n):
    pipe_boost = [('preprocessor', NaiveBayesPreprocessor(bins=20)),
                  ('nbb', NaiveBayesBoostingClassifier(n_iter=n))]
    pipe_boost = Pipeline(pipe_boost)
    return pipe_boost

%run nb.py
pipe = [('preprocessor', NaiveBayesPreprocessor(bins=20)),
        ('nb', NaiveBayesClassifier())]
pipe = Pipeline(pipe)
algos = {'nbayes': pipe,
        'gnb': make_pipe(GaussianNB()),
        'ada': make_pipe(AdaBoostClassifier()),
        'gbt': make_pipe(GradientBoostingClassifier())}
for i in [10, 20]:
    algos['nbayes+gboost {}'.format(i)] = make_boost_pipe(i)


def compare(X, y):
    rs = ShuffleSplit(test_size=0.25, n_splits=3)
    accuracies = defaultdict(list)
    rocs = defaultdict(list)
    times = {}
    for train_index, test_index in rs.split(X):
        X_train, X_test = X.loc[train_index], X.loc[test_index]
        y_train, y_test = y.loc[train_index], y.loc[test_index]
        for key, est in algos.items():
            then = datetime.now()
            est.fit(X_train, y_train)
            accuracies[key].append(est.score(X_test, y_test))
            # now the roc auc
            if not isinstance(est.steps[-1][-1], GaussianNB):
                ys = est.decision_function(X_test)
                if len(ys.shape) == 2 and ys.shape[1] == 2:
                    ys = ys[1]
            else:
                ys = est.predict_proba(X_test)[:, 1]
            rocs[key].append(roc_auc_score(y_test, ys))
            times[key] = datetime.now() - then
    accuracies = pd.DataFrame(accuracies)
    rocs = pd.DataFrame(rocs)
    times = pd.Series(times)
    accuracies.index.name = 'accuracy'
    rocs.index.name = 'roc-auc'
    times.index.name = 'time'
    return accuracies, rocs, times


def go(X, y):
    accuracies, rocs, times = compare(X, y)
    print(accuracies.to_string())
    print('-'*80)
    print(rocs.to_string())
    print('-'*80)
    print(times.to_string())

Quick look at the last line of the compare function tells us that it returns the accuracies, roc-auc’s and the times it took for all the algorithms it tried. The variable algos contains the algorithms we are trying. Now we are in a good position to apply these algorithms to various binary classification problems and see what happens.


2. Trying it out

In this section, the code cells contain urls which refer to the datasets being used. For more information about them, you can follow those links.

2.1 Spamdata

The first choice is the spam/non-spam binary classification for emails. A problem that is well known to be a good use case for classical Naive Bayes approach, although that sort of implies the dataset has bag-of-words type features and the NB algorithm in question is either MultinomialNB or BernoulliNB.

Let’s load the dataset, create our X and y, and take a quick glance at them.

spamdata = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data',
    header=None)
X = spamdata[spamdata.columns[:-1]].rename(columns={i: 'col{}'.format(i) for i in spamdata})
y = spamdata[spamdata.columns[-1]]
print(y.value_counts())
print('-'*80)
print(X.info())
X.sample(3)
0    2788
1    1813
Name: 57, dtype: int64
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4601 entries, 0 to 4600
Data columns (total 57 columns):
col0     4601 non-null float64
col1     4601 non-null float64
col2     4601 non-null float64
col3     4601 non-null float64
col4     4601 non-null float64
col5     4601 non-null float64
col6     4601 non-null float64
col7     4601 non-null float64
col8     4601 non-null float64
col9     4601 non-null float64
col10    4601 non-null float64
col11    4601 non-null float64
col12    4601 non-null float64
col13    4601 non-null float64
col14    4601 non-null float64
col15    4601 non-null float64
col16    4601 non-null float64
col17    4601 non-null float64
col18    4601 non-null float64
col19    4601 non-null float64
col20    4601 non-null float64
col21    4601 non-null float64
col22    4601 non-null float64
col23    4601 non-null float64
col24    4601 non-null float64
col25    4601 non-null float64
col26    4601 non-null float64
col27    4601 non-null float64
col28    4601 non-null float64
col29    4601 non-null float64
col30    4601 non-null float64
col31    4601 non-null float64
col32    4601 non-null float64
col33    4601 non-null float64
col34    4601 non-null float64
col35    4601 non-null float64
col36    4601 non-null float64
col37    4601 non-null float64
col38    4601 non-null float64
col39    4601 non-null float64
col40    4601 non-null float64
col41    4601 non-null float64
col42    4601 non-null float64
col43    4601 non-null float64
col44    4601 non-null float64
col45    4601 non-null float64
col46    4601 non-null float64
col47    4601 non-null float64
col48    4601 non-null float64
col49    4601 non-null float64
col50    4601 non-null float64
col51    4601 non-null float64
col52    4601 non-null float64
col53    4601 non-null float64
col54    4601 non-null float64
col55    4601 non-null int64
col56    4601 non-null int64
dtypes: float64(55), int64(2)
memory usage: 2.0 MB
None
col0 col1 col2 col3 col4 col5 col6 col7 col8 col9 ... col47 col48 col49 col50 col51 col52 col53 col54 col55 col56
1783 0.33 0.84 0.67 0.0 0.67 0.33 0.67 0.0 0.33 0.0 ... 0.0 0.0 0.183 0.000 0.156 0.104 0.026 6.500 525 858
3351 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0.0 0.00 0.0 ... 0.0 0.0 0.000 0.751 0.000 0.000 0.000 1.428 4 10
2601 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0.0 0.00 0.0 ... 0.0 0.0 0.000 0.000 0.000 0.000 0.000 1.769 8 23

3 rows × 57 columns

go(X, y)
               ada       gbt       gnb    nbayes  nbayes+gboost 10  nbayes+gboost 20
accuracy                                                                            
0         0.936577  0.946134  0.810599  0.893136          0.919201          0.932233
1         0.934839  0.947003  0.820156  0.894005          0.933970          0.943527
2         0.942659  0.947871  0.817550  0.905300          0.931364          0.942659
--------------------------------------------------------------------------------
              ada       gbt       gnb    nbayes  nbayes+gboost 10  nbayes+gboost 20
roc-auc                                                                            
0        0.975604  0.983871  0.939048  0.565013          0.975606          0.978455
1        0.977105  0.987682  0.942894  0.545387          0.980848          0.984901
2        0.979726  0.988510  0.956382  0.527922          0.981500          0.984293
--------------------------------------------------------------------------------
time
ada                00:00:00.250069
gbt                00:00:00.527525
gnb                00:00:00.006865
nbayes             00:00:04.577352
nbayes+gboost 10   00:00:28.502304
nbayes+gboost 20   00:00:59.384143

Evaluation:

So in this case, boosted NB did very comparatively with AdaBoost and Gradient Boosted DTs. Moreover, we observed a good improvement provided by the boosting iterations over the vanilla NB algorithm. We hope this to be a recurring theme.


2.2 Bankruptcy Data

Next up I would like to switch to a more business oriented problem. Here we have the data of Polish companies and the ones that went bankrupt. Let’s again load the data and take a quick look. Afterwards, let’s apply the algorithms and compare.

# Polish bankruptcy data from
# http://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data

from scipy.io import arff


def load_arff(fn):
    X = arff.loadarff(fn)[0]
    X = pd.DataFrame(X).applymap(lambda r: r if r != '?' else None)
    # X.dropna(how='any', inplace=True)
    y = X.pop('class')
    y = (y == b'1').astype(int)
    return X, y


fns = !ls /home/taylanbil/Downloads/*year.arff
_ = [load_arff(fn) for fn in fns]
X = pd.concat([X for X, y in _]).reset_index(drop=True)
y = pd.concat([y for X, y in _]).reset_index(drop=True)
del _
print(y.value_counts())
print('-'*80)
print(X.info())
X.sample(3)
0    41314
1     2091
Name: class, dtype: int64
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43405 entries, 0 to 43404
Data columns (total 64 columns):
Attr1     43397 non-null float64
Attr2     43397 non-null float64
Attr3     43397 non-null float64
Attr4     43271 non-null float64
Attr5     43316 non-null float64
Attr6     43397 non-null float64
Attr7     43397 non-null float64
Attr8     43311 non-null float64
Attr9     43396 non-null float64
Attr10    43397 non-null float64
Attr11    43361 non-null float64
Attr12    43271 non-null float64
Attr13    43278 non-null float64
Attr14    43397 non-null float64
Attr15    43369 non-null float64
Attr16    43310 non-null float64
Attr17    43311 non-null float64
Attr18    43397 non-null float64
Attr19    43277 non-null float64
Attr20    43278 non-null float64
Attr21    37551 non-null float64
Attr22    43397 non-null float64
Attr23    43278 non-null float64
Attr24    42483 non-null float64
Attr25    43397 non-null float64
Attr26    43310 non-null float64
Attr27    40641 non-null float64
Attr28    42593 non-null float64
Attr29    43397 non-null float64
Attr30    43278 non-null float64
Attr31    43278 non-null float64
Attr32    43037 non-null float64
Attr33    43271 non-null float64
Attr34    43311 non-null float64
Attr35    43397 non-null float64
Attr36    43397 non-null float64
Attr37    24421 non-null float64
Attr38    43397 non-null float64
Attr39    43278 non-null float64
Attr40    43271 non-null float64
Attr41    42651 non-null float64
Attr42    43278 non-null float64
Attr43    43278 non-null float64
Attr44    43278 non-null float64
Attr45    41258 non-null float64
Attr46    43270 non-null float64
Attr47    43108 non-null float64
Attr48    43396 non-null float64
Attr49    43278 non-null float64
Attr50    43311 non-null float64
Attr51    43397 non-null float64
Attr52    43104 non-null float64
Attr53    42593 non-null float64
Attr54    42593 non-null float64
Attr55    43404 non-null float64
Attr56    43278 non-null float64
Attr57    43398 non-null float64
Attr58    43321 non-null float64
Attr59    43398 non-null float64
Attr60    41253 non-null float64
Attr61    43303 non-null float64
Attr62    43278 non-null float64
Attr63    43271 non-null float64
Attr64    42593 non-null float64
dtypes: float64(64)
memory usage: 21.2 MB
None
Attr1 Attr2 Attr3 Attr4 Attr5 Attr6 Attr7 Attr8 Attr9 Attr10 ... Attr55 Attr56 Attr57 Attr58 Attr59 Attr60 Attr61 Attr62 Attr63 Attr64
28436 0.027486 0.28592 -0.070966 0.64665 -68.132 0.048296 0.033511 2.22740 1.06920 0.63686 ... -3472.5 0.064732 0.043158 0.93527 0.1336 15.0500 14.057 90.126 4.0499 0.93478
31905 0.148640 0.18727 0.594540 10.38300 353.680 0.000000 0.183680 4.33980 0.74969 0.81273 ... 3424.4 0.232490 0.182890 0.75900 0.0000 19.1100 34.970 30.850 11.8320 2.19150
11506 0.001574 0.50511 0.084972 1.17170 -32.080 0.000000 0.002360 0.97975 2.94260 0.49489 ... 108.0 -0.004278 0.003180 0.99921 0.0000 8.6374 12.808 61.386 5.9459 7.00370

3 rows × 64 columns

go(X, y)
               ada       gbt       gnb    nbayes  nbayes+gboost 10  nbayes+gboost 20
accuracy                                                                            
0         0.954386  0.970052  0.078787  0.727055          0.906008          0.932547
1         0.957704  0.971250  0.065334  0.730280          0.911076          0.943421
2         0.956782  0.971618  0.067822  0.731755          0.902046          0.946461
--------------------------------------------------------------------------------
              ada       gbt       gnb    nbayes  nbayes+gboost 10  nbayes+gboost 20
roc-auc                                                                            
0        0.879490  0.912178  0.497303  0.733217          0.760798          0.794332
1        0.890159  0.921132  0.503383  0.776884          0.786405          0.830294
2        0.888399  0.925583  0.495057  0.758146          0.777545          0.813152
--------------------------------------------------------------------------------
time
ada                00:00:12.529434
gbt                00:00:15.847193
gnb                00:00:00.074148
nbayes             00:00:52.060308
nbayes+gboost 10   00:04:27.274454
nbayes+gboost 20   00:08:38.005423

Evaluation:

In this example, we have another case of boosting iterations improving the performance. As known from classical boosting algorithms, number of iterations is a hyperparameter one needs to tune for a given problem. It does not seem that we have hit a ceiling of performance here. However, tuning those parameters is not the goal of this post, so we’ll skip that discussion.

2.3 Credit Card Defaults Data

Continuing with the finance theme, third dataset is the credit card default dataset from consumers in Taiwan. Let’s dive in.

'https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients'
bankruptcy = pd.read_excel('/home/taylanbil/Downloads/default of credit card clients.xls', skiprows=1)
y = bankruptcy.pop('default payment next month')
X = bankruptcy
print(y.value_counts())
print('-'*80)
print(X.info())
X.sample(3)
0    23364
1     6636
Name: default payment next month, dtype: int64
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 24 columns):
ID           30000 non-null int64
LIMIT_BAL    30000 non-null int64
SEX          30000 non-null int64
EDUCATION    30000 non-null int64
MARRIAGE     30000 non-null int64
AGE          30000 non-null int64
PAY_0        30000 non-null int64
PAY_2        30000 non-null int64
PAY_3        30000 non-null int64
PAY_4        30000 non-null int64
PAY_5        30000 non-null int64
PAY_6        30000 non-null int64
BILL_AMT1    30000 non-null int64
BILL_AMT2    30000 non-null int64
BILL_AMT3    30000 non-null int64
BILL_AMT4    30000 non-null int64
BILL_AMT5    30000 non-null int64
BILL_AMT6    30000 non-null int64
PAY_AMT1     30000 non-null int64
PAY_AMT2     30000 non-null int64
PAY_AMT3     30000 non-null int64
PAY_AMT4     30000 non-null int64
PAY_AMT5     30000 non-null int64
PAY_AMT6     30000 non-null int64
dtypes: int64(24)
memory usage: 5.5 MB
None
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 ... BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
22959 22960 210000 2 1 2 30 0 0 0 0 ... 89173 91489 93463 95021 4600 3789 3811 3465 3186 4389
24953 24954 60000 1 2 2 30 0 0 0 0 ... 6733 7662 8529 9884 1500 1300 1200 1000 1500 800
10352 10353 360000 1 3 1 58 -1 -1 -1 -1 ... 1090 780 390 388 554 1096 780 0 388 887

3 rows × 24 columns

go(X, y)
               ada       gbt       gnb    nbayes  nbayes+gboost 10  nbayes+gboost 20
accuracy                                                                            
0         0.813600  0.815467  0.410800  0.757333          0.788133          0.791867
1         0.818933  0.823200  0.365067  0.758667          0.790133          0.795333
2         0.814667  0.821867  0.371867  0.760933          0.793467          0.799333
--------------------------------------------------------------------------------
              ada       gbt       gnb    nbayes  nbayes+gboost 10  nbayes+gboost 20
roc-auc                                                                            
0        0.762167  0.770514  0.658582  0.483075          0.745817          0.739989
1        0.779408  0.783143  0.677352  0.479125          0.762096          0.754629
2        0.779661  0.787814  0.659249  0.483573          0.758837          0.750321
--------------------------------------------------------------------------------
time
ada                00:00:01.701464
gbt                00:00:02.663667
gnb                00:00:00.023219
nbayes             00:00:13.379899
nbayes+gboost 10   00:02:21.470685
nbayes+gboost 20   00:04:33.042267

Evaluation:

This case illustrates the importance of boosting very well. With 1 iteration, NB has absolutely no predictive value. However, once we start boosting, we get to a pretty good performance, competing with the classical boosting algorithms! That is a striking difference. 20 iterations being worse than 10 suggests that the optimal value is less than 20 (it could be less than 10 too, we did not check that).


All in all, we saw the improvement offered by boosting the basic NB algorithm. Although it did not beat the boosted DTs in these examples, there may be cases where it does. There seems to be value in adding this algorithm to one’s ML toolkit.

Now let’s get down to the nitty-gritty and see how this all works.

3. Code

The code can be found here. I will paste bits and pieces from that file and try to explain how it comes together.

3.1. Review of vanilla NB

First of all, the boosted NB builds on the classes introduced in the previous post. For convenience, here’s the code from that discussion.

As a quick reminder; the code contains three major parts;

  1. NaiveBayesClassifier, which implements the NB algorithm.
  2. NaiveBayesPreprocessor, which fits and/or transforms a dataset so that the output is suitable for applying the NaiveBayesClassifier.
  3. Discretizer, which takes a continuous pandas Series and bins it. It also remembers the bins for future use (typically @ scoring time). This is employed by the NaiveBayesPreprocessor.
from collections import defaultdict
from operator import itemgetter
from bisect import bisect_right

import numpy as np
import pandas as pd


class Discretizer(object):

    def __init__(self, upperlim=20, bottomlim=0, mapping=False):
        self.mapping = mapping
        self.set_lims(upperlim, bottomlim)

    @property
    def cutoffs(self):
        return [i[0] for i in self.mapping]

    def set_lims(self, upperlim, bottomlim):
        if not self.mapping:
            self.bottomlim = bottomlim
            self.upperlim = upperlim
        else:
            vals = sorted(np.unique(map(itemgetter(1), self.mapping)))
            self.bottomlim = vals[0]
            self.upperlim = vals[-1]
        assert self.bottomlim < self.upperlim

    def fit(self, continuous_series, subsample=None):
        self.mapping = []
        continuous_series = pd.Series(continuous_series).reset_index(drop=True).dropna()
        if subsample is not None:
            n = len(continuous_series)*subsample if subsample < 1 else subsample
            continuous_series = np.random.choice(continuous_series, n, replace=False)
        ranked = pd.Series(continuous_series).rank(pct=1, method='average')
        ranked *= self.upperlim - self.bottomlim
        ranked += self.bottomlim
        ranked = ranked.map(round)
        nvals = sorted(np.unique(ranked))  # sorted in case numpy changes
        for nval in nvals:
            cond = ranked == nval
            self.mapping.append((continuous_series[cond].min(), int(nval)))

    def transform_single(self, val):
        if not self.mapping:
            raise NotImplementedError('Haven\'t been fitted yet')
        elif pd.isnull(val):
            return None
        i = bisect_right(self.cutoffs, val) - 1
        if i == -1:
            return 0
        return self.mapping[i][1]

    def transform(self, vals):
        if isinstance(vals, float):
            return self.transform_single(vals)
        elif vals is None:
            return None
        return pd.Series(vals).map(self.transform_single)

    def fit_transform(self, vals):
        self.fit(vals)
        return self.transform(vals)


class NaiveBayesPreprocessor(object):
    """
    Don't pass in Nans. fill with keyword.
    """

    OTHER = '____OTHER____'
    FILLNA = '____NA____'

    def __init__(self, alpha=1.0, min_freq=0.01, bins=20):
        self.alpha = alpha  # Laplace smoothing
        self.min_freq = min_freq  # drop values occuring less frequently than this
        self.bins = bins  # number of bins for continuous fields

    def learn_continuous_transf(self, series):
        D = Discretizer(upperlim=self.bins)
        D.fit(series)
        return D

    def learn_discrete_transf(self, series):
        vcs = series.value_counts(dropna=False, normalize=True)
        vcs = vcs[vcs >= self.min_freq]
        keep = set(vcs.index)
        transf = lambda r: r if r in keep else self.OTHER
        return transf

    def learn_transf(self, series):
        if series.dtype == np.float64:
            return self.learn_continuous_transf(series)
        else:
            return self.learn_discrete_transf(series)

    def fit(self, X_orig, y=None):
        """
        Expects pandas series and pandas DataFrame
        """
        # get dtypes
        self.dtypes = defaultdict(set)
        for fld, dtype in X_orig.dtypes.iteritems():
            self.dtypes[dtype].add(fld)

        X = X_orig
        # X = X_orig.fillna(self.FILLNA)
        # get transfs
        self.transformations = {
            fld: self.learn_transf(series)
            for fld, series in X.iteritems()}

    def transform(self, X_orig, y=None):
        """
        Expects pandas series and pandas DataFrame
        """
        X = X_orig.copy()
        # X = X_orig.fillna(self.FILLNA)
        for fld, func in self.transformations.items():
            if isinstance(func, Discretizer):
                X[fld] = func.transform(X[fld])
            else:
                X[fld] = X[fld].map(func)
        return X

    def fit_transform(self, X, y=None):
        self.fit(X)
        return self.transform(X)


class NaiveBayesClassifier(object):

    def __init__(self, alpha=1.0, class_priors=None, **kwargs):
        self.alpha = alpha
        self.class_priors = class_priors

    def get_class_log_priors(self, y):
        self.classes_ = y.unique()
        if self.class_priors is None:
            self.class_priors = y.value_counts(normalize=1)
        elif isinstance(self.class_priors, str) and self.class_priors == 'equal':
            raise NotImplementedError
        self.class_log_priors = self.class_priors.map(np.log)

    def get_log_likelihoods(self, fld):
        table = self.groups[fld].value_counts(dropna=False).unstack(fill_value=0)
        table += self.alpha
        sums = table.sum(axis=1)
        likelihoods = table.apply(lambda r: r/sums, axis=0)
        log_likelihoods = likelihoods.applymap(np.log)
        return {k if pd.notnull(k) else None: v for k, v in log_likelihoods.items()}

    def fit(self, X, y):
        y = pd.Series(y)
        self.get_class_log_priors(y)
        self.groups = X.groupby(y)
        self.log_likelihoods = {
            fld: self.get_log_likelihoods(fld)
            for fld, series in X.iteritems()
        }

    def get_approx_log_posterior(self, series, class_):
        log_posterior = self.class_log_priors[class_]  # prior
        for fld, val in series.iteritems():
            # there are cases where the `val` is not seen before
            # as in having a `nan` in the scoring dataset,
            #   but no `nans in the training set
            # in those cases, we want to not add anything to the log_posterior

            # This is to handle the Nones and np.nans etc.
            val = val if pd.notnull(val) else None

            if val not in self.log_likelihoods[fld]:
                continue
            log_posterior += self.log_likelihoods[fld][val][class_]
        return log_posterior

    def decision_function_series(self, series):
        approx_log_posteriors = [
            self.get_approx_log_posterior(series, class_)
            for class_ in self.classes_]
        return pd.Series(approx_log_posteriors, index=self.classes_)

    def decision_function_df(self, df):
        return df.apply(self.decision_function_series, axis=1)

    def decision_function(self, X):
        """
        returns the log posteriors
        """
        if isinstance(X, pd.DataFrame):
            return self.decision_function_df(X)
        elif isinstance(X, pd.Series):
            return self.decision_function_series(X)
        elif isinstance(X, dict):
            return self.decision_function_series(pd.Series(X))

    def predict_proba(self, X, normalize=True):
        """
        returns the (normalized) posterior probability

        normalization is just division by the evidence. doesn't change the argmax.
        """
        log_post = self.decision_function(X)
        if isinstance(log_post, pd.Series):
            post = log_post.map(np.exp)
        elif isinstance(log_post, pd.DataFrame):
            post = log_post.applymap(np.exp)
        else:
            raise NotImplementedError('type of X is "{}"'.format(type(X)))
        if normalize:
            if isinstance(post, pd.Series):
                post /= post.sum()
            elif isinstance(post, pd.DataFrame):
                post = post.div(post.sum(axis=1), axis=0)
        return post

    def predict(self, X):
        probas = self.decision_function(X)
        if isinstance(probas, pd.Series):
            return np.argmax(probas)
        return probas.apply(np.argmax, axis=1)

    def score(self, X, y):
        preds = self.predict(X)
        return np.mean(np.array(y) == preds.values)

3.2. Quick Review of boosting

The idea of boosting consists of these following major parts;

  1. A weak (or “base”) learner which fits to the dataset.
  2. A loss function which evaluates the predictions and the truth values.
  3. A new weak learner that is trained on the new “truth values”, which are adjusted so that the past mistakes are weighed more heavily. The idea is to try to correct the mistakes iteratively.
  4. A method (usually line search) that will define how to bring the new weak learner that is freshly trained into the mix of the past weak learners.

To implement this with the NB classifier as the weak learner, we first need a new version of our NB classifier, that can be trained with arbitrary values as the truth set. But the NB algotihm is naturally a classifier. The classical boosting algorithms as AdaBoost and GBT’s use “regression trees”, which are decision trees that output continuous values, as their weak learners. Modifying NB to do that seems a bit awkward. Therefore, I chose to not touch the natural “classifier” aspect of NB. Instead, I coded a version of the NB classifier with sample weights. That is, the samples that are gotten wrong by the past weak learners are weighed heavily in the next iteration.

Our loss function will be the same as the BinomialDeviance loss function that GBTs use in sklearn.

3.3. Code for the Boosted NB

First let’s introduce the weak learner, i.e. the WeightedNaiveBayesClassifier. This inherits from the NaiveBayesClassifier, and modifies the key methods to accomodate sample weights. The code differs in the way that it stores the log likelihoods and the class priors. The scoring / predicting methods are the same as they retrieve data from the log likelihoods and class priors only.

class WeightedNaiveBayesClassifier(NaiveBayesClassifier):

    def get_class_log_priors(self, y, weights):
        self.classes_ = y.unique()
        self.class_priors = {
            class_: ((y == class_)*weights).mean()
            for class_ in self.classes_}
        self.class_log_priors = {
            class_: np.log(prior)
            for class_, prior in self.class_priors.items()}

    def get_log_likelihood(self, y, class_, series, val, vals, weights):
        # indices of `series` and `y` and `weights` must be the same
        # XXX: what to do with alpha? should we smooth it out somehow?
        cond = y == class_
        num = ((series.loc[cond] == val) * weights.loc[cond]).sum() + self.alpha
        denom = (weights.loc[cond]).sum() + len(vals) * self.alpha
        return np.log(num/denom)

    def get_log_likelihoods(self, series, y, weights):
        vals = series.unique()
        y_ = y.reset_index(drop=True, inplace=False)
        series_ = series.reset_index(drop=True, inplace=False)
        weights_ = weights.reset_index(drop=True, inplace=False)

        log_likelihoods = {
            val: {class_: self.get_log_likelihood(y_, class_, series_, val, vals, weights_)
                  for class_ in self.classes_}
            for val in vals}
        return log_likelihoods

    def fit(self, X, y, weights=None):
        if weights is None:
            return super(WeightedNaiveBayesClassifier, self).fit(X, y)
        weights *= len(weights) / weights.sum()
        y = pd.Series(y)
        self.get_class_log_priors(y, weights)
        self.log_likelihoods = {
            fld: self.get_log_likelihoods(series, y, weights)
            for fld, series in X.iteritems()
        }

Now we’re ready for the final piece, the NaiveBayesBoostingClassifier. Here’s the code.

class NaiveBayesBoostingClassifier(object):
    """
    For now, this is Binary classification only.
    `y` needs to consist of 0 or 1
    """

    TRUTHFLD = '__'

    def __init__(self, alpha=1.0, n_iter=5, learning_rate=0.1):
        self.n_iter = n_iter
        self.alpha = alpha
        self.learning_rate = learning_rate

    def loss(self, y, pred, vectorize=False):
        """
        This uses the `BinomialDeviance` loss function.
        That means, this implementation of NaiveBayesBoostingClassifier
            is for binary classification only. Change this function
            to implement multiclass classification

        Compute the deviance (= 2 * negative log-likelihood).

        note that `pred` here is the actual predicted posterior proba.
        `pred` in sklearn is log odds
        """
        logodds = np.log(pred/(1-pred))
        # logaddexp(0, v) == log(1.0 + exp(v))
        ans = -2.0 * ((y * logodds) - np.logaddexp(0.0, logodds))
        return ans if vectorize else np.mean(ans)

    def adjust_weights(self, X, y, weights):
        if weights is None:
            return pd.Series([1]*len(y), index=y.index)
            # return -- this errors out bc defers to non weighted nb above

        pred = self.predicted_posteriors
        return self.loss(y, pred, vectorize=True)

    def line_search_helper(self, new_preds, step):
        if self.predicted_posteriors is None:
            return new_preds
        ans = self.predicted_posteriors * (1-step)
        ans += step * new_preds
        return ans

    def line_search(self, new_preds, y):
        # TODO: This can be done using scipy.optimize.line_search
        # but for now, we'll just try 10 values and pick the best one
        if not self.line_search_results:
            self.line_search_results = [1]
            return 1
        steps_to_try = -1 * np.arange(
            -self.learning_rate, 0, self.learning_rate/10)
        step = min(
            steps_to_try,
            key=lambda s: self.loss(
                y, self.line_search_helper(new_preds, s))
        )
        self.line_search_results.append(step)
        return step

    def _predict_proba_1(self, est, X):
        # XXX: another place where we assume binary clf
        # TODO: need a robust way to get 1
        return est.predict_proba(X)[1]

    def fit(self, X, y, weights=None):
        self.stages = []
        self.predicted_posteriors = None
        self.line_search_results = []
        weights = None
        for i in range(self.n_iter):
            weights = self.adjust_weights(X, y, weights)
            nbc = WeightedNaiveBayesClassifier(alpha=self.alpha)
            nbc.fit(X, y, weights)
            new_preds = self._predict_proba_1(nbc, X)
            self.stages.append(nbc)
            self.line_search(new_preds, y)
            self.predicted_posteriors = self.line_search_helper(
                new_preds, self.line_search_results[-1])

    def decision_function_df(self, X, staged=False):
        stage_posteriors = [
            self._predict_proba_1(est, X) for est in self.stages]
        posteriors = 0
        if staged:
            staged_posteriors = dict()
        for i, (sp, step) in enumerate(zip(stage_posteriors, self.line_search_results)):
            posteriors = (1-step)*posteriors + step*sp
            if staged:
                staged_posteriors[i] = posteriors
        return posteriors if not staged else pd.DataFrame(staged_posteriors)

    def decision_function(self, X):
        if isinstance(X, pd.DataFrame):
            return self.decision_function_df(X)
        else:
            raise NotImplementedError

    def predict(self, X, thr=0.5):
        probas = self.decision_function(X)
        return (probas >= thr).astype(int)
        # # draft for multiclass approach below
        # if isinstance(probas, pd.Series):
        #     return np.argmax(probas)
        # return probas.apply(np.argmax, axis=1)

    def score(self, X, y):
        preds = self.predict(X)
        return np.mean(np.array(y) == preds.values)

Perhaps the most interesting part of this code is the fit method. Here we see the for loop which builds the NB classifiers for each boosting iteration. The adjust_weights method computes the new sample weights according to the loss function, which then is fed to the WeightedNaiveBayesClassifier for the next weak learner.

The way the loss function and the adjust_weights method are coded make them restricted to the binary classification case. One would need to replace binomial deviance loss with multinomial deviance loss and modify the weights adjustment accordingly to accomodate the multiclass case.

Then comes the line search part. Although it is possible to do an actual line search here, I opted out for a simple “try 10 things and pick the best one” method. Not the best approach for sure.

Feel free to contact me (@taylanbil in twitter, linkedin, facebook, etc) for any questions/comments.